OpenAI's newest frontier model — improved reasoning over 5.4 at the same 1.05M context, configurable thinking budget, and full tool-use support.
Search Models
Model Types
Fine-tuning
Modalities
Developers
Text-to-image generation with photorealistic output, accurate text rendering, and strong prompt adherence.
Reference-guided video from prompt plus optional images, videos, and audio references.
Text-to-video generation up to 1080P with configurable aspect ratio and duration.
Animates images into video up to 15s at 1080P with first/last-frame guidance, video continuation, and optional driving audio.
Reference-guided video generation with character consistency, multi-character support, optional reference voices, and up to 1080P output.
Text-to-video with multi-shot generation, up to 1080P, 2-15s duration, and optional driving audio.
Cheap, fast DeepSeek V4 — 13B active params over a 1M context, well-suited to high-volume traffic and as a default daily-driver.
Flagship DeepSeek V4 with a 1M-token context, 49B active params, and native tool use — strong reasoning and code at a fraction of frontier-tier pricing.
OpenAI's premium 5.5 variant — top-of-line reasoning at a higher price than 5.5 base, for the hardest agentic and research workloads.
Image-to-image editing with prompt-guided transformations and multi-reference composition.
Anthropic's most capable model, with a step-change jump in agentic coding over Opus 4.6 and a native 1M-token context window.
Fast-tier image-to-video with optional start-to-end frame transitions, flexible duration and aspect ratio, resolution up to 720p, and optional synchronized audio.
Fast-tier reference-guided video from prompt plus optional images, videos, and audio references.
Edit videos via text instructions or reference images with style transfer, up to 1080p, and flexible audio handling.
Image-to-video with optional start-to-end frame transitions, flexible duration and aspect ratio, resolution up to 1080p, and optional synchronized audio.
Text-to-video with flexible duration and aspect ratio, resolution up to 720p, and optional synchronized audio.
Flagship 31B dense multimodal model supporting text, image, and video input with 256K context window. Achieves competitive performance with much larger models.
Lightweight 2.3B multimodal model supporting text, image, video, and audio input with 128K context window and 140+ language support.
Efficient 4.5B multimodal model supporting text, image, video, and audio input with 128K context window and 140+ language support.
Alibaba's latest flagship closed model for advanced reasoning, coding, and complex text generation.
Video restoration and upscaling model from Topaz for detail-preserving 1080p/4k outputs.
Strongest OpenAI mini model for coding and agentic workloads, with 400K context, 128K max output, multimodal input, and broad tool support.
Video-to-depth estimation with temporal consistency, selectable model size, colormaps, and optional raw depth export.
Baseten-configured LTX 2.3 Pro 22B model with IC/Union-Control support for text-to-video and image-conditioned video generation.
Frontier model for complex professional work with 1.05M context, configurable reasoning, and extensive tool support including computer use and MCP.
Generates high-res 4K@25FPS videos from image+text, camera control, and synced audio.
Audio-to-video generation from image + audio input, 1080p output with synchronized visuals.
Text-to-video generation up to 4K@50FPS with optional audio and camera motion.
Fast, low-cost Gemini 3.1 model for high-throughput multimodal workloads, with configurable reasoning and a 1M-token context window.
GPT-5.3 Instant model for ChatGPT with 128K context, text and image inputs, and optimized conversational performance.
Hybrid Mamba-Transformer MoE with 1M context, optimized for agentic reasoning; 120B total, 12B active parameters.
Nano Banana 2 is a text-to-image model that generates images from text descriptions.
Nano Banana 2 Edit is an image editing model that enables blending multiple images, maintaining character consistency, targeted transformations using natural language, and leveraging world knowledge for precise edits.
Compact multimodal model with dual reasoning modes, native vision capabilities, support for over 200 languages, and long-context processing up to 262,144 tokens.
Multimodal LLM with native vision, image and video understanding, tool calling, optional thinking mode, support for 201 languages, and long-context processing up to 262,144 tokens.
Multimodal LLM with thinking mode by default, native vision, image and video understanding, tool calling, support for 201 languages, and long-context up to 262K tokens (extensible to 1M with YaRN).
Multimodal LLM with thinking mode by default, native vision, image and video understanding, tool calling, support for 201 languages, and long-context up to 262K tokens (extensible to 1M with YaRN).
Image generation with built-in reasoning, example-based editing, multi-reference control (up to 14 images), and 3K resolution support.(128 characters)
Flagship Gemini 3 reasoning model for complex multimodal and agentic workflows with a 1M-token context window.
Anthropic's latest Sonnet model with strong coding and agent performance, fast latency, and improved long-context reasoning.
Pro version of Qwen Image 2 with enhanced text rendering, realism, and semantic adherence for high-quality image generation and editing.
Anthropic's most advanced model, excelling in coding, agentic workflows, computer use, reasoning, math, and domain expertise in finance, law, STEM.
Animates a static image into native 4K motion with optional start/end frame anchoring and synchronized audio.
Native 4K reference-to-video generation from element and style references with optional frame anchoring and synchronized audio.
Native 4K text-to-video generation with cinema-grade detail and optional synchronized audio.
Kling Video O3 Pro is an advanced image-to-video generation model that animates static images into high-quality videos based on text prompts.
Kling o3 Pro reference-to-video model generates videos from a reference image and text prompt describing motion and cinematic intent.
Edit videos using text prompts and reference images for character consistency or object replacement.
Motion transfer from reference video to character image. Cost-effective for portraits and simple animations.
Grok Imagine text-to-image is a high-quality image generation model from xAI that produces cinematic, stylistically consistent images from text prompts.
Grok Imagine - Image Edit is a high-quality image generation model from xAI that produces cinematic, stylistically consistent images from text prompts.
Video editing model for prompt-driven modifications like object swapping, scene restyling, and character animation with synced native audio.
FLUX.2 Klein 4B is a compact 4 billion parameter text-to-image diffusion model optimized for fast inference and high-quality image generation.
FLUX.2 Klein 9B is a compact 9 billion parameter text-to-image diffusion model optimized for fast inference and high-quality image generation.
Multimodal LLM for targeted video editing: regenerate 2-16s segments (video/audio/both) via prompts, preserving motion, lighting, and continuity.
Generates high-res 4K@25FPS videos from image+text, camera control, and synced audio.
Native multimodal agentic model with vision, Agent Swarm (up to 100 sub-agents, 1,500 tool calls), coding from visual specs, and 256K context.
Frontier open LLM with advanced coding, agentic, and reasoning capabilities; 744B MoE with DSA for efficient 200K context.
Generative image model that improves on in photorealistic human portraits, finer natural scenes (landscapes, animal fur, and other natural elements), better text rendering overall.
Delivers high-fidelity, controllable image editing with dual semantic and appearance modes, precise on-image text, multi-image composition, and robust identity preservation.
Fast multimodal model with configurable reasoning, strong agentic workflows, long context, and tool use for interactive chat, coding, and complex tasks.
Transforms static images into cinematic videos with synchronized audio, dialogue, and sound effects in 1080p.
Generates 1080p videos from text with native synchronized audio, including dialogue, sound effects, and lip-sync.
Diffusion model for high‑fidelity image generation and editing, with strong prompt adherence, preserved composition and lighting, and adjustable quality controls.
Animates images into 15s, 1080p videos with preserved identity, native audio, lip-sync, and multi-shot sequences guided by reference videos.
Generates videos from reference videos, maintaining character consistency, with multi-shot narratives, up to 15s duration, and native audio sync.
Frontier model for professional work with configurable reasoning effort, 400K context, structured outputs, and distillation support.
GPT-5.2 model optimized for ChatGPT with 128K context, text and image input support, streaming, and structured outputs.
High-fidelity text-to-image and image-to-image generation with multi-reference control (up to 10 images), 4K support, and batch output.(128 characters)
Transforms images (with text and up to 7 references) into cinematic video clips with stable characters, controlled motion, and consistent environments.
Multimodal video model for reference-guided generation, preserving characters and styles from reference images.
Text-guided video-to-video editing that preserves motion and continuity while enabling character swaps, style changes, motion transfer, and scene transformations.
Fast photorealistic text-to-image model with accurate English and Chinese on-image text, ideal for interactive design, marketing visuals, and UI/UX workflows.
Generates photorealistic images with precise multi-reference editing, excels at legible text and infographics, and supports rapid LoRA fine-tuning workflows.
Delivers high-quality image generation and editing with advanced text rendering, multi-image reference for style consistency, and precise, JSON-based prompt control.
Delivers photorealistic, high-resolution images with advanced multi-reference editing, precise pose and color control, and reliable prompt and text adherence for professionals.
Excels at long-horizon reasoning, advanced coding, dynamic effort control, robust multimodal tasks, and detailed computer interface inspection for complex workflows.
Delivers high-fidelity images with advanced text rendering, consistent character identities, and precise prompt following for professional visual design and branding.
Zero-shot image segmentation with text/visual prompts; exhaustive instance detection and presence head reduce false positives.
Detects, segments, and tracks objects across video frames using text, exemplars, points, or masks, with memory for occlusions and real-time streaming.
Automatically routes prompts to fast or deep reasoning modes, with adaptive effort, enhanced tone and style controls, and improved coding and math.
Generates high-fidelity videos with native synced audio, offering strong narrative control, scene consistency, image-to-video animation, and multi-shot support.
Animates an input image into short videos with controllable motion, duration, aspect ratio, resolution, and optional audio.
Animates a single image into short videos with controllable motion, duration, aspect ratio, and cost-efficient quality settings.
Generates high-quality 1080p videos up to 12s with synced native audio, multi-scene reasoning, timeline prompting, and realistic physics.
Optimized for rapid, high-volume multimodal tasks with a 1M-token context window, delivering strong reasoning and cost efficiency for enterprise workflows.
Transforms single images into smooth, cinematic videos with natural motion, realistic camera work like dolly zooms, and preserved style.
Delivers high-fidelity, controllable image editing with dual semantic and appearance modes, precise on-image text, multi-image composition, and robust identity preservation.
Generates photorealistic images with precise prompt and text rendering, mask-free editing, and layout-aware outpainting, ideal for creative and multilingual content.
Delivers ultra-fast, high-resolution image generation, precise natural-language editing, and consistent multi-image output—ideal for creative, batch, or professional workflows.
Anthropic's most advanced AI model, excelling in coding, agent-based tasks, and computer usage. It delivers high performance in reasoning, math, and domain-specific knowledge across fields like finance, law, and STEM.
Enables precise bilingual text and semantic edits with strong consistency, advanced multi-image editing, and native pose/control support for creative compositions.
Lightweight multimodal model for visual Q&A, multilingual OCR, document and UI understanding, and agentic screen interpretation in constrained environments.
Multimodal LLM for text and images, excelling in visual QA, document/UI understanding, spatial reasoning, image captioning, and multimodal coding.
versatile multimodal large language model capable of understanding and generating both text and images. Built on the Qwen3 architecture, it provides strong general reasoning, detailed image interpretation, and instruction-following performance in a compact 8B parameter size.
Open-weight text-to-image model with advanced prompt adherence, anatomically accurate details, and powerful tools for inpainting, outpainting, and structural edits.
Handles complex reasoning, code generation, and multimodal inputs with improved accuracy, long context retention, and robust multilingual and personalization features.
Optimized for cost and speed, handles long contexts, supports text and image input, and excels at structured outputs and tool integration for precise tasks.
Multimodal model optimized for ultra-fast, cost-efficient summarization and classification, supporting both text and image inputs with real-time streaming output.
Excels at complex coding, autonomous research, and agent workflows, with advanced reasoning and a 200,000-token context for deep analysis and synthesis.
Built with a Mixture-of-Experts design, delivers efficient, transparent reasoning, tool use, and agentic capabilities, even with 128K token context windows.
Delivers strong reasoning and chain-of-thought, agentic features, and multilingual support, optimized for local deployment and efficient use on modest hardware.
An image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and support for a wide range of artistic styles. From photorealistic scenes to impressionist paintings, from anime aesthetics to minimalist design
Delivers high-fidelity text-to-video synthesis at 480p/720p using dual expert models for scene layout and fine motion detail, ideal for creative production.
Unified text-to-video and image-to-video model generates high-definition 720p, 24fps video clips efficiently on consumer GPUs, with advanced compression for speed.
Fast, cost-efficient multimodal reasoning model with million-token context for high-volume applications requiring speed and versatility.
Delivers precise, iterative image editing and generation with consistent character, style, and text changes—using multimodal input for seamless scene transformations.
Excels at deep reasoning, complex coding, and autonomous agent workflows with sustained performance, extended thinking, tool use, and memory across tasks.
Balances intelligence with efficiency for coding, research, and automation tasks; excels in reasoning, content generation, and nuanced instruction following.
Generates realistic text- and image-conditioned videos with native synchronized audio, including dialogue, ambient sound, and effects.
Excels at building interactive web apps, advanced code editing and agentic workflows, with native multimodality and strong video-to-code capabilities.
Dual reasoning modes enable rapid or step-by-step responses, with robust support for over 100 languages and long-context processing up to 262,144 tokens.
Efficient conversational AI for resource-limited devices with multilingual support, document summarization, translation, code generation, and simple information retrieval.
Excels at advanced reasoning, coding, math, and visual tasks with simulated reasoning, tool use, web browsing, and image understanding integration.
Optimized for fast, affordable reasoning with strong coding and visual skills, large 200k-token context, and efficient handling of complex tasks.
Excels in coding and instruction following with million-token context window, enabling superior performance on complex, multi-step tasks.
Powerful mid-sized model with GPT-4o-level performance at lower cost and latency, featuring a 1 million token context window for complex tasks.
OpenAI's fastest, cost-effective model with full 1 million token context, optimized for classification, autocompletion, and real-time AI agent tasks.
A lightweight, versatile 24B multimodal model handling text and images with extensive multilingual support and 128k token context window.
Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning.
Performs exhaustive, multi-step research by autonomously searching and synthesizing hundreds of sources into detailed, expert-level reports across domains.
Premium reasoning model for complex, multi-step analysis. Delivers detailed explanations, real-time web search, and double citations for thorough answers.
Efficient, multilingual instruction-tuned model designed for privacy-focused, on-device dialogue, summarization, and agentic retrieval across mobile and edge platforms.
Generates high-fidelity, temporally consistent videos from text or images, with readable English and Chinese text, sound effects, and customizable aspect ratios.
Optimized for search-augmented tasks, delivering fast, accurate answers with real-time web data and detailed citations. Excels in research and fact-checking.
Excels at complex, multi-step queries with real-time web search, detailed answers, extensive citations, and customizable information retrieval.
Efficient edge model with native function calling and interleaved sliding-window attention for fast, memory-efficient processing in resource-constrained environments.
Multimodal model handling text and images at native resolution with 128K context window, excelling in visual reasoning tasks like document analysis and image captioning.
Cost-efficient, fast model with 128K context window, supporting text/vision inputs and improved multilingual performance.
Next-generation general-purpose Topaz video upscaler with tunable detail, noise, blur, and grain controls.
Generates vector representations capturing semantic meaning/context for tasks like semantic search, text classification, and clustering. Multilingual support with versatile applications.
Multimodal LLM for real-time text, audio, and visual processing with multilingual support, emotional audio responses, and image generation.
Efficient Sparse MoE architecture with 39B active parameters, excels in multilingual tasks, math, coding, and handles 64K token contexts.
Powerful LLM with 123B parameters, excelling in multilingual tasks, coding, and reasoning, optimized for single-node inference and long-context applications.
Generates high-quality embeddings for complex text analysis and multilingual applications with 8,191 token context.
Generates compact, efficient embeddings for NLP tasks with multilingual support, balancing performance and low latency.
General-purpose Topaz video upscaling and enhancement with tunable detail, noise, blur, and grain controls.
Efficient Mixture of Experts (8 experts) with 13B active parameters, optimized for multilingual tasks and cost-performance balance.
Balanced performance in natural language and code tasks, efficiently handling longer sequences with innovative attention mechanisms.
Multimodal LLM for agentic applications, handling real-time data integration and multi-step tasks with enhanced reasoning via Thinking Mode, integrating Google tools and third-party functions.
Optimized for edge computing with function-calling capabilities, excelling in knowledge retrieval and commonsense reasoning with 128k token context.
Specializes in complex reasoning through chain-of-thought processing, excelling in STEM tasks like coding, math, and scientific analysis.
Optimized for STEM reasoning and problem-solving, excelling in complex tasks like advanced math and coding with improved cost efficiency.
Efficiently generates multilingual text and code, with dual modes for rapid chat or detailed reasoning; ideal for lightweight AI, agents, and education.