Welcome to your Workbench

Access 100+ AI models for image generation, video creation, and more.

Search Models

Model Types

Fine-tuning

Modalities

Developers

OpenAI's newest frontier model — improved reasoning over 5.4 at the same 1.05M context, configurable thinking budget, and full tool-use support.

multi-to-text
$6.50/1M tokens

Text-to-image generation with photorealistic output, accurate text rendering, and strong prompt adherence.

text-to-image
$0.22/image

Animates images into video up to 15s at 1080P with first/last-frame guidance, video continuation, and optional driving audio.

multi-to-video
$0.10/sec

Reference-guided video generation with character consistency, multi-character support, optional reference voices, and up to 1080P output.

multi-to-video
$0.10/sec

Text-to-video with multi-shot generation, up to 1080P, 2-15s duration, and optional driving audio.

text-to-video
$0.10/sec

Cheap, fast DeepSeek V4 — 13B active params over a 1M context, well-suited to high-volume traffic and as a default daily-driver.

text-to-text
$0.18/1M tokens

Flagship DeepSeek V4 with a 1M-token context, 49B active params, and native tool use — strong reasoning and code at a fraction of frontier-tier pricing.

text-to-text
$0.57/1M tokens

OpenAI's premium 5.5 variant — top-of-line reasoning at a higher price than 5.5 base, for the hardest agentic and research workloads.

multi-to-text
$39.00/1M tokens

Image-to-image editing with prompt-guided transformations and multi-reference composition.

multi-to-image
$0.22/image

Anthropic's most capable model, with a step-change jump in agentic coding over Opus 4.6 and a native 1M-token context window.

multi-to-text
$5.00/1M tokens

Fast-tier image-to-video with optional start-to-end frame transitions, flexible duration and aspect ratio, resolution up to 720p, and optional synchronized audio.

multi-to-video
$0.31/sec

Edit videos via text instructions or reference images with style transfer, up to 1080p, and flexible audio handling.

multi-to-video
$0.10/sec

Image-to-video with optional start-to-end frame transitions, flexible duration and aspect ratio, resolution up to 1080p, and optional synchronized audio.

multi-to-video
$0.40/sec

Text-to-video with flexible duration and aspect ratio, resolution up to 720p, and optional synchronized audio.

text-to-video
$0.40/sec

Flagship 31B dense multimodal model supporting text, image, and video input with 256K context window. Achieves competitive performance with much larger models.

multi-to-text
$0.14/1M tokens

Lightweight 2.3B multimodal model supporting text, image, video, and audio input with 128K context window and 140+ language support.

multi-to-text
$0.00/1M tokensFine-tunable

Efficient 4.5B multimodal model supporting text, image, video, and audio input with 128K context window and 140+ language support.

multi-to-text
$0.00/1M tokensFine-tunable

Alibaba's latest flagship closed model for advanced reasoning, coding, and complex text generation.

multi-to-text
$0.50/1M tokens

Strongest OpenAI mini model for coding and agentic workloads, with 400K context, 128K max output, multimodal input, and broad tool support.

multi-to-text
$0.75/1M tokens

Video-to-depth estimation with temporal consistency, selectable model size, colormaps, and optional raw depth export.

video-to-video
$0.05/sec

Baseten-configured LTX 2.3 Pro 22B model with IC/Union-Control support for text-to-video and image-conditioned video generation.

multi-to-video
$0.0028/sec

Frontier model for complex professional work with 1.05M context, configurable reasoning, and extensive tool support including computer use and MCP.

multi-to-text
$2.50/1M tokens

Generates high-res 4K@25FPS videos from image+text, camera control, and synced audio.

multi-to-video
$0.12/secFine-tunable

Audio-to-video generation from image + audio input, 1080p output with synchronized visuals.

multi-to-video
$0.12/sec

Targeted video segment editing: replace video, audio, or both via prompts.

multi-to-video
$0.12/sec

Fast, low-cost Gemini 3.1 model for high-throughput multimodal workloads, with configurable reasoning and a 1M-token context window.

multi-to-text
$0.25/1M tokens

GPT-5.3 Instant model for ChatGPT with 128K context, text and image inputs, and optimized conversational performance.

multi-to-text
$1.75/1M tokens

Hybrid Mamba-Transformer MoE with 1M context, optimized for agentic reasoning; 120B total, 12B active parameters.

text-to-text
$0.30/1M tokens

Nano Banana 2 is a text-to-image model that generates images from text descriptions.

text-to-image
$0.08/image

Nano Banana 2 Edit is an image editing model that enables blending multiple images, maintaining character consistency, targeted transformations using natural language, and leveraging world knowledge for precise edits.

image-to-image
$0.08/image

Compact multimodal model with dual reasoning modes, native vision capabilities, support for over 200 languages, and long-context processing up to 262,144 tokens.

multi-to-text
$0.0014/secFine-tunable

Multimodal LLM with native vision, image and video understanding, tool calling, optional thinking mode, support for 201 languages, and long-context processing up to 262,144 tokens.

multi-to-text
$0.0014/secFine-tunable

Multimodal LLM with thinking mode by default, native vision, image and video understanding, tool calling, support for 201 languages, and long-context up to 262K tokens (extensible to 1M with YaRN).

multi-to-text
$0.0014/secFine-tunable

Multimodal LLM with thinking mode by default, native vision, image and video understanding, tool calling, support for 201 languages, and long-context up to 262K tokens (extensible to 1M with YaRN).

multi-to-text
$0.0014/secFine-tunable

Image generation with built-in reasoning, example-based editing, multi-reference control (up to 14 images), and 3K resolution support.(128 characters)

text-to-image
$0.04/image

Flagship Gemini 3 reasoning model for complex multimodal and agentic workflows with a 1M-token context window.

multi-to-text
$2.00/1M tokens

Anthropic's latest Sonnet model with strong coding and agent performance, fast latency, and improved long-context reasoning.

multi-to-text
$3.00/1M tokens

Pro version of Qwen Image 2 with enhanced text rendering, realism, and semantic adherence for high-quality image generation and editing.

multi-to-image
$0.08/image

Anthropic's most advanced model, excelling in coding, agentic workflows, computer use, reasoning, math, and domain expertise in finance, law, STEM.

multi-to-text
$5.00/1M tokens

Animates a static image into native 4K motion with optional start/end frame anchoring and synchronized audio.

multi-to-video
$0.55/sec

Native 4K reference-to-video generation from element and style references with optional frame anchoring and synchronized audio.

multi-to-video
$0.55/sec

Native 4K text-to-video generation with cinema-grade detail and optional synchronized audio.

text-to-video
$0.55/sec

Kling Video O3 Pro is an advanced image-to-video generation model that animates static images into high-quality videos based on text prompts.

multi-to-video
$0.22/sec

Kling o3 Pro reference-to-video model generates videos from a reference image and text prompt describing motion and cinematic intent.

multi-to-video
$0.22/sec

Edit videos using text prompts and reference images for character consistency or object replacement.

video-to-video
$0.34/sec

Motion transfer from reference video to character image. Cost-effective for portraits and simple animations.

multi-to-video
$0.17/sec

Grok Imagine text-to-image is a high-quality image generation model from xAI that produces cinematic, stylistically consistent images from text prompts.

text-to-image
$0.02/image

Grok Imagine - Image Edit is a high-quality image generation model from xAI that produces cinematic, stylistically consistent images from text prompts.

image-to-image
$0.02/image

Video editing model for prompt-driven modifications like object swapping, scene restyling, and character animation with synced native audio.

video-to-video
$0.08/sec

FLUX.2 Klein 4B is a compact 4 billion parameter text-to-image diffusion model optimized for fast inference and high-quality image generation.

multi-to-image
$0.01/imageFine-tunable

FLUX.2 Klein 9B is a compact 9 billion parameter text-to-image diffusion model optimized for fast inference and high-quality image generation.

multi-to-image
$0.02/imageFine-tunable

Multimodal LLM for targeted video editing: regenerate 2-16s segments (video/audio/both) via prompts, preserving motion, lighting, and continuity.

video-to-video
$0.10/sec

Generates high-res 4K@25FPS videos from image+text, camera control, and synced audio.

multi-to-video
$0.12/secFine-tunable

Native multimodal agentic model with vision, Agent Swarm (up to 100 sub-agents, 1,500 tool calls), coding from visual specs, and 256K context.

multi-to-text
$0.60/1M tokens

Frontier open LLM with advanced coding, agentic, and reasoning capabilities; 744B MoE with DSA for efficient 200K context.

text-to-text
$0.95/1M tokens

Generative image model that improves on in photorealistic human portraits, finer natural scenes (landscapes, animal fur, and other natural elements), better text rendering overall.

text-to-image
$0.20/imageFine-tunable

Delivers high-fidelity, controllable image editing with dual semantic and appearance modes, precise on-image text, multi-image composition, and robust identity preservation.

image-to-image
$0.03/imageFine-tunable

Fast multimodal model with configurable reasoning, strong agentic workflows, long context, and tool use for interactive chat, coding, and complex tasks.

multi-to-text
$0.05/1M tokens

Transforms static images into cinematic videos with synchronized audio, dialogue, and sound effects in 1080p.

image-to-video
$0.07/sec

Generates 1080p videos from text with native synchronized audio, including dialogue, sound effects, and lip-sync.

text-to-video
$0.07/sec

Diffusion model for high‑fidelity image generation and editing, with strong prompt adherence, preserved composition and lighting, and adjustable quality controls.

text-to-image
$8.00/1M tokens

Animates images into 15s, 1080p videos with preserved identity, native audio, lip-sync, and multi-shot sequences guided by reference videos.

multi-to-video
$0.10/sec

Generates videos from reference videos, maintaining character consistency, with multi-shot narratives, up to 15s duration, and native audio sync.

video-to-video
$0.10/sec

Frontier model for professional work with configurable reasoning effort, 400K context, structured outputs, and distillation support.

multi-to-text
$1.75/1M tokens

GPT-5.2 model optimized for ChatGPT with 128K context, text and image input support, streaming, and structured outputs.

multi-to-text
$1.75/1M tokens

High-fidelity text-to-image and image-to-image generation with multi-reference control (up to 10 images), 4K support, and batch output.(128 characters)

text-to-image
$0.04/image

Transforms images (with text and up to 7 references) into cinematic video clips with stable characters, controlled motion, and consistent environments.

multi-to-video
$0.12/sec

Multimodal video model for reference-guided generation, preserving characters and styles from reference images.

image-to-video
$0.11/sec

Text-guided video-to-video editing that preserves motion and continuity while enabling character swaps, style changes, motion transfer, and scene transformations.

video-to-video
$0.18/sec

Fast photorealistic text-to-image model with accurate English and Chinese on-image text, ideal for interactive design, marketing visuals, and UI/UX workflows.

text-to-image
$0.01/imageFine-tunable

Generates photorealistic images with precise multi-reference editing, excels at legible text and infographics, and supports rapid LoRA fine-tuning workflows.

multi-to-image
$0.01/imageFine-tunable

Delivers high-quality image generation and editing with advanced text rendering, multi-image reference for style consistency, and precise, JSON-based prompt control.

multi-to-image
$0.12/image

Delivers photorealistic, high-resolution images with advanced multi-reference editing, precise pose and color control, and reliable prompt and text adherence for professionals.

multi-to-image
$0.10/image

Excels at long-horizon reasoning, advanced coding, dynamic effort control, robust multimodal tasks, and detailed computer interface inspection for complex workflows.

multi-to-text
$5.00/1M tokens

Delivers high-fidelity images with advanced text rendering, consistent character identities, and precise prompt following for professional visual design and branding.

image-to-image
$0.15/image

Zero-shot image segmentation with text/visual prompts; exhaustive instance detection and presence head reduce false positives.

image-to-image
$0.01/image

Detects, segments, and tracks objects across video frames using text, exemplars, points, or masks, with memory for occlusions and real-time streaming.

video-to-video
$0.02/image

Automatically routes prompts to fast or deep reasoning modes, with adaptive effort, enhanced tone and style controls, and improved coding and math.

multi-to-text
$1.25/1M tokens

Professional-grade image upscaling powered by AI, from Topaz Labs.

image-to-image
$0.05/image

Professional-grade video upscaling powered by AI, from Topaz Labs.

video-to-video
$0.04/sec

Generates high-fidelity videos with native synced audio, offering strong narrative control, scene consistency, image-to-video animation, and multi-shot support.

multi-to-video
$0.20/sec

Animates an input image into short videos with controllable motion, duration, aspect ratio, resolution, and optional audio.

multi-to-video
$0.10/sec

Animates a single image into short videos with controllable motion, duration, aspect ratio, and cost-efficient quality settings.

multi-to-video
$0.05/sec

Generates high-quality 1080p videos up to 12s with synced native audio, multi-scene reasoning, timeline prompting, and realistic physics.

multi-to-video
$0.30/sec

Optimized for rapid, high-volume multimodal tasks with a 1M-token context window, delivering strong reasoning and cost efficiency for enterprise workflows.

multi-to-text
$0.10/1M tokens

Transforms single images into smooth, cinematic videos with natural motion, realistic camera work like dolly zooms, and preserved style.

multi-to-video
$0.07/sec

Delivers high-fidelity, controllable image editing with dual semantic and appearance modes, precise on-image text, multi-image composition, and robust identity preservation.

image-to-image
$0.03/imageFine-tunable

Generates photorealistic images with precise prompt and text rendering, mask-free editing, and layout-aware outpainting, ideal for creative and multilingual content.

image-to-image
$0.04/image

Delivers ultra-fast, high-resolution image generation, precise natural-language editing, and consistent multi-image output—ideal for creative, batch, or professional workflows.

text-to-image
$0.03/image

Anthropic's most advanced AI model, excelling in coding, agent-based tasks, and computer usage. It delivers high performance in reasoning, math, and domain-specific knowledge across fields like finance, law, and STEM.

multi-to-text
$3.00/1M tokens

Enables precise bilingual text and semantic edits with strong consistency, advanced multi-image editing, and native pose/control support for creative compositions.

image-to-image
$0.03/imageFine-tunable

Lightweight multimodal model for visual Q&A, multilingual OCR, document and UI understanding, and agentic screen interpretation in constrained environments.

multi-to-text
$0.0014/secFine-tunable

Multimodal LLM for text and images, excelling in visual QA, document/UI understanding, spatial reasoning, image captioning, and multimodal coding.

multi-to-text
$0.0014/secFine-tunable

versatile multimodal large language model capable of understanding and generating both text and images. Built on the Qwen3 architecture, it provides strong general reasoning, detailed image interpretation, and instruction-following performance in a compact 8B parameter size.

multi-to-text
$0.0014/secFine-tunable

Open-weight text-to-image model with advanced prompt adherence, anatomically accurate details, and powerful tools for inpainting, outpainting, and structural edits.

text-to-image
$0.03/imageFine-tunable

Handles complex reasoning, code generation, and multimodal inputs with improved accuracy, long context retention, and robust multilingual and personalization features.

text-to-text
$1.25/1M tokens

Optimized for cost and speed, handles long contexts, supports text and image input, and excels at structured outputs and tool integration for precise tasks.

text-to-text
$0.25/1M tokens

Multimodal model optimized for ultra-fast, cost-efficient summarization and classification, supporting both text and image inputs with real-time streaming output.

text-to-text
$0.05/1M tokens

Excels at complex coding, autonomous research, and agent workflows, with advanced reasoning and a 200,000-token context for deep analysis and synthesis.

text-to-text
$15.00/1M tokens

Built with a Mixture-of-Experts design, delivers efficient, transparent reasoning, tool use, and agentic capabilities, even with 128K token context windows.

text-to-text
$0.15/1M tokens

Delivers strong reasoning and chain-of-thought, agentic features, and multilingual support, optimized for local deployment and efficient use on modest hardware.

text-to-text
$0.07/1M tokensFine-tunable

An image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and support for a wide range of artistic styles. From photorealistic scenes to impressionist paintings, from anime aesthetics to minimalist design

text-to-image
$0.03/imageFine-tunable

Delivers high-fidelity text-to-video synthesis at 480p/720p using dual expert models for scene layout and fine motion detail, ideal for creative production.

text-to-video
$0.0014/secFine-tunable

Unified text-to-video and image-to-video model generates high-definition 720p, 24fps video clips efficiently on consumer GPUs, with advanced compression for speed.

text-to-video
$0.0014/secFine-tunable

Fast, cost-efficient multimodal reasoning model with million-token context for high-volume applications requiring speed and versatility.

multi-to-text
$0.30/1M tokens

Delivers precise, iterative image editing and generation with consistent character, style, and text changes—using multimodal input for seamless scene transformations.

image-to-image
$0.03/imageFine-tunable

Excels at deep reasoning, complex coding, and autonomous agent workflows with sustained performance, extended thinking, tool use, and memory across tasks.

text-to-text
$15.00/1M tokens

Balances intelligence with efficiency for coding, research, and automation tasks; excels in reasoning, content generation, and nuanced instruction following.

multi-to-text
$3.00/1M tokens

Generates realistic text- and image-conditioned videos with native synchronized audio, including dialogue, ambient sound, and effects.

multi-to-video
$0.20/sec

Excels at building interactive web apps, advanced code editing and agentic workflows, with native multimodality and strong video-to-code capabilities.

multi-to-text
$1.25/1M tokens

Dual reasoning modes enable rapid or step-by-step responses, with robust support for over 100 languages and long-context processing up to 262,144 tokens.

text-to-text
$0.0014/secFine-tunable

Efficient conversational AI for resource-limited devices with multilingual support, document summarization, translation, code generation, and simple information retrieval.

text-to-text
$0.00046/secFine-tunable

Excels at advanced reasoning, coding, math, and visual tasks with simulated reasoning, tool use, web browsing, and image understanding integration.

multi-to-text
$2.00/1M tokens

Optimized for fast, affordable reasoning with strong coding and visual skills, large 200k-token context, and efficient handling of complex tasks.

multi-to-text
$1.10/1M tokens

Excels in coding and instruction following with million-token context window, enabling superior performance on complex, multi-step tasks.

multi-to-text
$2.00/1M tokens

Powerful mid-sized model with GPT-4o-level performance at lower cost and latency, featuring a 1 million token context window for complex tasks.

multi-to-text
$0.40/1M tokens

OpenAI's fastest, cost-effective model with full 1 million token context, optimized for classification, autocompletion, and real-time AI agent tasks.

multi-to-text
$0.10/1M tokens

A lightweight, versatile 24B multimodal model handling text and images with extensive multilingual support and 128k token context window.

text-to-text
$0.10/1M tokens

Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning.

text-to-text
$0.20/1M tokens

Performs exhaustive, multi-step research by autonomously searching and synthesizing hundreds of sources into detailed, expert-level reports across domains.

text-to-text
$5.00/1M tokens

Premium reasoning model for complex, multi-step analysis. Delivers detailed explanations, real-time web search, and double citations for thorough answers.

text-to-text
$3.00/1M tokens

Efficient, multilingual instruction-tuned model designed for privacy-focused, on-device dialogue, summarization, and agentic retrieval across mobile and edge platforms.

text-to-text
$0.00046/secFine-tunable

Generates high-fidelity, temporally consistent videos from text or images, with readable English and Chinese text, sound effects, and customizable aspect ratios.

text-to-video
$0.0014/secFine-tunable

Optimized for search-augmented tasks, delivering fast, accurate answers with real-time web data and detailed citations. Excels in research and fact-checking.

text-to-text
$2.00/1M tokens

Excels at complex, multi-step queries with real-time web search, detailed answers, extensive citations, and customizable information retrieval.

text-to-text
$5.00/1M tokens

Efficient edge model with native function calling and interleaved sliding-window attention for fast, memory-efficient processing in resource-constrained environments.

text-to-text
$0.10/1M tokens

Multimodal model handling text and images at native resolution with 128K context window, excelling in visual reasoning tasks like document analysis and image captioning.

text-to-text
$0.15/1M tokens

Cost-efficient, fast model with 128K context window, supporting text/vision inputs and improved multilingual performance.

multi-to-text
$0.15/1M tokens

Generates vector representations capturing semantic meaning/context for tasks like semantic search, text classification, and clustering. Multilingual support with versatile applications.

text-to-embeddings
$0.02/1M tokens

Multimodal LLM for real-time text, audio, and visual processing with multilingual support, emotional audio responses, and image generation.

multi-to-text
$2.50/1M tokens

Efficient Sparse MoE architecture with 39B active parameters, excels in multilingual tasks, math, coding, and handles 64K token contexts.

text-to-text
$2.00/1M tokens

Powerful LLM with 123B parameters, excelling in multilingual tasks, coding, and reasoning, optimized for single-node inference and long-context applications.

text-to-text
$2.00/1M tokens

Generates high-quality embeddings for complex text analysis and multilingual applications with 8,191 token context.

text-to-embeddings
$0.13/1M tokens

Generates compact, efficient embeddings for NLP tasks with multilingual support, balancing performance and low latency.

text-to-embeddings
$0.02/1M tokens

Efficient Mixture of Experts (8 experts) with 13B active parameters, optimized for multilingual tasks and cost-performance balance.

text-to-text
$0.70/1M tokens

Balanced performance in natural language and code tasks, efficiently handling longer sequences with innovative attention mechanisms.

text-to-text
$0.25/1M tokens

Multimodal LLM for agentic applications, handling real-time data integration and multi-step tasks with enhanced reasoning via Thinking Mode, integrating Google tools and third-party functions.

multi-to-text
$0.10/1M tokens

Optimized for edge computing with function-calling capabilities, excelling in knowledge retrieval and commonsense reasoning with 128k token context.

text-to-text
$0.04/1M tokens

Specializes in complex reasoning through chain-of-thought processing, excelling in STEM tasks like coding, math, and scientific analysis.

text-to-text
$15.00/1M tokens

Optimized for STEM reasoning and problem-solving, excelling in complex tasks like advanced math and coding with improved cost efficiency.

text-to-text
$1.10/1M tokens

Efficiently generates multilingual text and code, with dual modes for rapid chat or detailed reasoning; ideal for lightweight AI, agents, and education.

text-to-text
$0.00046/secFine-tunable

Generates 480P videos from text prompts on consumer GPUs, with multilingual support, image-to-video, aspect ratio control, and audio integration features.

text-to-video
$0.0014/secFine-tunable