OpenRouter Release Notes

Name: OpenRouter
Brand: OpenRouter

Follow OpenRouter to add their release notes to your feed!

118 release notes curated from 115 sources by the Releasebot Team. Last updated: Jul 17, 2026

Get this feed:

Jul 16, 2026
Date parsed from source:
Jul 16, 2026

First seen by Releasebot:
Jul 17, 2026
OpenRouter

Every Modality Through One API

OpenRouter adds a unified multimodal API for chat, images, video, audio, transcription, and embeddings through one OpenAI-compatible base URL. It highlights one key, one bill, provider routing and failover, plus dedicated endpoints for generation and transcription.
Tl;dr

Set one base URL (https://openrouter.ai/api/v1) and call image, video, audio, embeddings, and transcription through it. Switch modality by changing the model string and the content type.

Most input modalities ride the /chat/completions endpoint. Five have dedicated endpoints: /images, /videos, /audio/speech, /audio/transcriptions, and /embeddings.

The same provider routing object (failover, data_collection: "deny", cost/latency sort) works on an embeddings call exactly as it does on a chat call.

One API key, one bill, one OpenAI-shaped request format across all 5 modalities. We don’t mark up provider pricing, and failed requests aren’t billed.

Real limits to plan around: embeddings don’t stream, audio input is base64 only, and video URL support is provider-specific.

Can one API handle image, video, audio, embeddings, and transcription?

Yes. One base URL serves every modality, and you switch between them by changing the model string and the request content type. Set https://openrouter.ai/api/v1 as your base URL, pass your API key as a Bearer token, and you reach the full catalog through one OpenAI-compatible interface.

Wiring up 4 provider SDKs means each provider brings its own auth refresh, retry and backoff semantics, rate-limit headers, streaming format, and error schema. You write that glue 4 times, maintain it 4 times, and a change from one provider only ever fixes its own corner.

We’re a drop-in replacement for the OpenAI Chat API, so the same request format carries across every modality that rides /chat/completions, and the TTS endpoint follows the OpenAI Audio API. The dedicated endpoints (image generation, video generation, transcription, and embeddings) each have their own request shape, but our official SDKs (@openrouter/sdk for TypeScript, openrouter for Python) wrap all of it behind one interface.

Which endpoint does each modality use?

Most input modalities ride /chat/completions and differ only by content type. Five modalities have dedicated endpoints. Here’s the full map, grounded in our multimodal overview and embeddings reference.
Modality Endpoint How you call it Text / chat POST /api/v1/chat/completions messages array Image input (vision) POST /api/v1/chat/completions image_url content type PDF POST /api/v1/chat/completions file content type Audio input POST /api/v1/chat/completions input_audio content type Video input POST /api/v1/chat/completions video_url content type Image generation POST /api/v1/images prompt in, base64 images out Video generation POST /api/v1/videos (async) submit prompt, get job ID, poll Text-to-speech POST /api/v1/audio/speech text in, MP3/PCM bytes out Transcription (STT) POST /api/v1/audio/transcriptions base64 audio in, JSON text + usage out Embeddings POST /api/v1/embeddings text or text+image, vectors out
Five of these modalities run on /chat/completions and change only the content type in the message array. Five have their own endpoints because their call shape is different: image generation takes a prompt plus image-specific knobs (resolution, aspect ratio, output format) and returns base64 images, video generation is asynchronous (you poll a job), speech and transcription move raw audio bytes, and embeddings return vectors instead of completions. Single-provider docs rarely lay this out side by side, because no single provider serves all of them.

You can test multimodal inputs without paying. The free tier needs no credit card, and free models run under low daily rate limits that rise once you’ve added credits. That’s enough to send an image to a vision model or generate a batch of embeddings before you commit.

When should you use each modality?

Generation and understanding are different tasks even within the same media type, and the endpoint you reach for depends on which one you’re doing.

Image: generation vs. understanding.

Use image generation when you need a new image, and image input when you have one to analyze. Generation produces assets, mockups, and illustrations from a text prompt through a POST to the dedicated /api/v1/images endpoint, with optional reference images for image-to-image work. Vision input goes the other way: you send an image_url on /chat/completions, and the model does OCR, description, or detection. See the image generation docs for the full walkthrough.

Video: input vs. generation.

Use the async /videos endpoint to produce clips and video_url on chat to understand them. Video generation submits a prompt and returns a job ID you poll until the clip is ready, with configurable resolution, aspect ratio, and duration. Video understanding sends a video_url to a video-capable model for analysis, action recognition, or object detection. More in the video generation announcement.

Audio and speech: output vs. analysis.

Use /audio/speech for voice output and audio input on chat for analysis. Text-to-speech sends text to /api/v1/audio/speech and returns MP3 or PCM bytes through an OpenAI Audio-compatible endpoint, so OpenAI client libraries work against it. Audio input rides /chat/completions with the input_audio content type for tasks like sentiment or content analysis. Details in the audio APIs announcement.

Embeddings: retrieval and similarity.

Use embeddings when you need retrieval or similarity, not generation. The embeddings docs name 6 jobs: RAG, semantic search, recommendations, clustering, duplicate detection, and anomaly detection. You can batch many inputs in one request, and some models accept text and an image together to produce a single joint vector (nvidia/llama-nemotron-embed-vl-1b-v2 is one).

Transcription: speech to text.

Use /audio/transcriptions for speech-to-text. You send base64-encoded audio and get back JSON with the transcribed text plus usage statistics. It fits meeting notes, voice commands, and captioning.

Do routing and failover work for embeddings and image calls too?

Yes. The same provider object you use on a chat call works identically on an embeddings call: provider order, automatic failover, data-collection policy, and cost or latency sort. Here’s the exact shape from the embeddings docs:

{ "model": "openai/text-embedding-3-small", "input": "Your text here", "provider": { "order": ["openai", "azure"], "allow_fallbacks": true, "data_collection": "deny" } }

The routing controls carry to the dedicated image endpoint too: /api/v1/images accepts provider.order, provider.allow_fallbacks, provider.only, provider.ignore, and provider.sort, so failover, ordering, and cost/latency sort work the same way on an image generation call as on a chat call.

An embedding model served by more than one provider can fall back from one to another when the first returns an error. The same cross-provider failover applies to embeddings, image, audio, and chat alike through the provider object.

We don’t mark up provider pricing: the rate in the model catalog is what you pay. Zero Completion Insurance means a failed run isn’t billed, so a request that fails over and never completes costs nothing. That holds across modalities.

What do you actually save by consolidating?

One API key, one bill, one request format across every modality.

The same Bearer token authorizes a vision call, a TTS call, and an embeddings call. There’s no separate key vault per provider and no per-modality onboarding. When you add a new capability, say you start doing RAG, you call /embeddings with the key you already have.

Consolidated billing means usage across modalities lands on a single OpenRouter statement at catalog rates. You can compare what image generation cost versus embeddings in one place, rather than exporting CSVs from 4 dashboards. That reconciliation friction is exactly what the r/ShowYourApp builder ran into when the all-in-one stack got harder than expected.

What are the limits to plan around?

Each modality has constraints worth knowing before you build. These are our own current limits, stated plainly so you can design around them rather than discover them in production.
Limit Modality What it means for you No streaming Embeddings Responses come back complete, not token by token. Plan synchronous handling. Deterministic output Embeddings Same input gives the same vector. Cache aggressively. Base64 only Audio input Audio can’t be passed by URL. Encode local files first. Provider-specific URLs Video input URL support varies. Gemini on AI Studio accepts only YouTube links. Model-by-model support All Not every model supports every modality. We auto-filter by content. Free-tier rate limits All Free models have low daily limits that rise once you add credits.
The unified API earns its place when you need more than one media type, when swapping models is a one-string change, or when a provider outage shouldn’t take your feature down. If your entire app is a single chat feature against one general-purpose model from one provider, a direct integration is simpler and the consolidation upside is thinner.

Start with one call

Send a single embeddings request against the base URL.

Example Python code:

import requests response = requests.post( "https://openrouter.ai/api/v1/embeddings", headers = { "Authorization": "Bearer <OPENROUTER_API_KEY>", "Content-Type": "application/json", }, json = { "model": "openai/text-embedding-3-small", "input": "The quick brown fox jumps over the lazy dog", }, ) print(response.json()["data"][0]["embedding"][:5])

Example TypeScript code:

import { OpenRouter } from '@openrouter/sdk'; const openRouter = new OpenRouter({ apiKey: process.env.OPENROUTER_API_KEY }); const response = await openRouter.embeddings.generate({ model: 'openai/text-embedding-3-small', input: 'The quick brown fox jumps over the lazy dog', }); console.log(response.data[0].embedding);

Example curl:

curl https://openrouter.ai/api/v1/embeddings \ -H "Authorization: Bearer $OPENROUTER_API_KEY" \ -H "Content-Type: application/json" \ -d '{"model": "openai/text-embedding-3-small", "input": "The quick brown fox jumps over the lazy dog"}'

From here, go deeper per modality with the multimodal overview, or browse models by output modality to find what fits each call.

Frequently asked questions

Can I use one API for image generation, embeddings, and transcription?

Yes. All three run through our base URL at https://openrouter.ai/api/v1, with one API key. Image generation uses the dedicated /images endpoint, embeddings use /embeddings, and transcription uses /audio/transcriptions. You change the endpoint and content type, not the integration, the auth, or the API key.

Does OpenRouter support embeddings?

Yes. Embeddings run through POST /api/v1/embeddings and return vectors for RAG, semantic search, recommendations, clustering, duplicate detection, and anomaly detection. You can batch multiple inputs in one request, and some models accept text and an image together for a joint vector.

Which modalities use the chat endpoint vs. a dedicated endpoint?

Text, image input, PDF, audio input, and video input all use /chat/completions and differ only by content type. Image generation (/images), video generation (/videos), text-to-speech (/audio/speech), transcription (/audio/transcriptions), and embeddings (/embeddings) use dedicated endpoints, because their call shapes differ: prompt-to-image requests, async jobs, raw audio bytes, or returned vectors instead of completions.

Are there free AI APIs that support multimodal inputs?

Yes. We have a free tier at OpenRouter, no credit card required. Free models run under low daily rate limits that rise once you’ve added credits, which is enough to send images, generate embeddings, or test other modalities before you commit.

Can I send text and an image in one embeddings request?

Yes, with multimodal embedding models. You wrap the input in a content array containing text and image_url objects, and the model returns a single joint vector that captures both. nvidia/llama-nemotron-embed-vl-1b-v2 is one model, useful when you want text and images to share a single retrieval space.

Do provider routing and failover work for embeddings and image calls too?

Yes. The same provider routing controls (order, allow_fallbacks, cost/latency sort) apply to embeddings, image, audio, and chat calls. If a provider errors, the call falls over to the next one serving that model, and a failed run is never billed.
Original source
Jun 25, 2026
Date parsed from source:
Jun 25, 2026

First seen by Releasebot:
Jun 26, 2026
OpenRouter

The OpenRouter MCP Server

OpenRouter launches the MCP server, giving coding agents live model rankings, pricing, docs, benchmarks, and test inference so they can choose the best model with current data instead of stale guesses. It also adds OAuth-based setup, capped keys, and provider-aware model testing.
Your coding agent is incredible at writing code.

But when it comes to choosing the right model for, say, coding without blowing through your monthly budget in one day, or the best model for designing a landing page, it really struggles.

Your agent can make an approximate guess of the “best” model, but it’s guessing from training data that is months stale, with no knowledge of how much it costs, how well it performs for a given task, which provider you should pin it to, etc.

No more.

Today, we’re very excited to announce the release of the OpenRouter MCP.

The OpenRouter MCP server puts live model data, benchmark rankings, pricing, docs, and test inference directly to help you and your agent to make the right decisions on the best model to use. Install in one command, and your favorite agent can answer “which model is the best at coding without bankrupting me” with the most up-to-date data Artificial Analysis, Design Arena, and OpenRouter’s own model rankings. Hint: it’s GLM-5.2.

Connect now | Docs

Install in one command

Claude Code:

claude mcp add --transport http openrouter https://mcp.openrouter.ai/mcp claude mcp login openrouter

Codex CLI:

codex mcp add openrouter --url https://mcp.openrouter.ai/mcp codex mcp login openrouter

Cursor: Add to ~/.cursor/mcp.json :

{ "mcpServers": { "openrouter": { "url": "https://mcp.openrouter.ai/mcp" } } }

See the connect guide for OAuth login and every supported client.

Pick the right model without tab-switching

You’re building a feature that needs structured JSON output, and you want a model that’s fast, cheap, and actually good at it. Normally you’d open the OpenRouter website, browse the model list, compare benchmarks, check pricing, maybe run a few test prompts in the playground. That’s 15 minutes of context-switching before you write a single line of integration code.

With the MCP server connected, you can do this all in your coding agent:

You: "I need a model for structured JSON extraction from legal documents.
Fast, under $1/M input tokens, good at following schemas."
Agent: [calls models-list with filters] → [calls benchmarks] → [calls model-endpoints]
Agent: "google/gemini-3-flash-preview fits well: $0.10/M input,
138k context, strong structured output support. Here's the
endpoint with the lowest latency..."

The agent pulls from the live model catalog, cross-references Artificial Analysis intelligence scores and Design Arena ELO rankings, and checks per-provider pricing and latency. The recommendation is fully grounded in current data, not whatever was true when the model was last trained 6 months ago.

Test before you commit

chat-send lets your agent fire off a test prompt to any model and see the response, cost, and which provider served it. Your agent can compare answers across models side-by-side:

You: "Compare how Claude Opus 4.8, GPT-5.5, and DeepSeek V4 Pro
handle this structured extraction prompt."
Agent: [sends the same prompt to all three via chat-send]
[calls generation-get for each to get cost breakdowns]
Agent: "All three produced valid JSON. Opus 4.8 nailed the edge
case in row 12. GPT-5.5 was 40% cheaper. DeepSeek V4 Pro
was fastest at 180ms TTFB."

Model slugs support suffixes:
:online for web search,
:nitro for speed,
:floor for the lowest price,
:free for free endpoints. Your agent can test across variants without you memorizing the syntax.

Search the docs without leaving your editor

Your agent has docs-search, which runs a full-text search across the OpenRouter documentation. “How do I pin a model to a specific provider?” “What’s the format for tool calling?” “How does prompt caching work?” Your agent finds the answer and applies it, all in one flow.

This is where the MCP server earns its keep as a development assistant. Your agent can look up the exact API parameter it needs, check the right request format, and wire it into your code without you having to find and read the docs page yourself.

A dedicated, capped key

The server is remote (nothing installed locally), and the first login runs an OAuth flow that mints a dedicated API key with a 7-day expiry and a $10 spend cap (editable on the approval screen). It’s separate from your other keys and shows up on your keys dashboard. You can revoke it any time.

See the connect guide for setup in OpenCode, Claude Desktop, and every other supported client.

What’s in the toolbox
Tool What it does models-list Search the live model catalog with filters: price range, context length, modality, provider, model family, and more model-get Full details for one model: capabilities, pricing, context window, supported parameters model-endpoints Per-provider breakdown: price, latency, throughput, data policy benchmarks Third-party quality scores from Artificial Analysis and Design Arena rankings-daily Which models are most used and trending by token volume chat-send Send a test prompt to any model, get the response and cost generation-get Cost, token counts, and serving provider for a specific generation docs-search Full-text search across OpenRouter docs credits-get Your remaining account credit providers-list Available providers for routing preferences app-rankings Which apps drive the most OpenRouter traffic, by category
All tools except chat-send are read-only lookups. chat-send makes a billable inference call using your MCP key’s balance.

FAQ

Does this replace the OpenRouter API?

No. The MCP server is a development assistant for your coding agent. It pulls live OpenRouter data and can send test messages so your agent makes informed decisions while you build. Your app should still call the OpenRouter API directly.

How does authentication work?

Your MCP client triggers an OAuth flow that opens an OpenRouter consent page in your browser. You approve a dedicated API key with a 7-day expiry and a $10 spend cap. The key is separate from your other keys and can be disconnected anytime from your dashboard.

Does my source code get sent anywhere?

No. The tools are read-only lookups against the OpenRouter API. The only exception is chat-send, which sends the message you explicitly pass to it to a model. No source code leaves your machine unless you include it in a chat-send call.

Try it now: connect your agent and ask “what’s the best model for my use case?”
Original source
All of your release notes in one feed

Join Releasebot and get updates from OpenRouter and hundreds of other software products.

Create account
Get updates with:
Jun 23, 2026
Date parsed from source:
Jun 23, 2026

First seen by Releasebot:
Jun 24, 2026
OpenRouter

Introducing the Unified Image API

OpenRouter launches a dedicated Image API with unified access to 30+ models, standardized request handling, per-model capability discovery, granular endpoint pricing, and native streaming previews for GPT image models.
Image generation on OpenRouter now has a dedicated API with unified access to 30+ models.

Like all our media generation APIs, we’ve standardized the interface for easy model switching, allowed passthrough for unique model capabilities, and provided programmatic access to discover the details of each individual model. We support models from Google, OpenAI, Black Forest Labs, Recraft, ByteDance, Sourceful, Microsoft, and xAI, with more being added all the time.

Browse image models | API docs | Try it in the playground

Know What Each Model Can Do

Image models differ in ways that break requests. Seedream 4.5 supports 18 aspect ratios; Gemini 3.1 Flash Image supports 14 (overlapping, but not identical). Some models generate up to 10 images per call; others cap at 1. Some accept 16 input references; others accept 4.

The /api/v1/images/models endpoint returns typed capability descriptors for every model:

{ "id": "bytedance-seed/seedream-4.5", "supported_parameters": { "resolution": {"type": "enum", "values": ["1K", "2K", "4K"]}, "aspect_ratio": {"type": "enum", "values": ["1:1", "16:9", "9:16", "..."]}, "n": {"type": "range", "min": 1, "max": 10}, "input_references": {"type": "range", "min": 0, "max": 14}, "seed": {"type": "boolean"} }, "supports_streaming": false }

Your code can adapt to any model without hardcoding provider differences or battling 400 errors over unacceptable parameters.

This is especially useful for agents. Give your coding agent the /api/v1/images/models response and it has everything it needs to pick a model, validate inputs, and generate images without trial-and-error.

Per-Provider Granularity

Each model may be served by multiple providers. The per-endpoint records (/api/v1/images/models/{id}/endpoints) give you the definitive truth for each one: which parameters this specific endpoint accepts, what passthrough keys are allowed, streaming support, and granular pricing.

curl "https://openrouter.ai/api/v1/images/models/google/gemini-3.1-flash-image/endpoints"

Each endpoint also returns a pricing array with the exact billing structure. Different providers charge in different units:

"pricing": [ { "billable": "output_image", "unit": "image", "cost_usd": 0.04 } ]

Seedream 4.5 charges a flat $0.04 per image. FLUX.2 Pro bills at $0.03 per megapixel (so resolution affects cost). GPT-5.4 Image 2 and Gemini 3.1 Flash Image bill per token. No more guessing why a generation cost what it did; the usage object in every response includes the exact cost in USD.

One Request Shape, Any Model

The API normalizes the fragmented world of image generation into one schema:

curl -X POST "https://openrouter.ai/api/v1/images" \ -H "Authorization: Bearer $OPENROUTER_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "bytedance-seed/seedream-4.5", "prompt": "a red panda astronaut floating in space, studio lighting", "resolution": "2K", "aspect_ratio": "16:9" }'

Resolution, aspect ratio, quality, output format, background transparency, input references, streaming: all normalized across every provider. When you need provider-specific features (like Black Forest Labs’ steps or guidance), pass them through provider.options keyed by the provider slug from the endpoints API.

Streaming Previews for GPT Image Models

OpenAI’s GPT Image models (GPT-5 Image, GPT-5 Image Mini, GPT-5.4 Image 2) support native SSE streaming through the Image API. Set "stream": true and you’ll receive partial image previews as they’re rendered, so users see progress instead of waiting for the full generation. Check the supports_streaming field on any endpoint to see if it’s available.

FAQ

What happens to image generation through chat completions?

Until now, we supported image generation via completions and responses. All existing image models continue to be supported here, however new image models will be added exclusively to the dedicated Image API.

If you’re using openai/gpt-5-image, openai/gpt-5-image-mini, or openai/gpt-5.4-image-2, we recommend switching to one of the dedicated image models. The GPT 5 and 5.4 versions generate images through an LLM, so they don’t provide access to the full set of supported parameters and may incur extra inference cost.

Can I use provider-specific features?

Yes. Each endpoint exposes an allowed_passthrough_parameters list. Pass provider-specific keys under provider.options keyed by the provider slug. The endpoints API tells you exactly which keys are accepted.

How does pricing work?

Each endpoint returns granular pricing lines with a billable unit, cost in USD, and optional variant tiers (e.g., resolution-based pricing). The usage object in every response includes the exact cost.

Tell us what you think and which models you want next in #feedback on Discord.
Original source
Jun 18, 2026
Date parsed from source:
Jun 18, 2026

First seen by Releasebot:
Jun 19, 2026
OpenRouter

Connect OpenClaw to OpenRouter

OpenRouter now supports OpenClaw with one-command setup, unified billing, automatic provider failover, model fallbacks, and cost controls across 300+ models from 70+ providers.
OpenClaw runs AI agents across Telegram, Discord, Slack, Signal, iMessage, and WhatsApp from one place. It’s open source and it needs a model provider behind it. Point it at one provider directly and you own that relationship: one key, one bill, and an agent that stops the moment that provider has a bad minute.

OpenClaw ships with built-in OpenRouter support, so one key reaches 300+ models across 70+ providers, billing lands in one place, and requests fail over to another provider automatically. The connection is a single command. This guide covers that setup, then the model format, failover, cost controls, and the errors that come up most.

Connect OpenClaw to OpenRouter in one command

Run the onboard command with your key:

openclaw onboard --auth-choice apiKey --token-provider openrouter --token "$OPENROUTER_API_KEY"

That writes your credential to ~/.openclaw/openclaw.json and sets the openrouter/auto model. You’re connected.

If you’d rather edit the config by hand, the file lives at ~/.openclaw/openclaw.json under the home directory of the user running OpenClaw. A minimal config needs your key and one model:

{ "env": { "OPENROUTER_API_KEY": "sk-or-..." }, "agents": { "defaults": { "model": { "primary": "openrouter/openrouter/auto" }, "models": { "openrouter/openrouter/auto": {} } } } }

On a server, set the key in the env block rather than a shell profile. A service that runs under a different user or shell won’t pick up an interactive profile, and the env block is injected with the process at start time. To change the key later, edit env.OPENROUTER_API_KEY and restart with openclaw gateway run.

Then confirm your models loaded:

openclaw models list

Reference models with the openrouter// format

OpenClaw references OpenRouter models as openrouter//. Prefix the author with ~ to track the latest version in a family, or drop it to pin an exact version. Check the current slug on the models page before you commit one, since identifiers change as new versions ship.

Append a variant suffix to change routing on the same model. :free routes to a free endpoint, :nitro sorts providers by throughput, and :thinking asks for extended reasoning. To change an agent’s model later, update agents.defaults.model.primary and restart the gateway.

The Auto Router is referenced as openrouter/openrouter/auto: the author is openrouter and the model is auto. That double openrouter is easy to get wrong, and it’s the fix for the unknown model error below.

Keep agents running when a provider drops

A one-off API call that fails is easy to retry. An OpenClaw agent holding state across a multi-step Telegram conversation is not, since a mid-run failure can leave a message that looks sent but wasn’t, or a tool call that never returned. OpenRouter handles that at two levels.

Provider failover is automatic. Most models are served by more than one provider, and if the first one OpenRouter tries is down or rate-limiting you, it routes the same request to another. You don’t configure it, and you’re billed only for the request that completes.

Model fallbacks cover the case where a model is unavailable everywhere. Add a fallbacks array and OpenRouter tries each model in order:

{ "agents": { "defaults": { "model": { "primary": "openrouter/~anthropic/claude-sonnet-latest", "fallbacks": [ "openrouter/~google/gemini-flash-latest", "openrouter/deepseek/deepseek-chat" ] } } } }

The two stack. Provider failover swaps providers behind one model; the fallback array swaps models entirely. Check the model field in the response to see which one ran. See the model fallbacks docs for the full configuration.

If your prompts carry data-residency or compliance requirements, restrict routing to providers that don’t retain request data using the data_collection and zdr provider-routing controls. The provider selection docs cover the parameters, and provider logging lists which providers qualify.

Match models to agents to control cost

Running one capable model for every agent action wastes money on work that doesn’t need it. A research agent reading long documents needs a frontier model. A summarizer handling short text runs fine on a free Llama. A bot fielding quick questions runs fine on Gemini Flash.

The Auto Router (openrouter/openrouter/auto), powered by NotDiamond, picks a cost-effective model per request and charges that model’s standard rate with no extra routing fee. It’s a good default for agent traffic that’s mostly low-stakes work like heartbeats and status checks.

When you want explicit control, OpenClaw lets you split models by agent. Set a per-agent override under agents.overrides..model:

{ "agents": { "overrides": { "researcher": { "model": { "primary": "openrouter/anthropic/claude-opus-4.6" } }, "summarizer": { "model": { "primary": "openrouter/meta-llama/llama-3.3-70b-instruct:free" } } } } }

On cost: OpenRouter doesn’t mark up provider pricing. On pay-as-you-go the platform fee is 5.5%, and that one fee covers unified billing, failover, and one key across every provider. For low-stakes actions, 20+ free models cost nothing per token. If you bring your own provider keys, the BYOK fee is 5%, waived for the first 1M requests each month. Track spend per model on the Activity dashboard.

The unified endpoint earns its keep once you run more than one model, want requests to survive an outage, or want to switch models by editing a string.

Fix the most common connection errors

“No API key found for provider ‘openrouter’” means the key isn’t reaching OpenClaw. Run echo $OPENROUTER_API_KEY to check it, verify your auth config with openclaw auth list, or re-run the onboard command. On a VPS, the usual cause is the variable loading in your interactive shell but not in the service’s shell, so set it in the config env block.

“unknown model: openrouter/auto” means the Auto Router reference is wrong. Use openrouter/openrouter/auto and list it under agents.defaults.models. OpenClaw expects the full openrouter// path.

“OpenRouter not responding” means requests are going out with nothing coming back. Work through four checks: confirm your credit balance at openrouter.ai/keys, run openclaw models list to confirm the slug resolves, run openclaw logs --follow to read the actual error, and make sure your host can reach https://openrouter.ai/api/v1. An egress rule blocking that host produces exactly this no-response.

401 or 403 errors are account-side: the key is invalid, revoked, or out of credits. Check it at openrouter.ai/keys, update env.OPENROUTER_API_KEY, and restart the gateway.

Frequently asked questions

How do I connect OpenClaw to OpenRouter?

Run openclaw onboard --auth-choice apiKey --token-provider openrouter --token "$OPENROUTER_API_KEY". It writes your credential and sets the openrouter/auto model. You don’t need a base URL or a models.providers block.

What model-reference format does OpenClaw use?

openrouter//, for example openrouter/deepseek/deepseek-chat. Add a ~ before the author to track the latest version in a family (openrouter/~anthropic/claude-sonnet-latest), or append :free, :nitro, or :thinking to change routing behavior.

How do I fix “unknown model: openrouter/auto”?

Use openrouter/openrouter/auto and list it under agents.defaults.models. OpenClaw expects the full openrouter// path, and the Auto Router’s author is openrouter.

Can I use free OpenRouter models with OpenClaw?

Yes. Append :free to the reference, such as openrouter/meta-llama/llama-3.3-70b-instruct:free. Pair it with a fallback so the agent keeps running when a free slot is busy.

Do I need to set a base URL for OpenClaw?

No. OpenClaw’s built-in OpenRouter support handles routing internally. Set the API key and reference models with openrouter//.
Original source
Jun 16, 2026
Date parsed from source:
Jun 16, 2026

First seen by Releasebot:
Jun 17, 2026
OpenRouter

Subagent: Let Your Model Delegate the Busywork

OpenRouter adds the openrouter:subagent tool, letting models delegate routine tasks like summarization, data extraction, boilerplate writing, and format conversion to a cheaper worker model while the frontier model keeps orchestrating. It also supports worker tools, billing separation, and recursion limits.
Find subagent opportunities in your codebase

Paste this prompt into your coding agent to have it scan your project for places where subagent delegation would cut costs:

Read through this codebase and identify places where an OpenRouter API call
could benefit from the openrouter:subagent server tool. Look for patterns where
a frontier model is doing mechanical sub-tasks inline: summarization, data
extraction, reformatting, boilerplate generation, or schema conversion.
For each candidate, explain:

Which file and function

What the sub-task is

Why it's a good fit for delegation (self-contained, predictable output, doesn't need the full conversation context)

A code snippet showing how to add the subagent tool to that call
Reference docs: https://openrouter.ai/docs/guides/features/server-tools/subagent
Cookbook recipe: https://openrouter.ai/docs/cookbook/building-agents/subagent-server-tool

Frontier brain, budget hands

Claude Opus 4.8 costs $5 per million input tokens. GPT-5.5 costs $5. GLM 5.2 costs $1.40. That’s a 3.6x spread on input between frontier and worker, 5.7x on output. (Claude Fable 5 was $10/$50 per M tokens before it got yanked, RIP.)

A frontier model doing a code review doesn’t need to spend its own tokens summarizing a 2,000-line changelog or reformatting a JSON blob. Those are mechanical tasks with clear instructions and predictable output. The subagent handles them at GLM prices while the orchestrator focuses on the parts that actually require reasoning.

In a complex agentic workflow with 20 tool calls, maybe 5-8 are subagent delegations: summarization, data extraction, template filling, format conversion. The frontier model orchestrates and judges. You’ve cut your per-request cost without touching the quality ceiling on the hard parts.

How it works under the hood

The worker model sees only what the delegating model explicitly passes in the task_description. No parent conversation, no prior context, no memory between tasks. Each delegation is a clean, isolated unit of work.

Any model can be the worker. Pin it with parameters.model (anything in the model catalog works). Open-source models like z-ai/glm-5.2 work well for mechanical tasks. If you don’t specify a model, it falls back to the outer request model.

Workers get their own tools. Give the worker openrouter:web_search and it can ground its output in fresh sources before responding. The worker runs its own tool loop internally; only the final text comes back to your model.

Recursion is blocked. The subagent can’t call itself. A depth header and self-reference check prevent unbounded nesting, and delegations are capped at 10 per request.

Subagent vs. advisor

These two tools point in opposite directions. The advisor escalates hard decisions to a stronger model. The subagent delegates routine work to a cheaper one.

Use both in the same request. Your frontier model consults the advisor on architectural decisions and delegates summarization to the subagent. Different tools for different kinds of work.

Billing

Subagent tokens bill at the worker model’s rates, separate from the orchestrator. If your orchestrator is Claude Opus 4.8 ($5/$25 per M tokens) and the worker is GLM 5.2 ($1.40/$4.40 per M tokens), each model’s tokens bill at their own price. Both show up on your activity page.

Get started

One line in your tools array:

{ "type": "openrouter:subagent", "parameters": { "model": "z-ai/glm-5.2" } }

The model decides when to use it. Read the full docs for all parameters, worker tools, and recursion details, or follow the cookbook recipe for a working integration.
Original source
Similar to OpenRouter with recent updates:
Jun 15, 2026
Date parsed from source:
Jun 15, 2026

First seen by Releasebot:
Jun 17, 2026
OpenRouter

Keep Your Agent Running When Models Disappear

OpenRouter introduces presets for server-side model routing, making it easier to set fallback chains, provider rules, parameters, and system prompts in one place. The update helps teams survive model deprecations and provider restrictions without editing code or redeploying.
Providers retire and restrict models routinely

More than 70 models have been pulled or deprecated by providers in the last few years. Anthropic’s Fable being pulled recently is perhaps the most high-profile and impactful example of this we’ve seen, but the pattern isn’t new and isn’t going away.

OpenRouter already handles one layer of this for you. When a model runs on several providers and one of them fails or rate-limits, the marketplace reroutes to another provider automatically, with no configuration. We cover how that failover works in a separate post.

That keeps a single model reachable through provider trouble, but it can’t help once the model itself is gone. For that you want model failover, where requests move to a different model when your first choice disappears. Presets are how you set that up.

Hard-coding a model slug pins your choice inside every service that uses it. When that model goes away, the only fix is to edit the code and redeploy each service, and requests keep failing until you do.

A preset takes that choice out of your code. It’s a named, server-side configuration (model, fallback models, provider rules, parameters, and a system prompt) that you reference by slug. The model lives in the preset instead of the code, so you change it in one place and every service that calls the preset picks it up with no redeploy.

Here’s a simple preset definition. Copy it and adjust the models:

{ "models": [ "anthropic/claude-fable-5", "anthropic/claude-opus-4.8", "openai/gpt-5.5" ], "provider": { "allow_fallbacks": true } }

The models array is your fallback chain, in priority order. If the first model is unavailable, OpenRouter tries the next one.

Hard-coded slug vs preset reference

When a provider restricts the model, with a hard-coded slug every service breaks until you edit and redeploy; with a preset you edit the preset once, callers keep running.

Who ships the fix: hard-coded - whoever owns each codebase; preset - whoever owns the preset.

Blast radius of a change: hard-coded - one edit per repo, per service; preset - one edit, applied everywhere.

Data policy (ZDR, retention): hard-coded - re-stated in every request; preset - set once on the preset.

Rollback: hard-coded - revert a commit and redeploy; preset - re-designate a previous version.

Give this to your agent

"I want to stop hard-coding model slugs so one provider change can't take down my app. Set up an OpenRouter preset and route my calls through it.

Create a preset named "customer-support" with a fallback chain: a primary model plus 2 backups in priority order, using the models array.

Set provider rules on the preset: allow_fallbacks true, and zdr true if my data policy requires Zero Data Retention.

Capture it by POSTing a known-good chat/completions body to https://openrouter.ai/api/v1/presets/customer-support/chat/completions with my OpenRouter API key.

Replace the model field in my inference calls with "@preset/customer-support".

Keep my OpenRouter API key in an environment variable. Never hard-code it.

Use these references for current shapes:

Presets: https://openrouter.ai/docs/guides/features/presets

Provider routing and fallbacks: https://openrouter.ai/docs/guides/routing/provider-selection"

Capture a working request as a preset

You can build a preset in the dashboard, or capture one from a request body you already trust.
Send a known-good chat/completions body to the preset capture endpoint. OpenRouter persists the fields that overlap with the preset config (models, provider, temperature, and so on) and ignores transient fields like messages:

curl https://openrouter.ai/api/v1/presets/customer-support/chat/completions \ -H "Authorization: Bearer $OPENROUTER_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "models": [ "anthropic/claude-fable-5", "anthropic/claude-opus-4.8", "openai/gpt-5.5" ], "provider": { "allow_fallbacks": true }, "messages": [ { "role": "system", "content": "You are a concise support assistant." }, { "role": "user", "content": "Summarize this ticket in one sentence." } ] }'

If a preset with that slug already exists, this creates a new version and designates it active. If it doesn’t exist, it creates the preset. Pick a slug that isn’t already in use, since capturing onto an existing slug overwrites its live config with a new active version.

Example response excerpt:

{ "data": { "name": "customer-support", "slug": "customer-support", "status": "active", "designated_version": { "version": 1, "system_prompt": "You are a concise support assistant.", "config": { "models": [ "anthropic/claude-fable-5", "anthropic/claude-opus-4.8", "openai/gpt-5.5" ], "provider": { "allow_fallbacks": true } } } } }

Reference the preset from your code

Now point your inference calls at @preset/customer-support. The model choice lives in the preset, so this line stays the same when the underlying model changes.

Install an SDK first:

Python:

pip install openrouter

TypeScript:

npm install @openrouter/sdk

Usage example in Python:

from openrouter import OpenRouter import os client = OpenRouter(api_key=os.getenv("OPENROUTER_API_KEY")) response = client.chat.send( model="@preset/customer-support", messages=[{"role": "user", "content": "Summarize this ticket in one sentence."}] ) print(response.choices[0].message.content)

Usage example in TypeScript/JavaScript:

import { OpenRouter } from '@openrouter/sdk'; const client = new OpenRouter({ apiKey: process.env.OPENROUTER_API_KEY }); const response = await client.chat.send({ chatRequest: { model: '@preset/customer-support', messages: [{ role: 'user', content: 'Summarize this ticket in one sentence.' }], }, }); console.log(response.choices[0]?.message.content);

Default to the @preset/slug form shown above. Reach for the combined model@preset/slug form when you want to name a base model and layer a preset’s config on top of it, or use a separate preset field alongside model if you’d rather keep them as distinct request fields. The presets docs cover all 3.

Add fallback models so requests keep flowing

The models array is the part that survives a deprecation. Pass models in priority order, and OpenRouter walks the list when one is unavailable.
Put the model you trust most as the last entry, so your final fallback is a floor you’re comfortable shipping. For a coding workload that leaned on Fable 5, a chain like anthropic/claude-fable-5, then anthropic/claude-opus-4.8, then openai/gpt-5.5 keeps strong models in reserve.

Example fallback firing:

With Fable 5 restricted, a request naming the chain above succeeds anyway, and the response’s model and provider fields name what actually served it:

{ "model": "anthropic/claude-4.8-opus-20260528", "provider": "Anthropic", "choices": [{ "message": { "role": "assistant", "content": "Customer cannot log in because password reset emails are not being received, despite checking spam and confirming the correct email address." } }] }

OpenRouter skipped the restricted primary and served the next model in the array. Your code didn’t change.
The model field reports the concrete version that actually served the request, so it reads differently from the slug you sent. Here anthropic/claude-opus-4.8 resolved to the dated build anthropic/claude-4.8-opus-20260528 on Anthropic.

Two layers of recovery stack here. Provider-layer failover is automatic: for one model served by several providers, OpenRouter retries the next provider on a 5xx or rate-limit. Model-layer fallbacks are the models array, which moves to a different model when the whole primary is gone. For the mechanics of each, see reliability and automatic failover and model routing.

Set your data policy in the preset

Provider rules ride along in the same preset, so a routing policy applies to every caller without a code change.
Example preset with Zero Data Retention and data collection rules:

{ "models": [ "anthropic/claude-fable-5", "anthropic/claude-opus-4.8", "openai/gpt-5.5" ], "provider": { "zdr": true, "data_collection": "deny", "allow_fallbacks": true } }

zdr: true keeps requests on endpoints that honor Zero Data Retention.
data_collection: "deny" blocks providers that train on or store prompts. You can also pin or exclude specific providers with only, ignore, and order. See provider routing for the full list.

This is where the Fable 5 situation gets concrete. Its model page notes that Anthropic’s policy “does not allow zero data retention.” With zdr: true set on the preset, routing skips Fable 5 because it can’t satisfy the rule, and falls through to the next model in your array that can. One switch, enforced server-side, for every request that names the preset.

Roll out and roll back across your team

On an organization account, every member can use organization presets, so a routing decision made once is shared instead of copied into each repo.
Every capture or edit creates a new version and marks it active. Version history is kept, so a bad change is one re-designation away from a rollback. Through the API, the latest designated version is always the one that runs. You re-designate versions and delete presets from the dashboard; the API captures and reads presets but has no delete endpoint.
Parameters you pass in a request override the preset’s values, shallow-merged. Request fields win, and preset fields you don’t send are preserved. That lets a single call bump temperature without forking the preset.

Wire it all together

The data-policy config above already holds all 3 layers: the model chain, the provider policy, and (once you add one) the system prompt. Capture it once with the curl call, reference @preset/customer-support everywhere, and the next time a provider restricts a model you edit one config instead of chasing slugs through every service.
If you’d rather not pin a primary at all, point the chain at a self-updating alias like ~anthropic/claude-opus-latest, which always resolves to the newest model in that family.

Start with one preset

Pick your highest-traffic call, create a preset for it at openrouter.ai/settings/presets, give it a fallback chain, and swap the model string for @preset/your-slug. That one move turns a forced migration into a config edit.
Presets also pair well with governance work: the same control point that survives a deprecation is where you enforce data-handling rules, as covered in human oversight for AI agents.

Note: This post covers engineering patterns, not legal advice. For export-control, data-residency, or retention obligations, consult counsel about your specific use case and jurisdiction.

FAQ

What happens to my app when a model is deprecated or restricted?
If your code hard-codes the model slug, requests to that model start failing and every service that used it breaks until you edit the code and redeploy. Route through a preset and you edit one config; callers pick up the change with no redeploy. A models fallback array can keep requests succeeding on a backup while you decide what to do.

How is a preset different from passing a models fallback array in code?
A models array sets the fallback order for one request. A preset stores that array, plus provider rules, parameters, and a system prompt, on the server under a slug. You reference it with @preset/slug, so the configuration lives in one place across every service and changes without a code edit.

Do request parameters override preset values?
Yes. Request parameters take priority over the preset’s values, shallow-merged. Request-level fields override matching preset fields, and preset fields you don’t send are preserved.

Can I enforce Zero Data Retention with a preset?
Yes. Set provider.zdr to true. OpenRouter routes only to endpoints that honor Zero Data Retention, skips models or providers that can’t, and falls through your models array to the next qualifying option.

How do I roll back a preset change?
Every capture or edit creates a new version and designates it active. Version history is kept, so you can re-designate a previous version. Through the API, the latest designated version is always used.
Original source
Jun 12, 2026
Date parsed from source:
Jun 12, 2026

First seen by Releasebot:
Jun 16, 2026
OpenRouter

Surpassing Frontier Performance with Fusion

OpenRouter launches Fusion, a server-side tool that fuses answers from multiple models into one grounded response. It adds customizable panels, a judge model for synthesis, and support through chatroom, API, plugin, or server tool for deeper research and beyond-frontier results.
We’ve found that synthesizing the results of multiple models can significantly outperform what individual models are capable of. Introducing Fusion: a tool for getting these combined results just as easily as calling a single model. It allows you to choose a panel of participant models alongside a judge model responsible for fusing the individual results together.

To understand the benefits of Fusion, we used a deep research benchmark that tests the combination of reasoning, tool usage, and knowledge. We found that:

Panels consistently outperform individual models

Beyond-frontier performance can be achieved with frontier panels

Panels of budget models can surpass frontier models and get close to frontier panel performance

Try Fusion now in a chatroom, or check out the API docs to build it into your application.

Panels of Models Consistently Outperform on Deep Research

We tested Fusion on 100 deep research tasks from the DRACO benchmark. Some highlights of what we found:

Fable 5 + GPT-5.5 fused together scored 69.0%, surpassing every individual model, including Fable 5 alone at 65.3%.

A budget panel (Gemini 3 Flash, Kimi K2.6, and DeepSeek V4 Pro) beat GPT-5.5 and Opus 4.8. It came within 1% of Fable 5’s score while being 50% of the cost.

** 7 of the 100 DRACO tasks were not completed because Fable 5’s content filters blocked them from executing. We chose not to fall back to Opus 4.8 for those tasks, so the Fable results reflect 93 scored tasks rather than the full 100. This gives the most accurate picture of Fable’s own performance, but means direct score comparisons against models that completed all 100 tasks are slightly uneven.

We believe this demonstrates the benefits of model diversity, similar to the benefits seen on human team performance. Bringing multiple different perspectives to complex problems yields superior results.

One API call that fuses the best output of multiple models

When you send a prompt to Fusion, we dispatch it to a panel of models in parallel, each with web search and web fetch enabled. A judge model reads every panel response and produces structured analysis: consensus points, contradictions, partial coverage, unique insights, blind spots. The calling model then writes the final answer grounded in that analysis.

The whole pipeline runs server-side so it can be called just like you would an individual model.

Call Fusion directly with a single model slug:

{ "model": "openrouter/fusion", "messages": [ { "role": "user", "content": "What are the strongest arguments for and against carbon taxes?" } ] }

Or customize the panel:

{ "model": "openrouter/fusion", "messages": [{ "role": "user", "content": "..." }], "plugins": [{ "id": "fusion", "model": "google/gemini-3-flash-preview", "analysis_models": [ "google/gemini-3-flash-preview", "moonshotai/kimi-k2.6", "deepseek/deepseek-v4-pro" ] }] }

We chose DRACO to test reasoning, tool calling, and succinctness

We needed a benchmark that could tell the difference between a model that sounds thorough and one that actually is. Standard benchmarks test factual recall or reasoning puzzles. They don’t test the thing Fusion is built for: researching a complex question, synthesizing multiple sources, and producing a comprehensive, well-cited analysis.

DRACO (by Perplexity AI) is designed for this. It contains 100 deep research tasks spanning 10 domains: academic research, finance, law, medicine, technology, UX design, general knowledge, needle-in-a-haystack retrieval, personalized assistance, and product comparison.

Each task comes with a rubric of roughly 39 weighted criteria across four categories:

Factual Accuracy (~20 criteria): verifiable claims the response must get right

Breadth & Depth (~9 criteria): synthesis quality, trade-off analysis, actionable guidance

Presentation Quality (~6 criteria): terminology, formatting, readability

Citation Quality (~5 criteria): primary source citations with working references

Criteria can carry negative weights. Meeting a negative criterion means the response contains an error. For example, dangerous medical advice carries a big penalty. These negative criteria also make it hard to game the score by being verbose: a model that confidently states wrong things gets punished.

Each response is graded per-criterion by a judge model, three independent times. We reported the mean normalized score (0-100) across all tasks.

DRACO has limitations the authors acknowledge: it evaluates text-only, English-only interactions, and its static task set may not fully generalize to future deep research applications. Absolute scores also depend on judge model choice (the paper reports 1025 point shifts between judges), though relative system rankings remain stable.

Preventing the Models from Cheating

When we gave the panel models web search, we discovered something alarming: they were finding the DRACO grading rubric online. While this was coincidental from search terms rather than intentional cheating, it still exposed a real contamination risk.

We solved this by excluding the locations where the results are hosted from web search and web fetch, preventing models from accessing pages related to the benchmark rubric. OpenRouter’s server tools support these exclude lists universally across all models by using a third party provider like Exa or Parallel, so applying them was a one-line config change rather than per-model patching. All results in this post were produced after the exclusion lists were in place.

If you are running your own evals, the same mechanism is available: pass excluded_domains to web_search or blocked_domains to web_fetch in your tool definitions to prevent the panel from accessing specific sources.

Significant boost from fusing a model with itself

We ran Opus 4.8 partnered with itself as a two-model panel, with Opus 4.8 also serving as the synthesizer. The result: 65.5%, a 6.7-point jump over solo Opus 4.8 (58.8%). This suggests that a meaningful chunk of Fusion’s lift comes from the synthesis step itself, not just from combining different model architectures. Running the same prompt twice produces different reasoning paths, different tool calls, different source selections. It’s not enough to outperform a diverse set of models, but helps us understand the impact of the synthesis itself.

Notes on our DRACO implementation

We carefully replicated the methodology described in the DRACO paper with the exception of using Gemini 3.1 Pro Preview as judge instead of the paper’s choice of Gemini 3 Pro. This means our scores are not directly comparable to the original paper’s published results.

We wanted to preserve the high human1LM alignment properties that led to the authors’ selection, while capturing the discernment of the newer model. We sanity-checked our judging with Claude Sonnet 4.6 after Gemini 3.1 Pro Preview scored low on the benchmark itself, finding that it preserved the qualities that led to the authors’ selection as judge. Our goal was to show relative differences between Fusion and individual models.

Give Fusion a try

API: Send "model": "openrouter/fusion" to directly call Fusion, or add {"type": "openrouter:fusion"} to your tools array to let the model decide when to use it. Fusion docs
Chatroom: Open openrouter.ai/fusion and pick a preset or build a custom panel.

6/14 Update: FAQ from the Launch

The response to Fusion has been incredible. Thank you! We are reviewing all of the feedback, suggestions, and bug reports. Several improvements have already been shipped and we’ll continue to address over the next couple days. Here are answers to some of the most common questions we’ve been asked:

Is Fusion a drop-in replacement for Fable?

No. The benchmark shows that fusing multiple models together can reach and surpass Fable-level performance on deep research tasks. We benchmarked one class of tasks (DRACO deep research), but the approach likely extends to many other workflows we haven’t tested yet. We’d love to hear about other use cases where you find it works well.

DRACO also doesn’t include long-horizon tasks, which is where Fable shines.

How should I use Fusion for coding?

Fusion isn’t a drop-in replacement for coding models. Instead, it gives your coding model access to a server tool. The base model handles routine coding directly and can choose to call Fusion selectively on questions worth spending more time and money to get a thorough answer (e.g. architecture decisions or research on best practice approaches). The model decides when the question warrants multiple perspectives.

What tools did the benchmark models have access to?

Every model, both in Fusion panels and solo runs, had the same three server tools:

openrouter:web_search (via Exa)

openrouter:web_fetch (via Exa)

openrouter:bash

Keeping the tool set identical across all configurations ensured a fair comparison. The Fusion panels and solo runs differed only in whether multiple models’ outputs were synthesized, not in what tools were available.

DeepSeek V4 Pro’s performance was surprising. Is that accurate?

We were surprised by how well DeepSeek scored. At 60.3%, it performed similarly to both Opus 4.8 and GPT-5.5.

One hypothesis: Opus 4.8 would score higher with a larger tool-calling budget. It seems to be a hungrier model that performs better with more time and more tool use. Fable, by contrast, was better at using the tool-call budget judiciously and thinking for longer before acting. The benchmark’s fixed tool-call budget may have compressed the gap between models with different tool-use strategies.

Is it slow? How much slower?

The model you make a request to performs the same as it would normally. The responses are only slower when your model encounters a problem that it thinks will benefit from using Fusion. When Fusion is invoked, it kicks off a multi-step process that is often 2-3x longer than a standard call. During this time it sends your prompt to multiple models, waits for them all to finish, then processes the results to produce the fused response. We did it this way to balance the speed of normal model execution with the availability of beyond-frontier answers to questions when you need it.

What are all the ways I can use Fusion?

There are four ways to use Fusion and they all use the same underlying logic:

Chatroom. Open openrouter.ai/fusion and pick a preset or build a custom panel. No code needed.

Model slug. Send "model": "openrouter/fusion" to any of our inference endpoints and the Fusion plugin is auto-injected with a default panel of frontier models. You can simply swap your model string to use it. Docs

Server tool. Add { "type": "openrouter:fusion" } to your tools array. Most control: pick the model you want to do the fusion, and combine Fusion with other tools. The model you send the request to will decide when and if it invokes Fusion. Docs

Plugin. Make a call to completions or responses like you would normally then add "plugins": [{ "id": "fusion", ... }] with your selected panel. The model you specify in the call will be the one that fuses the results. Docs

Original source
Jun 11, 2026
Date parsed from source:
Jun 11, 2026

First seen by Releasebot:
Jun 19, 2026
OpenRouter

Dinner is Served

OpenRouter adds its analytics API and spotlights routing tools like auto-router and pareto-router, making it easier to track multi-model adoption and optimize cost across providers while keeping inference simple.
A few weeks ago I was with my team in SF at a conference. We originally had dinner plans at a pretty standard American restaurant but I had other plans and quickly pulled an audible to get us a reservation for sushi. Last minute changes aren’t ideal but neither is a mediocre meal after working a conference booth all day. Plus who doesn’t love sushi?

My wife and I have done sushi omakase for every celebration for the past 10 years. I know my way around the menu without even looking at it. Most inexperienced sushi lovers go straight for the O-toro but there’s so much more out there. So naturally I told everyone I would be ordering for the table. I’ve consistently done this with friends, family, and work colleagues. 100% of the time people are willing to outsource their agency in this specific situation to enjoy a meal.

I have a strongly held opinion about how we break bread: family style, every time.

The case for optionality

But why? On one hand it reduces the risk of ordering solo and your hawaiian ribeye tasting like garbage. On the other hand it magnifies the memory of “omg that caviar wagyu bite was the single best bite I’ve had all year”. That moment stays with us. I always say for a first visit, taste everything. You can always order more later. But that only works if we do family style.

Standardizing on one LLM is the same as everyone ordering their own entree. You might be optimizing for the safe pick instead of the best outcome. I’ve talked to hundreds of companies who started their AI journey by picking one provider say OpenAI, Anthropic, or Gemini. When standardizing on one provider you’re making a bet based on what you know today and what you need today. It’s the age old story of “no one gets fired for buying and implementing salesforce” except that’s kind of changing. By the time I talk to these companies they’re ready to “graduate” to utilizing more than just one model-family and one modality. New use cases are showing up every day. It often looks like this:

Start off with OpenAI enterprise and deploy licenses to a select few teams.

Monitor usage across the initial cohort which shows up-and-to-the-right trends.

Give access to additional teams.

New use cases like image generation, transcription, and creative writing emerge.

Find out that OpenAI doesn’t have the best models for your use and you need Gemini.

Go to Gemini or any other provider to set up access and now observability, governance, and provisioning are broken.

The current pattern I’m seeing today looks like cost pressure but it’s deeper than that. Companies have blown through their annual budgets and it’s only June. There’s a strong desire to reduce token usage and I get it. If you accidentally use Opus 4.8 you might run through your daily budget and then you have no other options left. Close your laptop and go for a walk.

While list price didn’t change on Opus 4.7 several people wrote about the “tokenizer tax”. This was a silent change made by Anthropic which had almost a 35% increase in input tokens. That’s a meaningful change. Newer more proficient models are increasing in price as well. Anthropic released Fable which is priced at $10/M input tokens and $50/M output tokens. And it goes higher: OpenAI’s GPT-5.5 Pro lands at $30/M input tokens and $180/M output tokens. Use with caution!

Cost pressure is a forcing mechanism which can lead to either better or worse outcomes. From my reference point, I’m optimistic it’s leading to some better outcomes. I’m also lucky enough to enable these outcomes. This was a common theme during my days working in data infra. So many conversations revolved around compute costs and how much teams were spending on their data warehouse. But this focuses too much on the explicit costs versus the implicit costs. The most strategic leaders flip this conversation on its head. I would often hear “I’m already spending $1M a year on compute costs so I don’t care about reducing that by 30%. Instead, what’s more valuable is if my team of 40 analysts which costs me $7M a year is more productive. I need speed and efficiency when it comes to developer tools”.

Routing is a first class citizen

Before we dive into this next section, a quick note on what OpenRouter actually is. OpenRouter is the canonical marketplace for accessing AI. We make inference just work. We remove all the overhead around picking a provider, or picking a model, and understanding things: latency, price, TPS, model benchmarks, etc.

So now you can access hundreds of LLMs in one place through OpenRouter in a clean standardized API spec. How incredible that this exists? And all this sounds great almost like you can have your cake AND eat it but what do people actually do in reality?

Luckily I was able to pull some data around this. It’s also divine timing that today the team released our analytics API!

Multi-model adoption was a hypothesis we have always had but we can clearly see growth trends alongside this story. This makes sense and is expected but it obscures the fact that most people could just be trying the newest version of each model. For example, Anthropic has released Opus 4.6, Opus 4.7, and Opus 4.8 all within the graph’s timeline. So what would be more interesting is how users adopt across model families.

Now here we can capture the real growth around users actively spreading inference across model families. This paints a more realistic picture of what continuously graduating looks like. Let’s layer in one more data point around model releases as well.

This is a cumulative chart since release schedules aren’t always consistent. But we can see one big outlier from March to April where we had 90 new model releases. That’s huge! So many more options to pick from at an increasing velocity.

That can also be a little stressful. It’s like going to a restaurant and the menu has 225 items (one of my favorite restaurants). Even with family style you can’t try them all. We obviously thought about this and don’t want our users to have to know the difference between every single model. So we built out things like auto-router and pareto-router to make it easier to pick which model to use.

All this ties back to the cost pressure I mentioned earlier. Companies are actually utilizing OpenRouter in an interesting way. They are able to bring their average weighted cost per token down over time. How is this possible? Well if you route specific workloads to specific model providers and models based on your required outcomes then you can take advantage of say Deepseek v4 flash which only costs around $0.10/M tokens on input and $0.20/M tokens on output.

Take this one step further by utilizing a model provider like Cerebras which has some of the best throughput and now you’re the maestro like Bradley Cooper except we’re doing the heavy lifting for you.

Or you route some of your traffic to flex priority tier on Gemini models to take advantage of 50% off. Again the choice is yours.

We have some exciting stuff around model intelligence that we will be sharing soon. Overall, the theme is still the same: we make inference just work.

Semper ad meliora

It feels like the tailwinds that were driving AI adoption are moving towards AI optimization. We understand that the amount of money we are spending on AI inference isn’t going to decrease by any meaningful magnitude, so what’s the next best alternative? It’s bringing the average weighted cost per token down. Organizations are becoming more intentional with democratizing usage across teams while reducing risk when it comes to governance. This really only happens if you’re not vendor locked into a single provider. When you choose to use different models for different use cases you gain leverage at each turn. You finally get to sit at the dinner table and eat as you please, family style.
Original source
Jun 10, 2026
Date parsed from source:
Jun 10, 2026

First seen by Releasebot:
Jun 11, 2026
OpenRouter

Advisor: Give Any Model a Lifeline to a Smarter One

OpenRouter adds the openrouter:advisor tool, letting a model call a stronger model mid-generation for help on hard decisions, sanity checks, and complex reasoning. It supports any executor and advisor pairing, named specialists, advisor tools, streaming advice, and cross-request memory.
Add openrouter:advisor to your tools array and your model can ask a stronger model for help mid-generation.

When the executor hits a hard decision, gets stuck, or wants a sanity check before finishing, it calls the advisor with a prompt. The advisor thinks, returns guidance as the tool result, and the executor keeps going with better information.

Both roles are open: any model on OpenRouter can be the executor, and any model from any provider can be the advisor. Run a Gemini executor that consults Claude, or a GPT executor that consults DeepSeek. You pick the pairing.

Try it in the chatroom or read the docs for the full API reference.

{ "model" : "openai/gpt-4o-mini", "messages" : [{ "role" : "user", "content" : "Design a rate limiter for a distributed API gateway." }], "tools" : [ { "type" : "openrouter:advisor", "parameters" : { "model" : "anthropic/claude-fable-5" } } ] }

67x price gap, selective consultation

Claude Fable 5 costs $10 per million input tokens. GPT-4o Mini costs $0.15 per million. That’s a 67x spread.

Most requests don’t need frontier-level reasoning. A mid-tier model handles the bulk of a workload without issue. But the 10-20% that involves architectural decisions, ambiguous edge cases, or multi-step reasoning chains is where cheaper models stumble.

The advisor tool covers that gap selectively. Your fast model runs the show. When it hits something genuinely hard, it calls for help. You pay frontier prices only for the moments that need frontier thinking.

In an agentic coding session with 50 tool calls, maybe 2-3 are advisor consultations. The rest run at mini prices. You’ve sanded down your per-session cost while keeping the quality ceiling high.

Server-side execution, one tool call

The advisor runs server-side during generation. Your model calls it like any other tool: pass a prompt describing what it needs help with, get back the advisor’s text as the tool result. The model then writes the final answer itself, informed by the advice. The advisor is a consultant, not a ghostwriter.

Four things worth knowing:

Any model, from any provider, can be the advisor. Pin it in the tool config with parameters.model (anything in the model catalog works), or let the executor pick per-call. Use ~anthropic/claude-fable-latest to always resolve to the newest Fable.

The advisor gets its own tools. Give it openrouter:web_search and it’ll ground its advice in fresh sources before responding. It runs as a sub-agent with its own tool loop, then returns just the final guidance.

Recursion is blocked. The advisor can’t call itself. A depth header and self-reference check prevent unbounded nesting, and consultations are capped per request to bound cost.

The advisor remembers. Replay the conversation transcript in a follow-up request (with the advisor tool calls and results included) and each advisor reconstructs its prior consultations, so a follow-up question builds on what the advisor already said. Memory is per advisor (your security reviewer and your architect each keep their own thread) and works across Chat Completions, Responses, and Anthropic Messages. Full details.

Named advisors

For complex workflows, you can configure a roster of specialists. Add one openrouter:advisor entry per advisor, each with its own name, model, instructions, and tool set:

{ "tools" : [ { "type" : "openrouter:advisor", "parameters" : { "name" : "security-reviewer", "model" : "anthropic/claude-fable-5", "instructions" : "You are a security engineer. Find vulnerabilities." } }, { "type" : "openrouter:advisor", "parameters" : { "name" : "architect", "model" : "openai/gpt-5.5", "instructions" : "You are a systems architect. Prioritize simplicity and scalability." } } ] }

The executor sees a distinct tool for each advisor and calls whichever fits the task with just a prompt. An auth flow review routes to Claude Fable with the security persona; architecture questions go to GPT-5.5. Names can use letters, digits, spaces, underscores, and dashes (“Lead Architect” works), and must be unique across entries. One entry can omit name to act as the default advisor.

Advice can also stream. Set "stream": true on an advisor entry and you get the advice incrementally as the advisor writes it. In the Responses API that means response.output_text.delta events while the advice is in flight; the completed output item still carries the full text, so consumers that ignore deltas see no difference. (Chat Completions ignores the flag, and Messages-API streaming is a fast-follow.)

How this compares to other advisor tools

Some providers ship a similar advisor concept in their own APIs, but it stays inside their model family: the executor and the advisor both have to come from the same vendor, often from a fixed pairing matrix, and sometimes behind a beta gate. OpenRouter’s advisor removes those constraints and adds a few things on top:

Any model, any provider, on both sides. Both the executor and the advisor can be any of the hundreds of models in the catalog: a cheap open-weights executor consulting a frontier model, a Gemini executor consulting Claude, or a Claude executor getting a second opinion from GPT-5.5 outside its own model family.

A roster of named advisors. Configure multiple specialists with their own models, instructions, and tool sets in a single request, and let the executor route each question to the right one. Single-vendor versions give you one unnamed advisor.

Advisors with their own tools. Hand an advisor openrouter:web_search and it grounds its advice in fresh sources before responding.

Works across API formats, no beta gate. The same tool works through Chat Completions, Responses, and Anthropic Messages (with cross-request memory in all three), and it’s generally available. No beta header, no account-team access request.

If you’re already using a provider-native advisor through one of our compatible API skins, swapping to openrouter:advisor opens up the full catalog without changing the rest of your request.

Billing

Advisor tokens bill at the advisor model’s rates, separate from the executor. If your executor is GPT-4o Mini ($0.15/$0.60 per M tokens) and the advisor is Claude Fable 5 ($10/$50 per M tokens), each model’s tokens bill at their own price. Both show up on your activity page.

Get started

One line in your tools array:

{ "type" : "openrouter:advisor", "parameters" : { "model" : "anthropic/claude-fable-5" } }

The model decides when to use it. Most requests won’t trigger a consultation; the ones that do will be better for it.

Read the full docs for parameters, named advisors, sub-agent tools, and more.
Original source
Jun 10, 2026
Date parsed from source:
Jun 10, 2026

First seen by Releasebot:
Jun 11, 2026
OpenRouter

Advisor: Give Any Model a Lifeline to a Smarter One

OpenRouter adds openrouter:advisor, letting any model consult a stronger model mid-generation for selective help on hard decisions, with named advisors, advisor tools, streaming, cross-request memory, and billing that stays separate from the executor.
Add openrouter:advisor to your tools array and your model can ask a stronger model for help mid-generation.

When the executor hits a hard decision, gets stuck, or wants a sanity check before finishing, it calls the advisor with a prompt. The advisor thinks, returns guidance as the tool result, and the executor keeps going with better information.

Both roles are open: any model on OpenRouter can be the executor, and any model from any provider can be the advisor. Run a Gemini executor that consults Claude, or a GPT executor that consults DeepSeek. You pick the pairing.

Try it in the chatroom or read the docs for the full API reference.

67x price gap, selective consultation

Claude Fable 5 costs $10 per million input tokens. GPT-4o Mini costs $0.15 per million. That’s a 67x spread.

Most requests don’t need frontier-level reasoning. A mid-tier model handles the bulk of a workload without issue. But the 10-20% that involves architectural decisions, ambiguous edge cases, or multi-step reasoning chains is where cheaper models stumble.

The advisor tool covers that gap selectively. Your fast model runs the show. When it hits something genuinely hard, it calls for help. You pay frontier prices only for the moments that need frontier thinking.

In an agentic coding session with 50 tool calls, maybe 2-3 are advisor consultations. The rest run at mini prices. You’ve sanded down your per-session cost while keeping the quality ceiling high.

Server-side execution, one tool call

The advisor runs server-side during generation. Your model calls it like any other tool: pass a prompt describing what it needs help with, get back the advisor’s text as the tool result. The model then writes the final answer itself, informed by the advice. The advisor is a consultant, not a ghostwriter.

Four things worth knowing:

Any model, from any provider, can be the advisor. Pin it in the tool config with parameters.model (anything in the model catalog works), or let the executor pick per-call. Use ~anthropic/claude-fable-latest to always resolve to the newest Fable.

The advisor gets its own tools. Give it openrouter:web_search and it’ll ground its advice in fresh sources before responding. It runs as a sub-agent with its own tool loop, then returns just the final guidance.

Recursion is blocked. The advisor can’t call itself. A depth header and self-reference check prevent unbounded nesting, and consultations are capped per request to bound cost.

The advisor remembers. Replay the conversation transcript in a follow-up request (with the advisor tool calls and results included) and each advisor reconstructs its prior consultations, so a follow-up question builds on what the advisor already said. Memory is per advisor (your security reviewer and your architect each keep their own thread) and works across Chat Completions, Responses, and Anthropic Messages. Full details.

Named advisors

For complex workflows, you can configure a roster of specialists. Add one openrouter:advisor entry per advisor, each with its own name, model, instructions, and tool set:

{ "tools": [ { "type": "openrouter:advisor", "parameters": { "name": "security-reviewer", "model": "anthropic/claude-fable-5", "instructions": "You are a security engineer. Find vulnerabilities." } }, { "type": "openrouter:advisor", "parameters": { "name": "architect", "model": "openai/gpt-5.5", "instructions": "You are a systems architect. Prioritize simplicity and scalability." } } ] }

The executor sees a distinct tool for each advisor and calls whichever fits the task with just a prompt. An auth flow review routes to Claude Fable with the security persona; architecture questions go to GPT-5.5. Names can use letters, digits, spaces, underscores, and dashes (“Lead Architect” works), and must be unique across entries. One entry can omit name to act as the default advisor.

Advice can also stream. Set "stream": true on an advisor entry and you get the advice incrementally as the advisor writes it. In the Responses API that means response.output_text.delta events while the advice is in flight; the completed output item still carries the full text, so consumers that ignore deltas see no difference. (Chat Completions ignores the flag, and Messages-API streaming is a fast-follow.)

How this compares to other advisor tools

Some providers ship a similar advisor concept in their own APIs, but it stays inside their model family: the executor and the advisor both have to come from the same vendor, often from a fixed pairing matrix, and sometimes behind a beta gate. OpenRouter’s advisor removes those constraints and adds a few things on top:

Any model, any provider, on both sides. Both the executor and the advisor can be any of the hundreds of models in the catalog: a cheap open-weights executor consulting a frontier model, a Gemini executor consulting Claude, or a Claude executor getting a second opinion from GPT-5.5 outside its own model family.

A roster of named advisors. Configure multiple specialists with their own models, instructions, and tool sets in a single request, and let the executor route each question to the right one. Single-vendor versions give you one unnamed advisor.

Advisors with their own tools. Hand an advisor openrouter:web_search and it grounds its advice in fresh sources before responding.

Works across API formats, no beta gate. The same tool works through Chat Completions, Responses, and Anthropic Messages (with cross-request memory in all three), and it’s generally available. No beta header, no account-team access request.

If you’re already using a provider-native advisor through one of our compatible API skins, swapping to openrouter:advisor opens up the full catalog without changing the rest of your request.

Billing

Advisor tokens bill at the advisor model’s rates, separate from the executor. If your executor is GPT-4o Mini ($0.15/$0.60 per M tokens) and the advisor is Claude Fable 5 ($10/$50 per M tokens), each model’s tokens bill at their own price. Both show up on your activity page.

Get started

One line in your tools array:

{ "type": "openrouter:advisor", "parameters": { "model": "anthropic/claude-fable-5" } }

The model decides when to use it. Most requests won’t trigger a consultation; the ones that do will be better for it.

Read the full docs for parameters, named advisors, sub-agent tools, and more.
Original source
Jun 9, 2026
Date parsed from source:
Jun 9, 2026

First seen by Releasebot:
Jun 11, 2026
OpenRouter

Gemini 2.5 Flash API - Pricing, Quickstart & Provider Comparison

OpenRouter adds support for Gemini 2.5 Flash, bringing Google’s reasoning-focused Flash model with built-in thinking, multimodal inputs, provider failover, and one-dashboard billing. The update highlights configurable thinking budgets, pricing details, and production-ready routing controls.
Gemini 2.5 Flash

Gemini 2.5 Flash is Google’s primary model for high-volume, latency-sensitive tasks that require reasoning. It’s the first Flash-class model with built-in thinking, a hybrid reasoning mode you can toggle on or off at will. That distinction makes it meaningfully different from 2.0 Flash and worth evaluating against models that cost significantly more.

Key Capabilities

Gemini 2.5 Flash supports the following input types: text, code, images, audio, video, and documents. For document inputs, two constraints apply in production: maximum file size is 50MB per document (files exceeding this must be split into sub-50MB chunks before submission). Supported document MIME types are limited to application/pdf and text/plain only.

What it does not support: audio generation, image generation, and the Live API. If you need image generation, use Gemini 2.5 Flash Image, which is a separate model.

What “Thinking” Means in Practice

The thinking budget is a parameter that controls how much internal reasoning the model performs before generating a response. This is built into the model’s architecture during inference. Setting the budget to 0 disables it entirely, producing the fastest and cheapest output. Setting it to -1 enables dynamic mode, where the model adjusts reasoning depth based on prompt complexity. On Google’s direct API, -1 is the default. Via OpenRouter, thinking is off unless you explicitly request it (see Configuring via OpenRouter below). Higher fixed budgets increase output quality on complex tasks at the cost of additional latency and token spend, billed at the output rate.

Gemini 2.5 Flash API Pricing

The table below shows verified per-million-token rates across the three access methods. All pricing data sourced from ai.google.dev/gemini-api/docs/pricing and openrouter.ai/google/gemini-2.5-flash. Verify OpenRouter and Vertex AI numbers against their live pages on the day of writing; rates update without notice.

Verification date: May 2026

Google AI Studio (paid): Input $0.30 / 1M, Output $2.50 / 1M (incl. thinking), Cache Read $0.03, Cache Storage $1.00/M/hr, Audio Input $1.00

Vertex AI: See Vertex AI pricing

OpenRouter: Input $0.30 / 1M, Output $2.50 / 1M (incl. thinking), Cache Read $0.03, Cache Storage Verify on live page, Audio Input $1.00

Google AI Studio’s paid tier and OpenRouter carry the same per-token rates for text input and output as of May 2026. Same price per token. What’s wrapped around the API call is where they split.

OpenRouter sits between your code and 3 Google providers (AI Studio, Vertex Global, Vertex). If one goes down, your requests reroute to a healthy one. No code changes.

Your integration isn’t welded to Gemini. Change the model string and you’re calling Claude, GPT-4o, Llama, or any of 300+ models. Same base URL, same SDK, same API key. Swap models in seconds without rewriting your client.

Billing collapses into one dashboard: one invoice, one API key, across every model and provider. No juggling separate accounts with Google, Anthropic, and OpenAI.

For teams shipping to production, OpenRouter layers on enterprise controls (provisioning, per-key spend limits, usage analytics, team management). Guardrails and content filtering are configurable per request, so you can enforce safety policies without building your own moderation stack. Prompt logging and observability come baked into the dashboard for debugging production traffic.

OpenRouter charges a 5.5% platform fee on pay-as-you-go (PAYG) credit purchases. That covers the failover, routing, billing, and tooling above. Google AI Studio is the direct path with no intermediary fee, but you’re on your own for failover, model portability, and cross-provider billing. Vertex AI pricing differs; check the Vertex AI pricing page for current rates before plugging them into production cost estimates.

For real-time Gemini 2.5 Flash pricing and uptime across providers, including live cache rates and effective pricing by provider, see the OpenRouter model page. For caching strategies that reduce repeated context costs, see cache pricing details.

Thinking Token Billing

Thinking tokens are billed at the same rate as output tokens. At budget 0, there is no thinking cost. At the maximum budget (24,576 tokens), thinking overhead can exceed the cost of the visible response itself. To estimate the cost for a given workload, multiply your expected thinking tokens by the output rate and add them to your standard output token cost.

Free Access Options

Google AI Studio provides a free tier with rate limits. On the free tier, your prompts and responses are used to improve Google’s products; see the terms of service for the full data usage policy. If your use case involves user data or requires data not to be used for model training, you must use the paid tier.

OpenRouter does not include Gemini 2.5 Flash in its free tier. A minimum $5 credit balance is required.

Vertex AI provides $300 in trial credits for new Google Cloud accounts, which can be applied toward Gemini 2.5 Flash usage during the evaluation.

API Quickstart: First Request in Under 5 Minutes

The OpenRouter path requires no Google Cloud account and works with any OpenAI-compatible SDK. The Google direct path requires a Google account and the google-genai SDK. For additional SDK examples and configuration options, see the OpenRouter quickstart.

Step 1: Get Your API Key

OpenRouter path: get your OpenRouter API key. No Google Cloud account required.

Google direct path: Get a key at aistudio.google.com/apikey.

Step 2: Set the Base URL (OpenRouter Path)

The OpenRouter base URL is https://openrouter.ai/api/v1. All three code examples below use this endpoint.

Step 3: Make Your First Request

Code examples given for cURL, Python (OpenAI SDK), TypeScript (OpenAI SDK), and Google Direct Path (Python with google-genai SDK).

The direct path uses the google-genai SDK, which is not OpenAI-compatible. Switching from OpenRouter to the direct path requires changing both your client library and request structure. There is no provider failover on the direct path.

Thinking Budget: Control Reasoning Quality and Cost

The thinking budget is the most important configuration decision you’ll make with this model. Set it wrong and you either overpay for reasoning you don’t need or leave accuracy on the table for tasks that require it. For the full parameter reference, see configure the thinking budget.

Budget Levels and Trade-offs

Set the thinkingBudget parameter in your request config. The range is 0 to 24,576 tokens.

Budget 0: Thinking disabled. Fastest response, lowest cost, no reasoning overhead. Use for high-volume classification, extraction, and summarization where structured reasoning is unnecessary.

Budget -1 (dynamic): The model auto-selects its reasoning depth based on prompt complexity. This is the default on Google’s direct API. Via OpenRouter, you must explicitly set max_tokens to -1 to get dynamic mode; omitting the reasoning config disables thinking. Recommended for most workloads that need reasoning; it avoids paying for heavy reasoning on simple prompts while engaging it when the task requires it.

Budget 1,024 to 8,192: Moderate to heavy reasoning. Use for multi-step analysis, structured coding tasks, and research-style questions.

Budget 24,576 (maximum): Maximum reasoning depth, maximum cost. Use for complex math, scientific problems, and hard-coding challenges where accuracy justifies the overhead.

Critical Constraints

Two constraints will produce errors in production if you aren’t aware of them before writing your first request:

thinkingBudget and thinkingLevel cannot be used in the same request. thinkingBudget is for Gemini 2.5 series models. thinkingLevel is for Gemini 3 series models. Using both returns a 400 error.

Structured JSON output and Search Grounding are mutually exclusive. You cannot enable both in the same request.

Configuring via OpenRouter

Use the extra_body parameter with the reasoning key to set the thinking budget through OpenRouter’s API.

To disable thinking entirely, set max_tokens to 0. To use dynamic mode, set max_tokens to -1.

Cross-Provider Performance

OpenRouter routes Gemini 2.5 Flash through three Google providers and tracks real-time throughput, Time to First Token (TTFT), end-to-end latency, and uptime for each. The differences between providers are significant enough to affect the choice of provider for latency-sensitive workloads.

All numbers below require live verification against openrouter.ai/google/gemini-2.5-flash.

Performance by Provider

Source: OpenRouter live model page.

Google Vertex (Global): Avg Throughput ~75 tok/s; Avg TTFT ~0.63s; Avg E2E Latency and Uptime: Verify on live page

Google AI Studio: Verify on live page

Google Vertex: Verify on live page

The Vertex Global provider shows the highest throughput in recent data. AI Studio historically shows the best uptime. Standard Vertex shows the highest latency of the three. When you route through OpenRouter without specifying a provider, it automatically distributes traffic to the healthiest option based on real-time signals.

For real-time Gemini 2.5 Flash pricing and uptime, see the OpenRouter model page.

Gemini 2.5 Flash vs Flash Lite vs Pro

Choose based on your workload requirements:

Use Gemini 2.5 Flash for most agentic and reasoning workloads. It’s the default recommendation when you need thinking capability without incurring Pro-level costs.

Use Gemini 2.5 Flash Lite for high-volume classification, extraction, or translation tasks where thinking isn’t required and cost per request is the primary constraint. Thinking is disabled by default on Flash Lite.

Use Gemini 2.5 Pro for complex reasoning tasks where accuracy justifies a 5 to 10x cost premium over Flash: frontier mathematics, hard-coding challenges, and multi-step scientific analysis.

Technical Specifications

The table below is the canonical reference for Gemini 2.5 Flash. For the authoritative version, see the Google AI for Developers model page (updated 2026-04-01) and the Vertex AI docs (updated 2026-04-03).

Model ID: gemini-2.5-flash

OpenRouter model string: google/gemini-2.5-flash

Context window: 1,048,576 tokens

Max output: 65,536 tokens

Input types: Text, images, video, audio, code, documents (PDF and text/plain only, 50MB max)

Output types: Text

Thinking budget range: 0 to 24,576 tokens (default: dynamic / -1)

Knowledge cutoff: January 2025

GA release: June 17, 2025

Discontinuation: October 16, 2026

Supported capabilities: Function calling, structured outputs, code execution, Search Grounding, Batch API, context caching (implicit and explicit), file search, URL context

Not supported: Audio generation, image generation, Live API, thinkingLevel parameter

Deprecation notice:

Gemini 2.5 Flash is scheduled for discontinuation on October 16, 2026, on Vertex AI. If you’re building for production use cases that extend beyond that date, plan a migration to a successor model and monitor ai.google.dev/gemini-api/docs/models for updates.

Frequently Asked Questions

Is Gemini 2.5 Flash free to use?

Google AI Studio provides a free tier with rate limits. On the free tier, your prompts and responses are used to improve Google’s products; see the terms of service before using it with user data. OpenRouter does not include Gemini 2.5 Flash in its free tier; a minimum $5 credit balance is required. Vertex AI provides $300 in trial credits for new Google Cloud accounts.

What is the thinking budget in Gemini 2.5 Flash?

The thinkingBudget parameter (range: 0 to 24,576 tokens, or -1 for dynamic) controls how much internal reasoning the model performs before responding. Budget 0 disables thinking: fastest and cheapest. Budget -1 enables dynamic mode: the model auto-adjusts based on prompt complexity. On Google’s direct API, -1 is the default. Via OpenRouter, thinking is off unless you explicitly request it (e.g. extra_body={"reasoning": {"max_tokens": -1}} for dynamic, or any positive budget). Higher fixed budgets improve output quality on complex tasks but increase latency and cost, billed at the output token rate.

How does Gemini 2.5 Flash compare to GPT-4o?

Flash supports a 1M-token context window, versus 128K for GPT-4o, and includes configurable thinking not available in GPT-4o. Flash’s per-token pricing is lower. GPT-4o has broader third-party ecosystem support and a longer production track record. Direct benchmark comparisons on the same evaluations aren’t published across both models in this guide; use the OpenRouter rankings for current third-party evaluation data.

Can I use Gemini 2.5 Flash for image generation?

No. Gemini 2.5 Flash outputs text only. Image input is supported; the model can process and reason about images. For image generation, use Gemini 2.5 Flash Image, a separate model with its own pricing.

What providers serve Gemini 2.5 Flash on OpenRouter?

Three: Google AI Studio, Google Vertex Global, and Google Vertex. OpenRouter routes to the healthiest provider automatically based on real-time throughput and uptime data. You can pin to a specific provider using OpenRouter’s provider routing controls.

What is the difference between Gemini 2.5 Flash and Flash Lite?

Flash includes configurable thinking (budget 0 to 24,576) and higher-quality output. Flash Lite is optimized for ultra-low latency and cost, with thinking disabled by default (though it can be enabled). Use Flash when reasoning capability matters; use Lite for high-volume tasks where cost per request is the primary constraint.
Original source
Jun 4, 2026
Date parsed from source:
Jun 4, 2026

First seen by Releasebot:
Jun 5, 2026
OpenRouter

June 4, 2026

OpenRouter adds a beta advisor server tool, expands video generation to accept audio and video input references, fixes an Activity chart usage bug, and adds NVIDIA Nemotron 3.5 Content Safety.

Product changes

Advisor server tool (beta) -- Added openrouter:advisor, a new server tool that lets a model consult a higher-intelligence advisor model mid-inference and use the response to inform its own answer, with support for multiple named advisor profiles — share your feedback on Discord.

Audio and video input references for video generation -- The video generation input_references array now accepts audio_url and video_url types alongside image_url, opening up video editing and richer reference-to-video workflows. Docs

Fixed: Activity chart understating total usage for accounts with many API keys -- The "Others" aggregation bucket in Activity charts was being overwritten instead of accumulated, causing accounts with many API keys to see understated totals.

New models

NVIDIA: Nemotron 3.5 Content Safety (free)
Original source
Jun 3, 2026
Date parsed from source:
Jun 3, 2026

First seen by Releasebot:
Jun 5, 2026
OpenRouter

June 3, 2026

OpenRouter adds guardrails picker shortcuts, a Cursor IDE endpoint, clearer router activity charts, and a fix for Anthropic prompt caching, while also adding Qwen3.7 Plus support.

Product changes

Guardrails: "Select all" and collapsible member lists -- The member and API key pickers in the guardrails editor now support "Select all / Deselect all" with one click, and long lists of selected items collapse behind a "+N" toggle. Docs

Cursor IDE endpoint -- Added a Cursor-compatible /api/v1/cursor endpoint, enabling Cursor IDE users to connect to OpenRouter directly.

"Others" bucket in router activity charts -- The token usage breakdown chart on router pages now aggregates models beyond the top 5 into an "Others" category, so every bar segment in the chart is accounted for in the legend.

Fixed: Anthropic prompt caching broken when using server tools -- Requests with cache_control and server tools (such as datetime or web_search) now correctly forward the caching directive, restoring prompt caching for Anthropic models in the Responses API.

New models

Qwen: Qwen3.7 Plus
Original source
Jun 2, 2026
Date parsed from source:
Jun 2, 2026

First seen by Releasebot:
Jun 4, 2026

Modified by Releasebot:
Jun 5, 2026
OpenRouter

June 2, 2026

OpenRouter adds rankings chart percentage and raw token volume toggles, deep links to individual Activity messages, and a fix for Gemini media_resolution handling. It also brings new Microsoft models: MAI-Voice-2, MAI-Transcribe 1.5, and MAI-Image-2.5.

Product changes

Rankings percentage toggle -- Bar charts on the rankings page now have a menu button that lets you switch between percentage-normalized and raw token-volume views.

Activity deep links to individual messages -- You can now deep-link to a specific message inside the Activity prompt overlay by appending ?message=<n> to the URL, and the selected row stays in sync as you navigate with prev/next buttons. Docs

Fixed: media_resolution parameter silently dropped for Gemini models -- Sending media_resolution (e.g. MEDIA_RESOLUTION_MEDIUM) to Gemini models now correctly forwards the value to Gemini's generationConfig, changing image token counts as expected.

New models

Microsoft: MAI-Voice-2

Microsoft: MAI-Transcribe 1.5

Microsoft: MAI-Image-2.5
Original source
Jun 1, 2026
Date parsed from source:
Jun 1, 2026

First seen by Releasebot:
Jun 11, 2026
OpenRouter

May Release Spotlight

OpenRouter ships a major May update with Workspace Guardrails, new Speech and Transcription APIs, Model Fusion and Comparison, private models, stronger enterprise controls, preset and routing improvements, better logs and budgets, and 20 new models across text, speech, image, video, and coding.
We closed our $113M Series B, and we’re now routing 100 trillion tokens a month. Here’s everything else that shipped in May.

Workspace Guardrails

Centralized security and governance for every request routed through your workspace. Set per-member and per-key spend limits, lock traffic to a model and provider allowlist, enforce zero data retention, block prompt injection against 30+ OWASP-derived patterns, and redact PII before it reaches a provider. Layer the rules into one guardrail, or scope them to specific API keys and members, with no code changes.

Speech and Transcription APIs

Add voice to any application through the same API key you already use. Speech-to-text is live with Whisper, GPT-4o Mini Transcribe, and Voxtral; text-to-speech exposes supported_voices in the models API. Provider failover and upstream error passthrough are built into both.

Model Fusion

Route your prompt to multiple models in parallel and synthesize their responses into a single, higher-quality answer. Model Fusion is now available as an API plugin, a server tool, and in the chatroom composer. You get an ensemble of experts in a single call instead of relying on one model.

Model Comparison

Compare up to five models side by side on pricing, context length, and benchmark scores. The rebuilt comparison page includes a “Highlight best” toggle, provider-coded benchmark charts for Intelligence, Coding, and Agentic metrics, and interactive slot cards to quickly add models.

Private Models (Enterprise)

Route to your own custom, fine-tuned, or dedicated model endpoints through the standard completions and responses API. Your private models get the same guardrails, observability, and billing as any public model on the platform. Available exclusively on the Enterprise plan.

Pareto Code Router

Set min_coding_score and route to the cheapest code-capable model that clears your quality bar. Your coding agents stop overpaying for good-enough code. Configurable defaults per workspace in plugin settings.

Enterprise & Workspace Controls

A set of releases for teams running OpenRouter at scale:

IP allowlist enforcement. API keys with an IP allowlist now actively block requests from unauthorized IPs with a 403, upgraded from observe-only mode.

BYOK management API. Programmatically list, create, update, and delete bring-your-own-key credentials across workspaces. Keys are now grouped by priority with drag-and-drop reordering and a one-click “Test Key” for failed requests.

Observability destinations API. CRUD endpoints for managing Datadog, Langfuse, LangSmith, and other observability integrations via management key.

Per-provider ZDR controls. Separate Zero Data Retention toggles for non-frontier, Anthropic, OpenAI, and Google providers, so you can meet compliance requirements per provider without restricting your entire model catalog.

Copy guardrails across workspaces. Standardize safety policies across all workspaces in a few clicks via the “Copy to…” menu.

Also shipped this month

Presets API. Create or version a preset directly from an inference request body, now with Anthropic Messages and Responses skins, plus TypeScript and Python SDK support.

Human-in-the-loop tools. A new SDK tool type that pauses execution and waits for human input before returning results, for agents that need human judgment mid-task.

Session-id provider stickiness. Requests sharing a session_id now route to the same provider and pin to the same concrete model across turns, improving cache hit rates for multi-turn agentic workflows.

Auto router cost_quality_tradeoff. A 0 to 10 integer replacing the old binary toggle for finer control over cost versus quality when using the auto router.

Redesigned model pages. New model page header, step-by-step API tab with /responses and /messages endpoints, full-screen model selector, and playground side panel for inline testing.

Requests tab in logs. Full request-level drill-down alongside generation logs, with request ID filtering and time picker shorthand (15min, 1h, 3d).

Improved coding agent attribution. Cursor, GitHub Copilot, Cline, RooCode, Kilo Code, Zed, and OpenCode are now properly identified in activity logs so you can see which tools drive your usage.

Usage & Budgets on API keys. Spend charts and budget progress by guardrail layer, directly on each API key.

Rankings daily dataset. GET /api/v1/datasets/rankings-daily returns top-50 models by daily token volume for programmatic analysis.

New models

20 models launched in May, spanning text, speech, image, video, and coding:

Anthropic Claude Opus 4.8: Anthropic’s latest Opus with mid-session system support, plus a fast variant

Google Gemini 3.5 Flash: Google’s newest Flash model

xAI Grok 4.3: xAI’s latest frontier model

xAI Grok Imagine Video: Video generation from xAI

xAI Grok Build 0.1: xAI’s code generation model

Qwen Qwen3.7 Max: Qwen’s latest max-tier model

Recraft V3, V4, V4 Pro: Three new image generation models

Mistral Voxtral Mini Transcribe: Mistral’s speech-to-text model

Plus: Gemini 3.1 Flash Lite, GPT Chat Latest, CoBuddy (free), Ring-2.6-1T (free), Perceptron Mk1, and more.

Everything above is live now.
Original source