AI Voice and Speech Release Notes
Release notes for AI voice synthesis, text-to-speech and audio generation tools
Products (9)
Latest AI Voice and Speech Updates
- Jun 1, 2026
- Date parsed from source:Jun 1, 2026
- First seen by Releasebot:Jun 2, 2026
June 1, 2026
Eleven Labs adds Exotel telephony integration, workflow-aware agent transfers, repeat agent test runs, richer conversation filtering, larger speech to text uploads, hCaptcha support for dubbing, a new workspace API key revoke endpoint, and refreshed SDK releases.
ElevenAgents
Exotel telephony integration: Added Exotel as a first-class telephony provider alongside Twilio and SIP trunking. New Exotel outbound call endpoint (POST /v1/convai/exotel/outbound-call) initiates outbound calls via the Exotel Connect API with agent_id, agent_phone_number_id, to_number, optional conversation_initiation_client_data and optional telephony_call_config. Exotel phone number create, list, get and update schemas are available, and exotel is added to TelephonyProvider, conversation initiation source and transfer result enums.
Workflow-aware agent transfers: AgentTransfer now allows a nullable agent_id so transfers can stay on the current agent, adds optional node_id to start at a specific node in the destination workflow, and only requires condition. Successful transfer_to_agent tool results can include nullable to_node, aligning transfers with workflow-node targets.
Agent test repeat runs: Run tests on the agent accepts optional repeat_count (integer, 1–20, default 1) to run each test multiple times. When repeat_count is greater than 1, responses include bucketing_status and result_groups for grouped summaries. Unit test create, update and summary schemas add optional conversation_initiation_source so tests can simulate a specific channel.
Conversation filtering: List conversations adds optional workflow_node_entered_id (string) to filter conversations that entered a given workflow node. Text search conversation messages adds optional topic_ids (array of strings) to filter by topic IDs assigned during topic discovery.
Agent versioning query parameters deprecated: enable_versioning on create agent and enable_versioning_if_not_enabled on update agent are marked deprecated. All agents are versioned and these parameters are ignored.
Procedure compiler mode: Removed append from ProcedureCompilerMode.
Workspaces
Revoke API key: Added revoke API key endpoint (DELETE /v1/workspaces/api-keys/revoke) with required api_key_name query parameter to revoke a workspace API key by name. This endpoint is destructive and requires additional account permissions, even though the documentation is public. Contact your ElevenLabs representative to enable it for your account.
Resource access disclosure: Resource access response models add optional access_source (creator, explicit, workspace_default, workspace_admin) so clients can disclose why a user has access to a shared resource.
Speech to Text
Larger upload limit: Convert speech to text now accepts audio and video files up to 5.0GB, increased from 3.0GB.
Dubbing
hCaptcha support: Dub a video or audio file request body adds optional hcaptcha_token (string) for bot protection on dubbing project creation.
Zip render format: RenderType adds zip alongside existing output formats such as mp4, aac, mp3, wav, aaf, tracks_zip and clips_zip.
SDK Releases
Packages
@elevenlabs/[email protected] - Added configurable microphone input chunk duration via inputChunkDurationMs (default 25ms).
@elevenlabs/[email protected] - Exposed inputChunkDurationMs for microphone input chunk duration and updated the client dependency to @elevenlabs/[email protected].
@elevenlabs/[email protected] - Updated dependencies to @elevenlabs/[email protected] and @elevenlabs/[email protected].
@elevenlabs/[email protected] and @elevenlabs/[email protected] - Updated widget dependencies to @elevenlabs/[email protected].
Flutter SDK v0.6.1 - Hardened LiveKit session teardown against uncaught stream errors.
API
Original source - May 25, 2026
- Date parsed from source:May 25, 2026
- First seen by Releasebot:May 27, 2026
May 25, 2026
Eleven Labs introduces Speech Engine for real-time voice on custom chat agents and adds text behavior overrides, new integrations, OTLP trace output, music generation modes, and SDK updates.
Introducing Speech Engine
ElevenLabs Speech Engine adds real-time voice to your own chat agent or LLM. ElevenLabs handles speech-to-text, turn-taking, text-to-speech and browser playback while your server owns the agent logic and streams response text over a Speech Engine WebSocket. Use it when you want voice on a custom runtime rather than a fully hosted ElevenAgents configuration.
ElevenAgents
- Text behavior overrides: Added text_behavior_overrides, a per-ConversationInitiationSource map of BehaviorOverride objects with optional verbosity, output_format and interaction_budget fields for channel-specific agent behavior.
- Integration sources: Added Intercom, Telegram and Freshdesk.
- OTLP conversation traces: Get conversation details now accepts an optional format query parameter. Set format=otlp_traces to return OTLP-compatible trace data alongside the standard conversation payload.
- ASR keyword overrides: Added ASRConversationalConfigOverride and ASRConversationalConfigOverrideConfig schemas with optional keywords arrays, wired into conversation config client override models.
- Webhook auth metadata: Webhook tool configuration schemas now expose optional auth_resolved_params (string array) documenting URL placeholders resolved from the auth connection.
Music
- Generation mode: Added MusicGenerationMode (track, loop, ambience) and an optional generation_mode field on music prompt request bodies.
- Video to music model: Video to music (POST /v1/music/video-to-music) now accepts optional model_id (string, default music_v1).
ElevenCreative Studio
- Conversion credits: Chapter and voice conversion statistics response models now include optional credits_needed_to_convert (integer) indicating credits required before conversion.
Workspaces
- Resource types: Added studio_projects to WorkspaceResourceType.
SDK Releases
Python SDK
- v2.50.0 - Regenerated the SDK for the May 25, 2026 API schema.
- v2.49.1 - Updated Speech Engine API calls to return the full response object.
- v2.49.0 - Regenerated the SDK for the May 18–25, 2026 API schema, including ElevenAgents text behavior overrides, music generation_mode, and workspace studio_projects.
JavaScript SDK
- v2.50.0 - Added missing Music API methods and tests, and regenerated the SDK for the May 25, 2026 API schema.
- v2.49.1 - Updated Speech Engine API calls to return the full response object.
- v2.49.0 - Regenerated the SDK for the May 18–25, 2026 API schema, including ElevenAgents text behavior overrides, music generation_mode, and workspace studio_projects.
Packages
- @elevenlabs/[email protected] - Fixed iOS Safari dropping the first agent message on WebSocket voice sessions by unlocking an AudioContext on the first user gesture and priming the playback graph after the audio worklet loads.
- @elevenlabs/[email protected] and @elevenlabs/[email protected] - Style emotion and audio tags in voice transcripts when strip_audio_tags is off, and treat null top-level terms_html or terms_text as a kill switch for the terms and conditions modal.
- @elevenlabs/[email protected] and @elevenlabs/[email protected] - Updated widget dependencies to @elevenlabs/[email protected].
API
Original source All of your release notes in one feed
Join Releasebot and get updates from Eleven Labs and hundreds of other software products.
- May 18, 2026
- Date parsed from source:May 18, 2026
- First seen by Releasebot:May 20, 2026
May 18, 2026
Eleven Labs adds new agent version metadata, text-only conversation filtering, and broader workspace auth and configuration updates, while expanding SDK and iOS support for the latest API changes. The release also improves LLM options, voice isolator history, and WebRTC reliability.
ElevenAgents
Agent version metadata: Added an API endpoint to retrieve metadata for a specific agent version. The response includes the version ID, agent ID, branch ID, description, sequence number, commit timestamp, parents and access information.
Conversation filtering: Added a text_only filter to conversation list and message text search endpoints so developers can narrow results to text-only conversations.
Procedure loading: Added load_procedure as a system tool configuration with ProcedureAtVersion, and added skills to ProcedureCompilerMode. ProcedureSettings.compiler_mode now defaults to skills.
Agent configuration updates: Added gemini-3.1-flash-lite and qwen35-397b-a17b as LLM options, added genesys_bot_connector as a conversation initiation source, added auto_resolve_after_inactive_minutes to alerting settings, and added attributes_to_headers to SIP trunk configuration.
Custom LLM temperature: Agent prompt temperature can now be set to null to omit the temperature field from downstream LLM requests.
Workspaces
Auth connection updates: Added an API endpoint to update workspace auth connections. The endpoint supports updating basic auth, OAuth2 client credentials and OAuth2 JWT connections. OAuth2 client credentials now support custom_headers for token requests.
Auth connection types: Added bearer auth create request support and Slack bot auth response schemas.
Workspace resource types: Added convai_templates and transcription_tasks to workspace resource types.
Voice Isolator
Video processing status: Audio isolation history items now include the video_processing_failed boolean field.
SDK Releases
Python SDK
v2.48.0 - Regenerated the SDK for the May 18, 2026 API schema, including typed support for workspace auth connection updates, agent version metadata, conversation text_only filters, new ElevenAgents configuration fields, updated LLM options and voice isolator history updates. The release also excludes the speech_engine_custom.py example from Fern packaging.
JavaScript SDK
v2.48.0 - Added environment configuration to conversation initiation data and events, and regenerated the SDK for the May 18, 2026 API schema, including workspace auth connection updates, agent version metadata, conversation text_only filters, new ElevenAgents configuration fields, updated LLM options and voice isolator history updates.
Packages
@elevenlabs/[email protected], @elevenlabs/[email protected], @elevenlabs/[email protected], @elevenlabs/[email protected] and @elevenlabs/[email protected] - Updated the client dependency. @elevenlabs/[email protected] pins livekit-client to 2.16.1 and forces the dual peer connection path to fix WebRTC connection failures with newer LiveKit join protocols.
@elevenlabs/[email protected] - Removed manually maintained types from @elevenlabs/types, including Role, Mode, Status, Callbacks, CALLBACK_KEYS, DisconnectionDetails, MessagePayload and AudioAlignmentEvent. Import those types from @elevenlabs/client instead. The types package now contains only generated code.
iOS SDK
v3.2.0 - Added support for text-only conversations over WebSockets and fixed message handling.
v3.1.5 - Improved ConnectionManager and DataChannelReceiver cleanup on deallocation, skipped audio hardware initialization in text-only mode, removed unused code and duplicate agent ID extraction, and bumped visionOS support.
API
Original source - May 2026
- No date parsed from source.
- First seen by Releasebot:May 16, 2026
TTS API bug fixes
Hume fixes a bug that caused duplicate interleaved TTS audio and distorted output.
Fixed a bug where duplicate interleaved audio was included in TTS audio output.
This resolves an issue where audio chunks could be duplicated and interleaved, resulting in distorted output.
Original source - May 15, 2026
- Date parsed from source:May 15, 2026
- First seen by Releasebot:May 16, 2026
May 15, 2026
Hume adds an experimental temperature parameter to its TTS API for more varied or consistent speech generation.
TTS API additions
Added an experimental temperature parameter to TTS endpoints. Controls sampling temperature for speech generation. Higher values increase variation; lower values increase consistency.
Original source - May 13, 2026
- Date parsed from source:May 13, 2026
- First seen by Releasebot:May 13, 2026
DramaBox TTS: Saving drama for the performance, not the security review
Resemble releases DramaBox, a prompt-driven expressive TTS model that turns plain-language scene directions into more human speech with emotion, laughter, breaths, and pacing. It also adds optional voice cloning and embedded watermarking for provenance.
Directable speech
At some point, everyone who has built with TTS hits the same wall. The voice sounds fine, the words are correct, and it still sounds wrong in a way that is hard to explain until you have heard it enough times to name it: flat. Technically accurate, completely unconvincing.
The problem was that we were using machine language to describe something fundamentally human. Tags, parameters, style tokens, all of it was a translation layer between what you actually wanted and what the model could understand. The translation is where the performance struggled to sound…human.
A real director does not say "increase emotional intensity by 20%." They set a scene, they describe a character, and the performance follows from that direction. DramaBox is the first TTS model that works the same way. You describe what you want in plain language, the way you would describe it to a person, and the model understands it that way.
The second problem was quieter but just as serious. Every piece of synthetic audio left your hands with no way to prove it was yours. No signal in the file, no chain of custody, nothing that holds up when legal gets involved.
Today we are releasing DramaBox, and it addresses both.
Huge congratulations to Resemble AI on the release of DramaBox, their latest expressive voice model. It's also amazing to see the Resemble TTS family surpass 10M downloads on Hugging Face, a testament to the strength of the open model community.
- Joshua Lochner, Open Source Machine Learning Engineer
The format works like a screenplay. Speaker description and stage directions go outside the quotes. Dialogue goes inside. The model speaks the dialogue and interprets everything outside the quotes as performance direction, never as words to say.
Read this prompt:
A regal woman speaks with cold fury in a measured, low voice. She sighs deeply, "I have told you a thousand times, and yet here we are again." Her voice sharpens with rising anger, "Do you honestly think I enjoy repeating myself?!" She drops to a venomous whisper, leaning close, "Now get out of my sight before I do something we will both regret."
That is the exact prompt we passed to the model, one generation with no post-processing and no stitching of takes together.
🎧 Sample:
Regal Queen — Cold Fury to Venomous WhisperHow it works
The prompt format distinguishes between two zones. Inside double quotes, the model speaks literally. Outside double quotes, it performs. Write, She sighs deeply outside the quotes and the model produces a sigh. Write "Sigh" inside the quotes and the model says the word out loud, which is not what you want. That distinction is what makes the whole thing work, and it is also what makes the prompting feel genuinely different from anything else in TTS.
Here is what that looks like when you give the model a character who cannot stop laughing:
A playful girl speaks in a bright, singsong voice, already mid-giggle, "Hehehe, oh my gosh you should see your face right now, it is priceless!" She gasps for air between giggles, "Oh my, I cannot stop laughing!" She tries to compose herself with a long sigh, "Ahhhhh okay okay, I will stop, I promise." She leans in and whispers, "But seriously though, between you and me," then immediately loses it again, "Haha, I just cannot! You are way too funny!"
🎧 Sample:
Catgirl — Uncontrollable GigglingThe gasps, the failed attempts to compose herself, the snort at the end — all of it came from the prompt, not from post-processing.
What's under the hood
DramaBox is a 3.3B-parameter audio-only diffusion transformer, LoRA-merged, with a Gemma 3 12B text encoder running at 4-bit quantization. It outputs 48kHz stereo audio in AAC or WAV. On a warm H100, generation takes around 2.5 seconds.
Voice cloning is optional. Pass a 10-second reference clip and the model clones the target timbre while still following the prompt for everything else: emotion, delivery, the laugh in the middle of a sentence, the breath before the hard line. The reference sets the voice and the prompt directs the performance. Without a reference, the model invents a voice that fits the scene description.
And it’s a step towards compliant TTS for coming regulations like the EU AI Act. Every output is watermarked with Resemble Watermarker. The watermark is embedded in the signal itself, not in file metadata, and constrained to speech-relevant frequencies so it is inaudible to listeners. It survives MP3 and AAC compression, re-encoding, and common edits at ~100% detection accuracy. Pass that output through our deepfake detection API and you get back a binary decode, not a probability score you have to interpret, but a watermark-present-or-not answer that is more defensible when legal gets involved. Generation and protection are not separate decisions here. They are the same decision, made at the moment of creation.
The range
The model handles tonal range across the full dramatic spectrum. A villain whose menace never tips into parody:
🎧 Sample:
Villain — Sinister LaughA talk show host who loses it completely and cannot find his way back:
🎧 Sample:
Talk Show Host — Wheezing LaughterFive voices building a late-90s pop harmony from soft synchronized layers to a full-group chorus:
Backstreet Boys, a polished late-90s boy band with five smooth, harmonizing male voices. "Step by step… out the door… new day… ready for more…" they sing in soft, synchronized harmony. One voice steps forward. "Keys in my hand… got my plan…" The others swell behind him. Their voices rise together. "Tell me why… every morning feels the same…" then "I'm ready to go…" The full group returns in a bright, unified chorus. "We'll make it our way… through the rush, through the noise, we keep moving strong, yeah!"
🎧 Sample:
Backstreet Boys — Pop HarmonyAnd a football commentator calling a fridge opening with the full gravitas of a Champions League final:
🎧 Sample:
Football Commentator — Martin TylerThat last one is worth listening to carefully because it demonstrates something more specific than expressiveness: precise tonal control across a long, slow build with crowd audio layered underneath, which is a harder thing to get right than a single emotional peak. The model is not just generating speech. It is generating a scene with pacing.
What and who this is for
DramaBox is an English-only release, and that is intentional. Getting directable speech right in one language is harder than getting flat TTS right in thirty, and we were not willing to trade quality for coverage on this one.
The use cases are anywhere flat TTS has always been the bottleneck: game dialogue that players do not skip, audiobook narration with real character differentiation, voice agents that do not sound like they are reading from a script, dubbing work where the performance has to match the scene and not just the words. Every file that comes out carries a provenance signal embedded at the moment of creation, so wherever it ends up, you can prove it was yours.
Run it self-hosted. The model card and quick-start code are on Hugging Face. If you’re interested in running DramaBox at scale, we suggest you reach out to Cerebrium or GMI.
What's next
DramaBox is the first of four open-source TTS models we are shipping this month, and each one is a different answer to a different problem.
- Chatterbox Nano: 110M parameters. 10x realtime on GPU, 3x realtime on CPU. Runs at the edge. Full paralinguistic tags, voice cloning from 5 seconds. The smallest serious TTS model we've ever shipped, and arguably the fastest open-source TTS.
- Chatterbox Flash: Chatterbox, rebuilt on a diffusion-LLM architecture. 2x faster than our AR baseline on vLLM. Ships with a novel prior-subtraction technique we believe generalizes to any dLLM TTS, one of the first production TTS models on this architecture.
- Chatterbox Multilingual V3: Better speaker similarity, fewer hallucinations, more natural delivery across languages. Plus dedicated single-language models for Mandarin, LATAM Spanish, Brazilian Portuguese, Spain Spanish, Portugal Portuguese, and Hindi.
We will cover each one when it's available.
Original source - May 13, 2026
- Date parsed from source:May 13, 2026
- First seen by Releasebot:May 13, 2026
May 13, 2026
Eleven Labs adds SIP signaling logs, in-place document editing and RAG chunk listing, SMS conversation metadata, workspace API analytics, API key IP allowlisting, new agent config fields and LLM options, plus updated voice preview, WAV output formats and SDK support.
ElevenAgents
SIP logs: SIP signaling logs are now available for SIP trunk calls in conversation history and phone number settings. Logs include the SIP call ID, phone numbers, addresses, transport, message direction, raw messages and errors to help debug call setup and routing issues.
Knowledge base document updates: New API to list indexed (RAG) chunks for a document. Documents can also be edited in place: rename the item and, for text-based documents, update the body without re-uploading the file. (File-based documents still use the existing replace-file flow where applicable.)
SMS conversation metadata: Added SMS support to conversation metadata with SMSConversationInfo, sms as an authorization method and twilio_sms as a conversation initiation source.
Agent configuration updates: Added background music configuration schemas, alerting monitor configuration, 2D layout fields, widget file upload configuration, and new LLM options gpt-5.4-mini, gpt-5.4-nano, gpt-5.4-mini-2026-03-17 and gpt-5.4-nano-2026-03-17.
Workspaces
API request analytics: Added a workspace analytics endpoint for querying API requests with time range, filtering, search, sort and limit controls.
API key IP allowlisting: Added allowed_ips to service account API key create, edit and response schemas. Create requests accept an array of IP addresses or CIDR ranges, or null to allow all IPs. Edit requests also accept clear and no_update.
Workspace permissions: Added conversational_ai_read and voice_design to workspace group permissions.
Voice Design
Voice preview migration: Deprecated POST /v1/text-to-voice/create-previews. Use Design a voice to create previews, then use the returned generated_voice_id when creating a voice.
Voice Changer
WAV output formats: Added wav_8000, wav_16000, wav_22050, wav_24000, wav_32000, wav_44100 and wav_48000 to the output_format query parameter on speech-to-speech conversion.
SDK Releases
Python SDK
v2.47.0 - Regenerated the SDK for the May 12, 2026 API schema. The release adds typed support for RAG chunk listing, file document updates, workspace API request analytics, service account API key IP allowlisting, phone number agent configuration fields, webhook request headers, voice metadata moderation, new LLM options and updated speech-to-speech output formats.
JavaScript SDK
v2.47.0 - Regenerated the SDK for the May 12, 2026 API schema. The release adds typed support for RAG chunk listing, file document updates, workspace API request analytics, service account API key IP allowlisting, phone number agent configuration fields, webhook request headers, voice metadata moderation, new LLM options and updated speech-to-speech output formats.
Packages
@elevenlabs/[email protected], @elevenlabs/[email protected], @elevenlabs/[email protected] and @elevenlabs/[email protected] - Added full tool result payload support to onAgentToolResponse. The callback now receives agent_tool_response_full_payload events with the raw full_tool_result string, capped at 64 KB, when enabled in the agent configuration.
@elevenlabs/[email protected], @elevenlabs/[email protected], @elevenlabs/[email protected] and @elevenlabs/[email protected] - Added native mute and unmute support to Scribe realtime STT. RealtimeConnection now exposes mute(), unmute() and isMuted, and useScribe exposes matching state and callbacks. The releases also add onAgentResponseCorrection for agent response correction events.
@elevenlabs/[email protected], @elevenlabs/[email protected], @elevenlabs/[email protected] and @elevenlabs/[email protected] - Added optional contextId to sendContextualUpdate for deduplicating contextual updates and added llm to the typed agent prompt override for conversation sessions. The React package also allows useScribe microphone deviceId to use full ConstrainDOMString constraints.
@elevenlabs/[email protected] and @elevenlabs/[email protected] - Added file upload support to the embedded ElevenAgents widget.
@elevenlabs/[email protected] and @elevenlabs/[email protected] - Fixed text input submission for IME users by ignoring Enter keydowns while composition is active.
@elevenlabs/[email protected] and @elevenlabs/[email protected] - Updated the widget packages to @elevenlabs/[email protected], including full tool result payload support.
@elevenlabs/[email protected] and @elevenlabs/[email protected] - Fixed transcript ordering when user and agent messages share the same event_id, so voice and DTMF turns render the user transcript before the agent message.
@elevenlabs/[email protected] and @elevenlabs/[email protected] - Fixed voice widget transcripts so streamed agent response parts are ignored for voice sessions and late user transcripts are inserted before their matching agent response.
API
Original source - May 9, 2026
- Date parsed from source:May 9, 2026
- First seen by Releasebot:May 9, 2026
API: New `simba-3.0` streaming model
Speechify adds simba-3.0 for audio speech and streaming, bringing a streaming-native voice model with lower TTFB, richer expressivity, per-voice speaking-rate, and ADV emotion controls. It currently supports English voices only, with multilingual support coming soon.
simba-3.0 is now available on POST /v1/audio/speech and POST /v1/audio/stream via the model field. It's the new streaming-native voice model with lower TTFB and richer expressivity, including direct support for per-voice speaking-rate and ADV (Arousal, Dominance, Valence) emotion controls inherited from the voice catalog.
{ "input": "Hello, world!", "voice_id": "george", "model": "simba-3.0" }Currently English only — multilingual coming soon
simba-3.0 currently supports English voices only. Requests with a non-English voice return 400 with the rejected locale called out in the message. Multilingual support is coming soon; the model name stays simba-3.0 across that change, so no migration is required.
For non-English voices today, continue to use simba-multilingual.
Original source - May 5, 2026
- Date parsed from source:May 5, 2026
- First seen by Releasebot:May 5, 2026
Realtime TTS-2
Text To Speech launches Realtime TTS-2, its most expressive TTS model, with natural language steering, stronger multilingual synthesis across 15 languages, cross-lingual voice reuse, voice localization, a new deliveryMode control, and an updated Voice Design with improved generations.
Launched Realtime TTS-2 (
inworld-tts-2), our most powerful and expressive TTS model:- Natural Language Steering: Direct any voice with bracketed instructions like
[say excitedly],[whisper in a hushed style], or free-form directions like[speak as if barely holding back rage]. Covers articulation, intonation, volume, pitch, range, speed, vocal style, and non-verbals ([laugh],[sigh], etc.). See the Steering guide. - Stronger Multilingual Support: Production-quality synthesis across 15 languages, plus experimental support for 90+ additional languages. See Languages.
- Cross-Lingual Voice Synthesis: Reuse the same voice across multiple languages. For best results, specify the
languagefield. - Voice Localization: Localize your voice for the most consistent, native-sounding speech in a target language. See Voice Localization.
- Delivery Mode: New
deliveryModefield (STABLE,BALANCED,EXPRESSIVE) controls the trade-off between consistency and emotional range. - Updated Voice Design: Released an updated version of Voice Design with improved generations. See Voice Design.
- May 4, 2026
- Date parsed from source:May 4, 2026
- First seen by Releasebot:May 7, 2026
May 4, 2026
Eleven Labs adds conversation tags, richer conversation filtering, new LLM options, contextual update metadata, and longer MCP response timeouts, while also expanding SDKs and packages with Scribe realtime enhancements, Android mute audio detection, and other reliability improvements.
ElevenAgents
Conversation tags: Added first-class conversation tags for organizing and filtering conversations. You can now create, update, list and delete tags, assign tags to conversations, and filter conversation history by tag_ids.
Conversation list filters: Added exclude_statuses to list conversations, allowing clients to hide conversations with statuses such as initiated, in-progress, processing, done, or failed.
Model options: Added claude-opus-4-7, gpt-5.4, gpt-5.5, gpt-5.4-2026-03-05, gpt-5.5-2026-04-23, and qwen36-35b-a3b to the LLM enum.
Contextual update metadata: Added contextual_update_info to conversation transcript response items. The field references ContextualUpdateInfo, which includes context_id and is_superseded, so clients can identify contextual updates and whether they have been replaced by later context.
Batch calling filter: Added an optional agent_id query parameter to list workspace batch calling jobs, allowing clients to return jobs for a single agent.
Test invocation listing: Made the agent_id query parameter optional and nullable on list test invocations, allowing clients to list test invocations without filtering by one agent.
MCP response timeout: Increased the maximum response_timeout_secs value from 120 to 300 seconds on MCP server configuration and MCP tool configuration overrides.
Workspaces
Subscription overage: Added current_overage to SubscriptionResponseModel and ExtendedSubscriptionResponseModel.
Resource collections: Added resource_collection to the WorkspaceResourceType enum.
ElevenCreative Studio
Project image status: Added error and pending_task to ProjectImageResponseModel, and made signed image URLs nullable when an image is still processing or has failed.
Pending media fields: Removed legacy pending_block_ids and pending_external_audio_ids fields from project external audio and video response models.
SDK Releases
Python SDK
v2.46.0 - Added keyterms and no_verbatim support to the Scribe realtime API, refactored WebSocket URL construction, and regenerated the SDK for the May 7, 2026 API schema, including conversation tags, conversation list filters, batch calling agent_id filtering, optional test invocation agent_id, contextual update metadata, and MCP response timeout updates.
JavaScript SDK
v2.46.0 - Fern regeneration for the May 7, 2026 API schema, including conversation tags, conversation list filters, batch calling agent_id filtering, optional test invocation agent_id, contextual update metadata, and MCP response timeout updates.
Packages
@elevenlabs/[email protected] - Added keyterms (string[]) and noVerbatim (boolean) options to the Scribe realtime API. keyterms are sent as repeated WebSocket query parameters and can include up to 50 terms of up to 20 characters each. noVerbatim removes filler words, false starts, and disfluencies from transcripts. Also fixed a case where interruptions could cut off agent audio that arrived less than 2 seconds after the interruption.
@elevenlabs/[email protected] - Added shared types for Scribe realtime keyterms and noVerbatim.
@elevenlabs/[email protected], @elevenlabs/[email protected], @elevenlabs/[email protected], and @elevenlabs/[email protected] - Updated to the latest client package, including Scribe realtime options and the agent audio interruption fix.
@elevenlabs/[email protected] - Patch release that includes the Scribe realtime options, the agent audio interruption fix, and an automatic random user ID when none is set.
Android SDK
v0.9.0 - Added support for detecting user audio while the user is muted.
API
Original source