AI Voice and Speech Release Notes
Release notes for AI voice synthesis, text-to-speech and audio generation tools
Products (9)
Latest AI Voice and Speech Updates
- Jun 1, 2026
- Date parsed from source:Jun 1, 2026
- First seen by Releasebot:Jun 2, 2026
June 1, 2026
Eleven Labs adds Exotel telephony integration, workflow-aware agent transfers, repeat agent test runs, richer conversation filtering, larger speech to text uploads, hCaptcha support for dubbing, a new workspace API key revoke endpoint, and refreshed SDK releases.
ElevenAgents
Exotel telephony integration: Added Exotel as a first-class telephony provider alongside Twilio and SIP trunking. New Exotel outbound call endpoint (POST /v1/convai/exotel/outbound-call) initiates outbound calls via the Exotel Connect API with agent_id, agent_phone_number_id, to_number, optional conversation_initiation_client_data and optional telephony_call_config. Exotel phone number create, list, get and update schemas are available, and exotel is added to TelephonyProvider, conversation initiation source and transfer result enums.
Workflow-aware agent transfers: AgentTransfer now allows a nullable agent_id so transfers can stay on the current agent, adds optional node_id to start at a specific node in the destination workflow, and only requires condition. Successful transfer_to_agent tool results can include nullable to_node, aligning transfers with workflow-node targets.
Agent test repeat runs: Run tests on the agent accepts optional repeat_count (integer, 1–20, default 1) to run each test multiple times. When repeat_count is greater than 1, responses include bucketing_status and result_groups for grouped summaries. Unit test create, update and summary schemas add optional conversation_initiation_source so tests can simulate a specific channel.
Conversation filtering: List conversations adds optional workflow_node_entered_id (string) to filter conversations that entered a given workflow node. Text search conversation messages adds optional topic_ids (array of strings) to filter by topic IDs assigned during topic discovery.
Agent versioning query parameters deprecated: enable_versioning on create agent and enable_versioning_if_not_enabled on update agent are marked deprecated. All agents are versioned and these parameters are ignored.
Procedure compiler mode: Removed append from ProcedureCompilerMode.
Workspaces
Revoke API key: Added revoke API key endpoint (DELETE /v1/workspaces/api-keys/revoke) with required api_key_name query parameter to revoke a workspace API key by name. This endpoint is destructive and requires additional account permissions, even though the documentation is public. Contact your ElevenLabs representative to enable it for your account.
Resource access disclosure: Resource access response models add optional access_source (creator, explicit, workspace_default, workspace_admin) so clients can disclose why a user has access to a shared resource.
Speech to Text
Larger upload limit: Convert speech to text now accepts audio and video files up to 5.0GB, increased from 3.0GB.
Dubbing
hCaptcha support: Dub a video or audio file request body adds optional hcaptcha_token (string) for bot protection on dubbing project creation.
Zip render format: RenderType adds zip alongside existing output formats such as mp4, aac, mp3, wav, aaf, tracks_zip and clips_zip.
SDK Releases
Packages
@elevenlabs/[email protected] - Added configurable microphone input chunk duration via inputChunkDurationMs (default 25ms).
@elevenlabs/[email protected] - Exposed inputChunkDurationMs for microphone input chunk duration and updated the client dependency to @elevenlabs/[email protected].
@elevenlabs/[email protected] - Updated dependencies to @elevenlabs/[email protected] and @elevenlabs/[email protected].
@elevenlabs/[email protected] and @elevenlabs/[email protected] - Updated widget dependencies to @elevenlabs/[email protected].
Flutter SDK v0.6.1 - Hardened LiveKit session teardown against uncaught stream errors.
API
Original source - May 1, 2026
- Date parsed from source:May 1, 2026
- First seen by Releasebot:Jun 2, 2026
- Modified by Releasebot:Jun 4, 2026
May 2026
Cartesia launches Ink-2, a streaming STT model for responsive real-time voice with turn detection and noisy-environment transcription.
Ink-2, our state-of-the-art streaming STT model — Build responsive real-time voice experiences with built-in turn detection and accurate transcription even in noisy environments. It currently supports only English, with additional languages coming later.
- Try it on the Cartesia Playground.
- Integrate via API, Python, TypeScript/JavaScript, LiveKit, and PipeCat.
- Switching from Deepgram Flux? See the migration guide.
All of your release notes in one feed
Join Releasebot and get updates from Eleven Labs and hundreds of other software products.
- Aug 19, 2025
- Date parsed from source:Aug 19, 2025
- First seen by Releasebot:May 28, 2026
Introducing Line: The Modern Voice Agent Development Platform
Cartesia launches Line, a code-first voice agent development platform that helps developers build, test, deploy, and monitor voice agents with low-latency conversations, logging, evals, and direct integration with Sonic and Ink. Line is now available to all developers.
Over the past 12 months, tens of thousands of developers have built and productionized industry-leading voice agents using our frontier speech models, Sonic and Ink. In serving these developers, we found that even the most sophisticated teams struggled to build great voice agents with the tools available today.
Voice agents are uniquely hard to bring to production: they require maintaining natural, low-latency conversations with advanced reasoning across millions of calls. A few challenges come up repeatedly—more reasoning introduces higher latency, small changes to one component can cause unpredictable quality swings, and fragile infrastructure across multiple providers make scaling unpredictable.
That’s why we built Line. Line is the modern voice agent development platform, making it easy for developers and businesses everywhere to build best-in-class voice agents.
Line is code-first
The best products are built with code. We realized this early as we reflected on the challenges developers faced. Great businesses transform industries by putting code at the center – Stripe reimagined payments, Twilio transformed communications, and Vercel redefined web.
Intelligent agents and great conversations are impossible to express in rigid conversational builders, which put limits on reasoning ability. Instead, the best voice agents should be like the best humans, using fluid context, advanced logic, and background reasoning.
Line uses code to give you the flexibility to perfect every part of your voice agent. The Line SDK makes it intuitive to build advanced voice agents with background reasoning, and is designed to cover any use case. Edge cases are easy to handle directly in code, and you can integrate your favorite packages alongside our SDK, no compromises necessary.
With AI, code is the fastest way to build and more accessible than ever. Developers are writing more lines of code than ever, using AI to build complex applications quickly. Our code-first approach and SDK roadmap is designed with this future in mind.
Line helps you iterate fast
Building great voice agents is an iterative process. Progress comes from making changes, testing them in live conversations, learning what works, and adjusting based on feedback. Line helps you move through that cycle quickly.
Line gets you to your first agent in minutes, starting from a text prompt or template. With our CLI and GitHub integrations, you can develop agents locally, deploy with one command to our platform, and talk to your agents in seconds. You can even share your agent with others.
Every iteration requires comprehensive evals and manual testing to get right. Developers constantly refine prompts, business logic, and background reasoning. They need to measure progress on their agents over time.
Line records all calls to deployed agents, saves the audio and transcripts, reports system metrics like latency, and provides comprehensive logs for auditing and debugging. Developers can use custom LLM-as-a-judge metrics to rate calls along any axis, including user satisfaction and call success.
Line brings you our frontier models and infrastructure
Best-in-class voice agents require frontier speech models. At Cartesia, our research team is pioneering the world’s best speech models using breakthrough advances in AI architectures.
Sonic is the world’s lowest latency text-to-speech model, with ultra-realistic conversational voices, and reliable transcript following. Ink is the fastest streaming speech-to-text model for speech understanding. Collectively, Sonic and Ink have powered millions of calls for businesses, from the fastest growing companies in voice AI to the Fortune 500.
Deeply integrating models and infrastructure creates new opportunities to optimize the end-to-end voice agent experience. Line voice agents are deployed alongside our Sonic and Ink models for the fastest latencies, and they’ll be the first to access our latest research advances in speech modeling.
For enterprises, this integration extends further: Line can be deployed entirely on-prem, from voice agents to models, and models can be customized with fine-tuning.
Line, Sonic, and Ink run on Cartesia’s reliable, globally distributed infrastructure, allowing every deployed voice agent to scale to thousands of calls while maintaining low latency.
Line is now available
Line is available today to all developers. As part of our launch, all subscription tiers will get their equivalent monthly plan dollars prepaid towards agents—at no extra cost.
Our north-star is simple. No matter how complex the code you write, when you deploy to the Line platform, every call should be a flawless experience.
With Line, we hope to make frontier voice AI accessible to everyone, and in turn, to make AI more accessible to the world.
We’re excited to see what you build. Sign up and check out the docs now to create your first voice agent. Contact us to start enterprise plan discussions.
Original source - Jul 11, 2025
- Date parsed from source:Jul 11, 2025
- First seen by Releasebot:May 28, 2026
Hierarchical modeling
Cartesia launches H-Net research on hierarchical networks that learn directly from raw data, aiming to improve scaling, robustness, and long-context reasoning. It also releases H-Net checkpoints on Hugging Face, including 2-stage XL, 1-stage XL, and 1-stage L.
The best AI architectures in use today treat all inputs equally. They process each input with the same amount of compute, without explicitly grouping related inputs into higher level concepts. While these architectures have achieved impressive results across domains, this lack of hierarchy has some fundamental limitations.
- Models have difficulty learning from high resolution, raw data, requiring inputs to be pre-processed into meaningful tokens for strong performance.
- The use of hand-crafted pre-processing steps (e.g. tokenization) can cause models to fail unexpectedly with small perturbations in the input data.
- Models waste compute on tokens that are easy to predict and not informative.
More importantly, information is fundamentally hierarchical. In language, ideas are chunked in characters, words, sentences, and paragraphs; in images, pixels are chunked in edges, shapes, and objects; in audio, raw waveforms are grouped into phonemes, sentences, and conversation turns. As humans, we consume raw information and group it in meaningful ways that allow us to reason and make connections at different levels of abstraction, from low level units to the high level ideas. This is core to intelligence. We believe hierarchical models will address several of the fundamental limitations and shortcomings of today’s architectures.
We’re excited to announce our latest research collaboration on hierarchical networks (H-Nets), a new architecture that natively models hierarchy from raw data. The core of the H-Net architecture is a dynamic chunking mechanism that learns to segment and compress raw data into meaningful concepts for modeling. It has three components: an encoder network, the main network, and a decoder network. The core of the encoder network is a routing module, which uses a similarity score to predict groups of meaningful chunks that should be grouped together and compressed for the main network. The main network can be any sequence to sequence model, and is responsible for next token prediction over these higher level chunks. Finally, the decoder network learns to decode chunks back into raw data, with a smoothing module for stabilizing learning.
H-Net demonstrates three important results on language modeling:
- H-Nets scale better with data than state-of-the-art Transformers with BPE tokenization, while learning directly from raw bytes. This improved scaling is even more pronounced on domains without natural tokenization boundaries, like Chinese, code, and DNA.
- H-Nets can be stacked together to learn from deeper hierarchies, which further improves performance.
- H-Nets are significantly more robust to small perturbations in input data like casing, showing an avenue for creating models that are more robust and aligned with human reasoning.
Our investment in this research is part of our larger push to build the next-generation of AI models that are multimodal, highly efficient, and reason and improve over long horizons. State space models represented our first research advancement, enabling stateful models that can compress information over long contexts. We believe H-Nets, and hierarchical modeling, are the key next step to addressing fundamental challenges in AI:
- Multimodal understanding and generation: A key challenge in multimodal modeling is fusing multiple streams of data. This is a difficult today, since different modalities are tokenized at different rates. For example, language is tokenized into subwords, while audio is tokenized as raw waveforms or downsampled codecs. This makes them difficult to model jointly. Hierarchical models like H-Net provide a promising path to fuse these multimodal streams at a higher abstraction level, enabling better transfer, reasoning, and understanding across modalities.
- Long-context reasoning: H-Nets unlock long context reasoning by chunking information into semantically meaningful units at higher levels of abstraction. This compression makes it easier for models to understand and reason across large inputs, particularly with deeper and deeper hierarchies. Hierarchical architectures will enable models that understand their environment from raw data and reason at appropriate levels of abstraction over long horizons.
- Efficient training and inference: Today’s architectures use the same amount of compute for every token, even though some tokens are less informative and easier to predict than others. Inference time optimizations, like speculative decoding, exploit this property to speed up computation on easier to predict tokens. With H-Nets, this is built directly into the architecture, by handling tokens that are easier to predict with lightweight encoder and decoder modules.
For more, read our full preprint on arXiv. We’ve also released checkpoints for H-Net 2-stage XL, H-Net 1-stage XL, and H-Net 1-stage L on HuggingFace.
If you’re excited about the future of architecture research and building systems and infrastructure to deliver these new models at scale, please reach out!
Original source - Jun 10, 2025
- Date parsed from source:Jun 10, 2025
- First seen by Releasebot:May 28, 2026
Introducing Ink: speech-to-text models for real-time conversation
Cartesia introduces Ink, a new streaming speech-to-text family for real-time voice apps, led by Ink-Whisper. It brings low-latency conversational transcription with strong accuracy, faster transcript completion, and affordable enterprise-grade pricing.
Sonic to Ink: from voice-out to voice-in
Today we’re introducing Ink, a new family of streaming speech-to-text (STT) models for developers building real-time voice applications. Our debut model is Ink-Whisper, a variant of OpenAI’s Whisper, specifically optimized for low-latency transcription in conversational settings. Available today, Ink-Whisper is the fastest, most affordable STT model–designed for enterprise-grade voice agents.
With Sonic, we’ve become the preferred text-to-speech (TTS) provider for builders who prioritize speed, quality, and reliability. This comes as no surprise, given Sonic’s market leadership in ultra-low latency, enabling customers to create the most realistic interactive voice experiences. Now, we’re turning our attention to the other side of the conversation (STT) with Ink.
Reimagining Whisper for real-time
For our STT release, we looked at what developers are already using and what might be broken. That led us to OpenAI’s whisper-large-v3-turbo. It is widely used for good reasons–Whisper performs comparably in conversational transcription accuracy to other proprietary speech-to-text providers, it is open source, and can be inferenced efficiently.
Most of the innovation around Whisper has focused on improving throughput, which is how quickly we can transcribe huge datasets (measured by real-time factor, or RTF). That’s great for post-processing long audio files, but standard Whisper falls short on speed and accuracy when it comes to powering real-time voice agents where transcription quality needs to be high on every call, not just in aggregate. Plus, standard Whisper wasn’t designed for challenging real-world conditions.
Ultimately, Whisper was fundamentally made for bulk processing, not live dialogue. So, we rearchitected it into Ink-Whisper, purpose-building it for real-time voice AI, with speed and real-world context at its core.
Ink-Whisper is built for real-world conversations
In enterprise use cases, voice AI agents need to transcribe speech as it happens–and do it reliably across a wide range of variable real-world environments. We built Ink-Whisper with those challenges in mind, focusing on accuracy in the types of conditions that typically trip up standard speech-to-text systems:
- Telephony artifacts: Low-bandwidth, compressed audio adds distortion
- Proper nouns and domain terms: Names of products, drugs, or financial instruments require clarity
- Background noise: Traffic, restaurant chatter, crying babies, and static make clean transcription difficult
- Disfluencies and silence: Fillers like “um” and pauses confuse standard Whisper implementations
- Accents and variation: Voices come in all kinds, and STT models need to adapt
One of our core improvements on standard Whisper is dynamic chunking. Standard Whisper performs best on full 30-second chunks. But conversational AI deals in much smaller, more fragmented audio segments. We’ve modified Whisper to handle variable-length chunks that end at semantically meaningful points. That means fewer errors and less hallucination, especially during silence or audio gaps.
To ensure the Ink-Whisper actually works better in the wild, we created a suite of evaluation datasets that reflect those common challenges in voice AI:
- Background Noise Dataset: Conversations recorded in noisy environments like traffic, cafes, or offices
- Proper Noun Dataset: 100 samples from SPGISpeech with dense financial terms and brand names
- Speech Accent Dataset: Transcripts featuring a range of English accents, to test robustness across demographics
Across these datasets, Ink-Whisper outperforms baseline whisper-large-v3-turbo in accuracy based on word error rate (WER). The WER for Ink-Whisper is also competitive with other streaming speech-to-text models–and critically, it’s optimized for production-grade, real-time performance:
Word error rate across relevant datasets | Dataset details | Cartesia Streaming whisper-natural | Deepgram Nova3 Streaming | Fireworks Whisper Streaming | Assembly Streaming
Phone calls | Natural, conversations over the phone | 0.19 | 0.18 | 0.28 | 0.23
Proper Nouns | Jargon-heavy speech | 0.065 | 0.045 | 0.071 | 0.044
Background Noises | Background noise | 0.033 | 0.038 | 0.099 | 0.027
Disfluencies | Fillers and noise | 0.064 | 0.055 | 0.156 | 0.137
Speech Accent Archive Subset | Diverse accent | 0.015 | 0.024 | 0.014 | 0.016Ink-Whisper is the fastest streaming model
Beyond accuracy, streaming transcription must deliver ultra-speed to achieve realistic conversation. With Ink-Whisper, we emphasize a new metric: time-to-complete-transcript (TTCT). This is how quickly the full transcript is ready once the user stops talking. TTCT determines how fast the entire system can respond, in a way that mimics a live, attentive listener.
A dead giveaway of an unnatural bot is the lag in its reply. Those lags break the rhythm of natural conversation. They lead to dropped calls, frustrated users, and lost revenue. Having the absolute lowest TTCT is about speed, yes, and ultimately, it’s about making the interaction feel human.
We’re proud to share that Ink-Whisper outperforms the baseline whisper-large-v3-turbo on TTCT. In fact, Ink-Whisper delivers the fastest TTCT of any streaming speech-to-text model we’ve tested:
Time to Complete transcription after last audio sent | Cartesia Streaming Ink-Whisper | Deepgram Nova3 Streaming | Fireworks Whisper Streaming | AssemblyAI Universal Streaming
Median (ms) | 66 | 74 | 70 | 737
P90 (ms) | 98 | 109 | 189 | 829Standard Whisper remains one of the most versatile open STT models, but it wasn’t made for real-time. Ink-Whisper changes that with optimizations for conversational accuracy and ultra-low latency. Ink-Whisper delivers the fastest TTCT we’ve seen, with strong performance across noisy, accented, and dynamic speech. We evaluated Ink-Whisper in the real-world conditions one encounters with voice agents–not the controlled environment of a lab or studio.
Ink-Whisper is the most affordable streaming model
Voice is the future–and we’re committed to this belief by making Ink-Whisper accessible to builders of voice solutions. Ink-Whisper is both the fastest and most affordable streaming STT model available. At just 1 credit per second (or $0.13/hr on our Scale plan), Ink-Whisper delivers top-tier real-time transcription at the lowest price.
Getting started with Ink-Whisper is as seamless as the experiences it powers:
- Ink-Whisper easily integrates with Vapi, Pipecat, and LiveKit, so you can start streaming voice interactions in minutes
- With 99.9% uptime and enterprise-grade compliance (SOC 2 Type II, HIPAA, PCI), you can deploy at scale with confidence
Start here or explore the docs.
The future of voice AI with Cartesia
We’re seeing surging demand for voice agents. The most effective ones rely on state-of-the-art audio AI like our Sonic text-to-speech model. Now with Ink-Whisper, we’re meeting that demand on the other side, enabling fast, natural conversations. Today’s release is an early glimpse into how we’re reimagining the real-time voice stack. More to come.
Original source - May 15, 2025
- Date parsed from source:May 15, 2025
- First seen by Releasebot:May 28, 2026
Introducing Organizations and Dashboards
Cartesia adds Organizations and Dashboards to help developers build and scale voice AI with shared API keys, voices, and billing, plus real-time visibility into credits, concurrency, and WebSocket connections.
We’re building Cartesia for developers scaling voice AI. Today, we’re introducing two features to make collaboration and visibility easier: Organizations and Dashboards.
NEW: Organizations on Cartesia
Build together on one unified platform.
The Organizations feature lets teams move faster with shared access to everything Cartesia offers—API keys, custom voices, and billing—all under one account:
- Shared API keys: Ship more lightning-fast voice experiences with shared API keys
- Shared voices: Keep voices consistent across use cases with access to the same custom, localized, and cloned voices
- Centralized billing: Simplify billing under one plan with pooled credits. (Soon, you’ll be able to see concurrency across the Organization, too)
Invite your team today to your Organization today. Now available on Startup+ plans.
Good to know:
- Getting started: Log in to Playground to find your auto-generated Organization and start inviting team members. (edited)
- Unlimited seats: Add as many teammates as you need; there’s no cap.
NEW: Dashboards on Cartesia
Scale with visibility into usage and concurrency.
Today, we’re making available Dashboards for every Cartesia member for real-time visibility into usage and performance.
- Monitor credits: Track your credit balance, so you can upgrade as necessary
- View concurrency: Understand how many voice generations are running in parallel, and request more to serve your customers (now available, as of May 20)
- Check WebSocket connections: Watch live sessions scale, so you can plan accordingly (now available, as of May 20)
Log in to Playground to find your Dashboard.
Original source - May 6, 2025
- Date parsed from source:May 6, 2025
- First seen by Releasebot:May 28, 2026
Introducing Professional Voice Cloning
Cartesia adds Professional Voice Clones powered by Sonic, making high-quality voice cloning more affordable and self-serve on Startup plans and above. Users can fine-tune custom voices through the Playground UI or API across 15 languages, with unlimited PVC slots and scalable credits.
Built on Sonic, Now Available on Startup+
Starting today, you can create Professional Voice Clones (PVCs) trained on Sonic, the world’s fastest text-to-speech model. PVCs are voice clones created by fine-tuning Sonic on voice data, enabling perfect replicas of the tone, cadence, style, and environment. Until now, creating PVCs was costly and inaccessible to most businesses. We’re changing that—making it more affordable and scalable. PVCs are available on our Startup plan and above.
Our team has developed new model infrastructure that enables training and serving PVCs at a fraction of the cost without sacrificing quality or latency. As a result, we’ve been able to make PVCs available self-serve without limits for a fixed number of credits. Whether you’re building virtual avatars, AI agents, or assembling a private voice library, PVCs are now easily accessible through our website or API (no Sales contact needed).
Fine-tune One—or Many
You can fine-tune our best-in-class models to create high-grade PVCs using our standard credit system. Training a PVC costs 1M credits, and generating PVC speech costs 1.5 credits per character. There is no cap on the number of PVCs you can make, as long as you have the credits. The easiest way to get started is with the Startup plan ($49/month), which includes 1.25M credits each month—enough to create up to 15 voices per year. With unlimited PVC slots, you can build a Professional Voice Clone for every persona, tone, or market you serve.
PVCs for Custom Libraries and Personal Avatars
We built our PVCs in response to overwhelming demand from businesses that need bespoke AI voices for private voice libraries and virtual personal avatars. Our customers from healthcare to hospitality can now build out custom voices that are exclusively theirs, on-brand, and consistently sound exactly the way they want. Platforms that enable personal AI avatars, or digital twins, also need PVCs so people can create lifelike clones of themselves at scale. Listen to these demos:
Expressive, studio-quality voices:
Ultra-realistic, conversational voices:
Perfectly capturing diverse speakers:
Create Your First PVC
Professional Voice Clones are available through our Playground UI and API, and across 15 languages. Bring your voice data and let’s begin fine-tuning. For more info, check out Docs.
Original source - Mar 28, 2025
- Date parsed from source:Mar 28, 2025
- First seen by Releasebot:May 28, 2026
Cartesia Python SDK v2.0.0
Cartesia releases v2.0.0 of its Python SDK, polishing the developer experience for AI voice apps with a primary client, async support, streaming, WebSockets, retries, timeouts, and easier error handling.
We are excited to announce the release of v2.0.0 of our Python SDK, polishing the developer experience when using Cartesia’s AI voice capabilities with Python.
Getting started with Cartesia using Python
Install the Cartesia Python SDK in your project with:
pip install cartesiaInitialize the SDK and authenticate:
from cartesia import Cartesia import os client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY"))Now, you can start making requests. For example, to generate audio with Cartesia’s text-to-speech model:
client.tts.bytes( model_id="sonic-2", transcript="Hello, world!", voice={ "mode": "id", "id": "694f9389-aac1-45b6-b726-9d9369183238", }, language="en", output_format={ "container": "wav", "sample_rate": 44100, "encoding": "pcm_f32le", }, )For more examples, check out our API Explorer to generate Python code snippets for any of our APIs.
The Python SDK at a glance
Building upon industry established SDK patterns, v2 of our Python SDK delivers a great development experience structured around a primary Cartesia client, which is the entry point for accessing the various API endpoints.
Even more features
- Basic Client - Instantiate and use the client with just 24 lines of code.
- Async Client - The SDK exports an async client alongside standard real-time calls, allowing you to make non-blocking API requests.
- Streaming - The SDK supports streaming responses and outputs a generator that you can iterate over.
- WebSocket - Integrate using WebSockets to build realtime, low-latency voice applications.
- Exception Handling - The API gracefully handles non-success status codes (4xx and 5xx responses).
- Retries - The SDK is instrumented with automatic retries with exponential backoff.
- Timeouts - The SDK defaults to a 60 second timeout. You can configure this with a timeout option at the client or request level.
- Custom Client — You can override the httpx client to customize it for your use-case. Some common use-cases include support for proxies and transports.
What’s next?
We can’t wait to see what you build with the Cartesia Python SDK! Your feedback helps us improve—let us know your thoughts by submitting issues on GitHub.
Original source - Mar 18, 2025
- Date parsed from source:Mar 18, 2025
- First seen by Releasebot:May 28, 2026
Introducing Narrations: create and edit long-form audio content with precision
Cartesia introduces Narrations, a platform for turning written content into polished audio productions. It offers a voice library, voice design, instant cloning from 3 seconds of audio, editing control, 15+ languages, and easy import from PDFs, Word, EPUB, Substack, and Medium.
Today we’re excited to introduce Narrations, a platform that enables creators to transform written content into polished audio productions with unprecedented control and efficiency. Whether you’re producing audiobooks, podcasts, or narrative content, Narrations puts professional-grade capabilities at your fingertips.
Voice Technology that Adapts to You
Our extensive Voice Library and Voice Design features give you unprecedented control over how your content sounds. Select from hundreds of voices, fine-tune every aspect of delivery, or create your perfect voice with our instant cloning with just 3 seconds of audio.
The possibilities are limitless:
- Craft distinct character voices across 15+ languages and wide variety of accents
- Fine-tune emotional delivery and pacing
- Edit and perfect individual fragments with different voices, until they sound exactly the way you want.
Get started easily
Import any existing document - PDFs, Word documents, EPUBs, or directly from platforms like Substack and Medium. Generate expressive audio content from your favorite writing and publishing platform.
This is more than just text-to-speech - it’s a complete reimagining of how you can create audio content, powered by Sonic 2.0.
Original source - Mar 11, 2025
- Date parsed from source:Mar 11, 2025
- First seen by Releasebot:May 28, 2026
Series A and the future of voice AI
Cartesia launches Sonic 2.0, its fastest and most controllable voice generation model, with stronger voice cloning and 90ms latency. It also adds Voice changer and Infill endpoints, expanding the Sonic API for more precise voice AI creation.
At Cartesia, we’re building the future of voice AI - ultra-realistic, fast, and controllable. Over the past year, we’ve powered millions of calls and helped tens of thousands of creators make their content more accessible.
We’re thrilled to announce our $64 million Series A led by Kleiner Perkins. The new funding will help us expand our team and invest in research to build the next generation of models, infrastructure, and products for voice, starting with the launch of our latest voice generation model— Sonic 2.0.
Sonic 2.0 is built on our new state space model architecture and is the fastest and most controllable voice model available today. It’s twice as large as Sonic, yet runs faster, at just 90ms latency for the full model and 40ms for turbo. And in blind, head-to-head evaluations on 100 held out voices, 1.5x as many people preferred Sonic 2.0 over the next best provider.
Beyond speed and quality, Sonic 2.0 offers unprecedented control over generations, with best-in-class voice cloning that captures complex accents and rich audio soundscapes. We’ve also introduced two powerful new endpoints:
- Voice changer – Perfect the style and voice of your audio.
- Infill – Seamlessly edit content within your audio.
We’re building the platform for Voice AI with enterprise-grade infrastructure. The Sonic API is purpose built for developers and has the most reliable and fastest serving stack for voice generation, with 99.9% uptime and the fastest P90 latencies globally. We’re SOC-2 and HIPAA compliant and support real-time on-premise and on-device deployments.
Finally, we’re continuing to advance our long-term research agenda. The next generation of audio models will require multiple algorithmic advances in several areas, including streaming architectures, codecs, long context modeling, and on-device inference - and we’re excited to share our progress here.
Learn more about our work on voice AI at cartesia.ai/sonic. If you’re interested in working with us, please reach out.
Original source