Cartesia Release Notes

Last updated: Dec 23, 2025

  • December 2025
    • No date parsed from source.
    • Detected by Releasebot:
      Dec 23, 2025
    Cartesia logo

    Cartesia

    Sonic

    Cartesia unveils an AI voice changer with a broad voice library, custom voices, multilingual localization, and ultra-fast real time generation. It enables precise voice transformation across media, gaming, and customer experiences.

    Reimagine your voice with our AI voice changer

    Let our AI re-deliver your words in another voice, exactly the way you want them to sound.

    TRY IT OUT

    TALK TO SALES

    TRUSTED BY 50K+ CUSTOMERS

    Transform your voice today
    Experience precise control over the speech you generate on our platform. Show us how you want something to be said, and our voices will deliver it perfectly, every time.

    EXTENSIVE VOICE LIBRARY

    Discover a diverse collection of unique voices in our voice library to bring your content to life.

    CUSTOM VOICES

    Mix multiple voices, customize speed and emotions to design your custom voice.

    TRANSLATION AND LOCALIZATION

    Our AI voice changer excels at localization, preserving the original voices and emotions.

    Instantly change your voice from a 3 second clip Scale up to hours of content with Fine-Tuning

    • SOURCE
    • ORACLE
    • BRIGHTON
    • SOURCE
    • HERO VOICE
    • ROBOTIC MALE
    • SOURCE
    • PIPPA
    • OVERLORD

    "Cartesia’s Sonic model is a game-changer for our Conversational Video Interface. Its ultra-low latency of 90ms and high-quality voice generation have enabled us to create truly immersive real-time conversations with AI digital twins. The natural voices and voice design capabilities have elevated our product to new heights."
    — Hassaan Raza, Co-Founder and CEO, Tavus

    Make your content accessible to a global audience
    Sonic supports seamless speech in 15 languages, with more added every release.

    15 LANGUAGES

    From Japanese to German—any language you need, we’ve got it.

    LOCALIZATION

    Localize a given voice to any accent or language.

    • German
    • English
    • Spanish
    • French
    • Japanese
    • Portuguese
    • Chinese
    • Italian

    What our customers say
    Join the growing list of companies opting for Sonic.

    TRY IT NOW

    TALK TO SALES

    "Cartesia’s voice API power dynamic and empathetic conversational experiences that are consistently dependable. What really stands out to me is how natural and considerate the responses feel—especially the empathetic tone in statements like ‘I’m sorry, that must be frustrating.’"
    Sami Ghoche, CEO of Forethought

    "In 1999, Salesforce brought software to the cloud. In 2025, 11x is killing software as we know it and unleashing the era of digital workers. To realise this vision, we needed AI voice technology that feels truly human. Cartesia’s technology gives our AI digital workers reps the speed, reliability, and natural expressiveness required to engage customers at scale.
    It's the only solution fit for our relentless drive toward innovation.”
    Keith Fearon, Head of Product & Growth, 11x

    "Before conversational voice models like Cartesia, Thoughtly relied on legacy text-to-speech APIs from major cloud providers. Nearly two years later, the evolution of this technology is staggering—customers can clone their voice and hear it speaking autonomously over the phone in just 60 seconds.”
    Torrey Leonard, CEO, Thoughtly

    Lifelike, expressive voices for every use case
    Support
    Power support experiences that delight your customers.

    Gaming
    Bring your storytelling to life with immersive voices

    Content
    Create content that engages viewers and drives clicks.

    Media
    Narrate content for podcasts, news, and publishing.

    Healthcare
    Empower healthcare with voices that patients trust.

    Sales
    Scale sales with lifelike voices that lead to conversions.

    Voice Agents
    Build responsive AI voice agents for any use case.

    Dubbing
    Go global with localized voices and accents for every language.

    Avatars
    Create expressive, relatable AI avatars for any use case.

    Logistics
    Automate complex logistics with voice-enabled systems.

    Recruiting
    Screen candidates with AI-powered voice interviews.

    Accessibility
    Make your content accessible to anyone, anywhere.

    How to Use Our AI Voice Changer

    STEP ONE

    Try Cartesia's AI voice changer on our website. Simply create a free account and upload your original recording.

    STEP TWO

    Choose your preferred voice and language settings. Transform your voice with our lifelike voice changer.

    STEP THREE

    Apply voice changes to your content to hear it in a different voice, with complete control over the delivery.

    Frequently asked questions

    • How does the free AI voice changer work?
    • What is a realtime voice changer client?
    • Can I use the voice changer for gaming?
    • Is the voice changer free to use?
    • How do I access the voice changer?
    • What makes our voice changer unique?
    Original source Report a problem
  • Aug 19, 2025
    • Parsed from source:
      Aug 19, 2025
    • Detected by Releasebot:
      Dec 23, 2025
    Cartesia logo

    Cartesia

    Introducing Line: The Modern Voice Agent Development Platform

    Cartesia launches Line, a code-first voice agent platform powered by Sonic and Ink, delivering low latency, on‑prem support, and code-driven tooling. Available now to all developers with prepaid plan credits; sign up to build your first agent.

    Over the past 12 months, tens of thousands of developers have built and productionized industry-leading voice agents using our frontier speech models, Sonic and Ink. In serving these developers, we found that even the most sophisticated teams struggled to build great voice agents with the tools available today.
    Voice agents are uniquely hard to bring to production: they require maintaining natural, low-latency conversations with advanced reasoning across millions of calls. A few challenges come up repeatedly—more reasoning introduces higher latency, small changes to one component can cause unpredictable quality swings, and fragile infrastructure across multiple providers make scaling unpredictable.
    That’s why we built Line. Line is the modern voice agent development platform, making it easy for developers and businesses everywhere to build best-in-class voice agents.

    Line is code-first

    Line is code-first
    The best products are built with code. We realized this early as we reflected on the challenges developers faced. Great businesses transform industries by putting code at the center – Stripe reimagined payments, Twilio transformed communications, and Vercel redefined web.
    Intelligent agents and great conversations are impossible to express in rigid conversational builders, which put limits on reasoning ability. Instead, the best voice agents should be like the best humans, using fluid context, advanced logic, and background reasoning.
    Line uses code to give you the flexibility to perfect every part of your voice agent. The Line SDK makes it intuitive to build advanced voice agents with background reasoning, and is designed to cover any use case. Edge cases are easy to handle directly in code, and you can integrate your favorite packages alongside our SDK, no compromises necessary.
    With AI, code is the fastest way to build and more accessible than ever. Developers are writing more lines of code than ever, using AI to build complex applications quickly. Our code-first approach and SDK roadmap is designed with this future in mind.

    Line helps you iterate fast

    Line helps you iterate fast
    Building great voice agents is an iterative process. Progress comes from making changes, testing them in live conversations, learning what works, and adjusting based on feedback. Line helps you move through that cycle quickly.
    Line gets you to your first agent in minutes, starting from a text prompt or template. With our CLI and GitHub integrations, you can develop agents locally, deploy with one command to our platform, and talk to your agents in seconds. You can even share your agent with others.
    Every iteration requires comprehensive evals and manual testing to get right. Developers constantly refine prompts, business logic, and background reasoning. They need to measure progress on their agents over time.
    Line records all calls to deployed agents, saves the audio and transcripts, reports system metrics like latency, and provides comprehensive logs for auditing and debugging. Developers can use custom LLM-as-a-judge metrics to rate calls along any axis, including user satisfaction and call success.

    Line brings you our frontier models and infrastructure

    Line brings you our frontier models and infrastructure
    Best-in-class voice agents require frontier speech models. At Cartesia, our research team is pioneering the world’s best speech models using breakthrough advances in AI architectures.
    Sonic is the world’s lowest latency text-to-speech model, with ultra-realistic conversational voices, and reliable transcript following. Ink is the fastest streaming speech-to-text model for speech understanding. Collectively, Sonic and Ink have powered millions of calls for businesses, from the fastest growing companies in voice AI to the Fortune 500.
    Deeply integrating models and infrastructure creates new opportunities to optimize the end-to-end voice agent experience. Line voice agents are deployed alongside our Sonic and Ink models for the fastest latencies, and they'll be the first to access our latest research advances in speech modeling.
    For enterprises, this integration extends further: Line can be deployed entirely on-prem, from voice agents to models, and models can be customized with fine-tuning.
    Line, Sonic, and Ink run on Cartesia’s reliable, globally distributed infrastructure, allowing every deployed voice agent to scale to thousands of calls while maintaining low latency.

    Line is now available

    Line is now available
    Line is available today to all developers. As part of our launch, all subscription tiers will get their equivalent monthly plan dollars prepaid towards agents—at no extra cost.
    Our north-star is simple. No matter how complex the code you write, when you deploy to the Line platform, every call should be a flawless experience.
    With Line, we hope to make frontier voice AI accessible to everyone, and in turn, to make AI more accessible to the world.
    We’re excited to see what you build. Sign up and check out the docs now to create your first voice agent. Contact us to start enterprise plan discussions.

    Original source Report a problem
  • Jul 11, 2025
    • Parsed from source:
      Jul 11, 2025
    • Detected by Releasebot:
      Dec 23, 2025
    Cartesia logo

    Cartesia

    Hierarchical modeling

    Researchers unveil H-Nets, a hierarchical AI architecture that learns from raw data to form meaningful chunks, boosting scalability, robustness, and long-context reasoning beyond traditional transformers. The team releases arXiv preprint and HuggingFace checkpoints for H-Net variations, inviting collaboration.

    The best AI architectures in use today treat all inputs equally. They process each input with the same amount of compute, without explicitly grouping related inputs into higher level concepts. While these architectures have achieved impressive results across domains, this lack of hierarchy has some fundamental limitations.

    • Models have difficulty learning from high resolution, raw data, requiring inputs to be pre-processed into meaningful tokens for strong performance.
    • The use of hand-crafted pre-processing steps (e.g. tokenization) can cause models to fail unexpectedly with small perturbations in the input data.
    • Models waste compute on tokens that are easy to predict and not informative.

    More importantly, information is fundamentally hierarchical. In language, ideas are chunked in characters, words, sentences, and paragraphs; in images, pixels are chunked in edges, shapes, and objects; in audio, raw waveforms are grouped into phonemes, sentences, and conversation turns. As humans, we consume raw information and group it in meaningful ways that allow us to reason and make connections at different levels of abstraction, from low level units to the high level ideas. This is core to intelligence. We believe hierarchical models will address several of the fundamental limitations and shortcomings of today’s architectures.

    We’re excited to announce our latest research collaboration on hierarchical networks (H-Nets), a new architecture that natively models hierarchy from raw data. The core of the H-Net architecture is a dynamic chunking mechanism that learns to segment and compress raw data into meaningful concepts for modeling. It has three components: an encoder network, the main network, and a decoder network. The core of the encoder network is a routing module, which uses a similarity score to predict groups of meaningful chunks that should be grouped together and compressed for the main network. The main network can be any sequence to sequence model, and is responsible for next token prediction over these higher level chunks. Finally, the decoder network learns to decode chunks back into raw data, with a smoothing module for stabilizing learning.

    H-Net demonstrates three important results on language modeling:

    • H-Nets scale better with data than state-of-the-art Transformers with BPE tokenization, while learning directly from raw bytes. This improved scaling is even more pronounced on domains without natural tokenization boundaries, like Chinese, code, and DNA.
    • H-Nets can be stacked together to learn from deeper hierarchies, which further improves performance.
    • H-Nets are significantly more robust to small perturbations in input data like casing, showing an avenue for creating models that are more robust and aligned with human reasoning.

    Our investment in this research is part of our larger push to build the next-generation of AI models that are multimodal, highly efficient, and reason and improve over long horizons. State space models represented our first research advancement, enabling stateful models that can compress information over long contexts. We believe H-Nets, and hierarchical modeling, are the key next step to addressing fundamental challenges in AI:

    • Multimodal understanding and generation: A key challenge in multimodal modeling is fusing multiple streams of data. This is a difficult today, since different modalities are tokenized at different rates. For example, language is tokenized into subwords, while audio is tokenized as raw waveforms or downsampled codecs. This makes them difficult to model jointly. Hierarchical models like H-Net provide a promising path to fuse these multimodal streams at a higher abstraction level, enabling better transfer, reasoning, and understanding across modalities.
    • Long-context reasoning: H-Nets unlock long context reasoning by chunking information into semantically meaningful units at higher levels of abstraction. This compression makes it easier for models to understand and reason across large inputs, particularly with deeper and deeper hierarchies. Hierarchical architectures will enable models that understand their environment from raw data and reason at appropriate levels of abstraction over long horizons.
    • Efficient training and inference: Today’s architectures use the same amount of compute for every token, even though some tokens are less informative and easier to predict than others. Inference time optimizations, like speculative decoding, exploit this property to speed up computation on easier to predict tokens. With H-Nets, this is built directly into the architecture, by handling tokens that are easier to predict with lightweight encoder and decoder modules.

    For more, read our full preprint on arXiv. We’ve also released checkpoints for H-Net 2-stage XL, H-Net 1-stage XL, and H-Net 1-stage L on HuggingFace.

    If you’re excited about the future of architecture research and building systems and infrastructure to deliver these new models at scale, please reach out!

    Original source Report a problem
  • Jun 10, 2025
    • Parsed from source:
      Jun 10, 2025
    • Detected by Releasebot:
      Dec 23, 2025
    Cartesia logo

    Cartesia

    Introducing Ink: speech-to-text models for real-time conversation

    Ink-Whisper launches a real-time streaming STT model built for low latency and accuracy in live conversations. It outpaces rivals on time-to-complete-transcript, is highly affordable, enterprise-ready, and easily integrates with popular tools for instant voice interactions.

    Ink, a new family of streaming speech-to-text models

    Today we’re introducing Ink, a new family of streaming speech-to-text (STT) models for developers building real-time voice applications. Our debut model is Ink-Whisper, a variant of OpenAI’s Whisper, specifically optimized for low-latency transcription in conversational settings. Available today, Ink-Whisper is the fastest, most affordable STT model–designed for enterprise-grade voice agents.

    Sonic to Ink: from voice-out to voice-in

    With Sonic, we've become the preferred text-to-speech (TTS) provider for builders who prioritize speed, quality, and reliability. This comes as no surprise, given Sonic's market leadership in ultra-low latency, enabling customers to create the most realistic interactive voice experiences. Now, we’re turning our attention to the other side of the conversation (STT) with Ink.

    Reimagining Whisper for real-time

    For our STT release, we looked at what developers are already using and what might be broken. That led us to OpenAI’s whisper-large-v3-turbo. It is widely used for good reasons–Whisper performs comparably in conversational transcription accuracy to other proprietary speech-to-text providers, it is open source, and can be inferenced efficiently.

    Most of the innovation around Whisper has focused on improving throughput, which is how quickly we can transcribe huge datasets (measured by real-time factor, or RTF). That’s great for post-processing long audio files, but standard Whisper falls short on speed and accuracy when it comes to powering real-time voice agents where transcription quality needs to be high on every call, not just in aggregate. Plus, standard Whisper wasn’t designed for challenging real-world conditions.

    Ultimately, Whisper was fundamentally made for bulk processing, not live dialogue. So, we rearchitected it into Ink-Whisper, purpose-building it for real-time voice AI, with speed and real-world context at its core.

    Ink-Whisper is built for real-world conversations

    In enterprise use cases, voice AI agents need to transcribe speech as it happens–and do it reliably across a wide range of variable real-world environments. We built Ink-Whisper with those challenges in mind, focusing on accuracy in the types of conditions that typically trip up standard speech-to-text systems:

    • Telephony artifacts: Low-bandwidth, compressed audio adds distortion
    • Proper nouns and domain terms: Names of products, drugs, or financial instruments require clarity
    • Background noise: Traffic, restaurant chatter, crying babies, and static make clean transcription difficult
    • Disfluencies and silence: Fillers like "um" and pauses confuse standard Whisper implementations
    • Accents and variation: Voices come in all kinds, and STT models need to adapt

    One of our core improvements on standard Whisper is dynamic chunking. Standard Whisper performs best on full 30-second chunks. But conversational AI deals in much smaller, more fragmented audio segments. We've modified Whisper to handle variable-length chunks that end at semantically meaningful points. That means fewer errors and less hallucination, especially during silence or audio gaps.

    To ensure the Ink-Whisper actually works better in the wild, we created a suite of evaluation datasets that reflect those common challenges in voice AI:

    • Background Noise Dataset: Conversations recorded in noisy environments like traffic, cafes, or offices
    • Proper Noun Dataset: 100 samples from SPGISpeech with dense financial terms and brand names
    • Speech Accent Dataset: Transcripts featuring a range of English accents, to test robustness across demographics

    Across these datasets, Ink-Whisper outperforms baseline whisper-large-v3-turbo in accuracy based on word error rate (WER). The WER for Ink-Whisper is also competitive with other streaming speech-to-text models–and critically, it’s optimized for production-grade, real-time performance:

    Word error rate across relevant datasets
    Dataset details
    Cartesia Streaming whisper-natural
    Deepgram Nova3 Streaming
    Fireworks Whisper Streaming
    Assembly Streaming
    Phone calls
    Natural, conversations over the phone
    0.19
    0.18
    0.28
    0.23
    Proper Nouns
    Jargon-heavy speech
    0.065
    0.045
    0.071
    0.044
    Background Noises
    Background noise
    0.033
    0.038
    0.099
    0.027
    Disfluencies
    Fillers and noise
    0.064
    0.055
    0.156
    0.137
    Speech Accent Archive Subset
    Diverse accent
    0.015
    0.024
    0.014
    0.016

    Ink-Whisper is the fastest streaming model

    Beyond accuracy, streaming transcription must deliver ultra-speed to achieve realistic conversation. With Ink-Whisper, we emphasize a new metric: time-to-complete-transcript (TTCT). This is how quickly the full transcript is ready once the user stops talking. TTCT determines how fast the entire system can respond, in a way that mimics a live, attentive listener.

    A dead giveaway of an unnatural bot is the lag in its reply. Those lags break the rhythm of natural conversation. They lead to dropped calls, frustrated users, and lost revenue. Having the absolute lowest TTCT is about speed, yes, and ultimately, it’s about making the interaction feel human.

    We’re proud to share that Ink-Whisper outperforms the baseline whisper-large-v3-turbo on TTCT. In fact, Ink-Whisper delivers the fastest TTCT of any streaming speech-to-text model we’ve tested:

    Time to Complete transcription after last audio sent
    Cartesia Streaming Ink-Whisper
    Deepgram Nova3 Streaming
    Fireworks Whisper Streaming
    AssemblyAI Universal Streaming
    Median (ms)
    66
    74
    70
    737
    P90 (ms)
    98
    109
    189
    829

    Standard Whisper remains one of the most versatile open STT models, but it wasn’t made for real-time. Ink-Whisper changes that with optimizations for conversational accuracy and ultra-low latency. Ink-Whisper delivers the fastest TTCT we’ve seen, with strong performance across noisy, accented, and dynamic speech. We evaluated Ink-Whisper in the real-world conditions one encounters with voice agents–not the controlled environment of a lab or studio.

    Ink-Whisper is the most affordable streaming model

    Voice is the future–and we’re committed to this belief by making Ink-Whisper accessible to builders of voice solutions. Ink-Whisper is both the fastest and most affordable streaming STT model available. At just 1 credit per second (or $0.13/hr on our Scale plan), Ink-Whisper delivers top-tier real-time transcription at the lowest price.

    Getting started with Ink-Whisper is as seamless as the experiences it powers:

    • Ink-Whisper easily integrates with Vapi, Pipecat, and LiveKit, so you can start streaming voice interactions in minutes
    • With 99.9% uptime and enterprise-grade compliance (SOC 2 Type II, HIPAA, PCI), you can deploy at scale with confidence

    Start here or explore the docs.

    The future of voice AI with Cartesia

    We’re seeing surging demand for voice agents. The most effective ones rely on state-of-the-art audio AI like our Sonic text-to-speech model. Now with Ink-Whisper, we’re meeting that demand on the other side, enabling fast, natural conversations. Today’s release is an early glimpse into how we’re reimagining the real-time voice stack. More to come.

    Original source Report a problem
  • May 15, 2025
    • Parsed from source:
      May 15, 2025
    • Detected by Releasebot:
      Dec 23, 2025
    Cartesia logo

    Cartesia

    Introducing Organizations and Dashboards

    Cartesia launches two team features: Organizations for shared API keys, voices, and centralized billing with unlimited seats, plus Dashboards for real time usage, credits, concurrency, and live WebSocket insights. Available now on Startup+ and for all members, with May 20 updates.

    We’re building Cartesia for developers scaling voice AI. Today, we’re introducing two features to make collaboration and visibility easier: Organizations and Dashboards.

    NEW: Organizations on Cartesia

    Build together on one unified platform.
    The Organizations feature lets teams move faster with shared access to everything Cartesia offers—API keys, custom voices, and billing—all under one account:

    • Shared API keys: Ship more lightning-fast voice experiences with shared API keys
    • Shared voices: Keep voices consistent across use cases with access to the same custom, localized, and cloned voices
    • Centralized billing: Simplify billing under one plan with pooled credits. (Soon, you'll be able to see concurrency across the Organization, too)

    Invite your team today to your Organization today. Now available on Startup+ plans.

    Good to know:

    • Getting started: Log in to Playground to find your auto-generated Organization and start inviting team members. (edited)
    • Unlimited seats: Add as many teammates as you need; there’s no cap.

    NEW: Dashboards on Cartesia

    Scale with visibility into usage and concurrency.
    Today, we’re making available Dashboards for every Cartesia member for real-time visibility into usage and performance.

    • Monitor credits: Track your credit balance, so you can upgrade as necessary
    • View concurrency: Understand how many voice generations are running in parallel, and request more to serve your customers (now available, as of May 20)
    • Check WebSocket connections: Watch live sessions scale, so you can plan accordingly (now available, as of May 20)

    Log in to Playground to find your Dashboard.

    Original source Report a problem
  • May 6, 2025
    • Parsed from source:
      May 6, 2025
    • Detected by Releasebot:
      Dec 23, 2025
    Cartesia logo

    Cartesia

    Introducing Professional Voice Cloning

    Professional Voice Clones built on Sonic are now live on the Startup plan, enabling self-serve fine-tuning and unlimited PVCs on credits. Create voices for avatars and private libraries via Playground UI and API in 15 languages.

    Built on Sonic, Now Available on Startup+

    Starting today, you can create Professional Voice Clones (PVCs) trained on Sonic, the world’s fastest text-to-speech model. PVCs are voice clones created by fine-tuning Sonic on voice data, enabling perfect replicas of the tone, cadence, style, and environment. Until now, creating PVCs was costly and inaccessible to most businesses. We’re changing that—making it more affordable and scalable. PVCs are available on our Startup plan and above.
    Our team has developed new model infrastructure that enables training and serving PVCs at a fraction of the cost without sacrificing quality or latency. As a result, we've been able to make PVCs available self-serve without limits for a fixed number of credits. Whether you're building virtual avatars, AI agents, or assembling a private voice library, PVCs are now easily accessible through our website or API (no Sales contact needed).

    Fine-tune One—or Many

    You can fine-tune our best-in-class models to create high-grade PVCs using our standard credit system. Training a PVC costs 1M credits, and generating PVC speech costs 1.5 credits per character. There is no cap on the number of PVCs you can make, as long as you have the credits. The easiest way to get started is with the Startup plan ($49/month), which includes 1.25M credits each month—enough to create up to 15 voices per year. With unlimited PVC slots, you can build a Professional Voice Clone for every persona, tone, or market you serve.

    PVCs for Custom Libraries and Personal Avatars

    We built our PVCs in response to overwhelming demand from businesses that need bespoke AI voices for private voice libraries and virtual personal avatars. Our customers from healthcare to hospitality can now build out custom voices that are exclusively theirs, on-brand, and consistently sound exactly the way they want. Platforms that enable personal AI avatars, or digital twins, also need PVCs so people can create lifelike clones of themselves at scale. Listen to these demos:
    Expressive, studio-quality voices:
    Ultra-realistic, conversational voices:
    Perfectly capturing diverse speakers:

    Create Your First PVC

    Professional Voice Clones are available through our Playground UI and API, and across 15 languages. Bring your voice data and let’s begin fine-tuning. For more info, check out Docs.
    Get Started
    Prepare your voice data to begin creating your first Professional Voice Clone
    Read the docs

    Original source Report a problem
  • Apr 16, 2025
    • Parsed from source:
      Apr 16, 2025
    • Detected by Releasebot:
      Dec 23, 2025
    Cartesia logo

    Cartesia

    sonic-2-2025-04-16

    Sonic 2 removes embeddings and experimental controls, boosting stability with similarity cloning. New guidance favors Instant Voice Cloning and older API versions for compatibility while deprecated endpoints and features are sunset to streamline performance and reliability.

    Updates to existing APIs

    Starting with sonic-2-2025-04-16, we’re removing support for:

    • Embeddings
    • stability cloning mode
    • Experimental controls for adjusting speed and emotion.

    In our latest models, the similarity cloning mode is dramatically better than stability cloning, and we’ve solved the stability issues. Since it’s no longer worth using stability mode at all, we’ve removed it.

    The experimental controls for speed and emotion negatively affect model stability in practice. We’re working on adding these controls in future model versions in a way that is more stable and effective.

    To control speed and emotion in Sonic 2.0, we recommend using Instant Voice Cloning to create new Voices with the modifications you’re looking for. For example, you can speed up or slow down voices with FFMPEG, use Voice Changer to create emotive versions of voices, and make instant clones of voices generated with embeddings on sonic-2-2025-03-07.

    Users who absolutely need embeddings or experimental controls should use API version 2024-11-13 with model ID sonic-2-2025-03-07, both of which are still available. Please be aware that these features could lead to reduced model stability even with that model.

    Specific API changes:

    • sonic-2 and sonic-2-2025-04-16 will ignore experimental controls used with TTS generations
    • Voice cloning only supports similarity clones — this technique performs the best across-the-board on the sonic-2 model.
    • Removed embeddings from all endpoints.
    • Voices may only be specified by Voice ID
      • /tts generations cannot be called with voice embeddings
    • Deprecated /voices/create and /voices/mix
    Original source Report a problem
  • Mar 28, 2025
    • Parsed from source:
      Mar 28, 2025
    • Detected by Releasebot:
      Dec 23, 2025
    Cartesia logo

    Cartesia

    Cartesia Python SDK v2.0.0

    Cartesia unveils the Python SDK v2.0.0 with a smoother developer experience for AI voice. It adds async and streaming support, WebSocket access, retries with exponential backoff, timeouts, and a streamlined client they'll love to start using today.

    v2.0.0 Python SDK release

    We are excited to announce the release of v2.0.0 of our Python SDK, polishing the developer experience when using Cartesia's AI voice capabilities with Python.

    Getting started with Cartesia using Python

    Install the Cartesia Python SDK in your project with:

    pip install cartesia
    

    Initialize the SDK and authenticate:

    from cartesia import Cartesia
    import os
    client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY"))
    

    Now, you can start making requests. For example, to generate audio with Cartesia's text-to-speech model:

    client.tts.bytes(
      model_id="sonic-2",
      transcript="Hello, world!",
      voice={
        "mode": "id",
        "id": "694f9389-aac1-45b6-b726-9d9369183238",
      },
      language="en",
      output_format={
        "container": "wav",
        "sample_rate": 44100,
        "encoding": "pcm_f32le",
      },
    )
    

    For more examples, check out our API Explorer to generate Python code snippets for any of our APIs.

    The Python SDK at a glance

    Building upon industry established SDK patterns, v2 of our Python SDK delivers a great development experience structured around a primary Cartesia client, which is the entry point for accessing the various API endpoints.

    Even more features

    • Basic Client - Instantiate and use the client with just 24 lines of code.
    • Async Client - The SDK exports an async client alongside standard real-time calls, allowing you to make non-blocking API requests.
    • Streaming - The SDK supports streaming responses and outputs a generator that you can iterate over.
    • WebSocket - Integrate using WebSockets to build realtime, low-latency voice applications.
    • Exception Handling - The API gracefully handles non-success status codes (4xx and 5xx responses).
    • Retries - The SDK is instrumented with automatic retries with exponential backoff.
    • Timeouts - The SDK defaults to a 60 second timeout. You can configure this with a timeout option at the client or request level.
    • Custom Client — You can override the httpx client to customize it for your use-case. Some common use-cases include support for proxies and transports.

    What’s next?

    We can’t wait to see what you build with the Cartesia Python SDK! Your feedback helps us improve—let us know your thoughts on our Discord or by submitting issues on GitHub.

    Try our Python SDK today

    Learn more about the implementation details

    Integrate now

    Original source Report a problem
  • Mar 25, 2025
    • Parsed from source:
      Mar 25, 2025
    • Detected by Releasebot:
      Dec 23, 2025
    Cartesia logo

    Cartesia

    Cartesia Named to 7th Annual Enterprise Tech 30 List Presented by Wing Venture Capital

    Cartesia is named to Wing Venture Capital’s 7th Enterprise Tech 30 list, spotlighting its generative audio models among top private enterprise tech. The honor signals growing market validation and potential for future IPOs or multi‑billion exits.

    Cartesia Named to 7th Annual Enterprise Tech 30 List Presented by Wing Venture Capital

    Karan Goel

    March 25, 2025 - Cartesia, a leading provider of generative audio models, today announced it has been named to the seventh annual Enterprise Tech 30—a definitive list of the most promising, private enterprise tech companies across all stages of maturity.

    The Enterprise Tech 30 reveals the enterprise startups who have the most potential to meaningfully shift how enterprises operate. Companies are inducted into the elite community through a selection-only process based on a successful background and substantial reach in enterprise technology. Nearly 600 venture-backed private enterprise tech companies across various stages were considered in the two-phase process by a record 100+ venture capitalists. Inductees to the ET30 are on a fast track to change how business is done and are expected to be future IPOs and multi-billion dollar exits.

    The companies are categorized by total capital raised. The Giga stage includes companies that have raised $1 billion or more; early-stage includes companies that have raised $35 million or less; mid-stage from $35 million to $150 million; and late-stage includes $150 million or more.

    “Wing’s Enterprise Tech 30 list recognizes the most promising private, venture-back companies. These companies can’t lobby or petition to be on the list; they are named based on a ballot process that taps 100+ leading VCs and technologists,” said Peter Wagner, Founding Partner at Wing Venture Capital and founder of the Enterprise Tech 30. “Making it on the ET30 is a massive accomplishment. Congratulations to all of the listees—they serve as the backbone of how businesses operate and are paving the way for business transformation.”

    For more information on the research methodology, additional insights, and to view the results, visit enterprisetech30.com.

    Original source Report a problem
  • Mar 18, 2025
    • Parsed from source:
      Mar 18, 2025
    • Detected by Releasebot:
      Dec 23, 2025
    Cartesia logo

    Cartesia

    Introducing Narrations: create and edit long-form audio content with precision

    Narrations launches a new platform that turns text into polished audio with hundreds of voices, instant 3-second cloning, multilingual delivery, and seamless import from PDFs, Word, EPUBs and platforms like Substack. It reimagines audio creation powered by Sonic 2.0.

    Chang Chen

    Today we're excited to introduce Narrations, a platform that enables creators to transform written content into polished audio productions with unprecedented control and efficiency. Whether you're producing audiobooks, podcasts, or narrative content, Narrations puts professional-grade capabilities at your fingertips.

    Voice Technology that Adapts to You

    Our extensive Voice Library and Voice Design features give you unprecedented control over how your content sounds. Select from hundreds of voices, fine-tune every aspect of delivery, or create your perfect voice with our instant cloning with just 3 seconds of audio.

    The possibilities are limitless:

    • Craft distinct character voices across 15+ languages and wide variety of accents
    • Fine-tune emotional delivery and pacing
    • Edit and perfect individual fragments with different voices, until they sound exactly the way you want.

    Get started easily

    Import any existing document - PDFs, Word documents, EPUBs, or directly from platforms like Substack and Medium. Generate expressive audio content from your favorite writing and publishing platform.

    This is more than just text-to-speech - it's a complete reimagining of how you can create audio content, powered by Sonic 2.0.

    Experience the future of audio content creation

    Try Narrations now and bring your words to life with AI-powered voices.

    Try it now

    Original source Report a problem

Related vendors