Q1 2026 in AI voice: the category bifurcates
The headline number for the quarter: ElevenLabs added over $100M in net new ARR in Q1 2026, raised $500M at an $11B valuation in early February, and rebranded its platform into three named products on February 9. In the same window, Inworld launched TTS 1.5 at half a cent per minute, Hume open-sourced a TTS architecture with zero measured hallucinations, and Resemble shipped almost nothing on the generation side at all.
These are not adjacent stories. They are evidence that "AI voice" has stopped being one category.
Eleven Labs and Cartesia climbed the stack
Eleven Labs renamed Conversational AI as ElevenAgents on February 9, and almost every weekly release after that was an agent feature, not a voice feature. Branching and version control for agents (Jan 26), MCP tool support and content guardrails (Feb 16), OAuth for MCP servers and prompt-injection guardrails (Feb 23), Claude Sonnet 4.6 and Gemini 3.1 added as supplemental LLMs (Mar 9), workspace seat types (Mar 25). Voice quality work was a footnote. Eleven v3 came out of alpha on Feb 2, and the model layer went quiet after that.
Cartesia is running the same play with Line. Q1 added agent History Management, custom user events, uninterruptible messages, multilingual agent configuration, and a Text-to-Agent deprecation that signals where the team thinks Line is going.
When the category leaders stop shipping voice quality and start shipping agent infra, the constraint has moved.
Inworld, Hume, and Fish Audio raced the other direction
Inworld TTS 1.5 shipped January 21 at $0.005 per minute for Mini and $0.01 per minute for Max, with the team explicitly claiming 25x lower cost than alternatives. Sub-130ms time-to-first-audio on Mini. Top of the Artificial Analysis TTS leaderboard. The pricing reads like a position rather than a discount: Inworld is betting that infrastructure-grade voice should cost closer to bandwidth than to a SaaS seat.
Fish Audio shipped S2 on March 10, open-source on a Qwen3-4B backbone, with bracket-syntax emotion control and 80+ languages. Hume open-sourced TADA the same day: 1B and 3B Llama-based models, MIT-licensed, with zero hallucinations measured across 1,000+ LibriTTSR test samples. Note how Hume frames itself in that post: voice AI research infrastructure for frontier labs. That is a deliberate step away from competing with Eleven Labs head-on.
Three open-source TTS releases of consequence inside one quarter, two of them on the same day. The "open vs. closed" split that did not matter in voice last year is starting to matter now.
Resemble bet the other way
Resemble's Q1 recap post is striking for what is missing. Five updates to the detection stack. Watermarking extended from audio to images and video. Zero Retention Mode for government and enterprise. A free deepfake-detection bot on X. The only generation update mentioned is custom vocabulary, and even that gets framed as a compliance feature for healthcare and legal.
While the rest of the category races on voice quality or agent platforms, Resemble is shipping the trust layer. The bet is that as generation commoditizes, provenance and detection become the durable business.
The other notable absences in Q1: Murf shipped two API changelog entries (a model query parameter on Jan 7 and a multiNativeLocale deprecation on Mar 5). Speechify shipped roughly twelve consumer-facing announcements, with the model release (SIMBA 3.0 on Feb 19) aimed straight at developer API workloads. Speechify's press release names Eleven Labs, Cartesia, and Deepgram as competitors, which says something about who Speechify thinks it is in a category with now.
If you were watching one company as a proxy for AI voice as a whole, you were watching the wrong company. The split is the story.
Track the quarterly cadence at Releasebot: Eleven Labs, Cartesia, Inworld, and Resemble.