AI Voice and Speech Release Notes

Release notes for AI voice synthesis, text-to-speech and audio generation tools

Get this feed:

Products (11)

Latest AI Voice and Speech Updates

Jul 15, 2026
Date parsed from source:
Jul 15, 2026

First seen by Releasebot:
Jul 17, 2026
Letterly

July 15, 2026

Letterly adds Dictionary and Text Replacements on Windows, plus smarter note search and faster, longer Dictation. The update helps users spell tricky terms correctly, turn spoken phrases into custom text, and rely on a smoother dictation experience.

Dictionary, Text Replacements, and more

2 new features · 3 improvements

We added Dictionary and Text Replacements on Windows, and made several improvements to Dictation to help Letterly feel faster, smarter, and easier to rely on.

Dictionary

Add names, brands, technical terms, and other words you want Letterly to always spell correctly. You can also add words to your Dictionary while using Dictation: select a word and press the shortcut to save it to your personal Dictionary.

Text Replacements

Turn spoken phrases into custom text you use often — like addresses, links, phone numbers, signatures, and more. For example:

Say “my signature” → get “Best regards, Adam”

Also in this update

🔍

Smarter note search

Type any part of a word, and Letterly will still find matching notes.

⏱️

Longer Dictation for Pro

Pro users can now dictate for up to 15 minutes in one session.

Faster Dictation start

Dictation now starts as soon as the microphone is ready, without the extra delay.
Original source
Jul 15, 2026
Date parsed from source:
Jul 15, 2026

First seen by Releasebot:
Jul 16, 2026
Resemble

The Watermark Can't Be Optional Anymore. So We Made It the Default

Resemble ships PerTh Multimodal, upgrading its audio watermarker into a single API for audio, video, image, and text with stronger robustness and near 99.9% detection accuracy. It also reports major detection gains and releases four new free research models.

Last quarter I wrote that I'd started thinking about how we fight deepfakes differently, and I want to pick that thread back up, because it's the whole reason for what we shipped in Q2.

For years, the job in this space has been detection: something fake gets out and you work to catch it. That work is essential, and it isn't going anywhere but it's not enough on its own.

That's why we've long treated this as a provenance problem too. Provenance just means proof of origin: marking what's real at the moment it's made, so there's something to check against later instead of guessing after the fact. It's the idea behind PerTh, our audio watermarker which we released in 2023.

This quarter, we had a hard reason to push that thinking further. Under the EU AI Act, starting August 2, any AI system placed on the market has to mark its output in a machine-readable format, from the very first file it generates, with no grace period for new systems. The fines for getting it wrong reach 15 million euros or 3% of global revenue.

The question we spent Q2 on was simple: how do you make sure everything that leaves your systems carries its own proof, in a way that survives the real world?

The watermarker we open-sourced in 2023 just got a major upgrade

Watermarking has been part of how we work for years. PerTh has been available since 2023, when we open-sourced it as an audio watermarker. The community ran with it: hundreds of GitHub stars, enterprises running it in production, developers building on top of it.

On June 24 we shipped the rebuild. PerTh Multimodal takes that same idea and extends it well past audio, marking audio, video, image, and text through a single API. It's also far more robust. The original had four known weak spots, everyday edits like pitch shifting and filtering that could wear the mark down, and we rebuilt the training approach to close them. The mark now holds up through compression, re-encoding, and the ordinary processing content goes through once it leaves your hands.

A watermark woven into the content survives the trip through a social platform, even after the file-level record on the outside gets stripped on upload. We hear the fair objection often, that watermarks are easy to strip, so why bother. RAND, the nonprofit research institute and think tank, argued in 2025 that provenance schemes leaning on the whole internet to cooperate won't hold up on their own. The answer is a mark that stays put, one we can still detect with close to 99.9% accuracy when we go looking for it.

The trap most teams don't see coming

Here's the part most teams adding watermarking miss, and it's really a detection problem.

Say you mark all synthetic speech but leave real human speech unmarked. A detector learning from that mix can pick up the wrong signal and decide the watermark itself is what "fake" looks like. One of our researchers, Nicolas Müller, showed exactly this in a June 2026 paper with a German research institute. A detector that falls for it gets worse at spotting content it hasn't seen, lets a fake through the moment someone strips the mark, and can even flag a real person's voice as fake.

You only catch a trap like that if you build both the watermark and the detector. The team that designs the mark is the same team that trains the detector, so we knew to train it on marked real and marked fake audio together, forcing it to learn the real signs of manipulation instead of leaning on the mark. That is the advantage of keeping both under one roof.

One mark isn't enough for the new rule

The August deadline won't be satisfied by a single mark. The EU's guidance, finalized in June, calls for layers working together: the file-level record, an invisible watermark inside the content, and a log to fall back on. Each layer covers for the others when they fail.

PerTh Multimodal is built to be the durable middle layer. It pairs with C2PA, the industry-standard format for the file-level record, so the two reinforce each other, and it handles all four formats from one model instead of a separate tool for each.

If you're mapping your own content before August, the question is narrow. Once your media leaves your systems and hits the open web, is there still something inside it that proves where it came from? If not, that's the gap to close now.

On the detection side our team was very busy

So that's what we rebuilt on the watermark and the other half of the story is detection. Our detection model reached 99% accuracy against NVIDIA's Magpie speech model after another round of training, and on an independent audio deepfake detection benchmark with Podonos, our model ranked first against 8 systems including the likes of Aurigin.ai, Hive, and Reality Defender.

We also keep researching how synthetic voice is built, because staying the clear leader in audio deepfake detection means knowing the newest generation methods before they show up in the wild. That research is open, and it produced four new models this quarter, all free to use: DramaBox, Chatterbox Nano, Chatterbox Flash, and Chatterbox Multilingual.

Finally, outside of approving our accuracy, our team was invited to speak at various events this past quarter. Zohaib joined a panel at AI Insiders in front of CISOs, security founders, and investors, and by his account every panel and every investor thesis kept circling back to detection. He also spoke at Deutsche Telekom's leadership summit in Berlin as one of their T-Challenge winners, where a customer's own leadership spent the session on the problem we build for: scaling AI without scaling the risk that rides along with it.

And Will Krispin, our Head of Partnerships, joined an Okta event on operational risk for federal agencies, where they spoke about how voice is collapsing as an identity signal and a ten-second clip is now enough to beat call-center verification or impersonate a senior official.

What's next

With a busy quarter behind us, we are still full steam ahead. One thing I'm most excited about is what we are shipping next week. Our research team is deep into doing final testing of our new detection model and the early numbers show accuracy in the high nineties against unseen generators.

If you’d like to learn more about anything I wrote about you can book a demo with our team to see detection and watermarking running on your own media or read our latest Deepfake 101 guide for a plain-language, sharable resource of the deepfake threat vectors and how to protect against them.
Original source
All of your release notes in one feed

Join Releasebot and get updates from Letterly and hundreds of other software products.

Create account
Get updates with:
Jul 13, 2026
Date parsed from source:
Jul 13, 2026

First seen by Releasebot:
Jul 15, 2026
Eleven Labs

July 13, 2026

Eleven Labs adds sentiment analysis, auto-translated transcripts, nested agent transfers, workspace service accounts and member listing, plus MCP environment scoping, SMS routing controls and new ping monitoring callbacks.

ElevenAgents

Per-agent sentiment analysis: Platform settings add optional sentiment_analysis (SentimentAnalysisSettings) for post-call sentiment scoring configuration per agent.

Auto-translate transcripts: Platform settings add optional auto_translate_transcript_to_app_language (boolean) to translate conversation transcripts to the viewer's app language in history.

Nested agent transfers: Workflows add push, pop and replace operations for nested agent transfers, with enable_nesting and return_when_nested controls for returning to a parent workflow.

Backchannel detection: Transcript user-turn models add optional ignored_as_backchannel (boolean) to mark utterances filtered as backchannels.

Auxiliary conversation audio: Get conversation responses add required has_auxiliary_audio (boolean) alongside existing audio flags.

MCP environment scoping: MCP server and tool routes add optional environment query parameter (defaults to production) on list tools, get tool, create tool config and update tool config endpoints.

Tool call sound override: Tool and MCP override schemas allow tool_call_sound: "off" to silence per-tool call sounds while inheriting server defaults when unset.

Twilio SMS routing: Twilio phone-number import requests add optional enable_sms (boolean, default on) to control inbound SMS routing during number setup.

User sentiment sorting: List users adds optional sort_direction and supports sorting by average_sentiment_score.

Knowledge base external sync: External sync schemas rename ExternalSyncType to ExternalSyncProvider, add sync job trigger and type enums with KbExternalSyncJob, and folder responses can include active_sync_job for in-flight or terminal sync status.

TTS latency setting deprecation: optimize_streaming_latency on agent TTS settings is marked deprecated and documented as a no-op.

Workspaces

Create service accounts: Added Create service account (POST /v1/service-accounts) with optional default_sharing_groups (DefaultSharingGroupConfig) for programmatic service-account provisioning.

List workspace members: Added Get workspace members (GET /v1/workspace/members) returning WorkspaceMemberResponseModel entries with user id, email, seat status, owner and locked flags. Service accounts are excluded.

Invite credit caps: Workspace invite payloads (single and bulk) add optional usage_limit for a monthly credit cap on invitees.

SDK Releases

JavaScript SDK

v2.57.0 - Regenerated from the latest OpenAPI schema with run_subagent system tool types, per-agent sentiment analysis settings, nested agent transfer workflow models, knowledge base external sync types, and MCP environment query parameters.

v2.58.0 - Regenerated from the latest OpenAPI schema with service account creation, workspace member listing, invite usage_limit, conversation has_auxiliary_audio, and multi-context text-to-dialogue WebSocket message types.

Python SDK

v2.57.0 - Regenerated from the latest OpenAPI schema with run_subagent system tool models, per-agent sentiment analysis settings, nested agent transfer workflow models, knowledge base external sync types, and MCP environment query parameters.

v2.58.0 - Regenerated from the latest OpenAPI schema with service account creation, workspace member listing, invite usage_limit, conversation has_auxiliary_audio, and multi-context text-to-dialogue WebSocket message types.

Packages

@elevenlabs/[email protected] - Added onPing callback exposing ping events and estimated ping_ms for connection latency monitoring.

@elevenlabs/[email protected] - Exposed the onPing callback from @elevenlabs/[email protected] in React hooks.

@elevenlabs/[email protected], @elevenlabs/[email protected] and @elevenlabs/[email protected] - Dependency alignment to @elevenlabs/[email protected].

API
Original source
Jul 9, 2026
Date parsed from source:
Jul 9, 2026

First seen by Releasebot:
Jul 17, 2026
Letterly

July 9, 2026

Letterly adds Dictionary and new Keyboard languages, making typing more familiar and accurate. The update also expands the interface to Dutch and Japanese on iPhone and Mac and brings smarter note search that finds notes by any part of a word.

Dictionary and Keyboard Update

2 new features · 2 new languages · 1 improvement

Two useful updates to help Letterly understand you better and make Letterly Keyboard feel more familiar and convenient.

Dictionary

Add names, brands, terms, and other words you want Letterly to always spell correctly.

New Keyboard languages

Letterly Keyboard now has layouts for English, German, French, Spanish and Russian. We want Letterly Keyboard to feel as smooth and familiar as your usual keyboard. If there's anything we could improve, let us know at [email protected].

Also in this update

New languages

The Letterly interface is now available in Dutch and Japanese on iPhone and Mac.

Smarter note search

You can now type any part of a word, and Letterly will still find the matching notes.
Original source
Jul 9, 2026
Date parsed from source:
Jul 9, 2026

First seen by Releasebot:
Jul 15, 2026
Wispr Flow

Reliability and accuracy: where things stand

Wispr Flow improves dictation reliability with 99.9% uptime, 30% lower latency, and a fix for overly aggressive Auto Cleanup accuracy issues. It also lets users move the Flow Bar to the left or right edge and adds a Canadian English nudge with one-tap dictionary switching.
Reliability and accuracy are the core of what Flow does. Here's where things stand:

Uptime: Dictation stayed up and running 99.9% of the time over the past few weeks.

Speed: Dictation latency is down 30% since the start of the year, and it's still coming down.

Accuracy: We tracked the biggest driver of recent accuracy issues to an Auto Cleanup setting that was too aggressive for some users. That's fixed now.

Put the Flow Bar wherever works best for you

The Flow Bar used to sit fixed at the bottom of your screen, which meant it could land right on top of something you needed next, like the send button in Gmail. Now you can drag it to the left or right edge instead, so it's out of the way of whatever you're actually doing. On Mac, that also means it's no longer sitting on top of your dock.

It remembers your position too, so it stays put instead of resetting every time you reopen Flow.

Just click and drag the Flow Bar to reposition it.

Canadian English

English-speaking users in Canada now get a friendly, one-time nudge to switch to Canadian English. A single tap swaps in the Canadian dictionary so spellings like "colour" and "centre" come out just right.

We shipped this on Canada Day. Seemed like the right day for it.

Find it: Go to Settings > General > Dictation Languages.
Original source
Jul 9, 2026
Date parsed from source:
Jul 9, 2026

First seen by Releasebot:
Jul 14, 2026
Speechify

API: list available TTS models with GET /v1/audio/models

Speechify adds a GET /v1/audio/models endpoint for listing available text-to-speech models at runtime, including model IDs, names, descriptions, default and recommended flags, and supported languages for easier model picker setup.
API: list available TTS models with GET /v1/audio/models

GET /v1/audio/models returns the text-to-speech models you can pass as the model parameter, so you can populate a model picker at runtime instead of hardcoding the list.

Each entry carries the model id, a human-readable name and description, a default flag (the model used when a request omits model), a recommended flag (the model we suggest for new integrations - distinct from the default, which stays stable for backwards compatibility), and the languages it can synthesize (BCP-47 locale strings). The catalog is returned in a single response and is not paginated. These values reflect current support and can change over time, so read them at runtime rather than caching them.

Models

{ "id": "simba-english", "name": "Simba English", "default": true, "recommended": false, "description": "English-only synthesis; the model used when a request omits model.", "languages": ["en"] }

{ "id": "simba-multilingual", "name": "Simba Multilingual", "default": false, "recommended": false, "description": "Synthesis across 30+ languages, including mixed-language input.", "languages": ["en", "fr-FR", "de-DE", "es-MX", "…"] }

{ "id": "simba-3.0", "name": "Simba 3.0", "default": false, "recommended": false, "description": "Earlier streaming-native model, English only. Superseded by simba-3.2.", "languages": ["en"] }

{ "id": "simba-3.2", "name": "Simba 3.2", "default": false, "recommended": true, "description": "Streaming-native model with the lowest time-to-first-byte and richest expressivity, English only today.", "languages": ["en"] }

Original source
Jul 9, 2026
Date parsed from source:
Jul 9, 2026

First seen by Releasebot:
Jul 14, 2026
Speechify

API: list available TTS models with `GET /v1/audio/models`

Speechify adds GET /v1/audio/models to list available TTS models.

API: list available TTS models with GET /v1/audio/models
Original source
Jul 9, 2026
Date parsed from source:
Jul 9, 2026

First seen by Releasebot:
Jul 13, 2026
Cartesia

Introducing Ink-2: The #1-ranked STT built for voice agents

Cartesia releases Ink-2, a real-time speech-to-text model for voice agents with leading streaming accuracy, built-in semantic turn detection, and 0.1s transcript latency. It is live via API and on play.cartesia.ai, with English support now and multilingual support on the way.
We’re excited to release Ink-2: a speech-to-text model built for real-time voice agents.

It’s ranked #1 on Artificial Analysis’s streaming leaderboard for lowest word error rate†, with the most accurate built-in turn detection of any provider, so the model knows precisely when to listen and when to respond.

For voice agents, speech-to-text has to get three things right: accuracy, turn detection, and latency. If any one of these falls short, the experience breaks down. The agent may misunderstand the user, interrupt at the wrong time, respond too slowly, or make the conversation feel unnatural. Ink-2 was built to lead on all three.

Accuracy: Getting every word right

We’ve done extensive work on structured entity recognition like phone numbers, email addresses, alphanumerics, and dates. Ink-2 understands when it’s mid-entity and waits for the full sequence before committing, with no special prompting needed.

We built Ink-2 to be robust across a range of accents, which reflects real voice agent calls. On AppTek, a multi-accent benchmark spanning 14 English accents on real call-center dialogue, Ink-2 is the strongest streaming STT provider at 8% WER, vs. 10% for Deepgram Flux and 12% for ElevenLabs Scribe v2.

Accuracy also holds under production conditions, not just clean reference audio. Our internal benchmark samples audio directly from live voice agent calls covering non-native English speakers, background noise, and degraded audio from poor network conditions. Ink-2 achieves 6.5% WER, compared to 9.2% for ElevenLabs Scribe v2 and 9.4% for Deepgram Flux.

Both structured entities and real-world production audio are where accuracy actually gets tested in a live voice agent, not just in a clean benchmark.

Turn detection: Knowing when to listen and when to respond

Accuracy shows up clearly in benchmarks, but turn detection is where most voice agents fall apart in production.

Most voice agents decide a turn is over based on silence. If someone pauses long enough, the turn abruptly ends. This works in a test environment, but on a real call it means cutting a customer off mid-address, or jumping in right after “and my email is…”

We created Ink-2 with built-in semantic endpointing: the model reads meaning, not silence, to decide when a turn is over. It knows when an incomplete address is still being given or when a trailing thought isn’t a stopping point. The turn stays open until the model is confident the speaker is done.

Ink-2 emits three events natively, with no external VAD needed:

turn.start — the user has begun speaking

turn.eager_end — the model predicts the turn is wrapping up; your LLM can start generating early

turn.end — turn confirmed complete

We measure turn detection against a human-labeled reference. Precision is how often the model is right when it calls a turn over (low precision means cutting people off), and recall is how often it catches every real end-of-turn (low recall means awkward dead air while the model keeps listening). F1 balances the two into a single score.

Ink-2 holds both precision and recall high at once, while the alternatives each trade one off for the other.

Latency: Keeping the conversation flowing

Similar to TTS latency, transcription latency has a direct impact on how natural a conversation feels. The metric we care about is Time-to-Final-Transcript (TTFT): how long it takes to get a final transcript from the moment the user finishes speaking. This is what determines whether your agent feels like it’s paying attention or like it’s buffering.

Ink-2’s latency is lightning fast, with a TTFT of 0.1s. And because turn.eager_end lets your LLM get a head start before the turn is fully confirmed, your agent responds fast enough to feel like it’s actually in the conversation.

Join the teams powered by Ink-2

Ink-2 is live at play.cartesia.ai, available directly via API and across the platforms voice teams build on, including LiveKit, Vapi, and Pipecat.

We’re excited to keep building alongside the best teams pushing voice AI forward and running Ink-2 in production. We’ve already pushed out a world class English model, and multilingual support is on the way, so Ink-2 fits the way your users speak, wherever they are.

Try Ink-2

Test Ink-2 out for yourself today
Original source
Jul 8, 2026
Date parsed from source:
Jul 8, 2026

First seen by Releasebot:
Jul 14, 2026
Speechify

API: New simba-3.2 streaming model (recommended)

Speechify adds the simba-3.2 streaming TTS model on its audio speech and stream APIs, with lower TTFB, richer expressivity, and a curated English voice allow-list for new integrations.
API: New simba-3.2 streaming model (recommended)

simba-3.2 is now available on POST /v1/audio/speech and POST /v1/audio/stream via the model field. It is the go-forward Simba 3 model — streaming-native, with lower TTFB and richer expressivity than simba-3.0. We recommend simba-3.2 for new English integrations.

simba-3.2 serves from a curated voice allow-list. Pass one of its registered voice IDs as voice_id: beatrice_32, dominic_32, edmund_32, geffen_32, harper_32, hugh_32, imogen_32, wyatt_32.

Example JSON:

{ "input": "Hello, world!", "voice_id": "geffen_32", "model": "simba-3.2" }

Currently English only — multilingual coming soon

simba-3.2 currently supports English voices only. Requests with a non-English voice return 400 with the rejected locale called out in the message. Multilingual support is coming soon; the model name stays simba-3.2 across that change, so no migration is required.

For non-English voices today, continue to use simba-multilingual. For cloned or personal voices, use simba-english — those are not registered under the Simba 3 voice allow-list.
Original source
Jul 7, 2026
Date parsed from source:
Jul 7, 2026

First seen by Releasebot:
Jul 17, 2026
Letterly

July 7, 2026

Letterly adds Dictionary, Text Replacements, and smarter note search to help capture names, custom phrases, and notes more accurately.

Dictionary, Text Replacements, and More

2 new features · 1 improvement

This update helps Letterly better understand the way you speak and write — from names and custom phrases to smarter note search.

Dictionary

Add names, brands, technical terms, and other words you want Letterly to always spell correctly.

Text Replacements

Turn spoken phrases into any custom text you choose. For example:

Say “my signature” → get “Best regards, Adam”

Say a short phrase to insert an address, link, phone number, or anything else you use often

Also in this update

Smarter note search

Type any part of a word, and Letterly will still find matching notes.
Original source

AI Voice and Speech Release Notes

Products (11)

Latest AI Voice and Speech Updates

July 15, 2026

Dictionary, Text Replacements, and more

Dictionary

Text Replacements

Also in this update

The Watermark Can't Be Optional Anymore. So We Made It the Default

The watermarker we open-sourced in 2023 just got a major upgrade

The trap most teams don't see coming

One mark isn't enough for the new rule

On the detection side our team was very busy

What's next

July 13, 2026

ElevenAgents

Workspaces

SDK Releases

JavaScript SDK

Python SDK

Packages

API

July 9, 2026

Dictionary and Keyboard Update

Dictionary

New Keyboard languages

Also in this update

New languages

Smarter note search

Reliability and accuracy: where things stand

Reliability and accuracy are the core of what Flow does. Here's where things stand:

Put the Flow Bar wherever works best for you

Canadian English

API: list available TTS models with GET /v1/audio/models

API: list available TTS models with GET /v1/audio/models

Models

API: list available TTS models with `GET /v1/audio/models`

Introducing Ink-2: The #1-ranked STT built for voice agents

Accuracy: Getting every word right

Turn detection: Knowing when to listen and when to respond

Latency: Keeping the conversation flowing

Join the teams powered by Ink-2

Try Ink-2

API: New simba-3.2 streaming model (recommended)

API: New simba-3.2 streaming model (recommended)

Currently English only — multilingual coming soon

July 7, 2026

Dictionary, Text Replacements, and More

Dictionary

Text Replacements

Also in this update

Smarter note search