Inworld Release Notes

Name: Inworld
Brand: Inworld

Follow Inworld to add their release notes to your feed!

28 release notes curated from 29 sources by the Releasebot Team. Last updated: Jun 18, 2026

Get this feed:

Inworld Products

Jun 15, 2026
Date parsed from source:
Jun 15, 2026

First seen by Releasebot:
Jun 18, 2026
Inworld

Building conversational voice agents with Mastra + Inworld Realtime API

Inworld releases Realtime TTS-2 and cuts prices for most developers while streamlining voice agents with a single WebSocket. The new @mastra/voice-inworld-realtime package adds semantic VAD, barge-in, tool calling, and expressive realtime speech in one session.
Realtime TTS-2 is live, and we cut prices in half or more for most developers, across the whole stack.

@mastra/voice-inworld-realtime collapses STT, LLM, and expressive, realtime TTS into one WebSocket, with semantic VAD, barge-in, and tool calling out of the box.

Most voice agents are cascaded pipelines: mic → STT → LLM → TTS → speaker. Each box is a different vendor, each hop adds latency, and the audio channel goes silent every time a tool call fires. The user notices.

Inworld's Realtime API folds the cascade into a single WebSocket session. The new @mastra/voice-inworld-realtime package binds that session to a Mastra Agent: semantic VAD ends the user's turn when they actually stop, barge-in cuts the assistant mid-sentence, and the LLM calls your Mastra tools without breaking the audio stream.

This reference CLI does all of this in less than 100 lines of TypeScript, and our example voice design agent shows how this can be integrated into a real application.

Watch the demo here.

One WebSocket, one session

Inworld's Realtime API takes the STT/LLM/TTS cascade and ships it as a single session over WebSocket. You pick an LLM (OpenAI, Anthropic, Google, and others; the routing string is provider-agnostic), pick an Inworld voice, and the pipeline runs end-to-end on Inworld's side.

Swap providers without touching your integration. Choose your STT engine independently of the LLM. And because the LLM emits Inworld's [steering] tags inline, like [whisper softly like you're sharing a secret], the TTS renders prosody and non-verbal cues in the same response stream, with no second model call and no SSML preprocessor.

For a Mastra developer, the interesting question is: how do you bind that to an Agent?

The voice package

@mastra/voice-inworld-realtime implements Mastra's MastraVoice interface against the Realtime API. It registers as an Agent's voice, surfaces the realtime stream as a typed event emitter, and routes the LLM's tool calls back to your Mastra createTool definitions.

The minimum config:

import { InworldRealtimeVoice } from '@mastra/voice-inworld-realtime'; const voice = new InworldRealtimeVoice({ model: 'openai/gpt-5.4-nano', speaker: 'Jason', });

model is provider/model, the same routing string the Realtime API uses on the wire.
speaker is any voice in Inworld's built-in library, or your cloned custom voices. Attach it to an Agent the same way you'd attach any other Mastra voice:

new Agent({ id: 'voice-demo', name: 'Voice Demo', instructions: 'You are a concise voice assistant. Reply in one or two short sentences. Use the get-current-time tool when asked the time.', tools: { getCurrentTime }, voice, });

Tool calls happen mid-conversation

Tool definitions look like any other Mastra tool:

const getCurrentTime = createTool({ id: 'get-current-time', description: 'Returns the current local time.', inputSchema: z.object({}), outputSchema: z.object({ time: z.string() }), execute: async () => ({ time: new Date().toLocaleTimeString(), }), });

When the user asks for the time, the LLM inside the Realtime session emits a tool call, the voice package routes it to execute(), and the result streams back into the session as a tool result. The audio channel stays open the whole time. No "let me look that up for you" stall while a cascaded pipeline tears down and rebuilds around the tool call.

The same contract works for anything you'd register on a non-voice agent: an HTTP call, a database query, an MCP tool. Mastra's tool surface doesn't change because the voice is realtime.

Semantic VAD, barge-in, and the interrupted event

Two things voice agents get wrong: they keep talking when the user starts talking, and they cut the user off because they think a pause means the turn is over.

Inworld's semantic VAD targets both. It detects intent boundaries, "did this user actually finish their thought?", rather than silence thresholds, and the eagerness is tunable on the session config. The default settings give natural conversational cadence; you can lean cautious or aggressive depending on your domain.

Barge-in is wired through the interrupted event:

voice.on('interrupted', ({ response_id }) => { players.get(response_id)?.kill('SIGTERM') });

When the user starts speaking over the assistant, Inworld emits interrupted with the response ID of the assistant turn that should stop. The demo kills the matching play process; in a browser, you'd cancel the AudioBufferSourceNode for that response.

The combination is what makes the conversation feel live: the assistant yields when interrupted, the user's turn ends when they actually finish, and tool calls don't break either side of that contract.

Streaming transcript

For debugging or for a UI transcript pane, the writing event streams text from both sides:

voice.on('writing', ({ text, role }) => { // role is 'user' | 'assistant' });

User text comes from STT, assistant text from the LLM, both incremental.

tool-call-start and error round out the event surface:

voice.on('tool-call-start', ({ toolName }) => console.log(`\n[tool] ${toolName}\n`)); voice.on('error', err => console.error('\n[error]', err));

tool-call-start fires before execute() runs; error surfaces anything the WebSocket or the pipeline raises. Same shape as the rest of the surface, just typed events, no string parsing.

Start building

A Mastra Agent that can have an engaging conversation. The Realtime API handles the speech-to-speech cascade and @mastra/voice-inworld-realtime translates between that pipeline and Mastra's Agent and tool primitives.

From here, the places to extend are the usual Mastra ones. Add memory so the agent carries context across sessions. Add scorers and evals to grade voice behavior over time. Add more tools. Anything that fits createTool works. Swap the LLM by changing one string. Swap the voice the same way.

Learn more in Inworld's Realtime API docs. For the agent layer, see the Mastra docs.
Original source
June 2026
No date parsed from source.

First seen by Releasebot:
Jun 10, 2026
Inworld

Cost is the wall in front of consumer AI. We are taking it down.

Inworld lowers consumer AI costs with new pricing live today, cutting API rates by half or more across text-to-speech, speech-to-text, LLM routing, and compute. The update adds tiered discounts, no-markup model routing, and cheaper realtime inference for scaling consumer apps.

Cost is the wall in front of consumer AI,

and we are taking it down.

A consumer app pays for AI on every session, but most of its users never pay, and the ones who do spend only a few dollars a month. That math puts a wall in front of every consumer product, and it decides whether you scale or stall. We are taking that wall down for the people who serve everyone, by cutting prices in half or more for most developers across the whole stack. That includes text-to-speech, speech-to-text, the LLM, and compute, with prices dropping further as you scale.

Here is the problem and the fix in one picture: on a flat rate your bill climbs in lockstep with engagement, so the more people use your app the harder cost pushes back.

MONTHLY SPEND GROWS SLOWER WITH INWORLD

Total monthly cost as you scale

OTHER PROVIDERS ($30/1M)

INWORLD (TIERED)

YOUR SAVINGS

10M

20M

120M

500M

1B

MONTHLY TTS VOLUME (CHARACTERS)

$0

$8K

$15K

$23K

$30K

Monthly spend ($)

Save $20K/mo

Source: Illustrative, based on Inworld plan-tier pricing (Developer $15, Growth $12.50, Enterprise $10 per 1M characters) vs a flat $30/1M competitor rate.

Lower the unit cost at every tier and the whole spend curve bends down with it, so growth stops being the thing that breaks your economics.

Consumer AI is hard for three reasons.

Cost is at the root of all of them.

A consumer app's costs start climbing the moment it works. The three: it has to scale, the people it serves are the buyers, and most of its cost is the AI itself. Each is hard on its own, and all three get harder because cost rises with the very engagement the product is built to create.

1

Scale is a necessity

Apps can reach millions of users in weeks.

Reliability is non-negotiable; an outage loses users and their habits.

A few people run infrastructure built for millions.

Most models were tuned for enterprise, not consumer.

When an app takes off, spending does not rise in proportion, because one line of the bill grows faster than every other cost you carry, and faster than anything a small team can offset.

MONTHLY SPEND AS A CONSUMER APP SCALES. AI INFERENCE GROWS FASTEST.

Costs grow with every user

AI INFERENCE ($/MO)

PAID ACQUISITION ($/MO)

OPERATIONS ($/MO)

DAILY USERS

0

1

2

3

4

5

6

7

8

9

10

11

12

MONTHS SINCE LAUNCH

$0K

$55K

$110K

$165K

$220K

Monthly cost ($K)

0K

150K

300K

450K

600K

Daily users

Illustrative growth model for a scaling consumer AI app. Cost in $K / month; users in thousands of DAU.

AS YOU SCALE, LOWER COST KEEPS YOUR MARGIN INSTEAD OF ERODING IT · ILLUSTRATIVE

Margin vs scale

WITHOUT THE PRICE CUT

WITH LOWER COST

1K

10K

50K

100K

500K

1M

DAILY ACTIVE USERS

0%

20%

40%

60%

80%

100%

Gross margin (%)

Starting margin 52%

22%

80%

Source: Illustrative gross-margin model at fixed price points. Both lines start at the same point to isolate the effect of scale.

Running the AI is that runaway line, so cutting its cost is the difference between scale that compounds your margin and scale that quietly drains it, which raises the obvious question of who is even paying.

2

Consumers are the buyers

Most users pay nothing; those who do pay $5 to $20 a month.

They expect a natural, realtime response, not a slow workflow.

Cost keeps most teams in English-only markets.

Acquisition is paid, so thin revenue has to cover serving everyone.

A consumer app and an enterprise tool can look identical on screen, yet they run on opposite economics, and the split shows up the moment you ask who actually sends money.

ABOUT 3 IN 100 USERS EVER PAY

Who pays for consumer AI

PAYING USER

FREE USER

~3%

OF USERS EVER PAY

Source: Menlo Ventures, 2025 State of Consumer AI

REVENUE PER USER, PER MONTH

What a single user is worth

CONSUMER APP

ENTERPRISE SEAT

$0

$100

$200

$300

$ PER USER / MONTH

Enterprise seat

Consumer app

$200

($100 to $300)

$10

($5 to $20)

Source: a16z, Menlo Ventures, 2025 to 2026

A small minority of payers, each worth a sliver of one enterprise seat, has to fund the compute every free session burns, so the price of each unit is what quietly decides survival.

3

AI is most of the cost

AI is the largest cost line, and it grows with every session.

Cheaper tokens do not help when you use far more of them.

Cutting unit cost means locking into a SKU and losing flexibility.

Many models and configs make billing its own problem.

You already know this one because you pay it every month: hunting for slightly cheaper tokens does not save you when usage grows faster than the discount; the bill tracks how much people engage.

EVEN AFTER THE VOLUME DISCOUNTS YOU EARN AT SCALE

Your AI bill grows with every user

LIST RATE

WITH VOLUME DISCOUNTS

10K

100K

500K

2M

DAILY ACTIVE USERS

$0K

$40K

$80K

$120K

$160K

Total AI bill ($K / mo)

$3K

$19K

$56K

$98K

Illustrative app-spend model; total monthly AI inference cost by daily active users, list rate vs a typical scaling volume discount.

Volume discounts soften the slope without ever flattening it, since growing means using more, so the only lever that holds is a structurally lower cost. Here is what happens when teams get one.

Lower the cost,

and each of these eases.

None of these problems vanish with a cheaper bill, but every one of them gets easier. Scale becomes affordable, more markets come into reach, and a small team can run a real product without infrastructure eating the margin. Teams already building on Inworld show what changes, and the reason the lever works is that the cost lives almost entirely in three places.

~95%

LOWER ON LLM COST

Wishroll / Status

500K+ daily users

~85%

LOWER ON TTS COST

Bible Chat

800K+ daily users

~40%

LOWER ON TTS COST

Talkpal

10M+ learners

250K+

USERS SERVED AT CONSUMER COST

Luvu

realtime AI coaching

"We scaled from prototype to 1 million users in 19 days with over 20× cost reduction."

FAI NUR · CEO, STATUS

"Inworld reduced our TTS costs by about 10x, while remaining neutral or better on all the metrics we care about, which is pretty incredible."

CRESTON BROOKS · CO-FOUNDER & CTO, LUVU

"We chose Inworld because of its low latency, high-quality output, multilingual support and competitive pricing."

DIMITRI DEKANOZISHVILI · CO-FOUNDER, TALKPAL

The biggest of these cuts is not a discount: Wishroll split one giant prompt into small, task-tuned models through the Router, and engagement held. The full mechanics are in the Router section below. Beyond these, some of the largest consumer apps we serve, including one processing more than 600 billion tokens a day, are moving their LLM workloads onto our realtime inference. Customer figures are self-reported.

FOLLOW THE MONEY

The voice AI stack was built for the enterprise.

Consumer was left to carry the cost.

The biggest labs make most of their money from businesses and developers, and it's not too hard to see why. Enterprises pay for productivity: a company will spend a hundred to a few hundred dollars per seat to save an employee time. On the other hand, only about 3% of consumer AI users ever pay, and those who do pay five to twenty dollars a month for something they enjoy.

So the models were built for that first buyer. The voice infrastructure especially was priced for enterprise and tuned for narrow jobs like support calls, IVR, and transcription, not for the range of things consumers actually do. And for a consumer, almost the entire cost of a session is the AI itself, split across the LLM, text-to-speech, and speech-to-text. Pricing that works for a per-seat tool doesn't work for an app that pays for every minute someone uses it.

Follow the money inside one live conversation and nearly every cent lands in the same place: the models talking, listening, and thinking, second by second, for everyone you are lucky enough to keep engaged.

WHAT A LIVE SESSION'S SERVING COST IS MADE OF

Most of the cost is the AI itself

LLM

TEXT-TO-SPEECH

SPEECH-TO-TEXT

55%

30%

15%

100%

AI INFERENCE

Illustrative cost mix for a typical conversational, voice AI session. The whole serving cost is AI inference, split across three layers.

Whoever owns the LLM, the text-to-speech, and the speech-to-text owns almost the whole cost of a session, which means they, not you, decide whether an app built for the full breadth of human experience can afford to grow.

WHY WE ARE DOING THIS

AI should reach everyone,

not only the enterprise.

We are a research lab, and we measure this technology by the breadth of human experience it reaches, not a handful of business tasks. If only enterprise economics work, only enterprise problems get built. We would rather a wider range of developers serve a wider range of people: the companion at midnight, the tutor, the coach, the daily check-in.

We can lower the wall because our research delivers both quality and efficiency, and we own the layers where cost adds up. We pass those savings on instead of marking them up.

Enterprise voice covers one narrow band of what people actually do with their voice. Here is the whole spectrum a consumer developer has to serve.

ENTERPRISE VOICE IS TUNED FOR A NARROW SET OF JOBS, EVEN WHERE ITS VOLUME IS LARGE

AI should reach the breadth of human experience

THE FULL BREADTH (CONSUMER)

ENTERPRISE VOICE (ONE SLICE)

Support / IVR

Accessibility

Language learning

Education

Coaching

Health

Journaling

Companions

Social

Meditation

Kids

SPECTRUM OF CONSUMER USE CASES

Diversity of needs served (illustrative)

The full breadth of human experience

CONSUMER VOICE AI

Enterprise voice

SUPPORT, IVR, TRANSCRIPTION

Illustrative

That whole curve is who this price cut is for.

WHAT WE DID

Our API prices drop by half or more for most developers,

across the whole stack.

New pricing is live today for every developer. For most developers it drops by half or more across every layer of a voice app, and it falls further as you scale. Here is each layer, what it does for a consumer app, and where it lands.

$ PER 1M CHARACTERS · INWORLD SHOWN ON THE GROWTH PLAN ($25 TO $12.50, ~50% OFF)

Text-to-speech price vs the market

OTHER PROVIDERS

INWORLD

$0

$55

$110

$201.60

$ PER 1M CHARACTERS

Google Gemini 3.1

Flash TTS*

ElevenLabs

Cartesia Sonic 3.5†

OpenAI TTS

Realtime TTS-2

(Growth plan)

$180

$100

$35

$30

$12.50

Provider API rates, June 2026. Inworld on the Growth plan; $10 at enterprise scale. *Gemini 3.1 Flash TTS is billed by audio output tokens; ~$180/1M is the effective rate on a typical query. †Cartesia estimated from published tier pricing.

TEXT-TO-SPEECH

Top-ranked voice, a fraction of the cost

Realtime TTS-2 is top-ranked on independent speech benchmarks. It supports steering and conversational context, and a single voice works across languages, so you can open new markets without switching models. It answers in realtime, and it costs a fraction of the comparable premium voices.

DOWN TO ~$10 / 1M CHARS

$ PER HOUR, STREAMING

Speech-to-text price vs the market

OTHER PROVIDERS

INWORLD (WAS)

INWORLD (NOW)

$0

$0.30

$0.60

$1.15

$ PER HOUR (STREAMING)

Azure

Google STT

Deepgram Nova-3

AssemblyAI Universal-3

Pro

Inworld STT-1 (was)

Inworld STT-1 now (at

scale)

$1

$0.96

$0.46

$0.45

$0.35

$0.10

Source: provider pricing, June 2026; Inworld new pricing. On-demand $0.15/hr, down to $0.10 at scale.

SPEECH-TO-TEXT

Fast, accurate, among the lowest cost

Our speech-to-text ranks top-4 in quality and reads emotion, accent, and intent as the user speaks, so the app responds to a person. It handles long, continuous sessions that call-center transcribers can't, and it now costs less than every major streaming rate we benchmark.

DOWN TO ~$0.10 / HOUR

WHAT A TYPICAL 5% GATEWAY MARKUP COSTS PER YEAR. THE INWORLD ROUTER: $0.

A gateway charges you more

5% GATEWAY MARKUP (PER YEAR)

INWORLD ROUTER: $0

$100K / mo

$500K / mo

$1M / mo

MONTHLY LLM SPEND

$0

$200K

$400K

$600K

Markup paid per year

+$60K/yr

+$300K/yr

+$600K/yr

A typical gateway adds ~5% on credit purchases. The Inworld Router passes routed models through at no markup.

THE LLM, THROUGH THE ROUTER

Hundreds of models, no markup

The LLM is a major line. Through the Router you reach 220+ models from every major lab and route to the best one for each job with a config change, not a migration. Routed third-party models carry no markup, where a typical gateway adds about 5%, which compounds into real money at scale. And the biggest wins come from decomposition: we helped one social app split one giant prompt into small, task-tuned models, cutting AI cost about 95% while engagement held.

NO MARKUP · UP TO ~95% LOWER

THE SAME TOP OPEN MODELS, UP TO 50% BELOW THE PUBLIC THIRD-PARTY RATE

Up to half the cost for your LLMs

YOU PAY

YOU SAVE

0%

25%

50%

75%

100%

Public third-party rate

Inworld realtime

inference

100%

50%

save up to 50%

Inworld realtime inference serves the top open models up to 50% below what you would pay elsewhere, with better latency and reliability.

REALTIME INFERENCE, A NEW PRODUCT

Optimized open models, up to 50% below the public rate

Realtime inference is a new product. The same team that tuned our voice models now hosts the top open models consumer apps run in production and serves them up to 50% below the public rate, with better latency and reliability, accessed through the Router. Three of the ten highest-volume consumer apps we work with are already moving their LLM workloads onto it.

UP TO 50% OFF, BETTER LATENCY

VARIABLE TOKEN COST VS DEDICATED GPUS AS VOLUME GROWS (ILLUSTRATIVE)

When fixed compute wins

PER-TOKEN (VARIABLE)

DEDICATED GPU (FIXED, FROM $5/GPU-HR)

0x

3x

6x

10x

MONTHLY VOLUME (RELATIVE)

$0

$3K

$6K

$11K

Monthly cost ($)

Crossover: fixed GPUs win

Source: Illustrative; Inworld $5/GPU-hr, GCP H100 ~$11/GPU-hr.

COMPUTE

Fixed GPUs when you reach real scale

At the largest scale, per-token billing is not always the right shape. You can move to dedicated GPUs from $5 per GPU-hour, less than half the on-demand rate of a hyperscaler, and transition at the crossover where fixed compute beats variable.

FROM $5 / GPU-HOUR

AND IT COMPOUNDS

Your unit price gets cheaper the more

you scale.

Today's prices are the ceiling, not the floor. As your total spend grows, the unit price of every layer falls. We own the layers where cost adds up, and scale makes us more efficient, not less. Your unit price drops by tier, all the way to enterprise. Shown here for Realtime TTS-2; speech-to-text and the LLM follow the same shape.

REALTIME TTS-2, $ PER 1M CHARACTERS BY MONTHLY SPEND

Your unit price drops as you scale

REALTIME TTS-2 UNIT PRICE

ELEVENLABS API ~$100/1M, FLAT (OFF SCALE)

Up to $300

$300

$1.5K

$5K

$25K

$100K

MONTHLY SPEND

$0

$10

$20

$30

$ per 1M characters

$25

$15

$12.50

$10

$7.50

$5

Inworld pricing, June 2026. The discounts arrive at modest spend: a $300/mo subscription covers about 20M characters at $15/1M, $1,500/mo covers about 120M at $12.50, and enterprise commitments step down from $10 toward $5 at the largest spend levels (illustrative). For comparison, the ElevenLabs API standard rate is ~$100 / 1M characters (standard tier, off this scale).

You saw the shape of this curve at the top of the post. Here it is with your spend on it: the total grows far slower than on a flat rate. You choose how to buy, trading flexibility for a lower, more predictable rate, up to about 90% off with fixed compute for the largest workloads.

TOTAL MONTHLY COST AT INCREASING VOLUME

What you actually spend

INWORLD (TIERED)

ILLUSTRATIVE FLAT RATE ($30 / 1M)

10M

20M

120M

500M

1B

MONTHLY VOLUME (CHARACTERS)

$0

$7.5K

$15K

$22.5K

$30K

Monthly spend ($)

SAVE $20K/MO

Illustrative. Inworld spend-tier pricing ($15 at $300/mo, $12.50 at $1,500/mo, $10 at enterprise scale, per 1M characters) vs a flat $30 / 1M rate.

MAXIMUM DISCOUNT BY HOW YOU BUY

More commitment, bigger discount

MAX DISCOUNT VS ON-DEMAND

Pay-as-you-go

Subscription

Commit

Fixed compute

PURCHASING MODEL

0%

25%

50%

75%

100%

Max discount (%)

0%

50%

80%

90%

Inworld pricing tiers: subscriptions reach 50% off list (TTS-2 $25 to $12.50 on Growth), enterprise commits reach 80% (TTS-1.5 Mini $25 to $5), fixed compute up to ~90% (illustrative, at full utilization vs per-token public rates).

Discounts on top of a high base never fixed the math. Discounts on top of a structurally lower base do, and fixed compute is the point where the line finally flattens. Pick the shape that matches your stage; the rate only moves one way as you grow.

Lower cost is the front door.

The rest is built for consumer too.

Human feel

Voice profiling reads emotion and intent as the user speaks, and steerable, audio-native TTS-2 answers in kind, so a session adapts to the person.

Modularity

Reach any model you need through one API, and swap it with a config change instead of a migration.

Unified billing and management

One bill across every layer, volume discounts that apply across the stack, and per-team and per-user controls.

Experiment freely

A/B test models and variants on live traffic to find what improves both the experience and the margin.

OUR COMMITMENT

Building the stack

consumer AI deserves.

The price cut is the visible part. Underneath it is a stack we build and own end to end, where each layer rests on the one below. Read it from the surface down.

SURFACE

FOUNDATION

LAYER 01 · APIS

CLOSEST TO THE USER

Simple, reliable APIs

VOICE THAT FEELS HUMAN

REALTIME LATENCY

ONE CROSS-LINGUAL VOICE

THE CONTROL DEVELOPERS NEED

LAYER 02 · INFERENCE

WHERE THE PRICE CUT COMES FROM

Optimized inference and model management

DEEP OPTIMIZATION ON TOP OPEN MODELS

PERFORMANCE, COST, RELIABILITY

UNIVERSAL ACCESS VIA THE ROUTER

UNIFIED BILLING AND COST CONTROL

LAYER 03 · RESEARCH

THE BEDROCK

Research to serve the breadth of human experience

WORLD-CLASS RESEARCH TEAM

PROPRIETARY TRAINING DATA

YEARS OF R&D

BUILT FOR THE DIVERSITY OF HUMAN EXPERIENCE

Voice profiling reads the user, and steering makes responses expressive and empathetic. Realtime latency is the bar for engaging a consumer, and one cross-lingual voice reaches users across languages and markets, with the control developers need made cost-effective.

Deep optimization applied to the top open models consumer apps run in production means better performance, cost, and reliability, with universal access through the Router and unified billing across every model.

Where others narrow their focus, we want to enable the diversity of users and use cases that consumer represents, across voice, language models, and code. The breadth is the point, and it is the layer everything above is built to serve.

No single layer is the moat.

The integrated whole is.

We are taking on the cost so consumer developers can win.

New pricing is live today for every developer. We build the best realtime models for voice and speech, optimize the inference that serves the LLMs you run, and keep lowering the cost, so a wider diversity of developers can bring AI to everyone.
Original source
All of your release notes in one feed

Join Releasebot and get updates from Inworld and hundreds of other software products.

Create account
Get updates with:
May 14, 2026
Date parsed from source:
May 14, 2026

First seen by Releasebot:
May 16, 2026
Inworld

Realtime voice agents can now see, listen, and engage

Inworld launches Realtime TTS-2 for human-like realtime voice agents, with expressive delivery, emotional context, multilingual support, non-verbal sounds, and sub-200ms latency. A Stream-powered Vision Agents reference implementation shows how agents can react to face, voice, and context in real time.
Realtime TTS-2 is live. Built for realtime conversation that feels human.

Realtime interactions are becoming the primary way people interact with agents. To showcase what's possible with Inworld's latest voice model, Realtime TTS-2, Inworld partnered with Stream to build a flagship reference implementation using their open-source Vision Agents framework. The 'Crashout Buddy' watches your face, hears your words, and shapes its delivery in real time based on how the end user actually feels in the moment.

This reference example can be adapted to many more use cases: from professional coaching with realtime guidance, companion apps that notice context and environment, patient intake with verbal and non-verbal cues, 1:1 personalized customer experiences, and beyond.

What this means for realtime voice agents

Realtime TTS-2 + Vision Agents only requires a few lines of code to unlock the ability for AI to make users feel heard and stay engaged:

Emotional context and conversational awareness become first-class inputs.

Whatever your agent knows about the user from sentiment in transcript, signals from a vision model, sensor data, etc. can be turned into a steering directive that shapes how the voice delivers the response.

Multilingual deployment prevents a fragmented voice identity.

One voice across 100+ languages means a single agent persona for a global user base. No need for model swaps mid-conversation when a user switches languages.

Non-verbal sounds become part of the script.

Laughs, sighs, breaths, and pauses live inline alongside the words. The agent sounds like it's actually listening before responding.

Long conversations build on themselves.

Because Realtime TTS-2 carries conversational context forward, sustained interactions stop feeling like a series of disconnected scripts.

Sub-200ms latency unlocks interruption and barge-in patterns suitable for realtime agent loops.

“We've had early access to Inworld's Realtime TTS-2 for a few days and we're all blown away. The expressiveness, language steering and multi-lingual support are genuinely impressive. The subtle details like natural pausing make it hard to differentiate between AI and human.”

NEEVASH RAMDIAL · VISION AGENTS LEAD, STREAM

Voice steering, live

The same steering capability powering Crashout Buddy. Pick a delivery tag and hear how the voice changes with the same line.

Start building

Any use case can follow a simple pattern: take the user's video feed, run a lightweight perception model on it, use the results as context, and let an expressive voice model render the response with appropriate delivery. Vision Agents makes the orchestration simple. Inworld Realtime TTS-2 makes the voice interaction believable.

Vision Agents — the framework

Inworld Realtime TTS-2 — expressive voice

GitHub repo — the code

The next generation of agents doesn't read scripts. It understands the full context of the user and makes them feel truly understood.
Original source
May 5, 2026
Date parsed from source:
May 5, 2026

First seen by Releasebot:
May 6, 2026
Inworld

Realtime TTS-2

Inworld launches Realtime TTS-2, a new voice model for realtime conversation with emotionally aware speech, plain-English voice direction, one voice identity across 100+ languages, and advanced voice design. It is available now as a research preview via the Inworld API and Realtime API.
Realtime TTS-2 from Inworld AI

Realtime TTS-2 from Inworld AI is a new generation of voice model built for realtime conversation. It hears the full audio of the exchange, picks up the user's tone, pacing and emotional state, then takes voice direction in plain English the way developers prompt an LLM. It holds one voice identity across over 100 languages. Available today via the Inworld API and the Inworld Realtime API as a research preview.

WHAT LAUNCH PARTNERS AND CUSTOMERS ARE SAYING

Voice AI that actually feels human.

"Inworld's TTS-2 marks a real step forward in emotionally expressive voice synthesis. When combined with the conversational intelligence of LiveKit agents, it enables interactions that feel genuinely human — responsive, nuanced, and alive in ways that feel natural."
DAVID ZHAO · CO-FOUNDER & CTO, LIVEKIT

"I've never seen steering work like this before TTS-2. The output is extremely natural and faithful to the steering prompt, even when it's hyper-specific. The biggest battle you fight with TTS is feeling bland, stale, and robotic — this level of steering unlocks a whole new axis to keep the experience fresh."
CRESTON BROOKS · CO-FOUNDER & CTO, LUVU

"We've always believed language learning should have no borders. TTS 2.0 just made that a lot more real."
DIMITRI DEKANOZISHVILI · CO-FOUNDER, TALKPAL

"We've been chasing the uncanny valley of voice AI for years — Inworld is finally closing the gap between 'impressive' and 'actually believable' with TTS 2.0. When your character speaks and you forget it's AI, that's when the story becomes real."
LOUIS MUK · CEO, ISEKAI ZERO

"We've had early access to Inworld TTS-2 for a few days and we're all blown away. The expressiveness, language steering and multi-lingual support are genuinely impressive. The subtle details like natural pausing make it hard to differentiate between AI and human."
NASH RAMDIAL · DEVELOPER RELATIONS, STREAM

"Inworld just made voice AI feel genuinely human across 100+ languages. Partnering with them means we can help bring that experience to kids around the world, safely and compliantly."
KIERAN DONOVAN · CEO, K-ID

"AI Native games need characters you can deeply connect with. Voice models that offer full control and emotional complexity to make characters feel real is one of the biggest pieces missing. TTS 2 is a significant advance in helping make that future a reality."
NICK WALTON · CEO, LATITUDE

"Inworld was already at the top of the Artificial Analysis TTS Arena and Realtime TTS-2 pushes further on a dimension VoiceRun customers care about: directability. Style, pacing, emphasis, emotion, and delivery can be shaped in ways that matter for real enterprise deployments."
NICK LEONARD · CEO & CO-FOUNDER, VOICERUN

Voice AI was built for audiobooks. We rebuilt it for conversation.

Realtime TTS 1.5 already ranks #1 on the Artificial Analysis Speech Arena, ahead of Google and ElevenLabs. Quality is solved. So we asked the next question: what does voice AI sound like when it is built for the way humans actually talk to each other? In realtime, mutual, alive to the moment?

Voice AI was shaped by the static stuff: audiobooks, narration, voiceover. A sentence in, audio out, the model never hearing the person on the other end.

Realtime TTS-2 is built from the ground up for realtime conversation. It listens to the prior turns of the exchange, so your tone and pacing carry forward. It takes voice direction in plain English, so you steer the read the way a director would. It holds one voice identity across over 100 languages, so the speaker stays the same person mid-switch. And Advanced Voice Design lets you build a saved voice from prose. Four capabilities that work together, in one model, on the same realtime connection.

Capabilities include:

Voice Direction: Natural-language descriptions of how a line should be delivered, passed inline at the start of your text. Not fixed preset emotions or sliders. Write prompts like stage directions.

Conversational Awareness: The model takes actual audio of prior turns, not just transcripts, so tone, pacing, and emotional state carry forward automatically.

Crosslingual: One voice identity preserved across 100+ languages, including mid-utterance language switches.

Advanced Voice Design: Create a new voice from a written prompt, no reference audio needed.

Additional features:

Non-verbal markers like laughs, sighs, breaths inline in text.

Disfluencies like uh and um in natural places.

Voice cloning from a short audio sample.

Stability modes to dial expressiveness up or down (Expressive, Balanced, Stable).

Built for natural realtime conversation: Listening, Thinking, Expressing stages all on the same persistent connection, passing full audio context, user's state, and conversation history.

Available today through the Inworld API and the Inworld Realtime API. Customers on Realtime TTS 1.5 upgrade by changing the model identifier, no other code changes needed. SDKs available in Node and Python.

"We are obsessed with how Voice AI feels, not just how it sounds." - KYLAN GIBBS, CEO, INWORLD AI

Pricing and documentation available at docs.inworld.ai and inworld.ai/pricing.
Original source
May 5, 2026
Date parsed from source:
May 5, 2026

First seen by Releasebot:
May 5, 2026
Text To Speech by Inworld

Realtime TTS-2

Text To Speech launches Realtime TTS-2, its most expressive TTS model, with natural language steering, stronger multilingual synthesis across 15 languages, cross-lingual voice reuse, voice localization, a new deliveryMode control, and an updated Voice Design with improved generations.
Launched Realtime TTS-2 (inworld-tts-2), our most powerful and expressive TTS model:

Natural Language Steering: Direct any voice with bracketed instructions like [say excitedly], [whisper in a hushed style], or free-form directions like [speak as if barely holding back rage]. Covers articulation, intonation, volume, pitch, range, speed, vocal style, and non-verbals ([laugh], [sigh], etc.). See the Steering guide.

Stronger Multilingual Support: Production-quality synthesis across 15 languages, plus experimental support for 90+ additional languages. See Languages.

Cross-Lingual Voice Synthesis: Reuse the same voice across multiple languages. For best results, specify the language field.

Voice Localization: Localize your voice for the most consistent, native-sounding speech in a target language. See Voice Localization.

Delivery Mode: New deliveryMode field (STABLE, BALANCED, EXPRESSIVE) controls the trade-off between consistency and emotional range.

Updated Voice Design: Released an updated version of Voice Design with improved generations. See Voice Design.

Original source
Similar to Inworld with recent updates:
Mar 13, 2026
Date parsed from source:
Mar 13, 2026

First seen by Releasebot:
Mar 18, 2026
Inworld

Beyond Quality: Emotionality and Expressiveness

Inworld releases insights on evaluating TTS, introducing arousal and expressivity metrics to shape training and data collection. It contrasts emotionality with expressiveness, notes a 1.5 release where AI companion voices gain about 30% more expressivity while staying natural, and signals future benchmarks and community collaboration.
How Do We Evaluate TTS?

When evaluating TTS systems, we usually focus on the obvious stuff: efficiency and accuracy. But once you nail those basics, things get interesting. What makes one voice more engaging than another? We decided to check it out.

This post will show how, while working on the new Inworld TTS models, we developed metrics for emotionality and expressiveness, and then used them to create voices that better match user needs.

First, let's compare the basic things: cost, speed, scalability, and ease of use. This approach assumes that the TTS works flawlessly.

When it doesn't (and, well, it often doesn't), there are well-established metrics to evaluate accuracy issues. We look at generation errors, pronunciation deviations, tone inconsistencies, unnatural sounds, and so on.

But what comes next?

Let's say we have a fast, cheap TTS that generates speech without errors. This is where things get interesting. Depending on our task, we might have some specific requirements. For a virtual AI companion, empathy and emotionality might be important, while for an AI support specialist, calmness and professionalism matter more.

How do we define such properties? Can we learn to work with them objectively? We started with emotionality, since it's the most researched aspect of expressive speech.

Emotionality

Evaluating the emotionality of text or speech is a fairly well-studied problem.

On one hand, theory tells us that the degree of emotionality in speech directly or indirectly affects various acoustic parameters: fundamental frequency (F0, pitch), loudness, speed, and so on. [1], [2].

On the other hand, there are plenty of ready-made solutions out there. Currently, HuggingFace hosts more than 20 models for emotion classification from speech audio.

We tried a bunch of them, and, unfortunately, most of them don't work quite the way we'd like. Some of them seem to be overfitted on specific domains, others are just unstable on our tests. So, we built our own "arousal" metric that evaluates overall speech emotionality regardless of the specific emotion.

Here are a few generation examples from the same speakers but with different levels of emotionality:

HADES

ALEX

Low 0:02 0:01

Medium 0:03 0:02

High 0:02 0:02

Emotionality vs Expressiveness

Now, if you think about it, emotionality isn't everything we want from expressive speech. Speech can be emotionally neutral yet very expressive.

Think of your favorite audiobook narrated by a professional voice actor. Such a narrator uses various techniques (speed, volume, tone, rhythm, pauses, semantic emphasis) to make the flow of speech less monotonous and to highlight key words and phrases.

Can we automatically evaluate expressiveness beyond emotionality? It turns out we can. We managed to develop an "expressivity" metric that, with reasonable accuracy, determines the level of expressiveness in speech. While arousal captures intensity, expressivity measures variation and dynamic range across prosodic features.

Here are a few examples where we can hear high expressiveness with roughly similar emotionality:

PHILIP

ELIZABETH

AGNES

Low 0:12 0:02

0:03

High 0:14 0:03

0:03

From Metrics to Training

With these metrics in hand, we can train models in a controlled way using reinforcement learning approaches, either increasing or decreasing emotionality and expressiveness depending on our preferences. We can also use them for data collection and training controlled generation models with explicit tags.

In our recent TTS 1.5 release, we used this approach to achieve more natural results that better match user needs. For instance, our AI companion voices now score 30% higher on expressivity while maintaining natural speech patterns.

Next Steps

Our current results are just the beginning. There's plenty of work ahead and quite a few tricky nuances to figure out: different languages and cultures use different intonation patterns to express emotions; non-standard voices (voices of non-human characters like robots, aliens, monsters, and cartoon characters) have unusual acoustic characteristics and may require special attention.

One way or another, we're planning to share the code for our metrics with the community. We are also going to create special benchmarks for developing and calibrating new ones.

If you're working on similar problems, we'd love to hear from you. The intersection of perceptual quality and measurable metrics is a fascinating space, and there's room for everyone to contribute.
Original source
Jan 21, 2026
Date parsed from source:
Jan 21, 2026

First seen by Releasebot:
Jan 21, 2026

Modified by Releasebot:
Apr 20, 2026
Text To Speech by Inworld

Inworld TTS 1.5

Text To Speech launches Inworld TTS 1.5, a new realtime model generation with faster first-audio latency, more expressive and stable speech, and support for additional languages including Hindi, Arabic, and Hebrew.
Launched Inworld TTS 1.5, our newest generation of realtime TTS models featuring:

Two New Models: Our flagship model inworld-tts-1.5-max is ideal for most use cases, with the best balance of quality and speed. For use cases where latency is the top priority, we also offer inworld-tts-1.5-mini.

Latency Improvements: Our new TTS-1.5 models achieve P90 latency for first audio chunk delivery under 250ms for our Max model and under 130ms for our Mini model, a 4x improvement compared to TTS-1.

More Expressive and More Stable: TTS-1.5 is 30% more expressive than prior generations and demonstrates a 40% reduction in word error rates.

Additional Languages: We've added support for additional languages, including Hindi, Arabic, and Hebrew, bringing total languages supported to 15.

Original source
Jan 21, 2026
Date parsed from source:
Jan 21, 2026

First seen by Releasebot:
Jan 21, 2026
Inworld

Inworld TTS-1.5: Upgrading the #1 Ranked TTS Model with Production-Grade Latency, Expression and Stability

Inworld unveils TTS-1.5, the fastest realtime voice AI with sub-250ms latency on Max and sub-130ms on Mini, plus 30% more expressiveness and 40% lower word error. It adds 15-language multilingual support, on‑prem options, and clear affordable pricing for global deployment.
Announcing Inworld TTS-1.5. The world's best realtime text-to-speech. <200ms latency. #1 on benchmarks.

We’re releasing Inworld TTS-1.5, the fastest, highest-quality realtime voice AI models available. With time-to-first-audio P90 latency of <250ms for 1.5 Max and <130ms for 1.5 Mini (4x faster than prior generations) and top rankings on independent leaderboards, this release sets a new standard for developers building voice-enabled applications at scale. TTS-1.5 improves on the Inworld models already #1 on leaderboards with 30% greater expressiveness, 40% reduction in word error rates, and enhanced multilingual support. It is also more than 25x lower cost than alternatives. Whether you're powering conversational AI agents, live translation, or interactive media experiences, TTS 1.5 gives you the world’s best text-to-speech without compromise.

Inworld TTS-1.5 Max is recommended for most applications, while TTS-1.5 Mini is optimized for hyper-latency sensitive applications.

Production-grade realtime latency: Professional voice actor quality at human-native speeds

For realtime applications latency isn't just a metric. It's the difference between a natural conversation and an awkward delay. TTS-1.5 delivers breakthrough speed improvements that unlock new categories of realtime experiences.

Our new TTS-1.5 models achieve time-to-first-audio P90 latency under 250ms for our Max model and under 130ms for our Mini model. This is a 4x improvement from prior generations. The Max model now delivers quality previously only achievable at much higher latencies, running nearly as fast as the Mini model while producing richer, more expressive speech.

Engagement-optimized quality: Upgrade every user experience with leading expression and stability

Speed means nothing if quality suffers. TTS 1.5 delivers both. Our models rank #1 on the Artificial Analysis TTS Leaderboard. What makes this ranking particularly meaningful is that it reflects blind comparisons by thousands of real users evaluating which outputs sound more natural and human. When developers and end users consistently choose Inworld TTS over alternatives, that's validation that matters.

Beyond third-party validation, TTS 1.5 is 30% more expressive than prior generations and demonstrates a 40% reduction in word error rate, reducing hallucinations, cutoffs, and artifacts. The result is speech that's virtually indistinguishable from human speaking: emotionally nuanced, contextually aware, and reliably accurate.

An expanded range of expression means support for many new consumer-facing use cases and applications, where the personality of voice really matters to engage, retain and convert every user. Ultimately meaning better business outcomes across the next wave of AI applications.

Unlocked consumer-scale: Enhanced multilingual support, still 25x lower cost than alternatives

State-of-the-art voice AI should be accessible to every developer, from indie hackers building their first voice app to enterprises scaling to millions of users.

Language support now spans 15 languages, with the addition of Hindi and expanded coverage across major world languages. Combined with on-prem deployment options, TTS 1.5 serves global enterprises with diverse requirements for data residency, compliance, and customization.

Most significantly, TTS 1.5 remains 25x more affordable than the next best model, a gap that's only widened as competitors have raised prices. At $0.005 per minute for 1.5 Mini and $0.01 per minute for 1.5 Max, we're keeping our commitment to radically accessible pricing that doesn't force developers to choose between quality and budget.

Model Price (in characters) Price (per minute) Best For
inworld-tts-1.5-max $10/M characters $0.01/min Most applications. Great balance of optimal quality and latency.
inworld-tts-1.5-mini $5/M characters $0.005/min Extremely latency-sensitive applications.

What Inworld TTS 1.5 unlocks: use-case inspiration

Bible Chat, Particle, Luvu, Talkpal, Astrobeam, and many others are proving what is possible when developers have access to consumer-grade voice AI. The combination of sub-200ms latency, benchmark-leading quality, and accessible pricing opens new possibilities:

Conversational AI agents: Build voice assistants that respond naturally, without the awkward pauses that break immersion. TTS-1.5's speed makes multi-turn conversations feel genuinely fluid.

Real-time translation and dubbing: Live interpretation requires voice synthesis that keeps pace with speakers. TTS-1.5 delivers the latency profile that makes real-time language bridging viable at scale.

Interactive entertainment: From AI companions to narrative experiences, TTS-1.5 enables characters that speak with emotional range and contextual awareness, responding in real-time to user input.

Accessibility applications: Screen readers, navigation aids, and assistive technologies benefit from natural-sounding speech that doesn't fatigue listeners or create cognitive load.

"We're blown away with Inworld’s latest models which achieve unmatched voice realism at a fraction of the cost. We’re excited to bring these models to Layercode where developers can create and deploy realtime latency, life-like voice agents with them." - Damien Tanner, CEO, Layercode

Enterprise-ready deployment options, now supporting On-Prem

TTS 1.5 supports the deployment flexibility enterprises require:

Cloud API: Immediate access via our standard API with global availability

On-prem deployment: Full model hosting on your infrastructure

Custom solutions: Contact our enterprise team for volume pricing, SLAs, and tailored deployment architectures

For organizations with strict data residency requirements or regulatory constraints, on-premise deployment provides complete control over voice synthesis without sacrificing capability.

TTS 1.5 is also available now via Layercode, LiveKit, NLX, Pipecat, Stream Vision Agents, Ultravox, Vapi, and Voximplant.

"I see an inflection in the not-so-distant future where conversational voice becomes the primary interface. The exciting thing is a lot of the technologies, like Inworld's realtime TTS, that need to come together to make this a reality are already here. And they're only getting better. So it's super exciting to operate in this space with partners like Inworld setting the pace for this innovation." - Andrei Papancea, CEO & Co-Founder, NLX

Get started today

TTS 1.5 is available now:

Try the TTS Playground: Hear TTS 1.5 in action with your own text or clone with a voice sample

Read the documentation: API reference, SDKs, and integration guides

Contact enterprise sales: Volume pricing, on-premise options, custom voice development

We're just getting started. TTS 1.5 represents our most significant voice AI release yet, and the foundation for what's coming next. We can't wait to see what you build.

Questions? Reach out to our team today.
Original source
Nov 25, 2025
Date parsed from source:
Nov 25, 2025

First seen by Releasebot:
Jan 2, 2026
Inworld

Runtime v0.8

New runtime launch delivers faster, smarter realtime agents with lower latency and instant streaming, plus smart early stopping for safer, cost-aware interactions. Production-ready templates and quick start onboarding unify setup and observability in one place, speeding adoption.
Build faster, smarter realtime agents - instant streaming, lower latency, and smart interruption handling

1. Runtime V0.8

Built for any use case: companions, language tutors, customer support, fitness trainers, games, and more.

Lower Latency Agents : Core runtime optimizations reduce latency significantly, making live multimodal agents feel snappier even under heavy LLM and TTS loads.

Instant streaming responses: Graph start is asynchronous now, enabling agents to begin streaming tokens or audio as soon as a run kicks off, eliminating awkward silence at the start of each interaction.

Smart early stopping: Cancel an agent run mid‑response for barge‑in, safety, or cost control so agents stop talking the moment the user or policy requires it.
For new users: Get Started
For returning users: Follow the 0.6 -> 0.8 migration guide

2. Template Library

Production‑ready in one click

Launch from full example projects: Clone production‑ready multimodal templates directly into your project and go from idea to running agent in minutes.

Find the right template fast: Filter by input/output modality, SDK, and use case to jump straight to the patterns that match your product.
Start Building with Templates on Portal
Start Building with Templates on Website
Don’t find the template you’re looking for? Talk to our team to request one

3. Overview Page

API keys, onboarding, and application health in one place

Personalised onboarding by SDK: See a tailored “getting started” flow for your chosen SDK, so you only follow the steps that matter to your stack.

Everything you need, front and center: Jump straight to your API key, llms.txt, and templates from the home view, cutting setup time from minutes to seconds.

Holistic observability at a glance: View key traces and logs so you can catch regressions, debug incidents, and keep AI experiences reliable.
Personalized Onboarding
Observability at a Glance
Get Started
Talk to our team

Original source
Nov 20, 2025
Date parsed from source:
Nov 20, 2025

First seen by Releasebot:
Dec 23, 2025

Modified by Releasebot:
Apr 19, 2026
Runtime by Inworld

Node.js Agent Runtime v0.8.0

Runtime improves custom nodes with faster performance, stronger execution control, and broader component access. It also adds support for canceling runs, calling LLMs and embedders from custom nodes, and building stateful graph loops, with breaking changes in graph startup and shutdown.
Enhanced performance, execution control, and component access for custom nodes.

2x faster performance with optimized addon architecture

Cancel running executions with abort() on GraphOutputStream

Call LLMs from custom nodes via getLLMInterface() and getEmbedderInterface()

Build stateful graph loops with DataStreamWithMetadata

Breaking changes: graph.start() is now async, and stopInworldRuntime() is required.

See the Migration Guide for upgrading from v0.6.
Original source
Nov 6, 2025
Date parsed from source:
Nov 6, 2025

First seen by Releasebot:
Dec 23, 2025
Inworld

Introducing Timestamp Alignment, WebSockets and More for Inworld TTS

Inworld TTS debuts a major release with speed boosts, multilingual expansion to Russian, API voice cloning, custom voice tags and pronunciation controls, plus WebSocket streaming for low latency and precise timestamping for lipsync.
Performance improvements - now #1 on Artificial Analysis TTS Leaderboard

Speed and quality are critical for real-time voice. Inworld TTS is now faster, smoother, and more natural across production workloads. Inworld TTS 1 Max just ranked #1 on the Artificial Analysis Text to Speech Leaderboard, which benchmarks the leading TTS models on realism and performance.

Quality improvements
New TTS models deliver clearer, more consistent, and more human-like speech.

Clearer articulation: Lower word error rate (WER) and better intelligibility on long or complex sentences.

Improved voice cloning: Higher speaker-similarity scores; voices retain tone, pacing, and emotion even across languages.

More accurate multilingual output: Fewer accent mismatches and more natural pronunciation across supported languages.

Latency improvements
We’ve reduced latency across multiple layers of our stack:

Infrastructure migration: New server placements cut internal round-trip time by ~50 ms, especially benefiting users in the US and Europe.

Optional text normalization: Disable text normalization in the API to save 30–40 ms for English (up to 300 ms on complex text) and up to 1 sec in other languages.

WebSocket streaming: Persistent connections reduce handshakes, enabling faster starts and smoother real-time dialogue.

Faster inference: Inworld TTS Max now runs on an optimized hardware stack, enabling responses that are ~15% faster.

WebSocket support

For real-time conversational applications, our new WebSocket API offers persistent connections with comprehensive streaming controls.
HTTP requests work fine for simple TTS, but they add overhead when you're building voice agents, interactive characters, or phone call agents, as each request requires connection setup.
WebSockets keep a persistent connection open. You can stream text as it arrives from your LLM, maintain conversation context, and handle interruptions gracefully.
Three ways WebSockets give you more control:

Context management: Run multiple independent audio streams over a single connection. Each context maintains its own voice settings, prosody, and buffer state.

Smart buffering: Configure when synthesis begins with maxBufferDelayMs and bufferCharThreshold. Start generating audio before complete text arrives, or wait for full sentences.

Dynamic control: Update voice parameters mid-stream, flush contexts manually, or handle user interruptions without dropping the connection.
Perfect for:

Interactive voice agents that require low latency

Dynamic conversations where barge-in or interruption support is needed

Timestamp alignment: Sync audio with visuals & actions

Building lipsync for 3D avatars? Highlighting words as they're spoken? Triggering game play actions at specific moments in speech? Handling barge-in and interruptions? You need timestamps.
Timestamp alignment returns precise timing information that matches your generated audio. Choose the granularity that fits your use case:
Use word-level timestamps for:

Karaoke-style caption highlighting

Triggering character actions when specific words play

Tracking where users interrupt the AI

Syncing UI elements with speech
Character-level timestamps are most common for lipsync animation, where they can be converted to phonemes and visemes.
Timestamps currently support English for both streaming and non-streaming, with other languages experimental.

Voice cloning API for programmatic voice creation

Voice cloning is no longer limited to our UI. Now you can create custom voices directly through the API. Available in beta to select customers.
Why this matters:
If you're building a platform where end users need to clone their own voices, you can now integrate that experience directly into your app, without redirecting users to Inworld's interface. You can also create voices in bulk using a simple script.
Use cases:

Games where players create their own character voices

Social platforms where users create their own avatars

Games or call centers where a large number of voices need to be created in bulk from pre-recorded audio samples
Voice cloning APIs enable third-party platforms to offer voice creation as a native feature in their own workflows or create voices in bulk.

Custom voice tags

When creating a custom voice in the UI or API, we now allow users to apply tags to their voices for grouping and filtering.
Why this matters:
You can now easily manage a large database of voices and filter for the appropriate voice at runtime, which is highly valuable in games and related applications, where characters are often generated on the fly.
Use cases:

Gaming platforms where characters are generated on the fly and need to be matched to an appropriate voice

Enterprise apps where the optimal voice is chosen at runtime based on the user profile

Applications that are still in development, where managing and iterating on a large number of voices is an essential workflow in the design process
Voice tags are the first step toward a larger voice library and management system.

Custom pronunciation: Say it your way

Getting AI voices to pronounce words correctly matters. Brand names, character names, technical terms, and regional dialects are often misspoken by standard TTS models because they aren't represented well in the training data.
We now allow users to manually insert phonetic notation into their text, allowing for consistent and accurate pronunciation of key words. Not sure what phonemes to use? Ask ChatGPT or your favorite AI assistant for the IPA transcription, or check reference sites like IPA Pronunciation Guide | Vocabulary.com
Common use cases:

Brand names that need to sound perfect every time

Unique names

Medical, legal, or technical terminology

Regional pronunciation variations

Fictional locations and proper nouns
We support International Phonetic Alphabet (IPA) notation.

Russian support and multilingual improvements

Inworld TTS now speaks Russian, bringing our total to 12 supported languages. All supported languages include English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Dutch, Polish, and Russian.
Clone a voice and label it as Russian, or choose one of our pre-built Russian voices. As with all languages, voices perform best when synthesizing text in their native language, though cross-language synthesis is possible.
We've also made quality improvements across all non-English languages. Better pronunciation accuracy, more natural intonation, and smoother speech patterns.
For multilingual applications, Inworld TTS Max delivers the strongest results with superior pronunciation and more contextually-aware speech across languages.

Try these features today

All features are available now through our API and TTS Playground, at the same accessible pricing.
Get started:

Try TTS Playground

Read the docs

You can also access Inworld voices and text-to-speech models via LiveKit, NLX, Pipecat, and Vapi.

Frequently asked questions

How do I convert timestamps to visemes for lipsync?
The typical pipeline: character timestamps → phonemes (using tools like PocketSphinx) → visemes (using your game engine's mapping). Our timestamps provide the timing foundation.

How do I gracefully handle interruptions with websocket?
The WebSocket endpoint supports multiple independent contexts, enabling seamless barge-in handling. When a user interrupts, you can start a new, independent context and send the post-interruption agent response to it. The old context can be closed when the interruption occurs.

What are some techniques to optimize end-to-end latency?
To reduce latency, consider using the TTS streaming API, keeping a persistent WebSocket connection, and disabling text normalization by instructing your LLM to create speech-ready text via a system prompt.
Original source
Nov 4, 2025
Date parsed from source:
Nov 4, 2025

First seen by Releasebot:
Dec 23, 2025
Inworld

The 3 Engineering Challenges of Realtime Conversational AI

Inworld launches Runtime, a low latency AI backend for real time conversational AI. Build with the SDK, deploy hosted endpoints, and run live A/B experiments with automatic traces to cut development time and improve user experience.
The Vision

Every builder in conversational AI shares a common goal: to create systems that feel natural, responsive, and personalized. But in practice, we spend more time wiring APIs, debugging, and optimizing latency rather than optimizing user experience.

Inworld Reflection

We started at Inworld by building lifelike AI characters that gamers loved—ones that could remember, converse naturally, and feel real.
As our customer base expanded beyond games, they asked for complex customizations—to plug in their own models, connect to proprietary data, define custom emotions, routing, and more.
With each request, our engineering teams spent less time on shipping user features and more time writing integrations and debugging.
This realization led us to a critical analysis of where our development time was truly going.That analysis revealed three recurring engineering pain points in building realtime conversational AI, and we built Inworld Runtime to solve them.

Inworld Runtime

Inworld Runtime is a low-latency AI backend for realtime conversational AI. You build your conversational AI with Inworld Runtime SDK, launch a hosted endpoint using Inworld CLI, and observe and optimize your conversational AI by running A/B experiments in the Inworld Portal.

The 3 Challenges of Realtime Conversational AI

Problem 1: Latency that breaks the realtime feel

Before: High latency under high loads

Scaling issues: As apps scaled to thousands of users, latency spiked above one second.

Blocking operations: Many popular programming languages, while excellent for rapid prototyping, have runtime limitations that prevent true parallel execution, leading to blocked operations when we need to run multiple LLM calls, embeddings, and processing tasks concurrently.
With Runtime: True parallel execution at the C++ core

Parallel execution: Using Runtime, an agent can embed user input, retrieve knowledge, and do a web search all at once, then proceed to the LLM call, dramatically reducing end-to-end latency.

Pre-optimized backend: The graph executor automatically identifies nodes without dependencies and schedules them in parallel — no manual threading code required.

Read Streamlabs case study: Built a realtime multimodal streaming assistant with sub-500 millisecond latency

Example Node.js LLM -> TTS pipeline - Low Latency with C++ optimized backend

Problem 2: 50% dev time spent in integration and debugging

Before: Repetitive, time-consuming tasks

Wrote repetitive integration code: For every new feature that required integrating an AI model, we found ourselves writing similar integration code.

Reconstructed execution paths by hand: When an agent's behavior was incorrect, our primary tool for analysis was traditional logging. We had to sift through disconnected logs from various parts of the codebase to manually reconstruct the sequence of events.

Coupled orchestration and business logic: The control flow for handling model responses, error retries, and feature-specific logic—like updating how fallback responses were triggered—was deeply embedded within the business logic, making even minor feature updates risky. Bringing new developers up to speed took weeks instead of days.
With Runtime: Less Maintenance, More Iteration

Build fast with pre-optimized nodes: Developers get a full suite of nodes to construct realtime AI pipelines that can scale to millions of users, including nodes for model I/O (STT, LLM, TTS), data engineering (prompt building, chunking), flow logic (keyword matching, safety), and external tool calls (MCP integrations).

View end-to-end traces and logs automatically: Instead of reconstructing the execution path manually, developers simply go to Inworld Portal to view the end-to-end trace and logs. Every node execution is automatically instrumented with OpenTelemetry spans capturing the node, inputs, outputs, duration, and success/failure.

Write modular, easy-to-understand code: Developers define each node’s inputs, outputs, and dependencies in a graph, making the execution path explicit and visible — you can see exactly which nodes connect to which others, making onboarding new team members easy. They can contribute to a single node on day one, then gradually understand the broader graph structure.

Read Wishroll Status Case Study: Went from prototype to production in 19 days with a 20x cost reduction

Automatic Traces for Each Graph Execution

Problem 3: Slow iteration speed

Before: Customization incurred technical debt

Bespoke customization: As our customer base grew, so did the need for customization, so the code became brittle and hard to reason about.

If/else hell: Different clients required slightly different logic, tools, or model choices. In our traditional codebase, this led to a labyrinth of if/else code blocks and feature flags scattered throughout the logic.
With Runtime: Fast user experience iterations

One-line change for models and prompts: Want to swap an LLM provider or adjust a model parameter? That's a simple configuration change. A/B test variations and deploy customizations without touching production code.

A/B testing at scale: We define agent behavior declaratively in JSON or through a fluent GraphBuilder API. Different clients get different graph configurations—not different code paths.

Live A/B Test: 50% traffic split to 2 models to observe what your users prefer

Why We're Sharing Inworld Runtime with You

We built Inworld Runtime to solve our own massive challenges in creating production-grade, scalable realtime conversational AI. But in doing so, we created a solution for a problem every AI developer faces: managing the inherent complexity of the "reason—act" agent cycle.
We believe the future of AI is not just about more powerful models, but better orchestration. It's about giving developers the architectural foundation they need to build robust, maintainable, and observable realtime conversational AI, without reinventing the wheel.
If you're tired of wrestling with tangled logic and want to focus on creating, we invite you to build your next experience on Inworld Runtime. Let us handle the complexity of orchestration, so you can focus on bringing your ideas to life.

Get Started with Inworld Runtime

Inworld Runtime is the best way to build and optimize realtime conversational AI and voice agents.
You can build realtime conversational AI that is fast, easy to debug, and easy to optimize via A/B experiments.
Get started now with Inworld CLI:

Build a production-ready conversational AI or voice agent

Deploy it to Inworld Cloud as an endpoint so you can easily integrate it into your app

Monitor dashboards, traces, and logs in the Inworld Portal

Improve user experience by running live A/B experiments to identify the best model and prompt settings for your users

Talk to our team
Original source
Oct 29, 2025
Date parsed from source:
Oct 29, 2025

First seen by Releasebot:
Jan 2, 2026
Inworld

Unreal AI Runtime: The first unified interactive AI toolkit for game developers

Inworld launches the Unreal AI Runtime SDK, a unified game AI runtime with STT, TTS, LLMs, a visual graph editor, and pre-built templates. It unifies dozens of providers under one API key, adds observability, and ships with Unity Runtime early access. This marks real product release progress.
Unreal AI Runtime SDK Launch

Today, we’re excited to be launching our Unreal AI Runtime SDK - the first unified solution for game developers building realtime interactive AI experiences. No more stitching together multiple AI plugins, managing separate provider APIs, or spending all your time fighting AI instead of building your game.
With the Unreal AI Runtime, you can now:

Start working immediately with foundational AI building blocks like speech-to-text (STT), text-to-speech (TTS), and LLMs

Get access to hundreds of models across various model providers with a single API key

Easily create AI pipelines, such as a speech to speech pipeline, with an intuitive visual graph editor

Leverage pre-built templates for common use cases like AI NPCs and chatbots

Manage and optimize costs, latency, and quality through built-in observability and experimentation tools
The Unreal AI Runtime SDK is live now and available for download. We are also launching our Unity AI Runtime SDK for early access.

Why we built this

Inworld has been building for game developers since day one — starting with lifelike, interactive characters that could form relationships, express emotions, and respond naturally through voice. Along the way, we saw the biggest challenges developers faced when building engaging, realtime AI experiences:

Keeping up with dozens of models and providers - Delivering a rich, multimodal AI often requires testing dozens of different models. In addition to integrating each new provider, this means juggling separate APIs, rate limits, billing. The Unreal AI Runtime SDK unifies access to all major model providers under a single API key, so switching models is as easy as selecting from a dropdown.

Easily customizing without reinventing the wheel - Every game has unique creative and technical needs. With pre-built templates (including our Character Template) and a modular design, you can customize or extend for your specific use case while starting from a production-ready, pre-optimized foundation. Plus, our intuitive visual graph editor makes it easy for anyone—writers, designers, or developers—to customize logic, prompts, and behavior that define your AI interaction.

Debugging non-deterministic AI responses - AI systems can behave unpredictably, and debugging the root cause is notoriously hard. The Unreal AI Runtime's built-in observability tools, including logs, traces, and dashboards, make it easy to trace every step of your AI pipeline, understand latency, and pinpoint exactly where an issue occurred.

Managing costs with scale - Scaling AI to millions of players can get expensive. The Unreal AI Runtime helps manage those costs with dashboards that provide a unified view of your cost drivers, experimentation tools that make it easy to test more efficient models or orchestrations, and support for running local models where it makes sense.
The Unreal AI Runtime was built to address these challenges for game developers, so you can spend more time building your game instead of building AI infrastructure.

What you can build
Engaging, conversational NPCs
Developer: Inworld team
SDK: Unreal AI Runtime
Motivation: Building a believable, real-time character is hard. Low-latency turn-taking and natural interruptions are notoriously difficult to get right. We also wanted to support MetaHumans and lipsync, with our streamed audio. Then there’s always the question of balancing LLM latency with quality - how do we ensure the dialogue is engaging and relevant without being too slow or expensive? We designed our Character, Metahuman, and Lipsync templates to handle all of this out of the box, with countless hours spent optimizing for performance, realism, and responsiveness. It’s the fastest way to build a fully interactive, voice-driven character in Unreal.
Generative Survivor’s game
Developer: Brian Cox, Shuang Liang
SDK: Unity AI Runtime
Motivation: I wanted to challenge myself to create a fully generative game. While today’s AI still struggles to build an entire game from scratch, I approached the problem using data-driven development. I created a core game template (in this case, a Survivor-like experience) and set up internal asset libraries covering 3D models, animations, VFX, music, fonts, and more. Each asset is tagged with metadata describing its visual and thematic attributes.
Using the Unity AI Runtime SDK, I built a conversational system where I can speak to an NPC (Merlin) and describe what type of game I want: player character, enemy types, starting weapon, environment, skybox, and so on. This voice input triggers a series of LLM service calls, each using specialized prompts to search the asset libraries and select the most fitting options.
Once the AI has chosen the final assets, the configuration is stored in a ScriptableObject. When the game returns to the main menu, the ScriptableObject is parsed and all required data is extracted to dynamically generate a customized Survivor-like game on the fly.
AI-powered decision making
Developer: Braeden Warnick
SDK: Unreal AI Runtime
Motivation: I wanted to show that while other AI-powered NPCs can obey commands based on what they can or can't do (like open locked doors), I wanted to explore making an NPC companion that can behave based on what they should do based on their character info. This also illustrates how you can adapt a custom Graph to leverage the evaluator power of an LLM in place of any scoring function used in any game system (e.g., procedural content generation, NPC decision making, game AI directors).

Get started with Inworld Runtime

To get started building today:

Download the AI Runtime SDK for Unreal or Unity and read the documentation

Explore the Character template to add conversational capabilities to your NPCs

Talk to our team for enterprise support.

We’re excited to see what games and experiences you bring to life!
Original source
Oct 22, 2025
Date parsed from source:
Oct 22, 2025

First seen by Releasebot:
Dec 23, 2025
Inworld

Introducing Inworld CLI

Inworld launches the Inworld CLI, a unified toolkit to build, deploy, and optimize realtime conversational AI. Expect faster performance, easier debugging, and live A/B testing from the command line with integrated telemetry and production endpoints.
Challenge

Until now, building realtime conversational AI meant facing:

Performance Bottlenecks: Unpredictable latency from third-party APIs creates a jarring user experience. This is compounded by core language limitations, like Python's GIL, that block parallel execution and stall critical operations.

High Development Overhead: Engineering resources are drained by maintenance. Teams spend more time debugging provider failures and integrating a complex patchwork of models than building new features, causing product velocity to stagnate.

Slow Iteration Speed: Scattered conditional logic for different models and clients makes the entire system fragile. This fragility makes every change high-risk, paralyzing rapid A/B testing and stalling product improvements.

Inworld faced these very challenges as our customer base grew and expanded beyond games and into mobile apps, voice agents, ai companions, and more. We hence built Inworld Runtime to solve them.

Inworld Runtime

Inworld Runtime is the AI backend for realtime conversational AI. You build your conversational AI with Inworld Runtime SDK, launch a hosted endpoint using Inworld CLI, and observe and optimize your conversational AI by running A/B experiment in the Inworld Portal.
Today, building with Inworld Runtime just became easier with the launch of Inworld CLI.

Inworld CLI

With Inworld CLI, developers can now build realtime conversational AI that are fast, easy to debug, and easy to optimize via A/B experiments.

Build realtime experiences

npm install -g @Inworld.ai/cli to install the Inworld CLI

inworld login to login and generate api keys automatically

inworld init to initialize conversational AI pipelines such as LLM -> TTS - preoptimized for latency and flexibility

inworld run to test locally with instant feedback

inworld deploy to create persistent, production-ready endpoints

Monitor with clarity

Integrated telemetry: Each request is automatically logged in dashboards, traces, and logs in Inworld Portal.

Optimize continuously

inworld graph variant register to run live A/B tests without client changes

Proven technology

Since launching Inworld Runtime earlier this year, we've seen developers build incredible realtime conversational AI experiences.

Wishroll went from prototype to 1M users in 19 days with 20x cost reduction.

Streamlabs built a real-time multimodal streaming assistant with under 500ms latency.

Bible Chat scaled their AI-native voice features to millions. Inworld CLI builds on Runtime to help developers build agents more efficiently and reliably.

Get started with Inworld Runtime

Inworld Runtime is the best way to build and optimize realtime conversational AI and voice agents

Get started now with Inworld CLI:

Build a prod-ready conversational AI or voice agent

Deploy it to Inworld Cloud as an endpoint so you can easily integrate into your app

Monitor dashboards, traces, and logs in the Portal

Improve user experience by Run live A/B experiments to identify the best model and prompt settings for your users

Talk to our team
Original source
Oct 15, 2025
Date parsed from source:
Oct 15, 2025

First seen by Releasebot:
Dec 23, 2025
Inworld

The new AI infrastructure for scaling games, media, and characters

Inworld launches Runtime, a high‑performance AI pipeline that scales voice‑forward, character‑driven experiences to millions. It connects LLM, STT, TTS with remote config, telemetry and multi‑vendor support; Unreal is available now in early access, Unity coming soon.
TTS is FREE for December! Plus 2.5x referral bonuses. $25 for you, $25 for them.

The new AI infrastructure for scaling games, media, and characters

Built on gaming and media innovation

We began by pushing the frontier of lifelike, interactive characters for games and entertainment, and this remains a core focus area. Today, Inworld powers real‑time, voice‑forward experiences and provides the infrastructure that lets those experiences scale from a prototype to millions of players without sacrificing quality. Partners across the industry, including Xbox, NVIDIA, Ubisoft, Niantic, NBCUniversal, Streamlabs, Unity, and Epic, have built with Inworld to explore new gameplay and audience experiences.
Long before “chat with anything” became a category, we were shipping playable, character‑centric demos and engine integrations that let teams imagine worlds where characters remember, react, and stay in‑world. That early craft in character design is still our foundation, and it is why leading studios and platforms continue to collaborate with us on the next generation of character‑driven interactive media.

From demos to production: Deeper control through a new AI infrastructure

As partners moved from impressive demos to live titles, we hit the same wall every game team hits: keeping voice, timing, and consistency flawless at scale, which is what players actually feel. Text‑only stacks and one‑off integrations were not built for real‑time, multimodal workloads, and stitching providers together left developers without enough control to maintain user‑facing quality as usage spiked or audiences expanded.
That is why we built Runtime: to put developers in control of the entire pipeline, and to make measurement and experimentation first‑class, so quality can be maintained and extended to new geographies and demographics, with personalization where it matters.

What is Inworld Runtime and how does it help you scale?

Inworld Runtime is a high‑performance, C++ graph engine (with SDKs like Node.js and Unreal) that orchestrates LLMs, STT, TTS, memory or knowledge, and tools in a single pipeline. Build a graph in code, ship it, then iterate with remote configuration, A/B variants (via Graph Registry), and built‑in telemetry without redeploying your game. It is the infrastructure we developed to support experiences with millions of concurrent users, now available to all developers.
Why this gives you more control and keeps quality tangible for users

Provider‑agnostic nodes so you can swap models and services without glue‑code churn or lock‑in.

Remote config and Graph Registry to change prompts, models, and routing live, safely rolled out to cohorts.

Targeted experiments to validate interaction quality for new geographies and demographics, including voices, timing, interruptions, prompts, and routing, and enabling personalization by segment.

Observability for player‑perceived quality with traces, dashboards, and logs that expose latency paths, first‑audio timing, and lip‑sync cadence so you fix what users actually feel.
The approach is simple: one runtime for your entire multimodal pipeline (e.g. STT → LLM → TTS → game or media state), with observability and experimentation to optimize quality, latency, and cost for every audience.

Voice that keeps up with gameplay

Many teams start with TTS, then expand into full pipelines as they localize, personalize, and harden for live ops, testing variations for new geographies and demographics and locking in what works.
Inworld TTS delivers expressive, natural‑sounding speech built for real‑time play. You get low‑latency streaming, instant voice cloning, and timestamp alignment for lip‑sync and captions, plus multi‑language coverage and integrations with LiveKit, NLX, Pipecat, and Vapi for end‑to‑end real‑time agents. Pricing starts at $5 per 1M characters, so you can scale voice across large audiences.
Try the TTS Playground or call the API to integrate quickly.

Proven at scale with industry leaders in games and media

Xbox × Inworld: Multi‑year co‑development to enrich narrative and character creation for game developers.

Ubisoft (NEO): Prototype showcased real‑time reasoning, perception, and awareness in characters powered by Inworld tech.

NVIDIA (Covert Protocol): Social simulation and hybrid on‑device or cloud capabilities using NVIDIA ACE with Inworld.

Niantic: From Wol to WebAR, teams used Inworld to bring AI characters into spatial experiences.

Streamlabs: Intelligent streaming assistant jointly powered by Streamlabs, NVIDIA ACE, and Inworld generative AI.

NBCUniversal and other media leaders: Runtime opened after building infrastructure to meet their scale and quality bars.

Continuing our character leadership at scale

We pioneered character‑first, real‑time interaction years before today’s wave. That DNA is alive and well, and now it is backed by an infrastructure layer that gives developers more control and a better fit for modern production: Runtime for orchestration and TTS for voice that performs under pressure. If you knew us for our previous character stack, you will find this generation faster to ship, safer to iterate, and easier to scale.
Learn more about how to create characters with Runtime

How do I get started with Inworld Runtime?

Explore the Runtime Overview for graphs, experimentation, and observability

Try our Templates for Node.js CLI, Voice Agent, Language Learning, and Companion apps

Test TTS capabilities in our TTS Playground

Check integrations with LiveKit, Pipecat, Vapi, and NLX

Contact our team if you're scaling voice-first or character-driven experiences

Unreal (Runtime) is available now for early access. Unity is coming soon. If you are scaling a voice‑first or character‑driven experience in games or media, we would love to help you map the pipeline and quality targets that matter for your audience. Start with the Runtime Overview and Templates.

Powering the future of interactive media

We are uniting believable AI characters and worlds with the runtime required to run them at multi‑million‑user scale. Build the worlds you want, with characters that truly come alive and stay alive, at scale.
Original source