Deepgram Release Notes

Last updated: Jan 16, 2026

  • Jan 16, 2026
    • Parsed from source:
      Jan 16, 2026
    • Detected by Releasebot:
      Jan 16, 2026
    Deepgram logo

    Deepgram

    January 16, 2026

    Voice Agent gains multi-provider support with automatic fallback and ordered usage for reliability. New LLM models are now supported including GPT-5.1, Claude Sonnet 4.5, and Gemini 3, configurable in agent settings for seamless upgrades.

    Voice Agent Multiple LLM Models

    We’ve added new functionality that allows users to specify multiple LLM providers for your Voice Agent, ensuring your agent will automatically fallback to another provider should you experience any issues. The
    think
    object supports both a single provider and an array of providers. LLM providers will be used in the order that you specify them.
    For more details, visit our Voice Agent Multiple LLM Models documentation

    🤖 New LLM Models Support

    We’ve added support for new LLM models in our Voice Agent API!

    Available Models:

    • OpenAI GPT 5.1 Chat (gpt-5.1-chat-latest)
    • OpenAI GPT 5.1 (gpt-5.1)
    • Anthropic Claude Sonnet 4.5 (claude-sonnet-4-5
    • Google Gemini 3 (gemini-3-pro-preview)

    Implementation:

    Configure your chosen model in your Voice Agent settings:

    {
      "type": "Settings",
      "agent": {
        "think": {
          "provider": {
            "type": "open_ai",
            "model": "gpt-5.1"
          }
        }
      }
    }
    

    Report incorrect code

    For complete information about supported LLMs including the new models, visit our Voice Agent LLM Models documentation.

    Original source Report a problem
  • Jan 15, 2026
    • Parsed from source:
      Jan 15, 2026
    • Detected by Releasebot:
      Jan 16, 2026
    Deepgram logo

    Deepgram

    Deepgram Self-Hosted January 2026 Release (260115)

    Deepgram's January 2026 self-hosted release speeds up TTS and adds a breaking API–Engine change. Deploy Engine 3.107.0-1 before API 1.176.0 (License Proxy 1.9.2 optional) to avoid downtime, with blue‑green deployment advised. Portuguese 'um' now transcribes as a non‑filler.

    Container Images Release

    Container Images (release 260115)

    • quay.io/deepgram/self-hosted-api:release-260115
      • Equivalent image to: quay.io/deepgram/self-hosted-api:1.176.0
    • quay.io/deepgram/self-hosted-engine:release-260115
      • Equivalent image to: quay.io/deepgram/self-hosted-engine:3.107.0-1
    • Minimum required NVIDIA driver version:
      • =570.172.08

    • quay.io/deepgram/self-hosted-license-proxy:release-260115
      • Equivalent image to: quay.io/deepgram/self-hosted-license-proxy:1.9.2
    • quay.io/deepgram/self-hosted-billing:release-260115
      • Equivalent image to: quay.io/deepgram/self-hosted-billing:1.12.1

    January 2026 Self-Hosted Release: Update Recommendation

    In Deepgram’s January 2026 self-hosted release (release-260115), we added new functionality to improve TTS response times from our API and Engine containers.
    Due to this product change, the January 2026 self-hosted release is not backwards-compatible with previous releases when used to serve TTS traffic. It is a breaking change in how the API and Engine containers communicate with each other. To avoid any downtime in your self-hosted deployment, the updated version of the Engine node (3.107.0-1) must be running in advance of the updated version of the API node (1.176.0) serving requests. Note that the new version of the Engine (3.107.0-1) is compatible with previous versions of the API, so the Engine container must be deployed before the API container. The blue-green deployment strategy is one possible deployment strategy, but there are others that satisfy the requirement that the Engine container is deployed first. This is only applicable for deployments serving TTS traffic. The breaking change is not relevant to deployments serving STT traffic.
    The License Proxy node is not impacted by breaking changes, but in the context of a complete Deepgram self-hosted deployment, it is most cohesive to also include the update to the License Proxy node (1.9.2) in the blue-green deployment.

    This Release Contains The Following Changes

    • Improves Transcription of “Um” in Portuguese — Monolingual Portuguese STT now transcribes “um” (meaning “one”) as a non-filler word, and “um” is included in Portuguese transcripts, even when the filler_words feature is disabled.
    • General Improvements — Keeps our software up-to-date
    Original source Report a problem
  • Jan 13, 2026
    • Parsed from source:
      Jan 13, 2026
    • Detected by Releasebot:
      Jan 15, 2026
    • Modified by Releasebot:
      Jan 16, 2026
    Deepgram logo

    Deepgram

    January 13, 2026

    Flux now supports WebM containers with Opus encoding for seamless WebM audio integration. When using WebM, omit encoding and sample_rate as Flux auto-detects from metadata. This release makes Flux more versatile for web apps and streaming.

    Flux: WebM Container Support Added

    Flux now supports the WebM container format with Opus codec, providing seamless compatibility with audio sources that output WebM-formatted audio streams.

    WebM Container Support

    Flux now accepts WebM containers with Opus codec encoding:
    WebM containers
    with
    opus
    encoding
    When sending WebM containerized audio,
    omit
    the
    encoding
    and
    sample_rate
    parameters—Flux will automatically detect these from the container metadata.

    Why This Matters

    WebM is commonly used in web applications and streaming scenarios. This addition makes it easier to integrate Flux with audio sources that natively output WebM format, eliminating the need for format conversion.

    Implementation

    For WebM containerized audio:
    wss://api.deepgram.com/v2/listen?model=flux-general-en
    Report incorrect code
    For detailed information about all supported Flux audio formats, see our Flux documentation.

    Original source Report a problem
  • Jan 13, 2026
    • Parsed from source:
      Jan 13, 2026
    • Detected by Releasebot:
      Jan 13, 2026
    Deepgram logo

    Deepgram

    Deepgram Raises $130M Series C at $1.3B Valuation to Power the Voice AI Economy

    Deepgram raises $130M in a Series C at a $1.3B valuation to accelerate real-time Voice AI platform growth. The release also announces the OfOne acquisition, a broader patent push, a Powered by Deepgram expansion, and a new San Francisco Voice AI Collaboration Hub, signaling moved or soon-to-launch capabilities.

    Funding Accelerates Deepgram’s API Platform for Real-Time Voice AI and Supports Launch of ‘Powered by Deepgram,’ Acquisition of OfOne, an Expanded Patent Portfolio, and New Voice AI Collaboration Hub in San Francisco

    SAN FRANCISCO (January 13, 2026) – Deepgram, the real-time API platform underpinning the Voice AI economy, today announced it has raised $130 million in Series C funding at a $1.3 billion valuation. The round was led by AVP, an independent global investment platform dedicated to high-growth technology companies across Europe and North America.

    All major existing investors joined the round, including Alkeon, In-Q-Tel, Madrona, Tiger, Wing, Y Combinator, and funds and accounts managed by BlackRock. Several new investors, including Alumni Ventures and Princeville Capital, invested in the round, in addition to industry leaders such as Twilio, ServiceNow Ventures, SAP, and Citi Ventures. University of Michigan and Columbia University also invested, joining other existing academic investors such as Stanford University.

    With this investment, Deepgram is ideally positioned to deliver the real-time frontier Voice AI models and platform required to reliably power billions of live conversations with the naturalness, latency, and accuracy of human voice. AVP was selected as lead investor for its deep expertise scaling category-defining companies globally and its ability to support Deepgram’s international expansion, including Europe and other key markets.

    “Much like Stripe delivered the API platform underpinning the payments economy, we believe Deepgram is poised to deliver the API platform underpinning the emerging trillion-dollar B2B Voice AI economy,” said Elizabeth de Saint-Aignan, General Partner at AVP. “Deepgram’s success in building real-time, reliable, and massively scalable Voice AI infrastructure, combined with the rapid shift toward voice-first B2B experiences, positions the company to become one of the foundational AI companies of this decade.”

    “As we rapidly approach a world where billions of simultaneous conversations are powered by Voice AI, enterprises and developers need real-time, reliable infrastructure capable of fully duplex, contextual conversations at scale – this is Deepgram,” said Scott Stephenson, CEO and Co-Founder of Deepgram. “From pioneering end-to-end deep learning for voice, to earning multiple patents for our research, to our commitment to pass the Audio Turing Test at scale in 2026, we’ve consistently executed on a single vision: powering a future centered on the original human interface – voice.”

    “We are pleased to welcome AVP and our new strategic investors,” continued Stephenson. “Together with our existing investors, their conviction reflects the emergence of the Voice AI economy – and Deepgram’s role in powering it.”

    Powered by Deepgram

    Today, more than 1,300 organizations build Voice AI functionality powered by Deepgram APIs. Deepgram APIs are a foundational infrastructure layer of a global set of offerings delivering real-time, accurate, and reliable speech understanding, speech generation, analytics, orchestration, and fully autonomous voice agents.

    "Voice is a critical, strategic, and data-rich customer engagement channel," said Andy O'Dower, Vice President of Product Management for Voice and Video at Twilio. "Twilio's flexible orchestration capabilities and global communications infrastructure, combined with speech recognition powered by Deepgram APIs, deliver seamless, low-latency, and human-like AI agent experiences that are powering today's Voice AI renaissance."

    Deepgram’s industry-leading offerings include:

    • Aura-2, the world’s most professional, cost-effective, and enterprise-grade text-to-speech model
    • Nova-3, the world’s most accurate, real-time and reliable speech-to-text model
    • Flux, the world’s first Conversational Speech Recognition model built specifically to solve the biggest problem in voice agents – interruptions
    • Voice Agent API, the world’s only enterprise-ready, real-time, and cost-effective conversational AI API
    • Saga, the Voice OS

    All Deepgram models can be customized to domain-specific terminology and acoustic environments and deployed as cloud APIs or through self-hosted and on-premises options. A full SDK library is available to simplify development and accelerate production timelines.

    See the Powered by Deepgram page to learn more about how the most innovative AI organizations in the world build Voice AI functionality powered by Deepgram.

    Deepgram Acquires OfOne to Expand Real-Time Voice Automation into Restaurants

    Deepgram also announced today the acquisition of OfOne, an AI-native voice platform created for restaurants and the quick-service drive-thru market. OfOne has consistently delivered more than 95% containment, with high employee satisfaction scores and strong operational impact for national QSR brands.

    The OfOne team has joined Deepgram, and its technology now anchors Deepgram for Restaurants, an offering built to help restaurants improve customer experience, increase order accuracy, and support overstretched staff with real-time AI assistance. Additional functionality and expanded integrations will be delivered in the coming months.

    “We are incredibly proud to join the Deepgram team. Deepgram has built the most advanced real-time voice platform in the world, and it is the perfect foundation for scaling what we started at OfOne,” said Will Edwards, GM of Deepgram for Restaurants (formerly CEO at OfOne). “The impact of AI for restaurants and drive-thrus is enormous, and together we can deliver on that opportunity with the accuracy, speed, and reliability operators need at international scale.”

    Expansion of Patent Portfolio

    New funding will also accelerate Deepgram’s expansion of its intellectual property, building on a patent portfolio filed continuously since 2016, with several key U.S. patents granted in 2025.

    US 12,380,880 for End-to-End Automatic Speech Recognition With Transformer establishes a novel method for integrating and training ASR and transformer models as a single system, leading to improvements in accuracy and speed. This is complemented by US 12,334,075 for Hardware-Efficient Automatic Speech Recognition, which utilizes intelligent batching and parallel processing to ensure optimal hardware use, directly reducing latency and cost for customers handling massive volumes of voice data. Most recently, US 12,499,875 for Deep Learning Internal State Index-Based Search and Classification protects techniques for leveraging internal neural representations to enable faster audio search and more accurate classification at scale. These newly granted patents solidify Deepgram’s leadership in core deep learning architecture, representation learning, and deployment efficiency.

    New Voice AI Collaboration Hub in San Francisco

    Deepgram is opening a new Voice AI Collaboration Hub in San Francisco to bring the voice AI community together in person. Designed for meaningful collaboration with customers, partners, and builders, the space will host hands-on working sessions, live demonstrations, executive briefings, community meetups, and developer hackathons – creating a shared environment where ideas turn into products and the future of Voice AI is built together.

    About AVP

    AVP is an independent global investment platform dedicated to high-growth, tech (from deep-tech to tech-enabled) companies across Europe and North America, managing more than €2.5bn of assets across four investment strategies: venture, early growth, growth, and fund of funds. Its multi-stage platform combines global research with local execution to drive investment. Since its establishment in 2016, AVP has invested in more than 60 technology companies and in more than 60 funds with the Fund of Funds investment strategy. Beyond providing equity capital, its expansion team works closely with founders, providing the expertise, connections, and resources needed to unlock growth opportunities and create lasting value through meaningful collaborations. For more information, please visit www.avpcap.com.

    About Deepgram

    Deepgram is the real-time API platform underpinning the Voice AI economy. Its Voice AI platform offers speech-to-text (STT), text-to-speech (TTS), and full speech-to-speech (STS) capabilities–all powered by its enterprise-grade runtime. 200,000+ developers build with Deepgram’s voice-native foundational models – accessed through cloud APIs or as self-hosted / on-premises APIs – due to its unmatched accuracy, low latency, and pricing. Customers include technology ISVs building voice products or platforms, co-sell partners working with large enterprises, and enterprises solving internal use cases. Having processed over 50,000 years of audio and transcribed over 1 trillion words, there is no organization in the world that understands voice better than Deepgram. To learn more, please visit www.deepgram.com, read its developer docs, or follow @DeepgramAI on X and LinkedIn.

    Original source Report a problem
  • Jan 12, 2026
    • Parsed from source:
      Jan 12, 2026
    • Detected by Releasebot:
      Jan 13, 2026
    Deepgram logo

    Deepgram

    Flux Just Got A Little Smarter

    Flux V0.1 brings a more conservative end‑of‑turn transcriber with a new training paradigm. It improves transcription quality up to 10%, cuts start‑of‑turn false positives by up to 70%, and speeds end‑of‑turn detection by 50–150ms with no user action needed.

    End-of-Turn Based Finalization

    We recently shipped a small improvement to Flux based on a new training paradigm, resulting in better overall accuracy and, notably, fewer false positives when it comes to start-of-turn detection.

    With Flux, one of the more visible changes we made was to how our streaming models handle finalization. Finalization refers to the “locking in” of a word; once the word is finalized, the system will not go back and change its hypothesis for that word. Many streaming STT systems finalize words according to

    • “wall clock time,” i.e., finalizing words some amount of time after they’ve been spoken (e.g., transducer-based models), or
    • “pause time,” i.e., whenever a break or pause, generally as detected by a voice activity detector (VAD), occurs in speech (e.g., Nova-3 streaming, “streaming Whisper” variants).
      By comparison, Flux uses “conversation time,” performing finalization upon the completion of a conversational turn.
      The reason for doing so was to help unlock low latency end-of-turn detection. Systems that inherently rely on some delay, e.g., until some time has passed or until a pause occurs, to finalize their transcript necessarily introduce additional latency between when a user stops speaking/ends their turn and when the transcript is ready. Instead, Flux constantly provides a view of what the most likely transcript would be if the turn were to end at that moment, such that when the turn does end a high-quality transcript is ready immediately.
      However, a downside of treating the turn as if it could end at any moment is that Flux would occasionally “take a stab” at transcribing words before the user had finished speaking them, resulting in hypotheses that would need to be revised later. In the worst case, Flux would end up guessing that non-word sounds would become common words such as “Hey” or “I,” resulting infrequently (but still more often than our target goal of “never”) in false positive speech detection.
      In reality, though, a turn is not likely to end at any moment; if the user is mid-word, then their turn is obviously not ending then. In these situations, Flux could afford to wait without sacrificing the ultimate end-of-turn accuracy. Since the initial release of Flux, we have developed a new training approach that better reflects this reality, emphasizing more the quality of the transcript at end-of-turn. The resulting model is slightly more conservative when it comes to transcription, which has three notable implications for developers. Relative to the initial version of Flux, this new version exhibits
    • Improved transcription quality, by up to 10% for certain types of audio data,
    • Reduced false positive rate for start-of-turn detection, by up to 70%, and
    • Faster end-of-turn detection at eot_threshold >= 0.8 by 50-150ms.
      To see these benefits, developers working with Flux will need to do…absolutely nothing! The new version was already launched in early December; to start, we applied our new training recipe to achieve a modest fine-tuning of the existing Flux model (”Flux V0.1”) such that the behavior is almost identical to that of the original model with the notable exceptions listed above.
      Below, we describe the intuition behind the new training approach in more detail, as well as the impact on the model.

    Flux V0.1: A More Conservative Transcriber

    As astute observers of Flux outputs might have noticed, Flux transcripts are constantly being revised throughout the turn, analogous to how human understanding of speech evolves as we ingest more context. For those more familiar with LLMs, the way this works under the hood is loosely analogous to the concept of “test time compute;” Flux has some transcription “budget” that it spends over the course of the turn. The initial version of Flux (”Flux V0”) was trained to spend this budget aggressively, transcribing everything it had heard up to that point in time. This ensured that all audio would already have been transcribed as soon as a turn ends, but had the downside that Flux V0 might occasionally transcribe something “thinking” that whatever occurred at the end of the audio might be part of a word.
    However, from the standpoint of “high-quality transcription at end-of-turn,” this aggressiveness is not really warranted! Instead, our new training paradigm does the obvious thing, namely optimize for…correctness specifically towards end-of-turn. The result is that the model learns to be more conservative with its budget. For instance, if you were to painstakingly evaluate the “working transcripts” output by Flux V0.1 throughout the course of the turn (note: you should not actually do this since it’s annoying and, anyway, we have done it for you as shall imminently become apparent), you would see a reduction in cases where the model

    • outputs a word and subsequently removes it, by 20%, and
    • changes the last word it output, by 30%.
      Notably, this is not the result of a hard-coded lookahead/delay, nor does the model learn to consistently delay transcription. Indeed, in this training paradigm, it could not since it is still penalized for not having a maximally accurate transcript at end-of-turn. Instead, the model learns when to delay transcription or not, allowing improved usage of budget without sacrificing end-of-turn latency.

    Improved Turn Detection

    The advantage of a more conservative transcriber is that it is less likely to be tricked into thinking it heard speech when it did not, or to have to delay end-of-turn while it fixes an incorrect transcript. Correspondingly, Flux V0.1 exhibits fewer false positive start-of-turn detections, and reduced end-of-turn detection latency, particularly at higher eot_threshold and in the tails.
    To evaluate start-of-turn detection, we compare Flux’s StartOfTurn detection time with the start time of the first word in the turn (for more details on how we evaluate these turn-oriented STT models, see Evaluating End-of-Turn (Turn Detection) Models). In this paradigm, a detection latency of “zero” is not really to be expected, since that would correspond to detecting and outputting the word before it had been fully spoken. We do this so that latency < 0 has a distinct interpretation, namely that such cases correspond to a likely false positive, since we detected a word before any were spoken.
    The plot below shows the full cumulative density function of StartOfTurn detection latency for Flux V0 and V0.1. Since it’s hard to see (like we said, Flux V0 only occasionally falsely detected turn start in our benchmarking), we have included in the caption the density (i.e., frequency) of detections with latency < 0, i.e., false positives. Flux V0.1 achieves a 0.4% false positive rate, an over 70% reduction compared to the 1.5% rate observed for Flux V0.
    As with most things, this improvement is not entirely free (I was disappointed to find, upon moving into ML in industry, that there is “no free lunch,” especially given free lunch is one of the primary motivators for physics PhD students). Flux V0.1 typically detects start-of-turn ~40ms slower than Flux V0. However, as the CDF of first word durations indicates, both versions still detect turn start within 200ms of when the user actually starts speaking, well within the regime of “normal.”
    Just like for start-of-turn, improvements for end-of-turn show up in the tails, i.e., those difficult cases where being a little bit more conservative might lead to a higher quality prediction. The plot below is analogous to the one above, but focused on end-of-turn detection latency, for an eot_threshold = 0.80. At the median, we see a modest reduction of median latency by 40ms, but speedups of closer to 100-150ms at higher percentiles.
    Below, we show the WER and end-of-turn detection F1 score as a function of median detection latency as controlled by eot_threshold (lower threshold = crossed earlier = faster detection, but more false positives). As we can see, the new model behaves almost identically, apart of being slightly faster (and more accurate!) at higher thresholds.
    Note that we do observe a minor slowdown at the left-side of the plot, which corresponds to eot_threshold = 0.6, lower than the values preferred by most developers. But, for this threshold, the latency is actually reduced for percentiles above 80%.

    Improved Transcription Quality

    Since the changes in behavior predominantly impact working transcripts and the implications for turn detection, it is not immediately obvious these changes would necessarily result in significant improvements to finalized accuracy. However, again leveraging the analogy to “test time compute,” it is not unreasonable that this is the case. If you have a finite time (or token budget) to answer a question and you devote some of that precious budget to a fruitless line of reasoning, that reduces the time you have available to find the right answer. And, indeed, we see that these changes in behavior do correspond to meaningful changes in transcription accuracy.
    Some of this is apparent above; Flux V0.1 exhibits a lower WER across most of the parameter space than its predecessor. However, since “two person conversation-oriented” data can somewhat narrow in terms of acoustic conditions or topics covered, we also compared the “pure transcription” capabilities of the model on a broader data sample.
    When evaluating STT models at Deepgram, we typically use internal test sets that are reflective of the wide range of real world use cases and audio conditions encountered by our customers. We do not prefer open source evaluation sets such as Common Voice due to their more artificial nature, and in fact have found that achieving high performance on Common Voice can typically come at the detriment of performance on customer data. Also, since the test splits are public, one might be tempted to over-fit on Common Voice in order to look more impressive on benchmarking platforms.
    In this case, however, our internal evaluation on a truly held-out Common Voice test set revealed significant improvements from this methodology, potentially due to the model’s ability to “wait” when confronted with challenging or more stilted speech. Specifically, whereas we observed a modest 3-5% improvement on our internal test sets, the improvement on our Common Voice test set was closer to 10%! So, especially since we are focused on relative Flux improvements and not comparing to competitors (who might have their own view of what an appropriate held-out set is), here we also share the results of our internal Common Voice benchmarking.
    The plots above show finalized transcription accuracy on two test sets, comparing the original version of Flux with this latest update. As you can see, the new version of Flux is modestly more accurate! This was a pleasant surprising in the land of English STT, where we find our models are very close to the accuracy ceiling imposed by ground truth noise (either due to inherent ambiguity in transcription or annotator mistakes).

    Conclusions

    Our new training approach results in a more economical version of Flux, resulting in better overall accuracy and, notably, fewer false positives when it comes to start-of-turn detection. Now, if only I could work out how to train my offspring to be more economical with their “TV budget,” we’d really be onto something…

    Original source Report a problem
  • Jan 10, 2026
    • Parsed from source:
      Jan 10, 2026
    • Detected by Releasebot:
      Jan 13, 2026
    Deepgram logo

    Deepgram

    Deepgram Expands Nova-3 with Italian, Turkish, Norwegian, and Indonesian Support

    Nova-3 expands to Italian, Turkish, Norwegian, and Indonesian, boosting enterprise-grade voice AI with diverse phonetics and real-time accuracy across regions. Enhanced customization via Keyterm Prompting and strong WER gains solidify global multilingual transcription for businesses.

    From vowel-rich Romance tones to agglutinative suffix chains, Nova-3 demonstrates its ability to understand the world’s linguistic diversity with accuracy, speed, and adaptability.

    Building on our global momentum, Nova-3 continues its language expansion with Italian, Turkish, Norwegian, and Indonesian, further extending enterprise-grade voice AI across new regions and linguistic systems.

    These four languages bridge Europe and Asia and introduce a wider range of speech structures, accents, and phonetic patterns than ever before. From vowel-rich Romance tones to agglutinative suffix chains, Nova-3 demonstrates its ability to understand the world’s linguistic diversity with accuracy, speed, and adaptability.

    Beyond Borders: The Next Wave of Linguistic Diversity

    While earlier Nova-3 updates brought coverage to European and global bridge languages such as German, Spanish, French, and Portuguese, this release marks a new stage defined by structural and phonetic diversity.

    Italian: A rhythmically regular, vowel-heavy language known for fast conversational flow and clear articulation.
    Turkish: Agglutinative and rule-driven, characterized by long compound words and strict vowel harmony.
    Norwegian: Tonal pitch accents and regional variation between Bokmål and Nynorsk make pronunciation highly variable.
    Indonesian: A hybrid language blending Malay roots, English loanwords, and regional inflection, often used in multilingual environments.

    These additions show how Nova-3 continues to evolve, not just expanding in coverage but also in complexity, performing across vastly different grammatical systems and phonetic patterns.

    Enterprise Voice AI for Emerging and Established Markets

    Each of these new languages opens doors for enterprises operating across Europe and Asia.

    Italian: Banking, insurance, and customer service sectors that demand precise transcription for regulatory compliance.
    Turkish: Contact centers and fintech applications serving a fast-growing regional market that bridges Europe and the Middle East.
    Norwegian: Nordic enterprises emphasizing secure, compliant voice data solutions for customer support and analytics.
    Indonesian: Expanding tech ecosystems across Southeast Asia relying on real-time transcription for marketplaces, logistics, and customer interaction.

    Whether handling a medical transcript in Milan, a financial conversation in Istanbul, or a multilingual call in Jakarta, Nova-3 brings the same enterprise-ready reliability across regions.

    Customization That Adapts to Local Language Use

    Beyond accuracy, Nova-3 includes Keyterm Prompting, Deepgram’s self-service customization feature that allows developers to inject up to 100 domain-specific terms directly into transcription.

    This capability is especially valuable for these languages, where context and morphology can shift meaning dramatically. Turkish’s long compound forms, Indonesian’s mix of English and regional terms, or Norwegian’s evolving loanwords can all be handled without retraining. Even specialized industry vocabulary, such as Italian pharmaceuticals, Turkish fintech products, or Indonesian brand names, can be added instantly to improve precision.

    Benchmarking: Accuracy Gains Across Languages

    Nova-3 continues to deliver measurable accuracy gains over Nova-2, reducing Word Error Rate (WER) across both batch and streaming modes.

    The results confirm a consistent pattern: streaming models show stronger relative WER improvements, reinforcing Nova-3’s ability to operate in dynamic, real-time environments where responsiveness and stability are essential.

    Word Error Rate (WER) – Relative Improvement (Italian, Turkish, Norwegian, Indonesian):

    Nova-3 delivers double-digit relative WER reductions across Italian, Turkish, Norwegian, and Indonesian, with streaming showing the largest gains.

    Key highlights:

    • All four languages show double-digit relative WER reductions compared to Nova-2.
    • Streaming models outperform batch across the board, with notable improvements in Turkish and Indonesian, reflecting robustness under variable speech conditions.
    • Accuracy gains extend to dialectal and informal speech, confirming Nova-3’s linguistic adaptability.

    These benchmarks highlight Nova-3’s continued evolution, improving not only across similar language families but across structurally diverse ones as well.

    Why It Matters

    This expansion shows Nova-3’s growing maturity as a global ASR foundation for enterprises building multilingual products and services.

    From phonetic clarity to morphological complexity, Nova-3 adapts to each language’s structure rather than applying a one-size-fits-all approach.

    For developers and enterprises, this means:

    • Broader coverage with consistent model performance across regions.
    • Improved recognition in both formal and conversational contexts.
    • Reduced latency and errors in multilingual voice workflows.
    • Flexible customization through Keyterm Prompting for fast adaptation to new terminology.

    Getting Started

    Switching to Italian, Turkish, Norwegian, or Indonesian in Nova-3 is as simple as adding a language parameter to your request. For example:

    curl --request POST
    --header "Authorization: Token YOUR_DEEPGRAM_API_KEY"
    --header "Content-Type: audio/wav"
    --data-binary @youraudio.wav
    "https://api.deepgram.com/v1/listen?model=nova-3&language=it"

    Replace language=it with tr, no, or id to transcribe Turkish, Norwegian, or Indonesian audio, respectively.

    Explore more options in the Models & Languages Overview.

    Looking Ahead

    With Italian, Turkish, Norwegian, and Indonesian now available, Nova-3 continues its journey toward global coverage, building a foundation for voice AI that understands every accent, every region, and every language structure.

    This release sets the stage for the next wave of expansions across Asia and Eastern Europe as Deepgram continues to deliver voice AI that is both globally accessible and locally precise.

    Unlock Enterprise-Grade Voice AI Today

    Sign up free and unlock $200 in credits - enough to power over 750 hours of transcription or 200 hours of speech-to-text in Italian, Turkish, Norwegian, or Indonesian. Explore more details on our Models & Language Overview page and dive into Nova-3’s world-class accuracy and adaptability across global markets.

    Join a growing community of innovators: your feedback will directly influence upcoming language support, helping shape the next wave of Deepgram’s global voice AI.

    Stay in the loop: Subscribe now to receive priority announcements on new features and language expansions, as Deepgram continues to scale globally with confidence.

    Original source Report a problem
  • Jan 10, 2026
    • Parsed from source:
      Jan 10, 2026
    • Detected by Releasebot:
      Jan 10, 2026
    Deepgram logo

    Deepgram

    Deepgram Expands Nova-3 with 11 New Languages Across Europe and Asia

    Nova-3 expands to 11 new languages across Eastern Europe, South Asia, East Asia and Southeast Asia, boosting global ASR with tonal, multi-script and morphologically complex speech. It also brings Keyterm Prompting with no retraining for enterprise multilingual voice AI.

    Built for More Than Just New Words

    Deepgram is expanding the global reach of Nova-3 with support for 11 additional languages spanning Eastern Europe, South Asia, East Asia, and Southeast Asia. This update extends Nova-3’s capabilities into regions characterized by tonal languages, complex word structures, multiple writing systems, and frequent code-switching — linguistic challenges that have traditionally posed problems for older speech-to-text models.

    Deepgram is continuing its global rollout of Nova-3 by unlocking support for 11 new languages across Eastern Europe, South Asia, East Asia, and Southeast Asia. This expansion brings Nova-3 into markets shaped by tonal speech, multi-script writing systems, high morphological complexity, and rapid code switching, all major challenges for legacy speech-to-text systems.

    Nova-3 now adapts not just to new vocabularies, but to entirely different linguistic structures, from syllable timing in Japanese to vowel harmony in Hungarian to tonal contour in Vietnamese.

    11 New Languages Now Live in Nova-3

    Nova-3’s earlier language support focused on widely spoken global and business languages. This release marks a new stage, expanding into regions with linguistic structures that differ sharply from Western European speech. From agglutinative suffix chains to tonal vowel shifts to script segmentation, these languages represent the next level of diversity in speech AI.

    Eastern Europe and Eurasia

    • Bulgarian (bg): Fast-changing vowel reductions and Cyrillic script make Bulgarian harder for generic ASR models. Nova-3 brings stronger morphological grounding across tense and aspect.
    • Czech (cs): Free word order and complex consonant clusters require strong acoustic modeling. Nova-3 improves both recognition and contextual inflection handling.
    • Hungarian (hu): An agglutinative language with long compound suffix chains. Nova-3 maintains high accuracy even as words stretch across many morphemes.
    • Polish (pl): Seven grammatical cases and nasal vowels often trip up speech systems. Nova-3 improves recognition of endings, plurals, and soft consonants.
    • Russian (ru): Rich inflection and heavy homophones demand strong contextual modeling. Nova-3 resolves word forms with better KRR performance.
    • Ukrainian (uk): Nova-3 delivers high accuracy across palatalized consonants and open vowels, improving word boundary detection in faster speech.

    Nordics and Baltics

    • Finnish (fi): Long compound words and vowel harmony are traditional STT challenges. Nova-3 improves segmentation, especially in real-time dictation use cases.

    South Asia

    • Hindi (hi): Frequent English code switching plus inflection-heavy verbs require hybrid modeling. Nova-3 improves recognition across Hinglish speech patterns.

    East Asia

    • Japanese (ja): Mixed kana, kanji, and loanword pronunciation make standard ASR brittle. Nova-3 tracks syllabic rhythm and foreign term pronunciation more reliably.
    • Korean (ko, ko-KR): Hangul syllable blocks, rapid conjugation, and spacing ambiguity are core challenges. Nova-3 boosts accuracy in broadcast, support, and agent use cases.

    Southeast Asia

    • Vietnamese (vi): A fully tonal language with six tones and strong regional variation. Nova-3 improves tone resolution and reduces false positives in fast speech.

    These additions show Nova-3’s ability to scale across language families, scripts, and speech behaviors, from Vietnamese tones to Hungarian suffix chains to Japanese and Korean writing systems. Nova-3 grows not by approximation, but by linguistic precision.

    Why Keyterm Prompting Matters for These Languages

    With Nova-3, Keyterm Prompting is now available across all 11 languages, giving developers control over product names, technical vocabulary, and domain terminology.

    It is especially useful for:

    • Korean compound nouns with dynamic spacing
    • Japanese loanwords spoken differently than written
    • Hindi and English mixed speech in customer support
    • Polish and Russian case endings that shift meaning
    • Vietnamese business terms without diacritics in source audio

    Instead of retraining a model, you can steer it with a simple prompt.

    Benchmarking: Accuracy Gains at Global Scale

    Nova-3 once again delivers measurable accuracy improvements over Nova-2, reducing Word Error Rate (WER) across both batch and streaming modes, even in languages with complex morphology, non-Latin scripts, or tonal structure.

    The trend remains consistent: streaming Nova-3 models show the strongest relative WER reductions, reinforcing their suitability for real-time applications such as voice agents, live captioning, and AI telephony systems.

    Word Error Rate (WER) – Relative Improvement (11 New Nova-3 Languages)

    Key highlights

    • Every language improved over Nova-2 in both batch and streaming transcription
    • Korean, Czech, and Hindi show the largest gains, with up to 27 percent WER reduction
    • Streaming consistently outperforms batch, confirming Nova-3’s strength in live audio environments
    • Improvements span multiple language families, including Slavic, Uralic, Indo-Aryan, Japonic, Koreanic, and Austroasiatic
    • Languages with higher baseline complexity such as Korean, Hindi, and Polish see some of the biggest jumps, reflecting Nova-3’s upgraded handling of compound words, inflection, and non-Latin characters

    The takeaway: Nova-3 is no longer just outperforming in English and Western European languages. It is scaling accuracy gains globally, across scripts, dialects, and very different linguistic systems.

    Why It Matters

    This expansion reflects Nova-3’s maturity as a global ASR foundation for enterprises building multilingual products and voice-enabled services.

    Instead of forcing a single speech pattern onto every language, Nova-3 adapts to the structure and behavior of each one, whether it involves tones, inflection, compound words, or non-Latin scripts.

    For developers and enterprise teams, this means:

    • Broader language coverage with consistent model performance
    • Better recognition across both formal and conversational speech
    • Lower latency and fewer transcription errors in multilingual workflows
    • Faster domain adaptation through Keyterm Prompting, without retraining
    • A single model family that scales across markets instead of piecemeal STT engines

    Nova-3 brings enterprise-grade accuracy to regions where legacy ASR systems fail, allowing voice agents, analytics platforms, and real-time applications to operate naturally in any supported language.

    Getting Started

    Switching to any of the newly added Nova-3 languages is as simple as adding a language parameter to your request. For example:

    curl --request POST \
      --header "Authorization: Token YOUR_DEEPGRAM_API_KEY" \
      --header "Content-Type: audio/wav" \
      --data-binary @youraudio.wav \
      "https://api.deepgram.com/v1/listen?model=nova-3&language=ko"
    

    Replace language=ko with any of the supported codes below to transcribe audio in that language:

    bg, cs, fi, hi, hu, ja, ko, pl, ru, uk, vi

    You can use Nova-3 for both streaming and batch transcription, no retraining or configuration required. Explore the full list in the Models & Languages Overview.

    Looking Ahead

    With 11 new languages now live, Nova-3 is continuing its path toward full global coverage, expanding accuracy and real-time reliability far beyond Western European speech. This release strengthens Nova-3’s reach across Slavic, Uralic, Indo-Aryan, and East Asian languages, and it is only the beginning.

    The next wave of expansion will extend deeper into Southern Europe, the Baltics, Southeast Asia, and South Asia, continuing our focus on delivering speech recognition that feels native no matter the language family, alphabet, or acoustic environment.

    As Nova-3 grows, the goal remains the same: voice AI that works everywhere, for everyone, accurate in fast speech, resilient in noisy environments, and adaptable to local dialects and cultural context.

    Unlock Enterprise-Grade Voice AI Today

    Sign up free and unlock $200 in credits, enough to power over 750 hours of transcription or 200 hours of speech-to-text across Nova-3’s growing language suite. Explore details on our Models & Languages Overview page and experience Nova-3’s world-class adaptability for yourself.

    Original source Report a problem
  • Dec 29, 2025
    • Parsed from source:
      Dec 29, 2025
    • Detected by Releasebot:
      Dec 23, 2025
    • Modified by Releasebot:
      Jan 16, 2026
    Deepgram logo

    Deepgram

    December 29, 2025

    Deepgram self-hosted release expands Aura-2 TTS with Dutch, German, French, Italian, and Japanese and adds Engine flux metrics for smarter scaling. It also introduces PHI redaction, optional blocking on model pre-loading, and general improvements across container images.

    Container Images Release

    Deepgram Self-Hosted December 2025 Release (251229)

    Container Images (release 251229)

    • quay.io/deepgram/self-hosted-api:release-251229
      • Equivalent image to:
      • quay.io/deepgram/self-hosted-api:1.173.4
    • quay.io/deepgram/self-hosted-engine:release-251229
      • Equivalent image to:
      • quay.io/deepgram/self-hosted-engine:3.107.0
    • Minimum required NVIDIA driver version:
      • =570.172.08

    • quay.io/deepgram/self-hosted-license-proxy:release-251229
      • Equivalent image to:
      • quay.io/deepgram/self-hosted-license-proxy:1.9.2
    • quay.io/deepgram/self-hosted-billing:release-251229
      • Equivalent image to:
      • quay.io/deepgram/self-hosted-billing:1.12.1

    This Release Contains The Following Changes

    • Expands Aura-2 TTS language support - Adds TTS support for Dutch, German, French, Italian, and Japanese. See the relevant changelog entry. Reach out to your Deepgram representative to obtain the new Aura-2 models.
    • Adds Engine metrics for Flux - Adds flux_max_streams, flux_used_streams, flux_fraction_streams, and flux_cursor_latency metrics to the Engine container for Flux monitoring and auto-scaling.
    • Adds PHI redaction category - Enables the use of redact=phi to redact six applicable sub-categories of PHI entities. See the related changelog entry for details.
    • Allows optional blocking on model pre-loading before Engine becomes ready - By default, models pre-load in the background, which can cause a delay on the first request. Setting blocking = true under [preload_models] in engine.toml makes the Engine wait until model pre-loading completes before accepting traffic. The tradeoff is longer startup time (potentially minutes), so orchestration and health checks should allow for a delayed readiness signal.
    • Includes General Improvements — Keeps our software up-to-date.
    Original source Report a problem
  • Dec 23, 2025
    • Parsed from source:
      Dec 23, 2025
    • Detected by Releasebot:
      Dec 28, 2025
    Deepgram logo

    Deepgram

    A Year of Voice AI at Scale: How Deepgram Turned AI Speech Into Infrastructure

    Deepgram turns voice AI into infrastructure with Nova-3 and Aura-2 delivering real-time, enterprise-grade speech across languages and industries. A unified Voice Agent API, streaming capabilities, and global deployments mark a shift to scalable, in-context voice for business.

    Introduction: A Year Where Voice Became Infrastructure

    Voice AI crossed a threshold this year—and Deepgram was at the center of it. From trillion-word scale and enterprise-grade agents to real-time speech-to-speech and global expansion, this is the story of how voice stopped being a feature and became infrastructure.

    Every year has its own texture. Some years feel like exploration—testing edges, learning limits, asking what’s possible. Others are about consolidation—turning ideas into systems that scale. For us at Deepgram, this year was something else entirely: a year where voice stopped being an experiment and became infrastructure.
    Looking back over the past twelve months, what stands out isn’t any single launch or announcement. It’s the throughline. Again and again, we focused on the same idea: voice AI shouldn’t be stitched together, fragile, or theatrical. It should be fast, accurate, controllable, and ready for real work—across industries, languages, and geographies.
    This year, we shipped relentlessly. We expanded globally. We raised the bar on accuracy, latency, and enterprise readiness. And we watched customers—from solo developers to global enterprises—build systems that simply wouldn’t have been possible a year ago.
    Here’s how it unfolded, month by month.

    January: Momentum, Measured in Scale

    We started the year with clarity—and proof.
    In January, we shared that Deepgram had entered 2025 cash-flow positive, serving more than 400 enterprise customers. Over the past four years, annual usage had grown 3.3×. Our models had processed over 50,000 years of audio and transcribed more than one trillion words.
    Those numbers mattered, not as vanity metrics, but as validation. Voice AI is notoriously hard to scale: accuracy degrades, latency creeps in, costs spiral. The fact that we could grow usage at that pace—while improving performance and unit economics—confirmed something we’d believed for a long time. When voice AI is built as infrastructure, not a demo, it compounds.
    January set the tone: this wasn’t a year of promises. It was a year of delivery.

    February: Experimenting in Public with Vibe Coder

    In February, we did something deliberately small and something deliberately big.
    Here’s the big thing: We introduced Nova-3, setting a new standard for AI-driven speech-to-text across domains. If you’re familiar at all with Deepgram, then you know just how impactful this announcement was. However, if you haven’t heard of Nova-3, perhaps the rest of this recap will clue you in on how big Nova-3 became.
    The small announcement was that we released Vibe Coder, an open-source VS Code extension designed to explore voice-based “vibe coding” inside AI-powered IDEs like Cursor and Windsurf. We were clear from the start: this wasn’t a fully baked product. It was an experiment.
    But it mattered. Vibe Coder represented a belief that voice will increasingly live inside developer workflows—not as dictation, but as a control surface. We wanted to see how speech could shape intent, iteration, and flow. We wanted feedback. We wanted to learn alongside the community.
    Whether Vibe Coder grows into something bigger or simply informs our next move, February reminded us of the value of curiosity—and of shipping early.

    March: Healthcare and the Enterprise, Front and Center

    March was about focus.
    We introduced Nova-3 Medical, our most advanced medical speech-to-text model to date. Built specifically for clinical environments, it delivered unmatched accuracy on medication names, diagnostic terms, and procedure details—while filtering out irrelevant noise that plagues generic models. Just as importantly, it was designed with HIPAA-compliant architecture and enterprise-grade security from day one.
    That same month, we announced a partnership with Genesys, launching the Deepgram Genesys Transcription Connector. Together, we enabled more accurate, real-time voice automation inside one of the world’s leading customer experience platforms.
    We also published the State of Voice AI 2025, offering a data-driven look at how enterprises were actually deploying voice systems.
    Healthcare and contact centers may look different on the surface, but they share the same requirement: voice AI that works under pressure. March was about meeting that standard.

    April: A Breakout Month for the Platform

    April was, simply put, huge.
    We also crossed a major technical milestone: the development of a speech-to-speech model that operates without converting speech to text at any stage. This was a pivotal step toward fully contextual, end-to-end voice systems that preserve nuance, intonation, and emotional tone in real time.
    And we introduced Aura-2, our most professional, cost-effective, and enterprise-grade text-to-speech model yet—built not for entertainment, but for real conversations.
    April wasn’t about one launch. It was about the platform coming into view.

    May: Voice in the Real World

    In May, the story shifted from capability to impact.
    We announced a partnership with Think41, a full-stack GenAI consulting firm building secure, enterprise-ready AI agents. Together, we showed what’s possible when low-latency speech recognition meets real-time agent assist: faster resolutions, better customer experiences, and systems that support humans instead of slowing them down.
    It was a reminder that the value of voice AI isn’t theoretical. It shows up in conversations—live ones.

    June: One API, Real Conversations

    June marked a turning point.
    We launched the Deepgram Voice Agent API—the industry’s only enterprise-ready, real-time, cost-effective conversational AI API. For developers, it meant one streaming API instead of stitching together STT, TTS, and orchestration layers. For enterprises, it meant control: no black boxes, no hidden constraints.
    We published the Voice Agent Quality Index (VAQI), offering a new benchmark for conversational performance. Independent validation from Coval confirmed Flux’s performance: 50% lower latency to first token, faster turn detection, and accuracy on par with Nova-3.
    This was the culmination of years of work. Voice agents finally felt cohesive—fast, controllable, and production-ready.
    Finally, we introduced Nova-3 Medical Streaming, bringing clinical-grade accuracy to real-time transcription without sacrificing ultra-low latency.

    July: Recognition and Global Reach

    In July, we introduced Saga, our Voice OS for developers.
    Saga lets developers control their workflows with natural speech—across tools like Cursor, MCP, and Slack—eliminating context switching and friction. It wasn’t about novelty. It was about flow.
    external validation caught up with internal momentum.
    Deepgram received the 2025 Voice AI Technology Excellence Award from CUSTOMER Magazine, recognizing Nova-3 for its accuracy, real-time multilingual transcription, and instant customization.
    We also expanded our infrastructure globally, announcing the general availability of Deepgram Dedicated, a fully managed single-tenant runtime, alongside early access to our EU-hosted API endpoint. For European customers, this unlocked true in-region inference without compromise.
    July reinforced a theme: enterprise voice AI is global—or it’s incomplete.

    August: Raising the Bar Again

    August was relentless.
    We expanded Nova-3 with support for German, Dutch, Swedish, and Danish.
    We saw Aura-2 recognized with the 2025 Contact Center Technology Award, validating its impact on both customer and employee experience.
    We also leveled up the Voice Agent API with GPT-5 and GPT-OSS-20B, giving developers new choices across latency, reasoning depth, and open-source flexibility.
    And we signed a strategic collaboration agreement with AWS, accelerating global deployment of voice AI across STT, TTS, and speech-to-speech.
    August felt like acceleration squared.

    September: Language as a First-Class Feature

    In September, we expanded Nova-3 to support Spanish, French, and Portuguese.
    Each language expansion wasn’t just a checkbox. It represented work on accents, code-switching, morphology, and real-world audio conditions. Voice AI only works when it works everywhere.
    Furthermore, we were featured on Fast Company’s Seventh Annual List of “The 100 Best Workplaces for Innovators”.

    October: Solving the Hardest Problem in Voice Agents

    October was about mitigating interruptions—the bane of conversational systems.
    First we introduced Flux, the first real-time conversational speech recognition model built specifically for voice agents. Flux solved interruptions without trading off latency.
    Then, we announced our Voice Agent API’s Integration with AWS Bedrock. For enterprise users in contact centers, healthcare, and customer experience, Deepgram's Voice Agent API integrated with Bedrock unlocks ultra-accurate, real-time speech AI—backed by AWS’s security, scalability, and compliance.
    Partners like Lindy, whose Gaia assistant runs on Flux, showed what natural phone conversations could finally feel like.
    Then, we expanded Nova-3 again with Italian, Turkish, Norwegian, and Indonesian support, continuing our steady global march.

    November: A Voice OS for Developers

    In November, Nova-3 expanded with numerous new languages, broadening European and Asian coverage. Specifically, we upgraded it to support Bulgarian, Czech, Hungarian, Polish, Russian, Ukrainian, Finnish, Hindi, Japanese, Korean, and Vietnamese.

    December: Closing the Loop

    December brought the year full circle.
    We launched Deepgram’s voice AI integrations with Amazon Connect, Amazon Lex, and Amazon SageMaker, bringing real-time speech intelligence directly into platforms enterprises already trust. Our EU Endpoint became generally available.
    Aura-2 learned to speak Dutch, French, German, Italian, and Japanese. And Nova-3 got upgraded yet again with keyterm prompting and ten new languages: Greek, Romanian, Slovak, Catalan, Lithuanian, Latvian, Estonian, Flemish, Swiss German, and Malay.
    It was a fitting close: deeper integrations, broader reach, and voice AI that’s ready for wherever the conversation happens next.

    Looking Ahead

    If this year proved anything, it’s that voice is no longer the interface of the future. It’s the infrastructure of the present.
    And we’re just getting started.

    Original source Report a problem
  • Dec 16, 2025
    • Parsed from source:
      Dec 16, 2025
    • Detected by Releasebot:
      Dec 23, 2025
    Deepgram logo

    Deepgram

    Deepgram Expands Nova-3 with 10 New Languages and Multilingual Keyterm Prompting

    Nova-3 expands with 10 new monolingual languages and a major Multilingual upgrade, boosting enterprise ASR accuracy and real-time performance across diverse scripts and tones. New Multilingual Keyterm Prompting lets you inject up to 500 tokens to boost domain-specific recognition without retraining.

    Overview

    Deepgram is expanding Nova-3 with support for 10 new monolingual languages and a major upgrade to Nova-3 Multilingual. This release further solidifies Nova-3 as a leading enterprise ASR model, offering high accuracy, adaptability, and precise performance across diverse languages and speech patterns.

    Deepgram continues to expand the reach of Nova-3 with support for 10 additional monolingual languages and a major update to Nova-3 Multilingual. This release strengthens Nova-3’s position as one of the most advanced enterprise ASR models available today, delivering accuracy, adaptability, and linguistic precision across diverse language families, scripts, and speech behaviors.

    Built for Global Speech Diversity

    This update brings Nova-3 into regions that challenge traditional ASR systems: languages with tonal variation, morphological complexity, and multi-script writing systems. Nova-3 handles these differences natively, preserving low latency and enterprise-grade accuracy across both batch and streaming modes.

    Nova-3 now supports 10 new monolingual languages across Southern Europe, the Baltics, and Southeast Asia, along with a major upgrade to multilingual accuracy through Keyterm Prompting.

    10 New Languages Now Live in Nova-3

    Earlier Nova-3 expansions focused on widely spoken European and Asian languages. This update represents the next phase, expanding into languages with distinct phonetic structures, scripts, and grammatical systems.

    Southern and Eastern Europe

    • Greek (el)
      Characterized by inflectional morphology and variable word stress. Nova-3 improves modeling of vowel alternations and compound forms.

    • Romanian (ro)
      A Romance language with Slavic influence and strong case inflection. Nova-3 delivers better handling of endings, stress patterns, and mid-word vowel shifts.

    • Slovak (sk)
      Complex consonant clusters and rich case systems make Slovak challenging for general ASR. Nova-3 improves recognition of grammatical gender and declension patterns.

    • Catalan (ca)
      A hybrid between Spanish and French with vowel reduction and multiple dialects. Nova-3 strengthens recognition in conversational and broadcast speech.

    Northern and Baltic Europe

    • Lithuanian (lt)
      A Baltic language with free stress and pitch accent. Nova-3 improves accuracy for rich morphology and long compounds.

    • Latvian (lv)
      Features vowel length contrast and consonant palatalization. Nova-3 increases clarity and keyword recall at varied speaking speeds.

    • Estonian (et)
      Combines vowel harmony with a three-length quantity system. Nova-3 improves segmentation and prosodic modeling in real-time scenarios.

    • Flemish (nl-BE)
      The Belgian variant of Dutch with regional phonetic shifts. Nova-3 enhances accuracy for colloquial and broadcast environments.

    • Swiss German (de-CH)
      A regional variant with extensive dialectal diversity. Nova-3 adapts more effectively to high-variance speech patterns.

    Southeast Asia

    • Malay (ms)
      Combines Austronesian roots with English and Arabic loanwords. Nova-3 improves accuracy in multilingual settings and conversational audio.

    Benchmarking: Accuracy Gains Across Languages

    Nova-3 continues to deliver measurable accuracy improvements over Nova-2, reducing Word Error Rate (WER) across both batch and streaming transcription. These gains hold across languages that vary widely in morphology, phonetics, and script complexity.

    A clear trend continues to emerge: streaming transcription often achieves the strongest relative WER reductions, reinforcing Nova-3’s suitability for real-time applications such as voice agents, live captioning, and AI telephony systems.

    Word Error Rate (WER) – Relative Improvement (10 New Nova-3 Languages)

    Key Highlights

    • All ten languages show accuracy gains in either batch or streaming modes, with many improving in both.
    • Malay, Romanian, and Slovak show some of the largest relative WER reductions, with improvements exceeding 20 percent in several cases.
    • Streaming models outperform batch in roughly half of the languages, supporting Nova-3’s strength in conversational and low-latency workflows.
    • Languages with complex morphology or less-standardized orthography such as Lithuanian, Latvian, and Slovak show robust gains, indicating improved handling of case systems, inflection, and compound formation.
    • Swiss German and Flemish deliver strong improvements despite dialectal variation, demonstrating Nova-3’s adaptability across regional speech patterns.

    New: Multilingual Keyterm Prompting

    Nova-3 Multilingual now supports Multilingual Keyterm Prompting, allowing developers to pass up to 500 tokens (about 100 words) to improve recognition of brand names, technical terminology, and domain-specific vocabulary across multilingual audio.

    Nova-3 can now prioritize these terms across all supported languages in a single request. This is especially valuable for global enterprises in finance, healthcare, retail, and customer support.

    No retraining is required. Nova-3 adapts instantly when you provide a list of key terms.

    Why It Matters

    Nova-3 continues to evolve as a unified speech recognition foundation for global products and workflows. Instead of applying one pattern to every language, Nova-3 adapts to each language’s structure whether it involves tones, inflections, or non-Latin alphabets.

    For developers and enterprise teams, this means:

    • Consistent performance across diverse global markets
    • Improved recognition in both conversational and formal speech
    • Lower latency and fewer transcription errors in multilingual environments
    • Flexible customization with Keyterm Prompting for domain-specific accuracy

    Getting Started

    Switching to any of the newly supported languages is simple. Update your API request with the appropriate language code:

    Supported language codes:

    el, lt, lv, ms, sk, ca, et, nl-BE, de-CH, ro
    

    To use multilingual Keyterm Prompting, pass your list of key terms through the keyterms parameter in your Nova-3 Multilingual request.

    Looking Ahead

    With 10 new languages and multilingual Keyterm Prompting now live, Nova-3 continues its progress toward full global coverage. Accuracy, adaptability, and real-time reliability continue to improve across language families, scripts, and acoustic environments.

    The goal is clear: voice AI that works everywhere, for everyone. Accurate in fast speech, resilient in noisy environments, and adaptable to local dialects and cultural context.

    Unlock Enterprise-Grade Voice AI Today

    Sign up free and unlock $200 in credits, enough to power over 750 hours of transcription or 200 hours of speech-to-text across Nova-3’s growing language suite. Explore details on our Models & Languages Overview page and experience Nova-3’s world-class adaptability for yourself.

    Original source Report a problem

Related vendors