Hume Release Notes

Follow

43 release notes curated from 43 sources by the Releasebot Team. Last updated: May 27, 2026

Get this feed:
  • Apr 18, 2024
    • Date parsed from source:
      Apr 18, 2024
    • First seen by Releasebot:
      May 27, 2026
    Hume logo

    Hume

    Introducing Hume’s Empathic Voice Interface (EVI) API

    Hume releases the EVI API and Configuration API, bringing emotionally intelligent voice experiences to apps with real-time WebSocket conversations, empathic responses, prosody awareness, interruptibility, and customizable prompts, LLMs, tools, and voices.

    Integrate emotionally intelligent voice experiences into any application with our EVI API.

    Last month, we released the demo of our Empathic Voice Interface (EVI) API. The first emotionally intelligent voice AI API is finally here!

    EVI does a lot more than stitch together transcription, LLMs, and text-to-speech. With a new empathic LLM (eLLM) that processes your tone of voice, EVI unlocks new capabilities like knowing when to speak, generating more empathic language, and intelligently modulating its own tune, rhythm, and timbre.

    EVI is the first voice AI that really sounds like it understands you. By adapting its tone of voice, it emulates the way humans convey meaning beyond words, unlocking more efficient, smooth, and satisfying AI interactions.

    Accessing EVI: integrating emotionally intelligent voice AI into your applications

    The main way to work with EVI is through a WebSocket connection that sends audio and receives responses in real time. This enables fluid, bidirectional dialogue where users speak, EVI listens and analyzes their voice, and EVI generates emotionally intelligent responses. You start a conversation by connecting to the WebSocket and streaming the user’s voice input to EVI.

    As the user speaks to EVI, the client can also send EVI text to speak aloud, which is intelligently integrated it into the conversation.

    See our documentation for more information on how to integrate EVI into your application. A great way to get started is on our platform, which allows developers to interactively configure custom system prompts and voices.

    Empathic AI (eLLM) features

    • Responds at the right time: Uses your tone of voice for state-of-the-art end-of-turn detection — the true bottleneck to responding rapidly without interrupting you.
    • Understands users’ prosody: Provides streaming measurements of the tune, rhythm, and timbre of the user’s speech using Hume’s prosody model, integrated with our eLLM.
    • Forms its own natural tone of voice: Guided by the users’ prosody and language, our model responds with an empathic, naturalistic tone of voice, matching the users’ nuanced “vibe” (calmness, interest, excitement, etc.). It responds to frustration with an apologetic tone, to sadness with sympathy, and more.
    • Responds to expression: Powered by our empathic large language model (eLLM), EVI crafts responses that are not just intelligent but attuned to what the user is expressing with their voice.
    • Always interruptible: Stops rapidly whenever users interject, listens, and responds with the right context based on where it left off.
    • Aligned with well-being: Trained on human reactions to optimize for positive expressions like happiness and satisfaction. EVI continuously learns from users’ reactions.

    Configurability: customizing your voice AI API

    With the general release of EVI we’re also releasing our Configuration API, which will enable developers to customize their EVI—the system prompt, LLM, the tools that EVI can use, context to use during the conversation, and more. You can configure EVI in both the API or the UI. Configurable elements below —

    • System prompt: customize EVI’s personality, response style, and the content of speech through prompt engineering. Use our guidelines for prompting EVI to improve the performance, or try out our sample prompts on the voice playground.
    • Inject other LLM responses into our model: Hume’s empathic large language model (eLLM) always generates the first response to a query, but you can configure other LLMs to formulate longer responses.
      • Integrate another LLM API: Currently we support Fireworks Mixtral8x7b, all OpenAI models, and all Anthropic models.
      • Bring your own LLM or generate text another way: Connect our WebSocket to your own server with your own tool or text generation, allowing you to determine all EVI messages in the conversation.
    • Bring your own model: Rather than using our LLMs, connect our WebSocket to your own server to generate your own text, allowing you to determine exactly how EVI responds in the conversation.
    • TTS: Use just EVI’s expressive voice by sending our API text to be spoken aloud.

    We plan to add more configuration options soon, allowing EVI to use tools, change its speaking style, and more. Join our Discord for product updates and technical support.

    Learn more about emotionally intelligent voice AI

    Build with our conversational AI voice

    Documentation

    Start building

    Speak with our Empathic Voice Interface

    Demo

    Original source
  • Mar 25, 2024
    • Date parsed from source:
      Mar 25, 2024
    • First seen by Releasebot:
      May 27, 2026
    Hume logo

    Hume

    Hume Raises $50M Series B and Releases New Empathic Voice Interface

    Hume launches its Empathic Voice Interface, a conversational voice AI that detects when users finish speaking, responds with expressive speech, and learns from reactions to improve satisfaction. The announcement also introduces Hume’s new empathic large language model.

    To build emotionally intelligent voice AI, Hume AI raised a $50m Series B round led by EQT Ventures and joined by Union Square Ventures, Nat Friedman & Daniel Gross, Metaplanet, Northwell Holdings, Comcast Ventures, and LG Technology Ventures.

    The company is unveiling its new flagship product, an Empathic Voice Interface (EVI). The conversational voice-to-voice AI knows when users are finished speaking and learns to generate vocal responses optimized for user satisfaction. EVI is powered by a new form of multimodal generative AI called an empathic large language model (eLLM) developed by Hume.

    Building the first empathic AI

    Hume AI is delighted to announce its $50m Series B! The round was led by EQT Ventures, with participation from Union Square Ventures, Nat Friedman & Daniel Gross, Metaplanet, Northwell Holdings, Comcast Ventures, and LG Technology Ventures also joined the round.

    In conjunction with today's Series B announcement, Hume is launching its Empathic Voice Interface (EVI), a first-of-its-kind conversational AI with emotional intelligence. EVI is trained on data from millions of human interactions and uses vocal tones to understand when users finish speaking, predict their preferences, and optimize responses for satisfaction over time.

    The future of voice AI

    AI voice products have the potential to revolutionize our interaction with technology. However their often stilted and mechanical responses act as barriers to truly immersive conversations. The goal with EVI is to provide the basis for engaging voice-first experiences that emulate the natural speech patterns of human conversation.

    Speak with the next generation of emotionally intelligent voice AI

    EVI uses a new form of multimodal generative AI that integrates large language models (LLMs) with expression measures, which Hume refers to as an empathic large language model (eLLM). Our eLLM enables EVI to adjust the words it uses and its tone of voice based on the context and the user’s emotional expressions. Developers will be able to integrate voice-first experiences into any application with a few lines of code. EVI will be publicly available in April, sign-up here for updates: link.hume.ai/evi-waitlist

    Learn about the features that make EVI a human-like conversationalist:

    • A universal voice interface, a single API for transcription, frontier LLMs, and text-to-speech.
    • End-of-turn detection, uses your tone of voice for state-of-the-art end-of-turn detection, eliminating awkward overlaps.
    • Interruptibility, stops speaking when interrupted and starts listening, just like a human.
    • Responds to expression, understands the natural ups and downs in pitch & tone used to convey meaning beyond words.
    • Expressive TTS, generates the right tone of voice to respond with natural, expressive speech.
    • Aligned with your application, learns from users' reactions to self-improve by optimizing for happiness and satisfaction.

    Hume continues its dedication to the development of safe and responsible AI through The Hume Initiative, a nonprofit that brings together AI researchers, ethicists, social scientists, and legal scholars to develop ethical guidelines for empathic AI.

    Alan Cowen, CEO and Chief Scientist, sees empathic AI as essential to aligning AI with human well-being:

    “The main limitation of current AI systems is that they’re guided by superficial human ratings and instructions, which are error-prone and fail to tap into AI’s vast potential to come up with new ways to make people happy. By building AI that learns directly from proxies of human happiness, we’re effectively teaching it to reconstruct human preferences from first principles and then update that knowledge with every new person it talks to and every new application it’s embedded in.”

    Shaping the future of voice AI and empathic technology

    Today’s announcement follows a period of exciting growth for Hume. Over the past year, Hume launched two key products: the Expression Measurement API, an advanced toolkit for measuring human emotional expression, and Custom Models, which uses transfer learning on those measurements to predict human preferences. Additionally, Hume grew its foundational databases to include naturalistic data from over a million diverse participants, doubled its headcount from 15 to 30 employees and published over eight academic articles in top journals.

    Funding will accelerate Hume’s growth into a global player in the generative AI space and cement empathic AI as an industry standard. The capital will be allocated to scale Hume’s team, accelerate its AI research, and continue the development of its Empathic Voice Interface. Interested in helping us build the future of empathic AI, apply here: hume.ai/careers

    Original source
  • All of your release notes in one feed

    Join Releasebot and get updates from Hume and hundreds of other software products.

    Create account
  • Sep 11, 2024
    • Date parsed from source:
      Sep 11, 2024
    • First seen by Releasebot:
      May 27, 2026
    Hume logo

    Hume

    Comparing the world’s first voice-to-voice AI models: EVI 2 and GPT-4o

    Hume releases EVI 2, its voice-to-voice AI model, as an app and API for developers. The update highlights faster natural voice interactions, empathic responses, customizable personalities, tool use, and broad interoperability for building voice-first experiences.

    Comparing EVI 2 and GPT-4o voice

    Imagine if you could speak naturally to any product or app—four times faster than typing—and it could talk back. Imagine if, based not just on what you said but also how you said it, it did what you wanted it to do. That’s what voice-to-voice foundation models, the latest major breakthrough in AI, will enable for many, if not most, products and services in the coming months and years.

    The world’s first working voice-to-voice models are Hume AI's Empathic Voice Interface 2 (EVI 2) and OpenAI's GPT-4o Advanced Voice Mode. EVI 2 was publicly released in September 2024, available as an app and an API that developers can build on. GPT-4o voice was previewed to a small number of ChatGPT users in mid-2024, and released for developers as the Realtime API in October 2024. Here we explore the similarities, differences, and potential applications of these systems.

    What are voice-to-voice AI models?

    Voice-to-voice AI models apply the same principles as large language models (LLMs), but they directly process audio of the human voice instead of text. Whereas large language models are trained on millions of pages of text, voice-to-voice models are trained on millions of hours of recorded voice data. These models enable users to speak with AI through voice alone.

    In many ways, these new voice-to-voice models bring to fruition what legacy technologies like Siri and Alexa had long promised. Siri and Alexa were presented as general-purpose voice understanding systems, with the capability to fulfill arbitrary voice queries. Unfortunately, Siri and Alexa were not actually powered by general-purpose voice AI models, but by traditional computer programs that generated fixed responses to a hardcoded set of keywords.

    As general-purpose systems that can fulfill arbitrary voice queries, voice-to-voice models make possible, for the first time, the things people always wished Siri and Alexa could do. Since these kinds of voice assistants were first launched over a decade ago, many have forgotten what made them so exciting to begin with. Voice is how humans interact with each other, our most natural modality for communication. Consider:

    • The average person speaks at 150 words per minute but types at only 40 wpm. Voice makes interacting with computers - especially for input - much faster.
    • Speech recognition accuracy has improved by over 5x since 2012, now rivaling or exceeding human transcription in accuracy.
    • Voice-to-voice models have the potential to democratize computing for 773 million illiterate adults worldwide.
    • For 2.2 billion people with visual impairments, voice-to-voice models are not just convenient - they can become their primary gateway to digital interaction.

    Voice-to-voice models will allow billions more people to use state-of-the-art technology with seamless communication. Within a decade, our current interfaces may feel as outdated as command-line interfaces in a GUI world.

    Comparing EVI 2 and GPT-4o voice

    Similarities

    EVI 2 and GPT-4o voice have many capabilities in common. Both are multimodal language models that can process both audio and language and output both voice and language. As a result, they can both converse rapidly and fluently with users with sub-second response times, understand a user’s tone of voice, generate any tone of voice, and even respond to some more niche requests like changing their speaking rate or rapping. Voice-to-voice models overcome the inherent limitations of traditional stitched-together systems that rely on separate steps for transcription, language modeling, and text-to-speech.

    Differences

    EVI 2 is optimized for emotional intelligence. EVI 2 excels at anticipating and adapting to your preferences, made possible by its special training for emotional intelligence. EVI 2 leverages Hume's research on human expression to interpret and respond to subtle emotional cues in the user's voice. Then, it can use these cues to make more empathic responses that are more likely to support the user’s well-being. In contrast, while ChatGPT voice is capable of interpreting tone and responding with an emotional tone of voice, it does not have the same depth of focus on emotional intelligence as EVI 2, and does not appear to be trained to promote the user’s well-being. Further, EVI 2 provides accurate emotional expression measures based on a decade of research for all of the user's speech.

    EVI 2 is trained to maintain compelling personalities. Hume’s speech-language model is trained to maintain characters and personalities that are fun and interesting to interact with. On the other hand, GPT-4o voice is currently restricted to a small set of prototypical “AI assistant” personalities.

    EVI 2 is customizable. Where the Realtime API has eight preset voices with relatively static personalities, EVI 2 can emulate an infinite number of personalities, including accents and speaking styles, with flexible prompting and voice modulation tools. We developed a novel voice modulation approach that allows anyone to adjust EVI 2's eight (and counting) base voices along a number of continuous scales, including gender, nasality, pitch, and more. This allows developers to create any custom voice, not just choose from a limited set.

    EVI 2 is designed for developers. Available through our API, EVI 2 is designed for developers with a customizable voice and personality that can be tailored to specific apps and users. It also includes features like tool use, phone calling, custom language models, a wide variety of conversational controls, and comprehensive documentation. In contrast, OpenAI's voice models were designed for ChatGPT, with particular “AI assistant” personalities that can be hard to adjust. EVI has been available as an API since early 2024 and has been tested by thousands of developers and improved with new features since then, while the Realtime API has a limited feature set.

    EVI 2 is designed for scale. The Realtime API costs $0.06 per minute of audio input and $0.24 per audio output, resulting in a cost of about $0.15/min or $9/hour. In contrast, the EVI API costs $0.072/min of conversation or $4.32/hour - about two times cheaper than OpenAI's voice offering. Further, Hume offers discounts at scale and a grant program for startups. The dramatic price difference and the fact that the Realtime API is still in an early-stage beta makes the empathic voice interface API far more practical for real-world applications. For companies looking to deploy voice AI at scale, this cost can be the difference between a profitable product and an unsustainable one.

    EVI 2 can be used with any LLM. While EVI 2 generates its own language, it is designed to be flexible and interoperable with other LLMs. This includes supplemental LLMs from OpenAI, Anthropic, Google, Meta, or any other provider. It also enables custom language models, allowing developers to bring their own LLMs or generate fixed responses. This flexibility enables developers to leverage the strengths of different LLMs while still benefiting from EVI's voice and empathic AI capabilities. In contrast, the Realtime API is tightly integrated with OpenAI's ecosystem. It cannot be used with non-OpenAI models. Thus, it is only well-suited for applications where GPT-4o’s responses are preferred over other LLMs like Gemini or Claude.

    GPT-4o supports more languages. OpenAI’s voice offering supports input and output in a wide range of languages. Similarly, EVI 2’s architecture allows for voice and text generation in any language, but the small model is currently only fluent in English, Spanish, French, German, and Polish. Many more languages will be added in the coming months.

    The use cases for EVI 2

    Voice-to-voice models are set to transform a wide range of products and services over the coming months.

    Customer service. For customer-facing businesses, voice-to-voice models can provide 24/7 support with unprecedented empathy and understanding. This is crucial, as businesses lose an estimated $75 billion annually due to poor customer service (source), and 81% of customers prefer self-service options according to Harvard Business Review (source). Businesses are unable to answer all incoming calls, which results in significant missed revenue - small and medium-sized businesses miss between 22% to 62% of all incoming calls (source). For example, in the automotive industry, a single missed call can result in a $220 lost opportunity, with the average automotive business losing $49,000 in revenue leakage per year due to unanswered calls (source). Using voice AI models can allow businesses to earn millions more in revenue. Empathic voice AI will be even more effective in satisfying customers and resolving their issues.

    A more efficient interface for virtually any application. Voice-to-voice models can significantly boost productivity by allowing hands-free, natural language interactions with complex systems. Since the voice is 4x faster than typing and allows applications to perform any action, not just the ones presented on a specific UI page, this may unlock an order of magnitude increase in productivity. Any application can add a voice interface to accelerate interaction speed and improve accessibility for millions of users.

    Mental health, education, and personal development. The ability of voice-to-voice models to understand context and emotional cues opens up possibilities in specific fields like mental health, education, and personal development. The market for AI-powered mental health apps is forecast to reach $8 billion by 2025 (source), showcasing the immense potential for personalized, empathic AI services at scale.

    These are just a small selection of example use cases for EVI 2. By enabling any application to add a customizable voice interface, EVI 2 enables countless new uses for voice AI.

    Looking forward: the future of EVI

    Currently, EVI 2 is available only in one model size: EVI-2-small. We are still making improvements to this model. In the coming weeks, it will become more reliable, learn more languages, follow more complex instructions, and use a wider range of tools. We’re also fine-tuning a larger, upgraded voice-to-voice model we will be announcing soon.

    While maintaining or exceeding EVI-2-small’s voice capabilities, this larger model will be more responsive to prompts and excel in complex reasoning. For now, if your application makes use of complex reasoning skills such as logical reasoning and tool use, we recommend that you configure the EVI API to use EVI 2 in conjunction with an external LLM.

    EVI 2 represents a critical step forward in our mission to optimize AI for human well-being. We focused on making its voice and personality highly adaptable to give the model more ways to meet users’ preferences and needs. The default personalities of EVI 2 already reflect how the model is optimized for user satisfaction, demonstrating that AI optimized for well-being will have a particularly pleasant and fun personality as a result of its deeper alignment with your goals.

    Our ongoing research is focused on optimizing for individual user’s preferences automatically, with methods to fine-tune the model to generate responses that align with ongoing signs of happiness and satisfaction during everyday use of an application.

    Voice-to-voice AI models represent a transformative leap in human-computer interaction. We can’t wait to try the delightful user experiences developers build with the EVI 2 API.

    Resources

    EVI 2 Documentation

    EVI 2 Pricing

    Developer Platform

    Hume Discord

    Original source
  • Sep 11, 2024
    • Date parsed from source:
      Sep 11, 2024
    • First seen by Releasebot:
      May 27, 2026
    Hume logo

    Hume

    Introducing EVI 2, our new foundational AI voice model

    Hume releases EVI 2, a beta voice-to-voice foundation model for human-like voice conversations in its app and API. It brings rapid responses, tone and personality adaptation, multilingual ability, and experimental voice modulation for custom synthetic voices without voice cloning.

    EVI 2 is our new voice-to-voice foundation model. It is one of the first AI models with which you can have remarkably human-like voice conversations. It can converse rapidly and fluently with users with subsecond response times, understand a user’s tone of voice, generate any tone of voice, and even respond to some more niche requests like changing its speaking rate or rapping. It can emulate a wide range of personalities, accents, and speaking styles and possesses emergent multilingual capabilities.

    At a higher level, EVI 2 excels at anticipating and adapting to your preferences, made possible by its special training for emotional intelligence. It’s trained to maintain characters and personalities that are fun and interesting to interact with. Put together, EVI 2 is designed to emulate the ideal AI personality for each application it is built into and each user.

    Getting started with EVI 2

    Today, EVI 2 is available in beta for anyone to use. It is available to talk to via our app and to build into applications via our API (in keeping our guidelines).

    Importantly, EVI 2 is incapable of cloning voices without modifications to its code. This is by design: we believe voice cloning has unique risks. By controlling its identity-related voice characteristics at the model architecture level, we force the model to adopt one voice identity at a time, maintaining it across sessions.

    But we still wanted to give users and developers the ability to adapt EVI 2’s voice to their unique preferences and requirements. To that end, we developed an experimental voice modulation approach that allows anyone to create synthetic voices and personalities. Developers can adjust EVI 2’s base voices along a number of continuous scales, including gender, nasality, pitch, and more. This first-of-its-kind feature allows you to create tailored voices for specific apps and users without the risks of voice cloning.

    What's next?

    The model that we’re releasing today is EVI-2-small. We are still making improvements to this model—in the coming weeks, it will become more reliable, learn more languages, follow more complex instructions, and use a wider range of tools. We’re also fine-tuning EVI-2-large, which we will be announcing soon.

    EVI 2 represents a critical step forward on our mission to optimize AI for human well-being. We focused on making its voice and personality highly adaptable to give it more affordances to optimize for users’ happiness and satisfaction. After all, personalities are the amalgamation of many subtle, subsecond decisions made during our interactions, and EVI 2 demonstrates that AI optimized for well-being will have a particularly pleasant and fun personality as a result of its deeper alignment with your goals. Our ongoing research is focused on optimizing for each user’s preferences automatically, with methods to fine-tune the model to generate responses that align with signs of happiness and satisfaction during everyday use of an application.

    Resources

    • EVI 2 Documentation
    • EVI 2 Pricing
    • Developer Platform
    • Hume Discord
    Original source
  • Dec 18, 2023
    • Date parsed from source:
      Dec 18, 2023
    • First seen by Releasebot:
      May 27, 2026
    Hume logo

    Hume

    Tutorial: Hands-on with Hume's Custom Model API

    Hume introduces its Custom Model API, letting users train multimodal models from labeled audio, video, image, language, voice, and facial expression data to predict outcomes like well-being and satisfaction. The API also includes dataset setup, training, evaluation, and deployment in beta.

    Meet our Custom Model API — a cutting edge AI tool that integrates language, voice, and/or facial movement to predict human preferences and needs more accurately than any LLM

    You can now use our Custom Model API to predict well-being, satisfaction, mental health, and more. Using a few labeled examples, our API integrates dynamic patterns of language, vocal expression, and/or facial expression into a custom multimodal model.

    Leveraging Hume’s AI models pretrained on millions of videos and audio files, our API can usually predict your labels accurately after seeing just a few dozen examples. That means that with just a few labeled examples and a few clicks, you can deploy powerful AI models that predict the outcomes your users care about most. Of course, the models you train using our API are yours alone to deploy and share.

    Visit dev.hume.ai/docs/custom-models or login to beta.hume.ai to get started!

    Step 1: Prepare your dataset

    The cool thing about our Custom Model API is that it learns rapidly from your own data. You just need to choose a dataset of image, video, or audio files for it to learn from—ideally, one that captures the different states, preferences, or outcomes that are important for your application.

    For example, consider a virtual education platform is building a feature that helps users stay focused. This customer could start by putting together a few examples of video snippets when students reported paying attention or being distracted. This dataset could then be submitted to our Custom Model API to build a model that detects signs of distraction automatically.

    The model creation process will be easier if you sort your image, video, or audio files into subfolders based on their labels. For example, you could create an umbrella folder called ‘Student Focus’ with subfolders called ‘Attentive’ and ‘Distracted’. Please note we currently don't support mixed media datasets.

    The amount of data you’ll need to build an accurate model depends on your goal’s complexity. Generally, it is good practice to have a similar number of samples with each label you want to predict. You also want to consider other forms of imbalance or bias in your dataset. Remember that the length of file, number of speakers, and language spoken will also impact the model's predictive accuracy. You'll essentially want to put together a training dataset very similar to the kinds of files you'll ultimately want to predict on. For more information, see Data Tips: What should your data look like?

    Step 2: Create your dataset

    Once you’ve assembled your dataset, it’s time to visit beta.hume.ai.

    After logging in, navigate to the ‘Datasets’ page.

    Step 2.1: Find the ‘+ Create Dataset’ button on the top right corner of your screen. Clicking this button will take you to the page where you can add your dataset to our Custom Model platform.

    Step 2.2: Give your dataset a title and enter in one of your column names along with the data type (categorial or numerical). Don't worry, you can always go back and make edits later. Your column is where you store your “label names”, which should just be the overall category. In our example, ‘Student Focus’ would be a suitable name.

    Step 2.3: Now you can just drag-and-drop the folder that contains your dataset.

    Remember, the folder should include subfolders for each label containing the corresponding samples. For example, if you drag in a folder with the subfolders ‘Attentive’ and ‘Distracted’, our platform will interpret ‘Attentive’ and ‘Distracted’ as labels belonging to the samples in each respective subfolder.

    Step 2.4: Assign a label name in the pop-up window. Your “label name” should just be the overall category. In our example, ‘Student Focus’ would be a suitable name. Then, hit ‘Save Labels and Continue’ and subsequently approve the uploading process.

    Step 2.5: Verify your uploads. Check the total file count and address any detected issues.

    Step 2.6: Hit the ‘Save’ button on the top right of the page once you’re ready. If you accidentally uploaded a mixed-media dataset, a pop-up window will ask you to select the single file type you would like to keep.

    Step 2.7: Now, you’re ready to create your model! Click ‘Create Custom Model’ on the top right of your screen.

    Step 3: Create your model

    Step 3.1: Select your Training Dataset. If you’re navigating to this page from a specific dataset (as recommended in this guide), this step will already be completed for you. However, if you want to check or change your dataset, you can do this by clicking on the ‘Edit’ button to the right of the heading ‘Select Training Dataset’.

    Step 3.2: Select which labels you want to predict and hit ‘Continue.’

    Step 3.3: Select your Task Type. Based on your data, we'll recommend creating either a Classification or Regression model. Next, select the specific type of model you want to create; you can currently select between binary or multiclass classification or univariate or multivariate regression.

    Step 3.4: Fill in your new Custom Model’s name and description.

    Step 3.5: Hit ‘Start Training’ once you’re ready.

    That’s it! You’ve successfully begun training your Custom Model.

    You should be redirected to a page confirming that your model is actively training.

    To check on the status of your model, click ‘View Jobs.’ To see existing, finished models, click ‘View Models’.

    Step 4: View and evaluate your model

    Step 4.1: Find your Custom Model on the ‘Models’ page. Once your model is done training, you’ll see it on the ‘Models’ page. Click to explore how it performed.

    Step 4.2: View your model’s estimated accuracy. The Custom Models page includes statistics on the accuracy of your trained model, obtained by iteratively training on ~90% of your data and testing on the remaining ~10%, across the samples in your dataset.

    Step 4.4: Interpret confidence scores and misclassified samples. Explore the confidence scores assigned to each sample, as well as the samples that were misclassified, by navigating the “Classification confidence” visualization. Note that the confidence scores give you a continuous measure that might have additional utility beyond your original labels, potentially representing gradations between your classes.

    Step 4.4: Use the 'Descriptive statistics' visualization to interpret how different expression modalities relate to your labels. You can explore how your labels are differentiated by the different modalities of emotional expression measured by our core models (Facial Expression, Prosody, Vocal Bursts, and Language). Toggle between 'Emotions' and 'Classes'.

    If you’re happy with your model’s performance, it’s time to put it to use.

    Step 5: Test your Custom Model on new files

    Step 5.1: Navigate to the ‘Playground’ page.

    Step 5.2: Locate the dropdown menu to select your model of interest. Select the custom model you successfully created from the dropdown.

    Step 5.3: To select a file to analyze, locate the 'Upload files' button. You can upload files stored locally on your computer or select files you have already uploaded to Hume. You can also use example files for the custom models.

    Step 5.4: Click ‘Analyze’ to evaluate your selected file with your custom model.

    That’s it! You’ve successfully applied your Custom Model to a new file. For more information on how to interpret your results, see Evaluating your model.

    Your model is automatically deployed on our API so that you can build it into your application. For further instruction, see Start Custom Model Inference Job.

    In closing

    The Hume AI Platform strives to be the only toolkit developers need to measure verbal and nonverbal cues in audio, video, or images, based on rigorous scientific studies of human expressive behavior. Our Custom Model API is the most powerful way to apply our models to your specialized use case.

    In this post, we walked through the basics of how to use our new Custom Model API. For more details and the latest documentation, be sure to bookmark our tutorials. And if you plan to develop an application using our API, note that it will need to adhere to the ethical guidelines of The Hume Initiative.

    We’re excited to expand our private beta over the next few months, and look forward to what you’ll build with it! If you have any questions or want more direct support, please join our Discord channel.

    Connect with us

    Follow us on Twitter @hume_ai or join our Discord channel. If you’re interested in beta access, you can sign up.

    Original source
  • Dec 14, 2023
    • Date parsed from source:
      Dec 14, 2023
    • First seen by Releasebot:
      May 27, 2026
    Hume logo

    Hume

    Announcing our Custom Model API

    Hume introduces its Custom Model API, letting users train multimodal AI models on a few labeled examples to predict outcomes like well-being, satisfaction, and mental health using language, voice, and facial signals. The beta training process is free, with paid inference for deployment.

    Meet our Custom Model API — a cutting edge AI tool that integrates language, voice, and/or facial movement to predict human preferences and needs more accurately than any LLM

    You can now use our Custom Model API to predict well-being, satisfaction, mental health, and more. Using a few labeled examples, our API integrates dynamic patterns of language, vocal expression, and/or facial expression into a custom multimodal model.

    Leveraging Hume’s AI models pretrained on millions of videos and audio files, our API can usually predict your labels accurately after seeing just a few dozen examples. That means that with just a few labeled examples and a few clicks, you can deploy powerful AI models that predict the outcomes your users care about most. Of course, the models you train using our API are yours alone to deploy and share.

    Visit dev.hume.ai/docs/custom-models or login to beta.hume.ai to get started!

    Our new API translates nuanced multimodal measures into personalized insights more accurately than any LLM

    For instance, we partnered with Lawyer.com to predict the quality of customer support calls. Using just 73 calls, we were able to train a model that predicted expert ratings of whether a call went well or poorly with 97.3% accuracy. By contrast, using language models alone—including one of the world’s most capable language models along with our in-house language emotion model—resulted in a 3x higher error rate.

    Leveraging dynamic patterns of language, vocal expression, and/or facial movement

    Our Custom Model API works by integrating complex patterns of language, vocal expression, and/or facial movement captured using Hume’s expression AI models.

    To combine these signals with language, we inserted the expression measures extracted by each of our expression models along with transcribed language into a novel empathic large language model (eLLM). We then pretrained our eLLM on millions of human interactions.

    When you train a custom model using our Custom Model API, you are leveraging the joint language-expression embeddings extracted by our eLLM to predict your own labels.

    Pricing

    There are two steps to using our Custom Model API, (1) Training and (2) Inference:

    1. Training: During our beta release, the model training process is completely free. This includes uploading data, training, evaluating results, and retraining.

    2. Inference: When deploying your custom model in your application, a fee is charged for each file processed by your model. Detailed pricing can be found on our pricing page.

    Hope you enjoy using our Custom Model API! If you have any questions, you can post them on our Discord channel. We look forward to seeing what you build.

    Original source
  • Dec 2, 2024
    • Date parsed from source:
      Dec 2, 2024
    • First seen by Releasebot:
      May 27, 2026
    Hume logo

    Hume

    Introducing Voice Control

    Hume introduces Voice Control in beta for EVI 2, giving developers interpretable, slider-based control over custom AI voices. The feature offers continuous adjustments across 10 voice dimensions, real-time preview, and reproducible voice changes without voice cloning risks.

    Why voice control matters

    • We’re introducing Voice Control, a novel interpretability-based method that brings precise control to AI voice customization without the risks of voice cloning.
    • Our tool gives developers control over 10 voice dimensions, labeled “masculine/feminine,” “assertiveness,” “buoyancy,” “confidence,” “enthusiasm,” “nasality,” “relaxedness,” “smoothness,” “tepidity,” and “tightness.”
    • Unlike prompt-based approaches, Voice Control enables continuous adjustments along these dimensions, allowing for precise control and making voice modifications reproducible across sessions.
    • We’re releasing Voice Control in beta so that developers can create one-of-a-kind voices for any application, but we’re still working on making voice quality 100% reliable for extreme parameter combinations.
    • Through an intuitive no-code interface, you can easily tinker with this frontier technology to craft the perfect voice for your brand or application.

    Faced with an increasingly recognizable set of preset voices from AI providers, creators still struggle to find voices that match their product, brand, or application without compromising on quality.

    Today, we're introducing Voice Control, our experimental feature for Empathic Voice Interface 2 (EVI 2) that transforms how custom AI voices are created through interpretable, continuous controls.

    Interpretable control for voice AI

    As scientists working at the intersection of emotion science and AI, our research goal was to develop interpretability tools for speech-language models. What makes this particularly challenging is that people’s perceptions of voices are far more granular than they can articulate in words. Consider how parents can instantly distinguish their child's voice in a playground full of young, squeaky, enthusiastic voices, or how you'd struggle to describe your best friend's voice to a stranger—despite immediately recognizing it yourself. Nuanced, ineffable voice characteristics are not just highly recognizable to humans, but extremely psychologically salient.

    Given these constraints, we decided to develop a slider-based approach to voice interpretability and control that reflects the nuances of human voice perception without forcing them through the bottleneck of language.

    Modifiable voice attributes

    The following attributes can be modified to personalize any of the base voices:

    • Masculine/Feminine
    • Assertiveness
    • Buoyancy
    • Confidence
    • Enthusiasm
    • Nasality
    • Relaxedness
    • Smoothness
    • Tepidity
    • Tightness

    Each voice attribute can be adjusted relative to the base voice's characteristics. Values range from -100 to 100, with 0 as the default. Setting all attributes to their default values will keep the base voice unchanged.

    These sliders represent perceptual qualities that listeners tend to associate with specific voice characteristics – for instance, what people commonly interpret as a voice that sounds 'confident' or 'feminine' – rather than making claims about someone’s underlying gender or confidence level (after all, these are synthetic voices that don’t correspond to any real person).

    Disentangling voice characteristics

    One of our core technical achievements is ensuring that, in general, modifications to one voice characteristic don't influence others. This is particularly challenging as many voice attributes are highly correlated across real speakers, so we decided to develop a new, unsupervised approach that preserves most characteristics of each base voice when specific parameters are varied.

    Implementation and integration

    Voice Control is immediately available through our platform. The creation process is straightforward:

    1. Select a base voice as your starting point
    2. Adjust the voice attributes using intuitive sliders
    3. Preview your changes in real-time
    4. Deploy your custom voice through the EVI configuration

    The system ensures that voice customizations are:

    • Reproducible across sessions
    • Stable across different utterances
    • Computationally efficient for real-time applications

    What's next

    This release marks just the beginning of our vision for voice customization. We're actively working on:

    • Expanding our range of base voices
    • Introducing additional interpretable dimensions
    • Enhancing preservation of voice characteristics under extreme modifications
    • Developing advanced tools for analyzing and visualizing voice characteristics

    Learn More: Transform AI interactions with EVI.

    Create customizable, emotionally intelligent voice AI for any industry to build AI applications that better understand and respond to human emotional behavior. Start building more engaging AI apps today.

    Original source
  • May 15, 2026
    • Date parsed from source:
      May 15, 2026
    • First seen by Releasebot:
      May 16, 2026
    Hume logo

    Hume

    May 15, 2026

    Hume adds an experimental temperature parameter to its TTS API for more varied or consistent speech generation.

    TTS API additions

    Added an experimental temperature parameter to TTS endpoints. Controls sampling temperature for speech generation. Higher values increase variation; lower values increase consistency.

    Original source
  • May 2026
    • No date parsed from source.
    • First seen by Releasebot:
      May 16, 2026
    Hume logo

    Hume

    TTS API bug fixes

    Hume fixes a bug that caused duplicate interleaved TTS audio and distorted output.

    Fixed a bug where duplicate interleaved audio was included in TTS audio output.

    This resolves an issue where audio chunks could be duplicated and interleaved, resulting in distorted output.

    Original source
  • Apr 10, 2026
    • Date parsed from source:
      Apr 10, 2026
    • First seen by Releasebot:
      Apr 11, 2026
    Hume logo

    Hume

    April 10, 2026

    Hume adds configurable turn detection and interruption settings to EVI configs, giving users finer control over turn-taking, speech detection, and interruption behavior on a per-config basis.

    EVI API additions

    Added configurable turn detection and interruption settings to EVI configs. You can now control how EVI handles turn-taking and interruptions on a per-config basis.

    • turn_detection.end_of_turn_silence_ms: How long EVI waits after speech ends before committing a turn (500-3000ms, default 800ms).
    • turn_detection.speech_detection_threshold: Sensitivity of voice activity detection (0.0-1.0, default 0.5).
    • turn_detection.prefix_padding_ms: Audio padding before detected speech (default 300ms).
    • interruption.min_interruption_ms: Minimum speech duration before EVI can be interrupted (50-2000ms, default 800ms).
    Original source
  • Mar 10, 2026
    • Date parsed from source:
      Mar 10, 2026
    • First seen by Releasebot:
      Mar 10, 2026
    Hume logo

    Hume

    Opensourcing TADA: Fast, Reliable Speech Generation Through Text-Acoustic Synchronization

    Hume releases TADA, a groundbreaking Text-Acoustic Dual Alignment for fast, low-hallucination LLM based TTS. Open-sourced now with 1B and 3B models and full audio tokenizer, enabling on-device deployment and real-time voice with one-to-one text audio mapping. Available on HuggingFace and GitHub for research and development.

    Approach

    The future of voice AI hinges on sounding natural, fast, expressive, and free of quirks like hallucinated words or skipped content. Today's LLM-based TTS systems are forced to choose between speed, quality, and reliability because of a fundamental mismatch between how text and audio are represented inside language models.

    TADA (Text-Acoustic Dual Alignment) resolves that mismatch with a novel tokenization schema that synchronizes text and speech one-to-one. The result: the fastest LLM-based TTS system available, with competitive voice quality, virtually zero content hallucinations, and a footprint light enough for on-device deployment.

    Hume AI is open-sourcing TADA to accelerate progress toward efficient, reliable voice generation. Code and pre-trained models are available now.

    For input audio, an encoder paired with an aligner extracts acoustic features from the audio segment corresponding to each text token. For output audio, the LLM's final hidden state serves as a conditioning vector for a flow-matching head, which generates acoustic features that are then decoded into audio and fed back into the model.

    Since each LLM step corresponds to exactly one text token and one acoustic representation, TADA generates speech faster and with less computational effort. And because the architecture enforces a strict one-to-one mapping between text and audio, the model cannot skip or hallucinate content by construction.

    Evaluation

    Hallucination Rate

    SAMPLES WITH CER > 0.15, SIGNALING SKIPPED WORDS, INSERTED CONTENT, OR UNINTELLIGIBLE SPEECH. OUT OF 1,088 SAMPLES.

    • FireRedTTS-2 41
    • Higgs Audio v2 24
    • VibeVoice 1.5B 17
    • TADA-3B 0
    • TADA-1B 0

    Speed

    TADA generates speech at a real-time factor (RTF) of 0.09 — more than 5x faster than similar grade LLM-based TTS systems. This is possible because TADA operates at just 2–3 frames (tokens) per second of audio, compared to 12.5–75 tokens per second in other approaches.

    Hallucination

    Our model was trained on large scale, in-the-wild data, without post-training, and achieves the same reliability as models trained on smaller curated datasets. We measured hallucination rate by flagging any sample with a character error rate (CER) above 0.15 — a threshold that captures unintelligible speech, skipped text, and inserted content. In the 1000+ test samples from LibriTTSR, TADA produced zero hallucinations.

    Voice Quality

    Results on SEED-TTS-EVAL and LIBRITTSR-EVAL show that TADA achieves reliability comparable to Index-TTS — one of the few systems with similarly low hallucination rates — while being trained on a larger, less-curated dataset.

    In human evaluation on expressive, long-form speech (EARS dataset), TADA scored 4.18/5.0 on speaker similarity and 3.78/5.0 on naturalness, placing second overall — ahead of several systems trained on significantly more data.

    Potential Applications

    On-device deployment. TADA is lightweight enough to run on mobile phones and edge devices without requiring cloud inference. For device manufacturers and app developers building voice interfaces, this means lower latency, better privacy, and no API dependency.

    Long-form and conversational speech. TADA's synchronous tokenization is dramatically more context-efficient than existing approaches. Where a conventional system exhausts a 2048-token context window in about 70 seconds of audio, TADA can accommodate roughly 700 seconds in the same budget. This opens the door to long-form narration, extended dialogue, and multi-turn voice interactions.

    Production reliability. Zero hallucinations in our tests suggests fewer edge cases to catch, fewer customer complaints, and less post-processing overhead in the product. This makes TADA well-suited for deploying voice in regulated or sensitive environments like healthcare, finance, and education.

    Limitations and Future Work

    Long-form degradation. While the model supports more than 10 minutes of context, we noticed occasional cases of speaker drift during long generations. Our online rejection sampling strategy reduces this significantly, but it's not fully resolved. We suggest resetting the context as an intermediate workaround.

    The modality gap. When the model generates text alongside speech, language quality drops relative to text-only mode. We introduce Speech Free Guidance (SFG), a technique that blends logits from text-only and text-speech inference modes to help close this gap, but more work is required.

    Use-cases. The model is only pre-trained on speech continuation; further fine-tuning is required for assistant scenarios. Get in touch to inquire about Hume's extensive library of fine-tuning data.

    Scale. The current release covers English and seven additional languages, so there's clear room to expand. We're training larger models with broader language coverage with Hume AI data.

    We're releasing TADA because we believe this architecture opens a productive direction for the field, and we want to accelerate progress. We invite researchers and developers to build on this work — whether that means extending the tokenizer to new modalities, solving the long-context problem, or adapting the framework for new applications.

    Get Started

    TADA is available now under an open-source license. We're releasing 1B and 3B parameter Llama-based models and the full audio tokenizer and decoder.

    1B (English):

    huggingface.co/HumeAI/tada-1b

    3B (multilingual):

    huggingface.co/HumeAI/tada-3b-ml

    Demo:

    huggingface.co/spaces/HumeAI/tada

    GitHub:

    github.com/HumeAI/tada

    TADA was developed by Trung Dang, Sharath Rao, Ananya Gupta, Christopher Gagne, Panagiotis Tzirakis, Alice Baird, Jakub Piotr Cłapa, Peter Chin, and Alan Cowen at Hume AI.

    Hume builds voice AI research infrastructure for frontier labs and AI-first enterprises. If you're working on voice models and need high-quality training data, evaluation systems, or reinforcement learning infrastructure, get in touch at

    [email protected]

    Original source
  • Feb 27, 2026
    • Date parsed from source:
      Feb 27, 2026
    • First seen by Releasebot:
      Feb 28, 2026
    • Modified by Releasebot:
      May 16, 2026
    Hume logo

    Hume

    February 27, 2026

    Hume adds EVI API support for new LLM models and zero prompt expansion control.

    EVI API additions

    Added support for new supplemental LLM models: claude-opus-4-6, gpt-5.1, gpt-5.1-priority, gpt-5.2, gpt-5.2-priority.

    Added support for zero prompt expansion. You can now set prompt_expansion to ZERO when configuring an external LLM, disabling automatic prompt expansion and giving you full control over the system prompt.

    Original source
  • Oct 14, 2025
    • Date parsed from source:
      Oct 14, 2025
    • First seen by Releasebot:
      Feb 18, 2026
    Hume logo

    Hume

    Revelum × Hume: Detecting Voice Fraud in Real-Time

    Revelum launches an AI-native security platform that stops deepfake fraud in real time with call risk analysis, live deepfake detection, and precise timestamps. A strategic partnership with Hume AI speeds up resilient detection and responsible AI use, demonstrated by a real-time fraud scenario.

    Revelum

    Revelum is an AI-native security platform that protects institutions from deepfake impersonations and fraud in real time. Founded by Enrique Barco in 2025, Revelum provides turn-key solutions and developer-friendly APIs to safeguard institutions from emerging threats in the era of AI-driven fraud.

    Revelum’s strategic partnership with Hume AI ensures their detection systems can identify even the most advanced synthetic voices, including Hume's own Empathic Voice Interface (EVI).

    The Demo: AI-Powered Fraud in Action

    In a recent demo, Revelum showcased how attackers use AI voice agents to attempt account takeovers. The scenario: "Jake" calls customer support, claiming an AI assistant accidentally changed and deleted his password. The EVI-powered voice sounds natural, but Revelum's technology instantly flags the deepfake.

    In real-time, Revelum’s platform provides:

    • A call risk assessment analysis
    • Real-time deepfake detection counts
    • Precise timestamps of synthetic voice segments

    The customer service agent receives an alert and initiates callback verification—stopping the attack immediately.

    Hume’s Partnership with Revelum

    Our collaboration with Revelum creates a critical feedback loop for responsible AI development:

    • Early Access to EVI: Revelum trains their models on Hume's cutting-edge voice technology, ensuring detection capabilities stay ahead of emerging threats before they reach malicious actors.
    • Continuous Refinement: As Hume's emotionally intelligent voices become more sophisticated, Revelum's detection algorithms evolve in parallel.

    Revelum founder Enrique notes:

    "By partnering with Hume, we’re taking a vital step toward building technology that anticipates — not just reacts to — the evolving tactics of bad actors seeking to misuse powerful models. Together, we’re staying one step ahead in ensuring generative AI is used responsibly."

    For more information on how empathic AI can enhance your digital solutions, contact Hume AI.

    Original source
  • Oct 21, 2025
    • Date parsed from source:
      Oct 21, 2025
    • First seen by Releasebot:
      Feb 18, 2026
    Hume logo

    Hume

    Creating immersive avatar experiences with Render Foundry

    Render Foundry unveils an immersive Babe Ruth simulator built with Hume AI voice cloning, blending Unreal Engine storytelling with authentic, warm dialogue. The experience lets visitors talk with Babe Ruth, syncing likeness with audio for a lifelike museum interaction.

    About Render Foundry

    Render Foundry specializes in creating immersive experiences, from interactive museum installations to digital twins of entire campuses. Led by Shane Boyce and Josh Harwell, their team combines Unreal Engine expertise with cutting-edge storytelling to blur the line between reality and simulation.

    When they set out to create an interactive Babe Ruth experience, they needed a voice that could do the impossible: bring him back to life with authenticity, warmth, and the personality that made him a legend.

    Watch the Experience

    Using Hume's custom voice cloning technology, Render Foundry created a Babe Ruth simulator that feels emotionally authentic. Hume captured the tonal qualities, cadence, and personality of the baseball icon, allowing visitors to have natural, engaging conversations with one of sports' most beloved figures.

    The result is an experience that transcends typical museum exhibits. Visitors not only learn about Babe Ruth, but also connect with him. Render Foundry created something truly special by simulating Babe’s likeness and syncing the audio and the visual, making this experience one-of-a-kind.

    Josh Harwell, the Creative Director at Render Foundry, says,

    “We’re excited to offer these curated experiences. It’s fun to watch clients interact with our characters as they are brought to life. Whether it’s a historical figure, a mascot, or a brand ambassador, Hume helps us deliver a solution that humanizes the responses of AI.”

    For more information on how empathic AI can enhance your digital solutions, contact Hume AI.

    Original source
  • Oct 24, 2025
    • Date parsed from source:
      Oct 24, 2025
    • First seen by Releasebot:
      Feb 18, 2026
    Hume logo

    Hume

    AudioStack × Hume: Professional Audio for Creatives

    AudioStack expands its AI audio production suite with Hume’s expressive voices, boosting speed, consistency, and emotional depth for ads, podcasts, and branded content. The integration enables scalable, natural sounding output across markets and languages while reducing costs and raising creative quality.

    About AudioStack

    AudioStack is an enterprise AI audio production platform trusted by global creative teams at Publicis, Omnicom, iHeartMedia, Dentsu, and more. Their AI-driven production suite empowers agencies, publishers, AdTech platforms, and brands to create broadcast-ready audio content 10 times faster at a fraction of traditional costs—reducing production expenses by up to 80% while scaling effortlessly across markets and languages.

    Expanding Audiostack’s Voice Library with Hume

    AudioStack offers a comprehensive voice library for audio advertisements, podcasts, and branded content. As they continue to grow their voice offerings, they're integrating Hume's emotionally intelligent voices to meet two core demands of creative teams:

    1. Consistent Stability
      Enterprise content generation requires voices that perform reliably across thousands of productions. Hume's voices deliver consistent quality and pronunciation, ensuring brand messaging remains clear and professional, whether creating one ad or thousands of dynamic variations.

    2. Natural Expressiveness
      Generic TTS voices often sound flat or robotic—a dealbreaker for agencies creating audio that needs to engage audiences. Hume's voices bring genuine emotional depth, helping audio content feel authentic, engaging, and human.

    By adding Hume’s expressive voices to their platform, AudioStack enables creative teams and advertisers to produce high-quality, emotionally resonant audio at scale.

    For more information on how empathic AI can enhance your digital solutions, contact Hume AI.

    Original source
Releasebot

Curated by the Releasebot team

Releasebot is an aggregator of official release notes from hundreds of software vendors and thousands of sources.

Our editorial process involves the manual review and audit of release notes procured with the help of automated systems.

Similar to Hume with recent updates: