MiniMax Release Notes

Follow

23 release notes curated from 29 sources by the Releasebot Team. Last updated: Jun 19, 2026

Get this feed:
  • Jun 1, 2026
    • Date parsed from source:
      Jun 1, 2026
    • First seen by Releasebot:
      Jun 19, 2026
    MiniMax logo

    MiniMax

    MiniMax M3: Frontier Coding, 1M Context, Native Multimodality — All in One Model

    MiniMax releases M3, a frontier open-weight model for coding and agentic work with 1M-token context, native multimodal input, and computer use. It also updates MiniMax Code, expands Token Plan access, and makes the M3 API available for developers.

    M3 is officially released today.

    M3 reaches frontier-level performance on specialized tasks such as coding and agentic work. It uses MSA (MiniMax Sparse Attention), a new attention architecture proposed by our team, and supports ultra-long context windows of up to 1M tokens. To much anticipation, it is also a natively multimodal model that supports image and video input and can operate a desktop computer.

    These three capabilities are now table stakes for closed-source frontier models. M3 is currently the first and only open-weight model to bring all three together.

    In terms of Coding capabilities, M3 shows significant improvements over M2, approaching the level of leading overseas closed-source models in areas such as bugfix, frontend/backend development, and performance optimization.

    In terms of Agentic capabilities, M3 performs strongly on commonly used office workflows like search and Office-suite tasks, and has also become initially usable in the financial domain.

    You can experience MiniMax M3 right away through MiniMax Code, the Token Plan, and our API services.

    MSA: Architectural Innovation Enables Context Scaling

    MSA: Architectural Innovation Enables Context Scaling

    Solving more complex Agent tasks was one of the most important goals when training M3, and one of the biggest challenges involved was context scaling. To achieve real change, you have to start at the most fundamental level—the attention mechanism—and avoid the "inherent flaw" of full attention: quadratic computational complexity growth.

    MSA is a clean and easily extensible new sparse attention architecture. It gives M3 a 1M context window and makes context truly another dimension that can be scaled.

    Sparse attention mechanisms generally avoid the complexity-explosion problem by adding a pre-filtering stage. Compared with approaches like DSA and MoBA, MSA can partition the KV into blocks more precisely, achieving higher effective context coverage.

    At the same time, we also optimized directly at the operator level, adopting a "KV outer gather Q" approach that uses KV blocks as the outer loop to aggregate the queries that hit them. Each block is read only once and memory access is contiguous; under M3's head configuration, the arithmetic intensity is significantly better than common methods—more than 4× faster than the open-source Flash-Sparse-Attention and flash-moba.

    Its clean, scalable, easy-to-implement, and hardware-friendly characteristics allow its theoretical gains to be fully realized in practice: at a context length of 1 million, M3's per-token compute is just 1/20 that of the previous-generation model. We achieved a speed up of more than 9× in the prefilling stage and more than 15x in the decoding stage. Moreover, across multiple ablations, MSA matched full attention on the vast majority of capabilities.

    Frontier Coding and Agentic Capabilities

    Frontier Coding and Agentic Capabilities

    Coding and agentic capabilities are key areas of improvement for M3. Across internationally recognized benchmarks spanning software engineering and terminal execution, M3 reaches the frontier:

    • SWE-Bench Pro: 59.0%
    • Terminal-Bench 2.1: 66.0%
    • SWE-fficiency: 34.8%
    • KernelBench Hard: 28.8%
    • MCP Atlas: 74.2%

    Today, coding prowess increasingly depends on whether models can be trained using real-world user logic. Existing coding benchmarks often fail to fully capture real user experience.

    Most current training and evaluation of code agents are based on the assumption of single-turn tasks. But real-world usage is not like this. Users often collaborate continuously within the same session—clarifying requirements, adjusting solutions, assigning tasks across contexts, and iterating over multiple rounds based on intermediate results.

    To narrow the gap between benchmarks and real-world user experience, we built an interactive user simulator framework.

    By simulating the behavioral patterns of real developers during collaboration, the framework exposes models during both training and evaluation to interaction scenarios that are much closer to production environments. It can simulate behaviors such as requirement elaboration, solution discussion, feedback-based correction, continuous task switching, and complex project iteration. As a result, the agent no longer merely executes instructions passively, but can actively collaborate with users to complete tasks.

    The next generation of agentic coding will not be measured only by code generation, but also by long-term collaboration capability, planning ability, and the efficiency of human-agent collaboration. M3 scales up the data that truly matters for coding and agents, with the goal not only of leading on benchmarks, but also of becoming a reliable collaborative partner for developers in real-world R&D workflows.

    Multimodality: Interleaved Training, Continued Scaling

    Multimodality: Interleaved Training, Continued Scaling

    M3 is a model that has undergone mixed-modality training from Step 0. This native multimodal approach allows the semantic spaces of different modalities to merge more naturally and deeply.

    Meanwhile, our extensive experiments show that interleaved data scales more easily than synthetic data. As a result, during the M3 cycle, we re-architected the entire text pretraining data pipeline, producing a large volume of interleaved data and incorporating it into model training.

    Real-World Tasks

    Real-World Tasks

    In our internal use and testing of M3, several real-world tasks left a strong impression.

    Independent Paper Reproduction

    Independent Paper Reproduction

    As three essential capabilities for a frontier model, we wanted to see how 1M ultra-long context, top-tier coding and agent capabilities, and native multimodal capability would perform when brought together in a long thread to solve a complex task.

    We gave M3 an ICLR 2025 Outstanding Paper Award-winning paper,

    Learning Dynamics of LLM Finetuning

    , and asked it to reproduce the paper independently. The paper studies the learning dynamics of large language models during fine-tuning. In the end, M3 ran autonomously for nearly 12 hours, independently producing 18 commits and 23 experimental figures throughout the process, and successfully completed the core experiments.

    It not only successfully matched the trend of prediction-probability changes during the SFT stage, but also clearly observed the squeezing effect highlighted in the DPO experiments, and successfully verified the Extend mitigation method proposed in the original paper.

    Multimodal capabilities were required to understand the curves, data, and formulas in the paper, while long context ensured that the paper, code, and experiment logs could all fit into the context window at once. Only with sufficiently strong coding and agent capabilities could the model complete the reproduction over a long thread, even with concurrent execution.

    M3 was able to do it all.

    CUDA Kernel Optimization

    CUDA Kernel Optimization

    FP8 matrix multiplication (GEMM) is one of the most compute-intensive parts of large model inference, and also one of the most difficult to optimize. Engineers must simultaneously handle multiple tightly coupled issues, including data layout, compute pipeline scheduling, and adaptation to hardware characteristics. On NVIDIA Hopper architecture GPUs, hand-writing a production-grade FP8 GEMM kernel typically requires one to two weeks of focused effort from an experienced team.

    We used this task to evaluate M3s long-horizon autonomous iteration capability. We asked MiniMax M3 to optimize this kernel on NVIDIA Hopper architecture GPUs. The model started with only a task description, a benchmark evaluation script, and a Triton skeleton that could not run directly, with no reference high-performance implementation available. This meant the model could not take shortcuts by imitating an existing solution; it had to start from first principles and autonomously explore the optimization path.

    Over the following approximately 24 hours of continuous execution, M3 completed 147 benchmark submissions and 1,959 tool calls. It independently went through the entire process from baseline implementation to production-grade optimization, including baseline implementation, autotune configuration generation, performance bottleneck diagnosis, CUDA Graph integration, persistent kernel rewriting, and host-side scheduling optimization. Each step was self-validated through benchmark feedback, with no human intervention required.

    In the end, after six landmark rounds of optimization, M3 improved Hopper FP8 hardware peak utilization from 7.6% in the first version to 71.3%, achieving a 9.4× speedup compared with the original version.

    Beyond the metrics, the models execution process is also worth noting. Except for Opus 4.7 and M3, most other models stopped making new progress within the first 30 submissions and exited on their own. M3s best solution, however, appeared on its 145th submission. Before that point, the model went through multiple performance plateaus where no further improvement was observed, yet it continued exploring different optimization directions.

    The capabilities required here go beyond traditional code generation. The context produced by repeated tool calls is highly structured and dense, and this is where MSAs long-context attention allocation mechanism played an important role.

    Letting M3 Train Models

    Letting M3 Train Models

    In the CUDA operator optimization task, M3 demonstrated its long-horizon iteration capability on a single engineering task with a clear optimization objective and well-defined feedback signals. But real research work often does not have such a clear feedback structure; researchers are usually faced with more open-ended problems.

    We wanted to understand how M3 performs in scenarios that require autonomous decision-making, so we tested it on PostTrainBench. The task was as follows: give M3 four Base models that had only completed pretraining and did not yet possess any downstream capabilities, and have it autonomously complete the entire process of data synthesis, training, evaluation, and iteration within 12 hours. The final goal was to enable these models to acquire basic capabilities across mathematical reasoning (AIME2025), tool calling (BFCL), scientific knowledge reasoning (GPQA Main), basic arithmetic reasoning (GSM8K), and code generation (HumanEval).

    The entire data synthesis training evaluation iteration process took place without any human intervention. The agent had to decide on its own what kind of data to synthesize, which training strategy to choose, and how to adjust the next round of plans based on evaluation results. M3 ultimately scored 0.37, slightly below Opus 4.7 (0.42) and GPT-5.5 (0.39), but clearly ahead of the other models.

    MiniMax Code

    MiniMax Code

    With the release of M3, MiniMax Code has also been updated. As an agent product designed specifically for M3 and trained together with M3, MiniMax Code can fully leverage M3s capabilities in long context, coding/agentic tasks, and native multimodality, making it the preferred agent to pair with MiniMax-M3.

    For long-horizon complex tasks, MiniMax Codes Agent Team can break large tasks down into multi-stage, concurrent, and dynamically adjustable workflows, which are then advanced collaboratively by a cluster of agents. Through a Producer + Verifier adversarial harness loop, the Agent Team can continuously produce, reflect, and correct itself during execution. It can run autonomously for days without human intervention and ultimately deliver high-quality results.

    We have seen that Claude Code has also recently released Dynamic Workflows in a similar direction. Compared with Claude Codes stronger emphasis on fixed orchestration based on JS code, MiniMax Code focuses more on deep reflection and continuous error correction: the agent adjusts its plans and priorities in real time based on task progress, while users can step in at any time to add requirements or correct the direction.

    Thanks to M3s native multimodal capabilities, MiniMax Code also supports computer use. For example, a user can say on their phone: Help me open the local ERP client and batch-enter invoice information based on this Excel spreadsheet. MiniMax Code will then automatically complete the required operations on the computer across applications, files, and systems.

    MiniMax Code is built on a harness based on the outstanding open-source community projects OpenCode and Pi. We also plan to open-source this project in the future as a way to give back to the open-source community.

    MiniMax Code desktop app: agent.minimaxi.com/download

    MiniMax Code can be used with MiniMax Token Plans.

    MiniMax Token Plan: Bringing Frontier Models to Developers Daily Work

    MiniMax Token Plan: Bringing Frontier Models to Developers Daily Work

    MiniMax M3 is a frontier model built to serve more users.

    With this release, the MiniMax Token Plan has also been updated across three tiers:

    • Plus $20/month: ~1.7B tokens / month of M3 usage
    • Max $50/month: ~5.1B tokens / month of M3 usage
    • Ultra $120/month: ~9.8B tokens / month of M3 usage

    Among subscription plans at comparable price points, the MiniMax Token Plan offers one of the highest token quotas globally. Text, image, speech, and music all share the same usage pool.

    All three tiers are fully available. Subscribe and start immediately!

    Subscription link:

    platform.minimax.io/subscribe/token-plan

    API

    API

    The M3 API is now available.

    Pricing depends on input length: calls with

    512K input tokens are billed at the standard rate, covering the vast majority of conversation and coding scenarios, while calls

    above 512K are billed at a higher long-context rate, mainly intended for high-load scenarios such as ultra-long document parsing and full-repository code understanding.

    M3 supports toggling thinking on or off. With thinking enabled, the model is suited to complex reasoning, agentic tasks, and long-horizon collaboration; with it disabled, it responds faster, suiting latency-sensitive scenarios such as conversation and code completion. The two modes share the same pricing and can be switched as needed at request time.

    All prices can also be combined with two service levels: the default

    standard

    tier is suitable for regular requests; the

    priority

    tier (service_tier=priority) receives scheduling priority and more stable response latency under high-concurrency scenarios, making it suitable for SLA-sensitive industrial use cases. The priority channel is currently enabled through sales support and is expected to open to all users in a few days.

    API guide:

    platform.minimax.io/docs/api-reference/api-overview

    We will continue improving model serving stability and optimizing throughput. Over the next 10 days, we will release the models technical report and open-source the corresponding model weights.

    Today, models are being updated at such a fast pace that it is easy to forget this is still a matter of steady, incremental progress. It follows its own objective laws, and it rewards teams that move forward solidly in accordance with those laws. Just as we believed at the very beginning of our founding, we will do our utmost to continuously improve the intelligence of our models and make them available to more users.

    Thank you for your trust, suggestions, and criticism.

    Intelligence with Everyone!

    Original source
  • May 27, 2026
    • Date parsed from source:
      May 27, 2026
    • First seen by Releasebot:
      Jun 19, 2026
    MiniMax logo

    MiniMax

    MiniMax Agent Team: Built for Long-Running Tasks and Continuous Evolution

    MiniMax introduces the upgraded MiniMax Agent as Mavis, adding Agent Teams for parallel multi-Agent collaboration and merging TokenPlan with Agent Plan into one subscription with shared credits across CLI, API, and Agent.

    Today we are introducing the overall upgrade of MiniMax Agent. We have given the upgraded Agent a new name: Mavis — MiniMax as a Jarvis, your AI butler.

    This release brings the following updates:

    • Launching Agent Teams. On MiniMax Agent desktop you can now run multiple Agents in parallel, create Agents in different roles, and have them form a team that collaborates on complex tasks — ideal for long, complex work a single Agent cannot finish alone.
    • Merging TokenPlan and Agent Plan. One subscription unifies CLI, API, and Agent across M2.7, music, video, and voice. Credits are shared between Agent and API for more flexible usage. If you previously subscribed to both plans, you will get an extra month of membership.

    This time we want to share the thinking behind Agent Teams: how did we design the Agent team? What problem does it solve? What cost did we pay? When should users adopt Agent Team, and when is it unnecessary?

    Let's first revisit how today's single-Agent setup actually executes.

    “Help me put together a long-form article about Agent Team. The information must be based on the latest data in 2026, and deliver both Markdown and HTML versions.”

    In the past, we would hand this sentence to a powerful AI assistant. It would reply immediately, pushing a long block of text back into the chat. The experience felt smooth, but as delivery quality requirements rise, problems surface: who gathers the sources? Who verifies the facts? Who lays out the document? And once today's job is done, will the system even remember the pitfalls next time?

    1. Why we need an Agent Team

    Even though you can iterate Skills to make a single Agent deliver well, when a single Agent produces the final result it is inevitably both the judge and the contestant. That contradiction is the starting point of Agent Team.

    Agent Team turns a complex task that was crushed onto one Agent into a process with a front office and a back office, with acceptance, and with memory. The user still sends only one message, but the Agent Team system decides whether to split the task, which roles can run in parallel, which results must be verified, and which experience should be retained.

    Continuing the scenario above.

    “Help me put together a long-form article about Agent Team. The information must be based on the latest data in 2026, and deliver both Markdown and HTML versions.”

    A single Agent might finish this smoothly, like a coworker sitting next to the user. When the user asks “polish this paragraph”, it can edit immediately; when the user says “format is off here”, it confirms right away. But a few problems show up: 1) if the user does not command it, the Agent stops and the user has to keep telling it to “confirm” or “keep going”.

    • A single Agent stops at moments the user does not expect.

    Users often see an Agent with 7 things to do that pauses after 3 edits and starts reporting: “I have finished edits 1, 2, 3 — would you like me to continue with the other 5?”

    This happens because models commonly have context anxiety, and training for very long tasks itself takes huge money, time, and algorithmic effort. The model's judgment of when a task can stop is fuzzy.

    • A single Agent gets dumber over time — degradation is obvious on long tasks.

    Users often feel that as the Agent runs it shifts from “a smart assistant” to “managing someone who is busy but easily distracted”. The user keeps asking: do you still remember that earlier requirement? Why did you turn the research task into a product marketing pitch?

    As long as one step drifts, everything downstream will keep generating along the drift.

    Worse, a single Agent rarely forms natural “checks and balances”. It might honestly self-check, but it is still inspecting the very scene it just constructed itself.

    • A single Agent also cannot respond quickly to long-cycle tasks.

    Especially in IM scenarios (driving Agents through messaging apps), user patience is very short. After sending a message in IM, users expect a response within seconds. Even if the task is complex, users still want a first reply along the lines of: “got it, here is what I will do, I will come back when it is done.” They do not want to stare at a chat box for ten minutes, half an hour, or longer just to confirm the task has started.

    “Why isn't my Agent replying to me” is the largest single source of user feedback we receive.

    Agent Team offers a different experience. The main Agent first responds quickly to the user: I have received the task, the goal is confirmed, and I will split and execute in the background. The task is broken into multiple section bundles or multiple versions, and executed in parallel.

    Users no longer need to wait for every sub-step to finish. They receive reports at key checkpoints: task started, blocked, decision needed, completed.

    They can also chat with the main Agent anytime: “I just had another idea, can you also research it on the side”, and the main Agent can respond: “sure, I will spin up another Agent group now and report when there is progress. By the way, of the in-flight tasks 2/5 are done, 2 of the remaining 3 are in the verification stage, and the last one I will keep watching.”

    Just like a thoughtful friend who replies to your WeChat messages in seconds.

    • Stepping outside any one task, we naturally have to accept the diversity of user needs and the division of roles across domains.

    On the same day, a single user might ask the Agent to write code, gather information, build slides, summarize a meeting, read a PDF, work on spreadsheets, handle expense reports, plan a project, and generate a weekly report. Each kind of task has different input structure, tool permissions, quality criteria, risk level, and delivery format.

    A single Agent can temporarily play different roles via Skills, but role-playing is not the same as role specialization. Real specialization, from the context perspective alone, has at least four dimensions: different tools, different context, different memory, different skills. From the result side, output protocol and acceptance criteria are also different. Suppose we have already built the Agent Team system above — Agents with different responsibilities can repeatedly meet tasks in their own domain, turn pitfalls into memory and valuable actions into Skills, like a group of colleagues who work long-term with the user and keep getting better at their respective jobs.

    2. Multi-Agent collaboration in the industry today

    Table listing various products and how they implement multi-agent collaboration, their strengths and limitations:

    • OpenAI Agents SDK: Agents can hand off tasks or call others temporarily, preserves conversation and compliance. Strengths: clear model, suitable for split tasks, safety checks, easy productization. Limitations: sequential relays, limited parallelism, weak isolation.
    • LangGraph: Workflow with supervisor agent. Strengths: controllable flow, supports progress save/resume. Limitations: higher build/debug cost, weak standalone execution.
    • OpenCode: Primarily single-agent, acts as execution layer in multi-agent systems. Strengths: unified command system, fine-grained permissions. Limitations: no internal multi-agent mechanism.
    • OMC oh-my-claudecode — Team Pipeline: Multiple agents relay across stages with repair stage. Strengths: complete process, fixing after verification failure, good for complex coding. Limitations: heavyweight flow, high overhead, fixed stages.
    • Claude Code — Teams: Lead agent manages team, assigns tasks to isolated teammates. Strengths: deep integration, context isolation, full task management. Limitations: scheduling depends on lead stability, limited long-running.
    • OMC Ralph Loop / Ralph Mode: Parallel execution with repeated verification and patching. Strengths: quality completion, suitable for repeated polishing. Limitations: higher runtime, cost, requires clear check criteria.
    • OMC Autopilot + Ralph: Full chain from analysis to verification with ongoing fixing. Strengths: full process coverage, suitable for complex automation. Limitations: long system flow, dependent stages, requires clear acceptance criteria.

    3. MiniMax Agent Team: giving every Agent more freedom on top of a constrained multi-Agent loop

    MiniMax's Agent Team is a multi-Agent system led by a main Agent that splits complex tasks into parallel sub-tasks dispatched to multiple Agents, with adversarial quality gates and a deterministic code logic. Inspired by Ralph-Loop and Harness, it splits tasks for context isolation, raising overall output quality.

    Key collaboration flow: Leader (translates user goal to task structure), Worker (executes sub-tasks with specific tools and specializations), Verifier (ensures deliverables meet quality, checks sources and risks, adversarial to Worker for high-quality outputs).

    Agent Team is interactive at any time, unlike traditional single-call Task tools. It uses a reliable state machine (Team Engine) to manage lifecycle: producing, verifying, done, with retries on verification failure. Leader can confirm task details proactively and send supplementary prompts. Collaboration is via multi-round interactions, not single function calls.

    Experience is stored in memory and Skills for Agents to improve in efficiency and understanding over time.

    • Inter-Agent communication design: Agents and humans have equal rights

    Users can prompt, spawn, abort, or kill Agents via UI; Agents can also perform these operations on each other. Operations are abstracted into interfaces usable by users, Agents, or the Team Engine. Boundaries ensure permissions and accountability remain controlled.

    3.1 Core scenario 1: IM integration, async execution with fast response

    IM users expect replies within seconds but tasks may take minutes or hours. Single Agents must choose between shallow fast replies or full tasks with long silence. IM conversations evolve, so binding long tasks to a single model context is insufficient. The system must persist task state, event logs, files, and records. Agent collaboration is a stateful, long-term task.

    3.2 Core scenario 2: Coding Harness

    Harness extends the idea of an Agent writing code to follow full development lifecycle: code on branches, sandboxes, diffs, runnable tests, reviews, failures with replay, task splits. Stop conditions are deterministic from external systems.

    Roles: Leader (control plane, decides granularity and escalation), Developer (implementation with clear briefs), Tester (runs verification), Reviewer (checks code quality and security, can run in parallel with specializations).

    Automated testing and code reviews are layered to ensure quality.

    3.3 Core scenario 3: parallel information retrieval and research

    Single Agent suffers from slow, polluted context and biased thinking. Agent Team splits research into parallel channels and merges findings with verification. Verifier ensures source verifiability and checks for stale or counter-evidence.

    3.4 Core scenario 4: pipeline-style office document writing

    Single Agent struggles with long documents: planning, sourcing, consistency, formatting, export quality. Agent Team splits tasks by stages: Planner, Writer, Formatter, Evaluator, turning generation into CI/CD like pipeline with intermediate artifacts and checks.

    4. Hard problems and reflections during development

    Costs introduced: handoff cost (reorganizing info between Agents), sharing cost (token usage for shared context), aggregation cost (merging outputs consistently).

    Trade-off between time/token costs and return on investment (ROI) is critical. While multi-Agent increases consumption, it can provide verifiable, auditable results that improve trust and free users for higher-level thinking.

    Verifier and retries add cost; leader decisions must be precise, especially for high-risk actions requiring human sign-off.

    Overall, multi-Agent must be treated as a runtime environment managing complex states and interactions rather than simple prompt orchestration. This requires significant engineering for observability, permissions, and constraints.

    5. Lessons

    5.1 Multi-Agent exists to complete complex tasks more reliably

    Structure is core; without it, multi-Agent is costly concurrency. Valuable multi-Agent systems answer splitting, verifying, stopping, recovery, and memory management clearly.

    5.2 Team value depends on complexity, but ROI cannot be judged on short term alone

    Teams shine with long, deep, risky, reusable tasks. Simple or deterministic tasks are better for single Agents or automation. Memory and Skills growth over time contribute to long-term ROI.

    5.3 The future Agent will look more like a long-term digital team

    Agents and humans will share reciprocal control interfaces. Humans configure roles, tasks, boundaries and supervise key decisions with Agents handling execution. Management interfaces may become Agent-controlled. Stronger models are needed to automate scheduling and management further.

    6. Open source and how to use it

    MiniMax Agent will be open-sourced soon, expected to coincide with MiniMax M3 model release. Meanwhile, the desktop app is officially released and available at https://agent.minimaxi.com/download. A single MiniMax subscription now includes both Agent and TokenPlan.

    Original source
  • All of your release notes in one feed

    Join Releasebot and get updates from MiniMax and hundreds of other software products.

    Create account
  • May 26, 2026
    • Date parsed from source:
      May 26, 2026
    • First seen by Releasebot:
      Jun 19, 2026
    MiniMax logo

    MiniMax

    Why Can't the MiniMax LLM Say "Ma Jiaqi"? Internal Investigation of Sparse Token Forgetting

    MiniMax shares an internal investigation into sparse token forgetting in the M2 series and says the issue has been resolved in subsequent model updates. The update explains how low-frequency token drift affected generation, fixed a minor-language mixing problem, and improved vocabulary coverage.

    Hypothesis 1: Token Misalignment Between Training and Inference

    From this case, the model still possesses the relevant knowledge — it can answer basic information about Ma Jiaqi (such as his group affiliation and debut date), indicating that the corresponding semantic representations have not been lost. The issue is that during generation, the model cannot output the token "Jiaqi (嘉祺)". Therefore, we first investigated at the tokenizer level: checking the token IDs for model input and expected output to confirm whether there is any mismatch in the text-to-token conversion process.

    Using the post-training tokenizer to encode "Ma Jiaqi (马嘉祺)", the results are as follows:

    Token IDs: [4143, 190467]

    Tokens: ['马', '嘉祺']

    Decode Verification: 马嘉祺

    Both encode and decode processes work correctly, but one noteworthy detail is that "Jiaqi (嘉祺)" is tokenized as a single independent token (id=190467). These two characters co-occur infrequently in everyday corpora, making their existence as a single token somewhat unexpected. This led to a hypothesis: could it be that different tokenizer strategies were used during pretraining versus post-training+serving? Specifically, perhaps during pretraining, "嘉祺" was actually split into two tokens ['嘉', '祺'], meaning the merged "嘉祺" token never received sufficient training. If the post-training and serving stages use the merged token, its generation probability would be very low (below 5%), and under a top-p = 0.95 sampling strategy, it would be masked and thus unable to be generated.

    To verify this hypothesis, we examined the pretrained model's vocab embedding from both statistical distribution and semantic nearest-neighbor perspectives to confirm whether token 190467 ("嘉祺") was adequately trained during pretraining.

    1. Statistical Distribution Check: Comparing the embed_tokens norm distribution across the full vocabulary, token 190467 ("嘉祺") falls within the normal distribution range, without the anomalously small values typically seen in untrained tokens, indicating that this token was adequately learned during pretraining.

    2. Semantic Nearest-Neighbor Verification: Performing nearest-neighbor retrieval on the embedding of "嘉祺", the results include semantically highly relevant Chinese name tokens such as "Qianxi (千玺)" and "Yaxuan (亚轩)", indicating that the pretrained model had established reasonable semantic clustering for this token, and the tokenizer was aligned with model parameters during pretraining.

    Top-10 tokens closest to the "嘉祺" token embedding include personal names and sub-tokens of "嘉祺".

    1. Pretrain vs. Post-training Few-shot Comparison: To further pinpoint the stage at which the problem emerged, we conducted few-shot tests on both the pretrained base model and the post-trained model. Using other celebrity names as examples, we guided the model to answer questions involving "Ma Jiaqi (马嘉祺)":
    • Pretrained base model: Successfully continued generation with "The leader of TNT (Teens in Times) is Ma Jiaqi (马嘉祺)"; the "嘉祺" token was generated normally.
    • Post-trained model: The model still tended to avoid this token and could not output it normally.

    Combining all three verifications, we can rule out the tokenizer misalignment hypothesis: token 190467 ("嘉祺") was adequately trained during pretraining with correct semantic representations. The root cause must lie in the post-training stage.

    Hypothesis 2: Post-training Data Distribution Issues

    Since the problem originates in the post-training stage, a natural hypothesis is that the token "嘉祺" appeared too infrequently in post-training data, causing the model to gradually "forget" its ability to generate this token during SFT.

    Statistical analysis of the post-training data revealed fewer than 5 samples containing "嘉祺", essentially confirming this hypothesis.

    Of course, the most straightforward fix is to supplement the post-training data with relevant samples. But we are more interested in: what exactly changed inside the model? Are there intermediate metrics that can more precisely characterize this sparse token forgetting mechanism? Another intriguing question is: why does the model still recognize the "嘉祺" token — why did the lack of post-training data only cause it to lose generation capability while retaining comprehension?

    Exploring Intermediate Metrics

    Most of the model's capabilities (such as knowledge Q&A, instruction following, etc.) did not degrade after post-training, so representational changes in the Transformer's intermediate layers are unlikely to be the primary cause. A more reasonable focus is the two ends of the model — the input-side vocab embedding and the output-side lm_head — as these layers directly participate in token-level mapping and are sensitive to sparse token effects.

    Vocab Embedding: Nearly Unchanged

    Comparing the vocab embeddings before and after SFT, we found virtually no difference. This fits expectations: gradient norms attenuate layer-by-layer during backpropagation; for extremely low-frequency tokens, the embedding layer receives almost no effective gradient updates from the loss — only weight decay exerts a weak regularization effect. Thus, vocab embeddings remain stable before and after post-training.

    lm_head: Significant Changes

    The output-side lm_head weight vector corresponding to "嘉祺" underwent significant drift during post-training, shown by two aspects:

    1. Drastic Drop in Cosine Similarity and Large Norm Changes: Token 190467 ranks among the highest in magnitude of change across vocabulary, with its output representation substantially rewritten.

    2. Dramatic Shift in Nearest-Neighbor Semantic Structure: Before SFT, neighbors were semantically related Chinese personal names; after SFT, neighbor structure deteriorated with many special tokens and noise tokens flooding in, indicating vector space compression and contamination.

    Other Findings: Which Tokens Had the Largest lm_head Changes?

    Tokens with largest changes are categorized:

    1. Special Tokens (e.g., , ): expected large adjustments, as they appear rarely in SFT data.
    2. Japanese Colloquial / Web Templates: Largest category (~40%+), common in pretraining corpora but rare in SFT data, causing significant lm_head representational drift.
    3. LaTeX / Web Metadata: Academic paper formatting markers and Wikipedia source templates, rare in SFT data.
    4. Chinese SEO / Spam Text: SEO spam keywords learned from pretraining crawler data, absent during SFT.

    The significant degradation of Japanese tokens explains earlier observed minor-language mixing problems.

    Conclusion

    The core cause of sparse token forgetting is uneven vocabulary coverage in post-training data causing low-frequency token lm_head representations to drift during SFT. Sparse updates to the input embedding layer mean generation capability is lost while comprehension persists.

    Validation & Repair Experiments

    To address the issue, repair experiments focused on improving vocabulary coverage were designed.

    Vocabulary Coverage Synthetic Data

    Additional synthetic repetition data covering the full vocabulary ensured every token was adequately trained during post-training. Synthetic data was constructed by random partitioning the 200,064 tokens into segments (~8,000 tokens each), shuffling tokens to make conversation samples with instructions to repeat content, generating ~500 conversations so each token appears as a target at least 20 times.

    Evaluation Methods

    Test categories compared the experimental group (+full vocabulary coverage data) vs baseline:

    1. Minor-Language Confusion Rate Test (using Korean and Japanese prompts)
    2. Ma Jiaqi Case Qualitative Verification
    3. Group Chat Comparison Cases verifying fixes for known failure cases
    4. lm_head High-Degradation Token Targeted Test for tokens with largest cosine similarity changes

    Experimental Results

    Japanese→Russian confusion dropped from baseline 47%/5% to 1%, showing significant improvement. Korean confusion rate was stable. Ma Jiaqi case direct queries were correctly answered by experimental group; baseline failed guided queries.

    Known group chat failure cases related to sparse token substitution were fixed by experimental group.

    lm_head High-Degradation Token Targeted Test passed more cases in experimental group than baseline.

    lm_head Cosine Similarity Quantitative Analysis showed experimental group preserved near-perfect embedding direction similarity (mean cosine similarity above 0.999) across languages, while baseline showed severe Japanese token degradation.

    Other Directions Worth Exploring

    1. Mixing in Pretraining Data during SFT
    2. Targeted Synthesis for Low-Frequency Tokens
    3. Vocabulary Pruning + Continual Pre-Training (CPT) to realign embedding space

    Deeper Reflections

    Sparse token degradation reflects mismatch between tokenizer vocabulary design and downstream use cases. Tokenizers built on pretraining corpora include many low-frequency tokens that, due to distribution shifts with post-training data, undergo parameter drift and forgetting.

    Future tokenizer vocabulary construction should consider post-training data distributions to reduce sparse tokens unlikely to be activated downstream, aligning vocabulary with use cases. Post-training training data coverage must ensure sufficient coverage across tasks and monitor low-frequency token generation probabilities to detect decay.

    Original source
  • Mar 18, 2026
    • Date parsed from source:
      Mar 18, 2026
    • First seen by Releasebot:
      Jun 19, 2026
    MiniMax logo

    MiniMax

    MiniMax M2.7: Early Echoes of Self-Evolution

    MiniMax launches M2.7, a new model built for agentic workflows, software engineering, and office tasks. It highlights self-evolving harnesses, stronger coding and editing performance, richer tool use, and improved emotional intelligence, and is now fully available on MiniMax Agent and the API Platform.

    In the months following the first release of our M2-series models, we received a large volume of feedback and suggestions from enthusiastic users and developers, which drove us to further accelerate the efficiency of our model iterations. With human productivity already fully unleashed, the natural next step was to initiate self-evolution of both the model and the organization. M2.7 is our first model deeply participating in its own evolution.

    M2.7 is capable of building complex agent harnesses and completing highly elaborate productivity tasks, leveraging capabilities such as Agent Teams, complex Skills, and dynamic tool search. For example, when developing M2.7, we let the model update its own memory and build dozens of complex skills in its harness to help with reinforcement learning experiments. We further let the model improve its learning process and harness based on the experiment results. This process initiates a cycle of model self-evolution.

    • M2.7 delivers outstanding performance in real-world software engineering, including end-to-end full project delivery, log analysis, bug troubleshooting, code security, machine learning, and more. On the SWE-Pro benchmark, M2.7 scored 56.22%, nearly approaching Opus's best level. This capability also extends to end-to-end full project delivery scenarios (VIBE-Pro 55.6%) and deep understanding of complex engineering systems on Terminal Bench 2 (57.0%).
    • We have also enhanced the model's expertise and task delivery capabilities across various fields in the professional office software domain. Its ELO score on GDPval-AA is 1495, the highest among open-source models. M2.7 shows significantly improved ability for complex editing in the Office suite — Excel, PPT, and Word — and can better handle multi-round revisions and high-fidelity editing. M2.7 is capable of interacting with complex environments: It maintains a 97% skill adherence rate while working with over 40 complex skills, each exceeding 2,000 tokens.
    • M2.7 exhibits excellent character consistency and emotional intelligence, opening up more room for product innovation.

    Based on these capabilities, M2.7 is also significantly accelerating our own evolution into an AI-native organization.

    Building an agent for model self-evolution

    We first share an internal workflow that enables the M2-series models to self-evolve. This workflow also serves as an exploration of the boundaries of the model's agentic capabilities.

    Modern agent harness utilizes a combination of complex skills, memory, and other external modules to help improve its adaptability to various workspace environments. In MiniMax, our agents are routinely faced with very complex and disparate working environments spanning multiple departments. As such, to improve the robustness of our agents in these heterogeneous environments, we tasked an internal version of M2.7 to build a research agent harness that interacts and collaborates with different research project groups. The harness supports data pipelines, training environments, infrastructure, cross-team collaboration, and persistent memory — enabling researchers to drive it to deliver better models. The research agent harness drives the iteration cycle that produces the next generation of models under the guidance set by researchers.

    An exemplary workflow lies in the daily routine of our RL team. A researcher starts by discussing an experimental idea with the agent, who helps with literature review, tracks a pre-set experiment spec, pipelines data and other artifacts, and launches experiments. During the experiments, the agent monitors and profiles the experiment's progress and automatically triggers log reading, debugging, metric analysis, code fixes, merge requests, and smoke tests, identifying and configuring subtle yet key changes. These could have required the collaboration of multiple human researchers from different teams before, but now human researchers only interact for critical decisions and discussions. This accelerates problem discovery and experimentation, delivering models faster. Here, M2.7 is capable of handling 30%-50% of the workflow.

    During the iteration process, we realized that the model's ability to recursively evolve its own harness is also critical. Our internal harness autonomously collects feedback, builds evaluation sets for internal tasks, and based on this continuously iterates its own architecture, skills/MCP implementation, and memory mechanisms to complete tasks better and more efficiently.

    For example, we had M2.7 optimize a model's programming performance on an internal scaffold. M2.7 ran entirely autonomously, executing an iterative loop of "analyze failure trajectories → plan changes → modify scaffold code → run evaluations → compare results → decide to keep or revert changes" for over 100 rounds. During this process, M2.7 discovered effective optimizations for the model: systematically searching for the optimal combination of sampling parameters such as temperature, frequency penalty, and presence penalty; designing more specific workflow guidelines for the model; and adding loop detection and other optimizations to the scaffold's agent loop. Ultimately, this achieved a 30% performance improvement on internal evaluation sets.

    We believe that future AI self-evolution will gradually transition towards full autonomy, coordinating data construction, model training, inference architecture, evaluation, and other stages without human involvement.

    To this end, we conducted preliminary exploratory tests in low-resource scenarios. We had M2.7 participate in 22 machine learning competitions at the MLE Bench Lite level open-sourced by OpenAI. These competitions can be run on a single A30 GPU, yet they cover virtually all stages of machine learning workflow.

    We designed and implemented a simple harness to guide the agent in autonomous optimization. The core modules include three components: short-term memory, self-feedback, and self-optimization. Specifically, after each iteration round, the agent generates a short-term memory markdown file and simultaneously performs self-criticism on the current round's results, thereby providing potential optimization directions for the next round. We ran a total of three trials, each with 24 hours for iterative evolution. The best run achieved 9 gold medals, 5 silver medals, and 1 bronze medal. The average medal rate across the three runs was 66.6%, a result second only to Opus-4.6 (75.7%) and GPT-5.4 (71.2%), tying with Gemini-3.1 (66.6%).

    Professional Software Engineering

    In software engineering tasks, M2.7 more deeply explores real-world programming abilities, including log analysis for bug hunting, refactoring, code security, machine learning, Android development, and more.

    Take a common production scenario as an example: debugging in a live environment. This requires not just code generation, but strong comprehensive reasoning abilities. When faced with alerts in production, M2.7 can correlate monitoring metrics with deployment timelines to perform causal reasoning, conduct statistical analysis on trace sampling and propose precise hypotheses, proactively connect to databases to verify root causes, pinpoint missing index migration files in the code repository, and even have the awareness to use non-blocking index creation to stop the bleeding first before submitting a merge request. From observability analysis and database expertise to SRE-level decision-making — this is not merely a model that can write code, but one that truly understands production systems. Compared to traditional manual troubleshooting processes, using M2.7, we have on multiple occasions reduced the recovery time for live production system incidents to under three minutes.

    In terms of raw programming capabilities, M2.7 has reached the level of SOTA models. On SWE-Pro, which covers multiple programming languages, M2.7 achieved a 56.22% accuracy rate, matching GPT-5.3-Codex. It demonstrates an even more notable advantage on benchmarks closer to real-world engineering scenarios, such as SWE Multilingual (76.5) and Multi SWE Bench (52.7).

    This capability also extends to end-to-end full project delivery scenarios. On the repo-level code generation benchmark VIBE-Pro, M2.7 scored 55.6%, nearly on par with Opus 4.6 — meaning that whether the requirement involves Web, Android, iOS, or simulation tasks, they can be handed directly to M2.7 to complete.

    What deserves even more attention is its deep understanding of complex engineering systems. On Terminal Bench 2 (57.0%) and NL2Repo (39.8%), both of which demand a high degree of system-level comprehension, M2.7 also performs solidly. This further confirms that it excels not only at code generation but can also deeply understand the operational logic and collaborative dynamics of software systems.

    To improve development efficiency, one particularly important feature is native Agent Teams (multi-agent collaboration). Agent Teams impose paradigm-level demands on the model: role boundaries, adversarial reasoning, protocol adherence, and behavioral differentiation — these cannot be achieved through prompting alone and must be internalized as native capabilities of the model. In Agent Teams scenarios, the model needs to stably anchor its role identity, proactively challenge teammates' logical and ethical blind spots, and make autonomous decisions within complex state machines. Below is an Agent Teams setup we use internally for product prototype development, which contains a minimal organization for building product prototypes.

    Professional Work

    Beyond software engineering, agents are becoming increasingly useful in office scenarios. We believe this comes down to two core capabilities:

    Domain expertise and task delivery capability.

    The model needs to possess professional knowledge across various fields and understand user requirements. In the GDPval-AA evaluation, which measures this capability, M2.7 achieved an ELO score of 1495 among 45 models, second only to Opus 4.6, Sonnet 4.6, and GPT5.4, and surpassing GPT5.3. For the most common office document processing tasks, we systematically optimized the model's ability to handle Word, Excel, and PPT. Across various agent harnesses, M2.7 can both generate files directly based on templates and skills, and follow users' interactive instructions to perform multiple rounds of high-fidelity editing on existing files, ultimately producing editable deliverables.

    Ability to interact with complex environments.

    Generalized everyday scenarios mean the model must flexibly adapt to various contexts, invoke diverse skills and tools, and maintain stable instruction adherence throughout extended interactions. M2.7 has made substantial improvements in these areas. On Toolathon, M2.7 achieved an accuracy of 46.3%, reaching the global top tier. In MM Claw testing, M2.7 maintained a 97% skill compliance rate across 40 complex skills (each exceeding 2,000 tokens).

    We tested the model's professional proficiency in finance. For example, in a scenario involving reading research reports and modeling a company's future revenue, M2.7 can autonomously read annual reports and earnings call minutes, cross-reference multiple research reports, independently design assumptions and build a revenue forecast model, and then produce a PPT and research report based on templates. The feedback from practitioners is that the output can already serve as a first draft and go directly into subsequent workflows. Below is an example for TSMC.

    Task:
    Based on TSMC's annual report and earnings call information, build a revenue model for TSMC. Read multiple research reports, design corresponding assumptions, model TSMC's revenue based on the latest information, then produce a PPT based on a PPT template, and write a Word document research report.

    The recent surge in popularity of OpenClaw is representative of a thriving agent ecosystem, and we are pleased that our M2-series models have contributed to the community's flourishing. Based on commonly used tasks in OpenClaw, we built an evaluation set called MM Claw, covering a wide range of real-world needs in both work and life. M2.7 achieved a level close to Sonnet 4.6 on this test, with an accuracy of 62.7%.

    Entertainment

    With OpenClaw and similar personal agents, we noticed that beyond getting work done, many users also want the model to have high emotional intelligence and character consistency. With a persona in place, users start interacting with OpenClaw like a friend. We believe this presents an opportunity to extend the use of agentic models beyond pure productivity into interactive entertainment. To this end, we strengthened character consistency and conversational capabilities in M2.7.

    Based on this, we built a preliminary demo: OpenRoom, an interaction system based on an agent harness that liberates AI interaction from plain text streams and places it within a Web GUI space where everything is interactive. Here, character settings are no longer cold chunks of prompts; conversation drives the experience, generating real-time visual feedback and scene interactions, with characters proactively engaging with their environment. We believe this framework is highly extensible and can continue to evolve alongside improvements in agentic capabilities and community development, exploring entirely new ways for humans and agents to interact.

    To encourage exploration in this area, we have open-sourced the initial demo (of which most of the code was written by AI):

    • Project repository: github.com/MiniMax-AI/OpenRoom
    • Try it now: openroom.ai

    MiniMax M2.7 is now fully available on MiniMax Agent and the MiniMax API Platform. We look forward to users and developers exploring even more interesting use cases with M2.7.

    MiniMax Agent: agent.minimax.io
    API: platform.minimax.io
    Coding Plan: platform.minimax.io/subscribe/coding-plan

    Intelligence with Everyone.

    Original source
  • Mar 4, 2026
    • Date parsed from source:
      Mar 4, 2026
    • First seen by Releasebot:
      Mar 5, 2026
    MiniMax logo

    MiniMax

    Music 2.5+: Unlock instrumental music, break through style boundaries

    MiniMax Music 2.5+ launches instrumental music creation, expanding from song generation to full instrumental scores across classical, electronic, ambient, and ethnic timbres. It enables film TV scoring, ads, and game soundtracks with cross‑style fusion and studio‑quality production.

    MiniMax Music 2.5 Launch

    Today, we are pleased to announce that MiniMax Music 2.5 has officially launched its instrumental music creation capability. MiniMax Music has always centered on song generation. Today, we are extending our capabilities to a more essential form of music 6 instrumental music. No vocals needed; the music itself becomes the expression.

    Unlock All Styles

    MiniMax Music supports diverse generation styles including classical orchestration, minimalism, modern electronic, ambient sounds, and natural soundscapes. It covers the full spectrum from quiet atmospheres to powerful, high-energy tracks, adapting to meditation, sleep aids, advertising, game scoring and other scenarios. MiniMax Music model can handle the complete complexity from "pure natural sound without instruments" to "multi-track instrumental arrangements," with style switching requiring no additional tuning 6generate and use immediately.

    3b5
    Sleep Aid Music
    Prompt4ac A lullaby featuring music box as the primary timbre, with an extremely slow tempo, gentle melody, suitable for falling asleep late at night.

    3b5
    Meditation
    Prompt: Create extremely serene, extremely slow-paced meditation music. The background features soft, water-like flowing synth ambient pads, decorated with crisp Tibetan singing bowls and minimalist xylophone taps. The overall atmosphere is ethereal and deep, as if standing on a temple above the clouds, with no heavy percussion, designed to help listeners achieve deep inner peace.

    3b5
    Natural Soundscape
    Prompt4ac A healing late-night rainstorm, the crisp sound of raindrops hitting the roof and leaves, distant low and gentle thunder, minimalist, white noise.

    3b5
    Advertising / Brand Video Intro
    Prompt4ac A minimalist, tech-inspired brand intro track centered on pulsing synthesizers, precise and restrained in tone.

    3b5
    Game Music
    Prompt: Electric guitar-driven uplifting melody, adding passion to adventure and combat.

    The instrumental music capability also enables MiniMax Music to serve film and TV scoring directly. Films, short drama, documentaries, and TV series each have different scoring requirements. The model generates complete soundtracks matching narrative rhythm based on scene descriptions, covering various emotional types and atmospheric needs.

    3b5
    Film scoring
    Prompt: A minimalist cinematic score driven by pulsing synthesizers, with tight and precise rhythms.

    Cross-Genre Fusion, Unleash Imagination

    Beyond existing styles, MiniMax Music has strong style generalization capabilities, supporting cross-style tag combinations for generation.

    Whether traditional instruments with modern electronic, or Eastern timbres with Western structures, the model can understand the tension between different styles and transform them into coherent musical language, rather than simple element collage.

    This fusion is built on solid musicality 6rich harmonic layers, complete melodic progression with proper beginning, development, transition and conclusion logic, natural transitions from motif development to climax release. The more cross-style the work, the more it demonstrates the model's deep understanding of musical structure. In terms of audio quality, the sound field has distinct three-frequency separation, clear instrument separation with dynamic balance, each track has independent spatial positioning, ensuring professional production standards across different styles.

    It is worth mentioning that MiniMax Music's understanding and reproduction of traditional Chinese musical instruments is at an industry-leading level. MiniMax Music can accurately present the tonal expressiveness and performance details of ethnic instruments such as flute, pipa, and guzheng, naturally integrating them into orchestral arrangements and modern production contexts.

    3b5
    Epic orchestral music
    Prompt4ac Epic cinematic East Asian fusion, 136BPM, virtuosic Chinese bamboo flute (Dizi) leading a powerful orchestra. Intense Taiko drum beats, martial arts atmosphere, heroic and urgent. Dramatic shifts between fierce action and lyrical reflection. High energy, triumphant climax.

    3b5
    Baroque Metal 6 Baroque d7 Hardcore Heavy Metal
    Prompt: A gorgeous auditory metamorphosis. Crisp, rigorous Baroque harpsichord polyphonic melody, suddenly invaded by violent blast beats and heavily distorted heavy metal guitars. Complex classical harmonics perfectly fused with modern metal's aggressiveness, creating a grand and chaotic opera-style metal listening experience.

    3b5
    Chinese Style d7 Fantasy Epic
    Prompt: A Chinese-style pure music depicting an adventure in a vast fantasy world. The music atmosphere is hopeful, led by a retro cello solo melody, accompanied by rhythmic percussion. Overall dynamic range is wide, creating a rich sense of layering.

    Welcome to MiniMax Music 2.5+, unlock your musical creativity!

    Product Experience:
    https://www.minimax.io/audio/music

    API Interface:
    https://platform.minimax.io/docs/api-reference/music-generation

    Original source
  • Similar to MiniMax with recent updates:

  • Feb 12, 2026
    • Date parsed from source:
      Feb 12, 2026
    • First seen by Releasebot:
      Jun 19, 2026
    MiniMax logo

    MiniMax

    MiniMax M2.5: Built for Real-World Productivity.

    MiniMax introduces M2.5, a faster and more cost-efficient frontier model for coding, agentic tool use, search, and office work. It also adds M2.5-Lightning and deploys M2.5 in MiniMax Agent with new Office Skills and Expert workflows.

    Today we're introducing our latest model, MiniMax-M2.5.

    Extensively trained with reinforcement learning in hundreds of thousands of complex real-world environments, M2.5 is SOTA in coding, agentic tool use and search, office work, and a range of other economically valuable tasks, boasting scores of 80.2% in SWE-Bench Verified, 51.3% in Multi-SWE-Bench, and 76.3% in BrowseComp (with context management).

    Trained to reason efficiently and decompose tasks optimally, M2.5 exhibits tremendous speed in performing complicated agentic tasks, completing the SWE-Bench Verified evaluation 37% faster than M2.1, matching the speed of Claude Opus 4.6.

    M2.5 is the first frontier model where users do not need to worry about cost, delivering on the promise of intelligence too cheap to meter. It costs just $1 to run the model continuously for an hour at a rate of 100 tokens per second. At 50 tokens per second, the cost drops to $0.30. We hope that the speed and cost effectiveness of M2.5 enable innovative new agentic applications.

    Coding

    In programming evaluations, MiniMax-M2.5 saw substantial improvements compared to previous generations, reaching SOTA levels. The performance of M2.5 in multilingual coding tasks is especially pronounced.

    A significant improvement from previous generations is M2.5's ability to think and plan like an architect. The Spec-writing tendency of the model emerged during training: before writing any code, M2.5 actively decomposes and plans the features, structure, and UI design of the project from the perspective of an experienced software architect.

    M2.5 was trained on over 10 languages (including Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, and Ruby) across more than 200,000 real-world environments. Going far beyond bug-fixing, M2.5 delivers reliable performance across the entire development lifecycle of complex systems: from 0-to-1 system design and environment setup, to 1-to-10 system development, to 10-to-90 feature iteration, and finally 90-to-100 comprehensive code review and system testing. It covers full-stack projects spanning multiple platforms including Web, Android, iOS, and Windows, encompassing server-side APIs, business logic, databases, and more, not just frontend webpage demos.

    To evaluate these capabilities, we also upgraded the VIBE benchmark to a more complex and challenging Pro version, significantly increasing task complexity, domain coverage, and evaluation accuracy. Overall, M2.5 performs on par with Opus 4.5.

    We focused on the model's ability to generalize across out-of-distribution harnesses. We tested performance on the SWE-Bench Verified evaluation set using different coding agent harnesses.

    • On Droid: 79.7(M2.5) > 78.9(Opus 4.6)
    • On OpenCode: 76.1(M2.5) > 75.9(Opus 4.6)

    Search and Tool calling

    Effective tool calling and search are prerequisites for a model's ability to autonomously handle more complex tasks. In evaluations on benchmarks such as BrowseComp and Wide Search, M2.5 achieved industry-leading performance. At the same time, the model's generalization has also improved — M2.5 demonstrates more stable performance when facing unfamiliar scaffolding environments.

    In research tasks performed by professional human experts, using a search engine is only a small part of the process; most of the work involves deep exploration across information-dense webpages. To address this, we built RISE (Realistic Interactive Search Evaluation) to measure a model's search capabilities on real-world professional tasks. The results show that M2.5 excels at expert-level search tasks in real-world settings.

    Compared to its predecessors, M2.5 also demonstrates much better decision-making when handling agentic tasks: it has learned to solve problems with more precise search rounds and better token efficiency. For example, across multiple agentic tasks including BrowseComp, Wide Search, and RISE, M2.5 achieved better results with fewer rounds, using approximately 20% fewer rounds compared to M2.1. This indicates that the model is no longer just getting the answer right, but is also reasoning towards results in more efficient paths.

    Office work

    M2.5 was trained to produce truly deliverable outputs in office scenarios. To this end, we engaged in thorough collaboration with senior professionals in fields such as finance, law, and social sciences. They designed requirements, provided feedback, participated in defining standards, and directly contributed to data construction, bringing the tacit knowledge of their industries into the model's training pipeline. Based on this foundation, M2.5 has achieved significant capability improvements in high-value workspace scenarios such as Word, PowerPoint, and Excel financial modeling. On the evaluation side, we built an internal Cowork Agent evaluation framework (GDPval-MM) that assesses both the quality of the deliverable and the professionalism of the agent's trajectory through pairwise comparisons, while also monitoring token costs across the entire workflow to estimate the model's real-world productivity gains. In comparisons against other mainstream models, it achieved an average win rate of 59.0%.

    Efficiency

    Because the real world is full of deadlines and time constraints, task completion speed is a practical necessity. The time it takes a model to complete a task depends on its task decomposition effectiveness, token efficiency, and inference speed. M2.5 is served natively at a rate of 100 tokens per second, which is nearly twice that of other frontier models. Further, our reinforcement learning setup incentivizes the model to reason efficiently and break down tasks optimally. Due to these three factors, M2.5 delivers a significant time savings in complex task completion.

    For example, when running SWE-Bench Verified, M2.5 consumed an average of 3.52 million tokens per task. In comparison, M2.1 consumed 3.72M tokens. Meanwhile, thanks to improvements in capabilities such as parallel tool calling, the end-to-end runtime decreased from an average of 31.3 minutes to 22.8 minutes, representing a 37% speed improvement. This runtime is on par with Claude Opus 4.6's 22.9 minutes, while the total cost per task is only 10% that of Claude Opus 4.6.

    Cost

    Our goal in designing the M2-series of foundation models is to power complex agents without having to worry about cost. We believe that M2.5 is close to realizing this goal. We’re releasing two versions of the model, M2.5 and M2.5-Lightning, that are identical in capability but differ in speed. M2.5-Lightning has a steady throughput of 100 tokens per second, which is two times faster than other frontier models, and costs $0.3 per million input tokens and $2.4 per million output tokens. M2.5, which has a throughput of 50 tokens per second, costs half that. Both model versions support caching. Based on output price, the cost of M2.5 is one-tenth to one-twentieth that of Opus, Gemini 3 Pro, and GPT-5.

    At a rate of 100 output tokens per second, running M2.5 continuously for an hour costs $1. At a rate of 50 TPS, the price drops to $0.3. To put that into perspective, you can have four M2.5 instances running continuously for an entire year for $10,000. We believe that M2.5 provides virtually limitless possibilities for the development and operation of agents in the economy. For the M2-series, the only problem that remains is how to continually push the frontier of model capability.

    Improvement Rate

    Over the three and a half months from late October to now, we have successively released M2, M2.1, and M2.5, with the pace of model improvement exceeding our original expectations. For instance, in the highly-regarded SWE-Bench Verified benchmark, the rate of progress of the M2-series has been significantly faster than that of peers such as the Claude, GPT, and Gemini model families.

    RL Scaling

    One of the key drivers of the aforementioned developments is the scaling of reinforcement learning. As we train our models, we also benefit from their abilities. Most of the tasks and workspaces that we perform in our company have been made into training environments for RL. To date, there are already hundreds of thousands of such environments. At the same time, we did plenty of work on our agentic RL framework, algorithms, reward signals, and infrastructure engineering to support the continued scaling of our RL training.

    Forge –– Agent-Native RL Framework

    We designed an agent-native RL framework in-house, called Forge, which introduces an intermediary layer that fully decouples the underlying training-inference engine from the agent, supporting the integration of arbitrary agents and enabling us to optimize the model's generalization across agent scaffolds and tools. To improve system throughput, we optimized asynchronous scheduling strategies to balance system throughput against sample off-policyness, and designed a tree-structured merging strategy for training samples, achieving approximately 40x training speedup.

    Agentic RL Algorithm and Reward Design

    On the algorithm side, we continued using the CISPO algorithm we proposed at the beginning of last year to ensure the stability of MoE models during large-scale training. To address the credit assignment challenge posed by long contexts in agent rollouts, we introduced a process reward mechanism for end-to-end monitoring of generation quality. Furthermore, to deeply align with user experience, we evaluated task completion time through agent trajectories, achieving an optimal trade-off between model intelligence and response speed.

    We will release a more comprehensive introduction to RL scaling soon in a separate technical blogpost.

    MiniMax Agent: M2.5 as a Professional Employee

    M2.5 has been fully deployed in MiniMax Agent, delivering the best agentic experience.

    We have distilled core information-processing capabilities into standardized Office Skills deeply integrated within MiniMax Agent. In MAX mode, when handling tasks such as Word formatting, PowerPoint editing, and Excel calculations, MiniMax Agent automatically loads the corresponding Office Skills based on file type, improving the quality of task outputs.

    Furthermore, users can combine Office Skills with domain-specific industry expertise to create reusable Experts tailored to specific task scenarios.

    Take industry research as an example: by merging a mature research framework SOP (standard operating procedure) with Word Skills, the Agent can strictly follow the established framework to automatically fetch data, organize analytical logic, and output properly formatted research reports — rather than merely generating a raw block of text. In financial modeling scenarios, by combining an organization's proprietary modeling standards with Excel Skills, the Agent can follow specific risk control logic and calculation standards to automatically generate and validate complex financial models, rather than simply outputting a basic spreadsheet.

    To date, users have built over 10,000 Experts on MiniMax Agent, and this number is still growing rapidly. MiniMax has also built multiple sets of deeply optimized, ready-to-use Expert suites on MiniMax Agent for high-frequency scenarios such as office work, finance, and programming.

    MiniMax itself has been among the first to benefit from M2.5's capabilities. Throughout the company's daily operations, 30% of overall tasks are autonomously completed by M2.5, spanning functions including R&D, product, sales, HR, and finance — and the penetration rate continues to rise. Performance in coding scenarios has been particularly notable, with M2.5-generated code accounting for 80% of newly committed code.

    Appendix

    Further benchmark results of M2.5:

    Evaluation methods:

    • SWE benchmark:
      SWE-bench Verified, SWE-bench Multilingual, SWE-bench-pro, and Multi-SWE-bench were tested on internal infrastructure using Claude Code as the scaffolding, with the default system prompt overridden, and results averaged over 4 runs. Additionally, SWE-bench Verified was also evaluated on the Droid and Opencode scaffoldings using the default prompt.
    • Terminal Bench 2:
      We tested Terminal Bench 2 using Claude Code 2.0.64 as the evaluation scaffolding. We modified the Dockerfiles of some problems to ensure the correctness of the problems themselves, uniformly expanded sandbox specifications to 8-core CPU and 16 GB memory, set the timeout uniformly to 7,200 seconds, and equipped each problem with a basic toolset (ps, curl, git, etc.). While not retrying on timeouts, we added a detection mechanism for empty scaffolding responses, retrying tasks whose final response was empty to handle various abnormal interruption scenarios. Final results are averaged over 4 runs.
    • VIBE-Pro:
      Internal benchmark. Uses Claude Code as the scaffolding to automatically verify the interaction logic and visual effects of programs. All scores are computed through a unified pipeline that includes a requirements set, containerized deployment, and a dynamic interaction environment. Final results are averaged over 3 runs.
    • BrowseComp:
      Uses the same agent framework as WebExplorer (Liu et al., 2025). When token usage exceeds 30% of the maximum context, all history is discarded.
    • Wide Search:
      Uses the same agent framework as WebExplorer (Liu et al., 2025).
    • RISE:
      Internal benchmark. Contains real questions from human experts, evaluating the model's multi-step information retrieval and reasoning capabilities when combined with complex web interactions. A Playwright-based browser tool suite is added on top of the WebExplorer (Liu et al., 2025) agent framework.
    • GDPval-MM:
      Internal benchmark. Based on the open-source GDPval test set, using a custom agentic evaluation framework where an LLM-as-a-judge performs pairwise win/tie/loss judgments on complete trajectories. Average token cost per task is calculated based on each vendor's official API pricing (without caching).
    • MEWC:
      Internal benchmark. Built on MEWC (Microsoft Excel World Championship), comprising 179 problems from the main and other regional divisions of Excel esports competitions from 2021–2026. It evaluates the model's ability to understand competition Excel spreadsheets and use Excel tools to complete problems. Scores are calculated by comparing output and answer cell values one by one.
    • Finance Modeling:
      Internal benchmark. Primarily contains financial modeling problems constructed by industry experts, involving end-to-end research and analysis tasks performed via Excel tools. Each problem is scored using expert-designed rubrics. Final results are averaged over 3 runs.
    • AIME25 ~ AA-LCR:
      Obtained through internal testing based on the public evaluation sets and evaluation methods covered by the Artificial Analysis Intelligence Index leaderboard.
    Original source
  • Feb 12, 2026
    • Date parsed from source:
      Feb 12, 2026
    • First seen by Releasebot:
      Feb 18, 2026
    MiniMax logo

    MiniMax

    MiniMax M2.5: Built for Real-World Productivity.

    MiniMax unveils M2.5 a fast low cost frontier model with strong coding and agentic abilities boosted by RL scaling and Forge. It ships two versions for office, finance and software work promising efficient tool use and industry leading speed and cost performance.

    MiniMax-M2.5 Overview

    Today we're introducing our latest model, MiniMax-M2.5.

    Extensively trained with reinforcement learning in hundreds of thousands of complex real-world environments, M2.5 is SOTA in coding, agentic tool use and search, office work, and a range of other economically valuable tasks, boasting scores of 80.2% in SWE-Bench Verified, 51.3% in Multi-SWE-Bench, and 76.3% in BrowseComp (with context management).

    Trained to reason efficiently and decompose tasks optimally, M2.5 exhibits tremendous speed in performing complicated agentic tasks, completing the SWE-Bench Verified evaluation 37% faster than M2.1, matching the speed of Claude Opus 4.6.

    M2.5 is the first frontier model where users do not need to worry about cost, delivering on the promise of intelligence too cheap to meter. It costs just $1 to run the model continuously for an hour at a rate of 100 tokens per second. At 50 tokens per second, the cost drops to $0.30. We hope that the speed and cost effectiveness of M2.5 enable innovative new agentic applications.

    Coding

    In programming evaluations, MiniMax-M2.5 saw substantial improvements compared to previous generations, reaching SOTA levels. The performance of M2.5 in multilingual coding tasks is especially pronounced.

    A significant improvement from previous generations is M2.5's ability to think and plan like an architect. The Spec-writing tendency of the model emerged during training: before writing any code, M2.5 actively decomposes and plans the features, structure, and UI design of the project from the perspective of an experienced software architect.

    M2.5 was trained on over 10 languages (including Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, and Ruby) across more than 200,000 real-world environments. Going far beyond bug-fixing, M2.5 delivers reliable performance across the entire development lifecycle of complex systems: from 0-to-1 system design and environment setup, to 1-to-10 system development, to 10-to-90 feature iteration, and finally 90-to-100 comprehensive code review and system testing. It covers full-stack projects spanning multiple platforms including Web, Android, iOS, and Windows, encompassing server-side APIs, business logic, databases, and more, not just frontend webpage demos.

    To evaluate these capabilities, we also upgraded the VIBE benchmark to a more complex and challenging Pro version, significantly increasing task complexity, domain coverage, and evaluation accuracy. Overall, M2.5 performs on par with Opus 4.5.

    We focused on the model's ability to generalize across out-of-distribution harnesses. We tested performance on the SWE-Bench Verified evaluation set using different coding agent harnesses.

    • On Droid: 79.7(M2.5) > 78.9(Opus 4.6)
    • On OpenCode: 76.1(M2.5) > 75.9(Opus 4.6)

    Search and Tool calling

    Effective tool calling and search are prerequisites for a model's ability to autonomously handle more complex tasks. In evaluations on benchmarks such as BrowseComp and Wide Search, M2.5 achieved industry-leading performance. At the same time, the model's generalization has also improved — M2.5 demonstrates more stable performance when facing unfamiliar scaffolding environments.

    In research tasks performed by professional human experts, using a search engine is only a small part of the process; most of the work involves deep exploration across information-dense webpages. To address this, we built RISE (Realistic Interactive Search Evaluation) to measure a model's search capabilities on real-world professional tasks. The results show that M2.5 excels at expert-level search tasks in real-world settings.

    Compared to its predecessors, M2.5 also demonstrates much better decision-making when handling agentic tasks: it has learned to solve problems with more precise search rounds and better token efficiency. For example, across multiple agentic tasks including BrowseComp, Wide Search, and RISE, M2.5 achieved better results with fewer rounds, using approximately 20% fewer rounds compared to M2.1. This indicates that the model is no longer just getting the answer right, but is also reasoning towards results in more efficient paths.

    Office work

    M2.5 was trained to produce truly deliverable outputs in office scenarios. To this end, we engaged in thorough collaboration with senior professionals in fields such as finance, law, and social sciences. They designed requirements, provided feedback, participated in defining standards, and directly contributed to data construction, bringing the tacit knowledge of their industries into the model's training pipeline. Based on this foundation, M2.5 has achieved significant capability improvements in high-value workspace scenarios such as Word, PowerPoint, and Excel financial modeling. On the evaluation side, we built an internal Cowork Agent evaluation framework (GDPval-MM) that assesses both the quality of the deliverable and the professionalism of the agent's trajectory through pairwise comparisons, while also monitoring token costs across the entire workflow to estimate the model's real-world productivity gains. In comparisons against other mainstream models, it achieved an average win rate of 59.0%.

    Efficiency

    Because the real world is full of deadlines and time constraints, task completion speed is a practical necessity. The time it takes a model to complete a task depends on its task decomposition effectiveness, token efficiency, and inference speed. M2.5 is served natively at a rate of 100 tokens per second, which is nearly twice that of other frontier models. Further, our reinforcement learning setup incentivizes the model to reason efficiently and break down tasks optimally. Due to these three factors, M2.5 delivers a significant time savings in complex task completion.

    For example, when running SWE-Bench Verified, M2.5 consumed an average of 3.52 million tokens per task. In comparison, M2.1 consumed 3.72M tokens. Meanwhile, thanks to improvements in capabilities such as parallel tool calling, the end-to-end runtime decreased from an average of 31.3 minutes to 22.8 minutes, representing a 37% speed improvement. This runtime is on par with Claude Opus 4.6's 22.9 minutes, while the total cost per task is only 10% that of Claude Opus 4.6.

    Cost

    Our goal in designing the M2-series of foundation models is to power complex agents without having to worry about cost. We believe that M2.5 is close to realizing this goal. We’re releasing two versions of the model, M2.5 and M2.5-Lightning, that are identical in capability but differ in speed. M2.5-Lightning has a steady throughput of 100 tokens per second, which is two times faster than other frontier models, and costs $0.3 per million input tokens and $2.4 per million output tokens. M2.5, which has a throughput of 50 tokens per second, costs half that. Both model versions support caching. Based on output price, the cost of M2.5 is one-tenth to one-twentieth that of Opus, Gemini 3 Pro, and GPT-5.

    At a rate of 100 output tokens per second, running M2.5 continuously for an hour costs $1. At a rate of 50 TPS, the price drops to $0.3. To put that into perspective, you can have four M2.5 instances running continuously for an entire year for $10,000. We believe that M2.5 provides virtually limitless possibilities for the development and operation of agents in the economy. For the M2-series, the only problem that remains is how to continually push the frontier of model capability.

    Improvement Rate

    Over the three and a half months from late October to now, we have successively released M2, M2.1, and M2.5, with the pace of model improvement exceeding our original expectations. For instance, in the highly-regarded SWE-Bench Verified benchmark, the rate of progress of the M2-series has been significantly faster than that of peers such as the Claude, GPT, and Gemini model families.

    RL Scaling

    One of the key drivers of the aforementioned developments is the scaling of reinforcement learning. As we train our models, we also benefit from their abilities. Most of the tasks and workspaces that we perform in our company have been made into training environments for RL. To date, there are already hundreds of thousands of such environments. At the same time, we did plenty of work on our agentic RL framework, algorithms, reward signals, and infrastructure engineering to support the continued scaling of our RL training.

    Forge –– Agent-Native RL Framework

    We designed an agent-native RL framework in-house, called Forge, which introduces an intermediary layer that fully decouples the underlying training-inference engine from the agent, supporting the integration of arbitrary agents and enabling us to optimize the model's generalization across agent scaffolds and tools. To improve system throughput, we optimized asynchronous scheduling strategies to balance system throughput against sample off-policyness, and designed a tree-structured merging strategy for training samples, achieving approximately 40x training speedup.

    Agentic RL Algorithm and Reward Design

    On the algorithm side, we continued using the CISPO algorithm we proposed at the beginning of last year to ensure the stability of MoE models during large-scale training. To address the credit assignment challenge posed by long contexts in agent rollouts, we introduced a process reward mechanism for end-to-end monitoring of generation quality. Furthermore, to deeply align with user experience, we evaluated task completion time through agent trajectories, achieving an optimal trade-off between model intelligence and response speed.

    We will release a more comprehensive introduction to RL scaling soon in a separate technical blogpost.

    MiniMax Agent: M2.5 as a Professional Employee

    M2.5 has been fully deployed in MiniMax Agent, delivering the best agentic experience.

    We have distilled core information-processing capabilities into standardized Office Skills deeply integrated within MiniMax Agent. In MAX mode, when handling tasks such as Word formatting, PowerPoint editing, and Excel calculations, MiniMax Agent automatically loads the corresponding Office Skills based on file type, improving the quality of task outputs.

    Furthermore, users can combine Office Skills with domain-specific industry expertise to create reusable Experts tailored to specific task scenarios.

    Take industry research as an example: by merging a mature research framework SOP (standard operating procedure) with Word Skills, the Agent can strictly follow the established framework to automatically fetch data, organize analytical logic, and output properly formatted research reports — rather than merely generating a raw block of text. In financial modeling scenarios, by combining an organization's proprietary modeling standards with Excel Skills, the Agent can follow specific risk control logic and calculation standards to automatically generate and validate complex financial models, rather than simply outputting a basic spreadsheet.

    To date, users have built over 10,000 Experts on MiniMax Agent, and this number is still growing rapidly. MiniMax has also built multiple sets of deeply optimized, ready-to-use Expert suites on MiniMax Agent for high-frequency scenarios such as office work, finance, and programming.

    MiniMax itself has been among the first to benefit from M2.5's capabilities. Throughout the company's daily operations, 30% of overall tasks are autonomously completed by M2.5, spanning functions including R&D, product, sales, HR, and finance — and the penetration rate continues to rise. Performance in coding scenarios has been particularly notable, with M2.5-generated code accounting for 80% of newly committed code.

    Appendix

    Further benchmark results of M2.5:

    Evaluation methods:

    • SWE benchmark: SWE-bench Verified, SWE-bench Multilingual, SWE-bench-pro, and Multi-SWE-bench were tested on internal infrastructure using Claude Code as the scaffolding, with the default system prompt overridden, and results averaged over 4 runs. Additionally, SWE-bench Verified was also evaluated on the Droid and Opencode scaffoldings using the default prompt.
    • Terminal Bench 2: We tested Terminal Bench 2 using Claude Code 2.0.64 as the evaluation scaffolding. We modified the Dockerfiles of some problems to ensure the correctness of the problems themselves, uniformly expanded sandbox specifications to 8-core CPU and 16 GB memory, set the timeout uniformly to 7,200 seconds, and equipped each problem with a basic toolset (ps, curl, git, etc.). While not retrying on timeouts, we added a detection mechanism for empty scaffolding responses, retrying tasks whose final response was empty to handle various abnormal interruption scenarios. Final results are averaged over 4 runs.
    • VIBE-Pro: Internal benchmark. Uses Claude Code as the scaffolding to automatically verify the interaction logic and visual effects of programs. All scores are computed through a unified pipeline that includes a requirements set, containerized deployment, and a dynamic interaction environment. Final results are averaged over 3 runs.
    • BrowseComp: Uses the same agent framework as WebExplorer (Liu et al., 2025). When token usage exceeds 30% of the maximum context, all history is discarded.
    • Wide Search: Uses the same agent framework as WebExplorer (Liu et al., 2025).
    • RISE: Internal benchmark. Contains real questions from human experts, evaluating the model's multi-step information retrieval and reasoning capabilities when combined with complex web interactions. A Playwright-based browser tool suite is added on top of the WebExplorer (Liu et al., 2025) agent framework.
    • GDPval-MM: Internal benchmark. Based on the open-source GDPval test set, using a custom agentic evaluation framework where an LLM-as-a-judge performs pairwise win/tie/loss judgments on complete trajectories. Average token cost per task is calculated based on each vendor's official API pricing (without caching).
    • MEWC: Internal benchmark. Built on MEWC (Microsoft Excel World Championship), comprising 179 problems from the main and other regional divisions of Excel esports competitions from 2021–2026. It evaluates the model's ability to understand competition Excel spreadsheets and use Excel tools to complete problems. Scores are calculated by comparing output and answer cell values one by one.
    • Finance Modeling: Internal benchmark. Primarily contains financial modeling problems constructed by industry experts, involving end-to-end research and analysis tasks performed via Excel tools. Each problem is scored using expert-designed rubrics. Final results are averaged over 3 runs.
    • AIME25 ~ AA-LCR: Obtained through internal testing based on the public evaluation sets and evaluation methods covered by the Artificial Analysis Intelligence Index leaderboard.
    Original source
  • Jan 27, 2026
    • Date parsed from source:
      Jan 27, 2026
    • First seen by Releasebot:
      Jun 19, 2026
    MiniMax logo

    MiniMax

    A Deep Dive into the MiniMax-M2-her

    MiniMax introduces MiniMax-M2-her, a role-play agent built for deeper world fidelity, stronger story progression, and better preference alignment. It also launches Role-Play Bench and highlights three years of work on long-horizon, multi-turn role-play quality.

    Worlds to Dream, Stories to Live

    How we built a Role-Play Agent for the production usage

    Three Years of Observations: How We Define Role-Play

    This year marks our third year optimizing Role-Play in Talkie / Xingye.

    Three years is long enough for a product to leave its mark on users’ lives, and long enough for a user to form a deep bond with the NPC. Beyond the product metrics, we have found that the most valuable insights come from the user behaviors reflecting their real needs.

    Here are a few signals that are most interesting:

    • The “Regenerate” button follows a long-tail usage pattern, concentrated on narrative pivot points. Whether it’s a confession or a moment of sentiment, users hit “regenerate” to curate their own “perfect moment”. This signals that the role-play experience is not about a binary pass/fail judgment, but rather a pursuit of narrative precision. What matters most to users is the fidelity of these peak emotional experiences.

    • NPC popularity diverges from a typical power-law curve. Unlike broad content platforms, even niche characters maintain distinct, high-retention user groups. For these users, the character’s specific idiosyncrasies are the core value proposition. If our model regresses to satisfy the “average” experience, we destroy the very nuance that minority users value, leading to engagement loss in the long tail.

    • Conversation turn count correlates non-linearly with engagement. We observed a significant drop in conversation turns after turn 20. This signals that shallow role-play is driven by novelty, while long-term retention depends not on one-time thrills but on whether the NPC and user can build a stable emotional connection within limited turns. Based on this, we decomposed engagement drivers into instant gratification and long-term connection. We continuously deepen emotional bonds while providing new stimuli through exploration.

    All of these converge to one singular insight: The essence of Role-play is not static impersonation; it is the unique narrative journey a user and a character weave together. A Deep role-play is not just about accuracy; it’s about agency—enabling every user to step into a living, breathing environment and arrive at a moment of resolution that is uniquely theirs. Formally, we define this as an agent’s capacity to navigate specific coordinates: {World} × {Stories}, conditioned on {User Preferences}.

    Guided by this framework, we have distilled our technical strategy of Role-Play into three core challenges:

    • How do we preserve the distinct “soul” of each world? (Worlds)
      User-generated contexts span a massive spectrum—from slice-of-life campus dramas to high-stakes fantasy epics, from intimate dyads to complex ensemble casts. If our model merely learns the “average,” characters will homogenize, and these diverse worlds will collapse into mediocrity. We need a model capable of representing the full distribution, preserving the fidelity of both mainstream hits and long-tail niches without regression.

    • How do we sustain narrative vitality over time? (Stories)
      As conversation length increases, the risk of coherence drift rises. Models naturally tend toward mechanical loops and repetitive phrasing, causing narrative tension to evaporate. A compelling story requires cadence—the intelligence to know when to escalate conflict to drive the plot, and when to slow down to allow for emotional processing.

    • How do we decode implicit user intent? (User Preferences)
      Users rarely explicitly state their pacing preferences. Some seek a “slow burn” emotional buildup, while others crave rapid plot progression. The model must learn to infer these unspoken desires from contextual cues, dynamically aligning its rhythm and tone with the user’s underlying psychological flow.

    1 MiniMax-M2-her

    Over the past three years, we have relentlessly iterated our models to answer these fundamental questions. Today, we are proud to introduce MiniMax-M2-her—our systematic attempt toward deeper Role-Play.

    Specifically, MiniMax-M2-her is engineered to deliver:

    • High-Fidelity World Experience:
      MiniMax-M2-her does more than process text; it anchors itself within complex settings. Whether the context is a sprawling epic or an intimate drama, it maintains strict coherence, ensuring every interaction aligns with the established lore and the character’s soul.

    • Dynamic Story Progression:
      MiniMax-M2-her rejects mediocre repetition and rigid patterns. By utilizing richer, more vivid prose, it actively drives the plot forward, imbuing stories with the tension and breathing rhythm of life itself.

    • Intuitive Preference Alignment:
      MiniMax-M2-her is designed to read between the lines. It detects unspoken expectations and subtle context cues, adapting dynamically to the user’s unique style and long-term habits without needing explicit instruction.

    In the following sections, we will break down the insights gained from three years of research and the engineering efforts that power MiniMax-M2-her.

    2 Starting with Evaluation — Is A/B Testing A Good Evaluation?

    Prior to mid-2024, our iteration cycle—like much of the industry—was tethered to traditional online A/B testing. We relied heavily on lagging indicators like LifeTime (LT), duration time and average conversation turns to judge performance.

    However, we quickly hit a ceiling: velocity.
    Validating a new model required lengthy testing cycles to achieve statistical significance, often stretching feedback loops to a week or more. Furthermore, we faced a unique challenge with user inertia. Long-term users build extensive histories and deep emotional habits with specific NPCs. When we swapped the underlying model—even a “better” one—the sudden stylistic shift often felt like a violation of the character’s established voice.

    To break free from these slow cycles, we needed a way to approximate online metrics through offline evaluation. But here we encountered the “Ground Truth Paradox.” Unlike conventional NLP tasks, Role-Play is inherently subjective and non-verifiable. If you ask a tsundere character, “Do you like me?”, valid responses could range from a flushed “Hmph, as if!” to a cold “...You’re so annoying.”

    However, we identified a key insight: While “alignment” (what makes a response great) is subjective, “misalignment” (what makes a response wrong) is surprisingly objective. This gave us a clear path forward: while it’s hard to define aligned responses, it is feasible to detect a misaligned one.

    Leveraging this logic, we developed Role-Play Bench. This evaluation framework utilizes Situated Reenactment to automatically detect model misalignment. By focusing on error detection rather than subjective perfection, we have created a metric that correlates closely with online performance, significantly accelerating our iteration velocity.

    2.1 Situated Reenactment: Bridging the Gap to Online Evaluation

    Situated Reenactment measures an agent’s performance at specific coordinates: {Worlds} × {Stories}, conditioned on {User Preferences}. Instead of evaluating static, single-turn responses, we generate multi-turn dialogue trajectories via self-play simulation.

    Scenario construction.
    We started from our massive internal NPC/User prompt library (>1M) and the corresponding relationship setups. We produced hierarchical structured tags via embedding clustering → LLM semantic aggregation → human verification. We then uniformly sampled 100 NPC settings each in Chinese and English.

    Model sampling.
    We built a Model-on-Model Self-Play sampling pipeline where models play both NPC and User. We run 100 turns of self-play for each setting, repeated three times, generating 300 dense conversation sessions.

    The Evaluation Protocol.
    Evaluation focuses exclusively on NPC-side outputs, scored across predefined dimensions, using evaluation model to align with human perception.

    2.2 Evaluation Taxonomy of Role-Play Bench

    Worlds focus on Basics, Logic, and Knowledge errors:

    • Basics:
      We scan for mixed languages, excessive repetition, and formatting glitches.

    • Logic:
      We place special emphasis on Reference Confusion, a metric that reflects whether models can truly remember user-constructed characters’ relationships.

    • Knowledge:
      We ensure the model adheres to the immutable physical and magical laws of the specific setting.

    Stories consider Diversity and Content Logic problems:

    • Diversity:
      We detect single-pattern phrasing, repetitive plot beats, stagnation, and low-information filler.

    • Content Logic:
      It measures narrative coherence and OOC (out-of-character) breaks.

    User Preferences primarily evaluate interaction quality:

    • AI Speaks for User:
      Reflects whether the model oversteps boundaries.

    • AI Ignores User:
      Captures whether the model talks to itself.

    • AI Silence:
      Judges whether the model provides “hooks” that invite a reply.

    • Interaction Boundary:
      Requires models to balance safety boundaries with emotional interaction.

    2.3 Role-Play Bench Results

    We systematically evaluated mainstream models using Role-Play Bench, focusing exclusively on multi-turn dynamic interaction. The results are definitive: across extended 100-turn sessions, MiniMax-M2-her ranks #1 overall.

    On the Worlds dimension, MiniMax-M2-her performs best.
    This result challenges the common assumption that strong general reasoning automatically translates to role-play fidelity. More common failures are reference confusion and physical logic error.
    In multi-character settings, models often attribute dialogue to the wrong character.
    MiniMax-M2-her maintains strict separation of voice and identity.
    Additionally, MiniMax-M2-her recognizes physical state changes and autonomously introduces Narrative Bridging by using narration to transition through time or space rather than forcing impossible dialogue.

    On the Stories dimension, MiniMax-M2-her ranks fifth among all models, still achieving a relatively high standard.

    On the User Preferences dimension, MiniMax-M2-her excels.
    It avoids speaking for users while emphasizing responsive intent recognition and natural interaction.

    Long-Horizon Quality Degradation Analysis:
    We found MiniMax-M2-her better maintains long-conversation stability:

    • Long-range quality stability:
      Most models hit a “performance wall” after turn 20. MiniMax-M2-her avoids context bloat and compounding logic gaps.

    • Response length controllability:
      MiniMax-M2-her has been specifically optimized for brevity. Even in 100-turn conversations, it maintains response length within the optimal range.

    3 How We Built MiniMax-M2-her

    We propose a two-phase alignment strategy: Agentic Data Synthesis to broaden training data variety and mitigate misalignment, followed by Online Preference Learning to integrate feedback and align with user preferences.

    3.1 Agentic Data Synthesis

    We proposed Agentic Data Synthesis—a dialogue synthesis pipeline driven by a sophisticated agentic workflow optimizing two dimensions:

    1. Quality.
      Adherence to rigorous standards from linguistic fundamentals to high-level narrative execution.

    2. Diversity.
      Capacity to span a vast manifold of interactions across varying worldviews, scenarios, and interaction styles.

    The workflow proceeds in stages:

    • Random sampling from NPC/User Prompts library and instantiating expert models.

    • Expert models act as NPC and User with a Dynamic Chat Planning Module guiding direction and emotional tone.

    • Best-of-N (BoN) sampling to filter low-quality outputs.

    • LLM-as-a-judge agent periodically reviews and rewrites segments to correct drift.

    • Rewritten segments become the initial state for next synthesis round.

    Diversity Guarantees:

    • Scenario diversity:
      Dispersion sampling to neutralize style bias from overrepresented tropes.

    • Prompt diversity:
      Enriching skeletal NPC Prompts with worldview positioning and plot development.

    • Style diversity:
      Pool of expert models finetuned on distinct stylistic corpora.

    • Structural diversity:
      Dynamic turn allocation enabling consecutive turns and varied rhythms.

    Quality Guarantees:

    • Segment Checking and Refinement:
      Periodic scanning for surface errors, logic failures, and repetition.

    • User-side Planning Agent:
      Assesses conversation state and introduces new plot elements to maintain narrative progress.

    3.2 Online Preference Learning

    We utilize Online RLHF to train MiniMax-M2-her to perceive and adapt to contextualized preferences—implicit behavioral signals like regeneration patterns and engagement duration.

    The process: gather annotation signals → Signal Filtering and Causal Denoising → RLHF training with early stopping → redeploy and iterate.

    Causal Denoising Protocol:

    • Stratified Bias Removal:
      Categorize annotators to neutralize systematic biases.

    • Causal Inference:
      Session Duration is a high-fidelity predictor of satisfaction; Turn Count is weaker.

    • Quality Floor Filter:
      Discard signals that fail baseline quality benchmarks.

    Model Training:
    Primary risk is Entropy Degradation. We apply early stopping when diversity drops. RLHF tends to overfit rapidly—often by the second epoch.

    4 What’s Next?

    We call the next direction Worldplay—upgrading users from “entering a pre-set world” to “co-creating the world.”

    This drives evolution from static prompt injection to Dynamic World State modeling: structuring entities, relationships, and causal chains at 100-turn and 1000-turn scales.

    Another critical axis is Multi-character Coordination—ensemble dramas where multiple agents share world state, coordinate narrative, and maintain independent personas.

    Ultimately: a world you can define, stories that grow with you, and companions that understand you without usurping your agency.

    Worlds to Dream, Stories to Live. Let’s go together.

    Original source
  • January 2026
    • No date parsed from source.
    • First seen by Releasebot:
      Jan 16, 2026
    MiniMax logo

    MiniMax

    MiniMax-M2.1: Polyglot programming mastery, precision code refactoring

    MiniMax now integrates with the Anthropic API ecosystem, letting developers plug in with an easy SDK setup and shared prompts. With supported models, streaming options, and clear config steps, you can deploy cross‑ecosystem AI quickly.

    • Install Anthropic SDK
    pip install anthropic
    
    • Configure Environment Variables

    For international users, use https://api.minimax.io/anthropic; for users in China, use https://api.minimaxi.com/anthropic

    export ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic
    export ANTHROPIC_API_KEY=${YOUR_API_KEY}
    
    • Call API

    Python example:

    import anthropic
    
    client = anthropic.Anthropic()
    
    message = client.messages.create(
        model = "MiniMax-M2.1",
        max_tokens = 1000,
        system = "You are a helpful assistant.",
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Hi, how are you?"
                    }
                ]
            }
        ]
    )
    
    for block in message.content:
        if block.type == "thinking":
            print(f"Thinking:\n{block.thinking}\n")
        elif block.type == "text":
            print(f"Text:\n{block.text}\n")
    
    • Important Note

    In multi-turn function call conversations, the complete model response (i.e., the assistant message) must be append to the conversation history to maintain the continuity of the reasoning chain.

    • Append the full response.content list to the message history (includes all content blocks: thinking/text/tool_use)

    • Supported Models

    When using the Anthropic SDK, the MiniMax-M2.1, MiniMax-M2.1-lightning, MiniMax-M2 model is supported:

    Model Name Description MiniMax-M2.1 Powerful Multi-Language Programming Capabilities with Comprehensively Enhanced Programming Experience (output speed approximately 60 tps) MiniMax-M2.1-lightning Faster and More Agile (output speed approximately 100 tps) MiniMax-M2 Agentic capabilities, Advanced reasoning

    Note: The Anthropic API compatibility interface currently only supports the MiniMax-M2.1, MiniMax-M2.1-lightning, MiniMax-M2 model. For other models, please use the standard MiniMax API interface.

    • Compatibility

    • Supported Parameters

    When using the Anthropic SDK, we support the following input parameters:

    Parameter Support Status Description model Fully supported supports MiniMax-M2.1 MiniMax-M2.1-lightning MiniMax-M2 model messages Partial support Supports text and tool calls, no image/document input max_tokens Fully supported Maximum number of tokens to generate stream Fully supported Streaming response system Fully supported System prompt temperature Fully supported Range (0.0, 1.0], controls output randomness, recommended value: 1 tool_choice Fully supported Tool selection strategy tools Fully supported Tool definitions top_p Fully supported Nucleus sampling parameter metadata Fully Supported Metadata thinking Fully Supported Reasoning Content top_k Ignored This parameter will be ignored stop_sequences Ignored This parameter will be ignored service_tier Ignored This parameter will be ignored mcp_servers Ignored This parameter will be ignored context_management Ignored This parameter will be ignored container Ignored This parameter will be ignored
    • Messages Field Support
    Field Type Support Status Description type="text" Fully supported Text messages type="tool_use" Fully supported Tool calls type="tool_result" Fully supported Tool call results type="thinking" Fully supported Reasoning Content type="image" Not supported Image input not supported yet type="document" Not supported Document input not supported yet
    • Examples

    • Streaming Response

    Python example:

    import anthropic
    
    client = anthropic.Anthropic()
    
    print("Starting stream response...\n")
    print("="*60)
    print("Thinking Process:")
    print("="*60)
    
    stream = client.messages.create(
        model = "MiniMax-M2.1",
        max_tokens = 1000,
        system = "You are a helpful assistant.",
        messages = [
            {
                "role": "user",
                "content": [{"type": "text", "text": "Hi, how are you?"}]
            }
        ],
        stream = True,
    )
    
    reasoning_buffer = ""
    text_buffer = ""
    
    for chunk in stream:
        if chunk.type == "content_block_start":
            if hasattr(chunk, "content_block") and chunk.content_block:
                if chunk.content_block.type == "text":
                    print("\n" + "="*60)
                    print("Response Content:")
                    print("="*60)
        elif chunk.type == "content_block_delta":
            if hasattr(chunk, "delta") and chunk.delta:
                if chunk.delta.type == "thinking_delta":
                    # Stream output thinking process
                    new_thinking = chunk.delta.thinking
                    if new_thinking:
                        print(new_thinking, end = "", flush = True)
                    reasoning_buffer += new_thinking
                elif chunk.delta.type == "text_delta":
                    # Stream output text content
                    new_text = chunk.delta.text
                    if new_text:
                        print(new_text, end = "", flush = True)
                    text_buffer += new_text
    print("\n")
    
    • Important Notes
    1. The Anthropic API compatibility interface currently only supports the MiniMax-M2.1, MiniMax-M2 model
    2. The temperature parameter range is (0.0, 1.0], values outside this range will return an error
    3. Some Anthropic parameters (such as thinking, top_k, stop_sequences, service_tier, mcp_servers, context_management, container) will be ignored
    4. Image and document type inputs are not currently supported
    Original source
  • Dec 23, 2025
    • Date parsed from source:
      Dec 23, 2025
    • First seen by Releasebot:
      Jun 19, 2026
    MiniMax logo

    MiniMax

    MiniMax M2.1: Significantly Enhanced Multi-Language Programming, Built for Real-World Complex Tasks

    MiniMax releases M2.1, a new AI model update focused on stronger real-world coding, better multilingual support, improved web, app and office workflows, faster responses, and broader agent tool use. The MiniMax M2.1 API is live, Agent is publicly available, and the model weights are open source.

    Key Highlights of MiniMax M2.1

    Exceptional Multi-Programming Language Capabilities

    Many models in the past primarily focused on Python optimization, but real-world systems are often the result of multi-language collaboration. In M2.1, we have systematically enhanced capabilities in Rust, Java, Golang, C++, Kotlin, Objective-C, TypeScript, JavaScript, and other languages. The overall performance on multi-language tasks has reached industry-leading levels, covering the complete chain from low-level system development to application layer development.

    WebDev and AppDev: A Comprehensive Leap in Capability and Aesthetics

    Addressing the widely recognized weakness in mobile development across the industry, M2.1 significantly strengthens native Android and iOS development capabilities. Meanwhile, we have systematically enhanced the model's design comprehension and aesthetic expression in Web and App scenarios, enabling excellent construction of complex interactions, 3D scientific scene simulations, and high-quality visualization, making vibe coding a sustainable and deliverable production practice.

    Enhanced Composite Instruction Constraints, Enabling Office Scenarios

    As one of the first open-source model series to systematically introduce Interleaved Thinking, M2.1's systematic problem-solving capabilities have been further upgraded. The model not only focuses on code execution correctness but also emphasizes integrated execution of composite instruction constraints, providing higher usability in real office scenarios.

    More Concise and Efficient Responses

    Compared to M2, MiniMax-M2.1 delivers more concise model responses and thought chains. In practical programming and interaction experiences, response speed has significantly improved and token consumption has notably decreased, resulting in smoother and more efficient performance in AI Coding and Agent-driven continuous workflows.

    Outstanding Agent/Tool Scaffolding Generalization Capabilities

    M2.1 demonstrates excellent performance across various programming tools and Agent frameworks. It exhibits consistent and stable results in tools such as Claude Code, Droid (Factory AI), Cline, Kilo Code, Roo Code, and BlackBox, while providing reliable support for Context Management mechanisms including Skill.md, Claude.md/agent.md/cursorrule, and Slash Commands.

    High-Quality Dialogue and Writing

    M2.1 is no longer just stronger in coding capabilities. In everyday conversation, technical documentation, and writing scenarios, it also provides more detailed and structured responses.

    Benchmarks

    MiniMax-M2.1 delivers a significant leap over M2 on core software engineering leaderboards. It shines particularly bright in multilingual scenarios, where it outperforms Claude Sonnet 4.5 and closely approaches Claude Opus 4.5.

    We also evaluated MiniMax-M2.1 on SWE-bench Verified across a variety of coding agent frameworks. The results highlight the model's exceptional framework generalization and robust stability.

    Furthermore, across specific benchmarks—including test case generation, code performance optimization, code review, and instruction following—MiniMax-M2.1 demonstrates comprehensive improvements over M2. In these specialized domains, it consistently matches or exceeds the performance of Claude Sonnet 4.5.

    To evaluate the model's full-stack capability to architect complete, functional applications from zero to one, we established a novel benchmark: VIBE (Visual & Interactive Benchmark for Execution). This suite encompasses five core subsets: Web, Simulation, Android, iOS, and Backend.

    MiniMax-M2.1 delivers outstanding performance on the VIBE aggregate benchmark, achieving an average score of 88.6—demonstrating robust full-stack development capabilities.

    Showcases

    Multilingual Coding

    3D Interactive Animation

    MiniMax M2.1 built a 3D Dreamy Christmas Tree based on React Three Fiber and InstancedMesh, successfully rendering over 7,000 instances. It supports gesture interaction and complex particle animation, demonstrating advanced 3D rendering capabilities.

    Avant-Garde Web UI Design

    M2.1 generated a minimalist photographer's personal homepage using an asymmetrical layout and a black-white-red contrasting color scheme. By combining immersive imagery with brutalist typography, it achieved a high-impact visual effect.

    Website - Skincare Brand

    M2.1 designed a landing page for a high-end organic skincare brand. Adopting a Clean & Minimalist style, it accurately presented the brand's premium identity and international visual appeal.

    Web 3D Lego Sandbox

    M2.1 developed a high-freedom 3D brick building application based on Three.js, implementing precise grid snapping algorithms and collision detection mechanisms. The project perfectly replicates the glossy texture of plastic bricks, supporting multi-angle rotation, drag-and-drop assembly, and instant color switching.

    Native App Development - Android

    M2.1 used Kotlin to develop a native Android gravity sensor simulator. Utilizing the gyroscope for a silky-smooth control experience, it features clever visual easter eggs that elegantly present the MERRY XMAS MiniMax M2.1 message through natural UI transitions and collision effects.

    Native App Development - iOS

    M2.1 wrote an interactive iOS Home Screen widget, designing a Sleeping Santa click-to-wake mechanism. The logic is complete with native-level animation effects.

    Web Audio Simulation Development

    M2.1 developed a 16-step drum machine simulator based on the Web Audio API. It integrates synthesized drum sounds, non-linear rhythm algorithms, and real-time glitch sound effects, providing an avant-garde electronic music experience.

    Rust TUI

    M2.1 built a powerful Linux security audit tool with dual CLI + TUI modes using Rust, supporting one-click low-level scanning and intelligent risk rating for critical items such as processes, networks, and SSH.

    Python Data Dashboard

    M2.1 created a Web3 cryptocurrency price dashboard in the style of The Matrix. Using Python for real-time price API fetching, HTML structure, and CSS with Matrix aesthetic: green digital rain on black background, monospaced font, glowing neon green text, terminal-like UI.

    C++ Image Rendering

    M2.1 utilized C++ and GLSL to implement complex light transport algorithms, accurately rendering the physical refraction of a crystal ball, detailed SDF modeling of a snowman, and shimmering snow effects in a real-time environment.

    Java Real-time Danmaku

    M2.1 implemented a high-performance real-time Danmaku (bullet chat) system based on Java, a clean and intuitive user interface, and millisecond-level response capabilities.

    SVG Generation

    M2.1 generated an interactive isometric SVG island map, constructing a detailed miniature world that supports one-click zooming to freely explore four major themed areas.

    Agentic Tool Use

    M2.1 demonstrated its tool-use capabilities by autonomously invoking Excel and Yahoo Finance to complete an end-to-end task, ranging from market research data cleaning and analysis to chart generation.

    Digital Employee

    The Digital Employee is a key feature of the MiniMax M2.1 model. M2.1 accepts web content presented in text form and controls mouse clicks and keyboard inputs via text-based commands. It can complete end-to-end tasks in daily office scenarios across administration, data science, finance, human resources, and software development.

    Local Deployment Guide

    Download the model from the HuggingFace repository. We recommend using the following inference frameworks to serve the model: SGLang, vLLM, Transformers, and Ktransformers.

    Recommended inference parameters: temperature=1.0, top_p=0.95, top_k=40

    How to Use

    The MiniMax-M2.1 API is now live on the MiniMax Open Platform. Our product MiniMax Agent, built on MiniMax-M2.1, is now publicly available. The MiniMax-M2.1 model weights are now open-source, allowing for local deployment and use.

    Original source
  • Dec 23, 2025
    • Date parsed from source:
      Dec 23, 2025
    • First seen by Releasebot:
      Dec 23, 2025
    MiniMax logo

    MiniMax

    MiniMax M2.1: Significantly Enhanced Multi-Language Programming, Built for Real-World Complex Tasks

    MiniMax M2.1 unleashes AI-native development with stronger multi-language coding, improved office task automation, and enhanced mobile Web/App capabilities. It promises faster, cheaper, more capable AI workflows and opens the model to open-source deployment and public tools.

    MiniMax M2.1 Release

    MiniMax has been continuously transforming itself in a more AI-native way. The core driving forces of this process are models, Agent scaffolding, and organization. Throughout the exploration process, we have gained increasingly deeper understanding of these three aspects. Today we are releasing updates to the model component, namely MiniMax M2.1, hoping to help more enterprises and individuals find more AI-native ways of working (and living) sooner.

    In M2, we primarily addressed issues of model cost and model accessibility. In M2.1, we are committed to improving performance in real-world complex tasks: focusing particularly on usability across more programming languages and office scenarios, and achieving the best level in this domain.

    Key Highlights of MiniMax M2.1:

    • Exceptional Multi-Programming Language Capabilities
      Many models in the past primarily focused on Python optimization, but real-world systems are often the result of multi-language collaboration.
      In M2.1, we have systematically enhanced capabilities in Rust, Java, Golang, C++, Kotlin, Objective-C, TypeScript, JavaScript, and other languages. The overall performance on multi-language tasks has reached industry-leading levels, covering the complete chain from low-level system development to application layer development.
    • WebDev and AppDev: A Comprehensive Leap in Capability and Aesthetics
      Addressing the widely recognized weakness in mobile development across the industry, M2.1 significantly strengthens native Android and iOS development capabilities.
      Meanwhile, we have systematically enhanced the model's design comprehension and aesthetic expression in Web and App scenarios, enabling excellent construction of complex interactions, 3D scientific scene simulations, and high-quality visualization, making vibe coding a sustainable and deliverable production practice.
    • Enhanced Composite Instruction Constraints, Enabling Office Scenarios
      As one of the first open-source model series to systematically introduce Interleaved Thinking, M2.1's systematic problem-solving capabilities have been further upgraded. The model not only focuses on code execution correctness but also emphasizes integrated execution of "composite instruction constraints," providing higher usability in real office scenarios.
    • More Concise and Efficient Responses
      Compared to M2, MiniMax-M2.1 delivers more concise model responses and thought chains. In practical programming and interaction experiences, response speed has significantly improved and token consumption has notably decreased, resulting in smoother and more efficient performance in AI Coding and Agent-driven continuous workflows.
    • Outstanding Agent/Tool Scaffolding Generalization Capabilities
      M2.1 demonstrates excellent performance across various programming tools and Agent frameworks. It exhibits consistent and stable results in tools such as Claude Code, Droid (Factory AI), Cline, Kilo Code, Roo Code, and BlackBox, while providing reliable support for Context Management mechanisms including Skill.md, Claude.md/agent.md/cursorrule, and Slash Commands.
    • High-Quality Dialogue and Writing
      M2.1 is no longer just "stronger in coding capabilities." In everyday conversation, technical documentation, and writing scenarios, it also provides more detailed and structured responses.

    First Impressions

    "We're excited for powerful open-source models like M2.1 that bring frontier performance (and in some cases exceed the frontier) for a wide variety of software development tasks. Developers deserve choice, and M2.1 provides that much needed choice!"

    • Eno Reyes, Co-Founder, CTO of Factory AI

    “MiniMax M2.1 performed exceptionally well across our internal benchmarks, showing strong results in complex instruction following, reranking, and classification, especially within e-commerce tasks. Beyond its general versatility, it has proven to be an excellent model for coding. We are impressed by these results and look forward to a close collaboration with the MiniMax team as we continue to support their latest innovations on the Fireworks platform.”

    • Benny Chen, Co-founder of Fireworks

    “Minimax M2 series has demonstrated powerful code generation capability, and has quickly became one of the most popular model on Cline platform during the past few months. We already see another huge advancement in capability for M2.1 and very excited to continue partner with minimax team to advance AI in coding”

    • Saoud Rizwan, Founder, CEO of Cline

    “We could not be more excited about M2.1! Our users have come to rely on MiniMax for frontier-grade coding assistance at a fraction of the cost, and early testing shows M2.1 excelling at everything from architecture and orchestration to code reviews and deployment. The speed and efficiency are off the charts!”

    • Scott Breitenother, Co-Founder, CEO of Kilo

    "Our users love MiniMax M2 for its strong coding ability and efficiency. The latest M2.1 release builds on that foundation with meaningful improvements in speed and reliability, performing well across a wider range of languages and frameworks. It's a great choice for high-throughput, agentic coding workflows where speed and affordability matter."

    • Matt Rubens, Co-Founder, CEO of RooCode

    “Integrating the MiniMax M2 series into our platform has been a significant win for our users, and M2.1 represents a clear step forward in what a coding-specific model can achieve. We’ve found that M2.1 handles the nuances of complex, multi-step programming tasks with a level of consistency that is rare in this space. By providing high-quality reasoning and context awareness at scale, MiniMax has become a core component of how we help developers solve challenging problems faster. We look forward to seeing how our community continues to leverage these updated capabilities.”

    • Robert Rizk, Co-Founder, CEO of BlackBox

    Benchmarks

    MiniMax-M2.1 delivers a significant leap over M2 on core software engineering leaderboards. It shines particularly bright in multilingual scenarios, where it outperforms Claude Sonnet 4.5 and closely approaches Claude Opus 4.5.

    We also evaluated MiniMax-M2.1 on SWE-bench Verified across a variety of coding agent frameworks. The results highlight the model's exceptional framework generalization and robust stability.
    Furthermore, across specific benchmarks—including test case generation, code performance optimization, code review, and instruction following—MiniMax-M2.1 demonstrates comprehensive improvements over M2. In these specialized domains, it consistently matches or exceeds the performance of Claude Sonnet 4.5.

    To evaluate the model's full-stack capability to architect complete, functional applications "from zero to one," we established a novel benchmark: VIBE (Visual & Interactive Benchmark for Execution). This suite encompasses five core subsets: Web, Simulation, Android, iOS, and Backend. Distinguishing itself from traditional benchmarks, VIBE leverages an innovative Agent-as-a-Verifier (AaaV) paradigm to automatically assess the interactive logic and visual aesthetics of generated applications within a real runtime environment.
    MiniMax-M2.1 delivers outstanding performance on the VIBE aggregate benchmark, achieving an average score of 88.6—demonstrating robust full-stack development capabilities. It excels particularly in the VIBE-Web (91.5) and VIBE-Android (89.7) subsets.
    MiniMax-M2.1 also demonstrates steady improvements over M2 in both long-horizon tool use and comprehensive intelligence metrics.

    Showcases

    • Multilingual Coding

    • 3D Interactive Animation
      MiniMax M2.1 built a "3D Dreamy Christmas Tree" based on React Three Fiber and InstancedMesh, successfully rendering over 7,000 instances. It supports gesture interaction and complex particle animation, demonstrating advanced 3D rendering capabilities.
      Try it out: https://yuyl27wq92.space.minimax.io/

    • Avant-Garde Web UI Design
      M2.1 generated a minimalist photographer's personal homepage using an asymmetrical layout and a black-white-red contrasting color scheme. By combining immersive imagery with brutalist typography, it achieved a high-impact visual effect.
      Try it out: https://m6xkaf07udss.space.minimax.io/

    • Website - Skincare Brand
      M2.1 designed a landing page for a high-end organic skincare brand. Adopting a "Clean & Minimalist" style, it accurately presented the brand's premium identity and international visual appeal.
      Try it out: https://2drpfocv00n9.space.minimax.io/

    • Web 3D Lego Sandbox
      M2.1 developed a high-freedom 3D brick building application based on Three.js, implementing precise grid snapping algorithms and collision detection mechanisms. The project perfectly replicates the glossy texture of plastic bricks, supporting multi-angle rotation, drag-and-drop assembly, and instant color switching, providing users with an immersive 3D creative building experience.
      Try it out: https://8e6nunemyuzh.space.minimax.io/

    • Native App Development - Android
      M2.1 used Kotlin to develop a native Android gravity sensor simulator. Utilizing the gyroscope for a silky-smooth control experience, it features clever visual easter eggs that elegantly present the "MERRY XMAS MiniMax M2.1" message through natural UI transitions and collision effects.

    • Native App Development - iOS
      M2.1 wrote an interactive iOS Home Screen widget, designing a "Sleeping Santa" click-to-wake mechanism. The logic is complete with native-level animation effects—Santa lives in your widget; tap him ten times to wake him up for a surprise! 🎅🎁

    • Web Audio Simulation Development
      M2.1 developed a 16-step drum machine simulator based on the Web Audio API. It integrates synthesized drum sounds, non-linear rhythm algorithms, and real-time glitch sound effects, providing an avant-garde electronic music experience! (Turn on the sound in the video below to listen!)
      Try it out: https://21okxwno2u.space.minimax.io

    • Rust TUI
      M2.1 built a powerful Linux security audit tool with dual CLI + TUI modes using Rust, supporting one-click low-level scanning and intelligent risk rating for critical items such as processes, networks, and SSH.

    • Python Data Dashboard
      M2.1 created a Web3 cryptocurrency price dashboard in the style of The Matrix. Use Python (backend for real-time price API fetching), HTML (structure), and CSS (Matrix aesthetic: green digital rain on black background, monospaced font, glowing neon green text, terminal-like UI).

    • C++ Image Rendering
      M2.1 utilized C++ and GLSL to implement complex light transport algorithms, accurately rendering the physical refraction of a crystal ball, detailed SDF modeling of a snowman, and shimmering snow effects in a real-time environment.

    • Java Real-time Danmaku
      M2.1 implemented a high-performance real-time Danmaku (bullet chat) system based on Java, a clean and intuitive user interface, and millisecond-level response capabilities.

    • SVG Generation
      M2.1 generated an interactive isometric SVG island map, constructing a detailed miniature world that supports one-click zooming to freely explore four major themed areas.
      Try it out: https://08tmc3aada59.space.minimax.io/

    • Agentic Tool Use
      Tool Use Capability: Excel Market Research
      M2.1 demonstrated its tool-use capabilities by autonomously invoking Excel and Yahoo Finance to complete an end-to-end task, ranging from market research data cleaning and analysis to chart generation.

    • Digital Employee
      The "Digital Employee" is a key feature of the MiniMax M2.1 model. M2.1 accepts web content presented in text form and controls mouse clicks and keyboard inputs via text-based commands. It can complete end-to-end tasks in daily office scenarios across administration, data science, finance, human resources, and software development. The following demo video is a screen recording of M2.1's behavioral trajectory in the Agent Company Benchmark.

    • End-to-End Office Automation
      Demo 1: Administrative tasks
      Task Requirements: Proactively collect employees' equipment requests on communication software, then search for relevant documents on the enterprise's internal server to obtain equipment prices, calculate the total cost and determine whether the department budget is sufficient, and then record equipment changes.

      Demo 2: Project management tasks
      Task Requirements: Search for blocked or backlogged issues on the project management software, then find relevant employees on the communication software and consult them for solutions, and update the status of the issues based on the employees' feedback.

      Demo 3: Software development tasks
      Task Requirements: A colleague wants to know which is the most recent Merge Request that modified a certain file. Search for the relevant Merge Request, find its number, and inform the colleague.

    How to Use

    • The MiniMax-M2.1 API is now live on the MiniMax Open Platform: https://platform.minimax.io/docs/guides/text-generation
    • Our product MiniMax Agent, built on MiniMax-M2.1, is now publicly available: https://agent.minimax.io/
    • The MiniMax-M2.1 model weights are now open-source, allowing for local deployment and use: https://huggingface.co/MiniMaxAI/MiniMax-M2.1

    Local Deployment Guide

    Download the model from HuggingFace repository
    We recommend using the following inference frameworks (listed alphabetically) to serve the model:

    • SGLang
      We recommend using SGLang to serve MiniMax-M2.1. Please refer to our SGLang Deployment Guide.

    • vLLM
      We recommend using vLLM to serve MiniMax-M2.1. Please refer to our vLLM Deployment Guide.

    • Other Inference Engines

      • MLX
      • KTransformers

    Inference Parameters

    We recommend using the following parameters for best performance:
    temperature=1.0, top_p = 0.95, top_k = 40

    Tool Calling Guide

    Please refer to our Tool Calling Guide.

    Contact Us

    • Contact us at [email protected]
    • Business Cooperation: [email protected]
    • MiniMax X: https://x.com/MiniMax__AI
    • MiniMax LinkedIn: https://www.linkedin.com/company/81521159
    • MiniMax Discord: https://discord.gg/minimax
    Original source
  • December 2025
    • No date parsed from source.
    • First seen by Releasebot:
      Dec 23, 2025
    MiniMax logo

    MiniMax

    MiniMax-M2.1

    MiniMax-M2.1 launches a polyglot text generation API with buildable tool calls, accessible via HTTP or SDKs. It supports ultra large context windows up to 204,800 tokens and emphasizes code understanding and interleaved tool use.

    🎉 MiniMax-M2.1: Polyglot programming mastery, precision code refactoring

    The text generation API uses MiniMax M2.1 to generate conversational content and trigger tool calls based on the provided context.

    It can be accessed via HTTP requests, the Anthropic SDK (Recommended), or the OpenAI SDK.

    Supported Models

    Model Name Context Window (total input + output per request) MiniMax-M2.1 MiniMax-M2.1-lightning 204,800 MiniMax-M2 204,800

    Please note: The maximum token count refers to the total number of input and output tokens.

    Recommended Reading

    • Compatible Anthropic API (Recommended): Use Anthropic SDK with MiniMax models
    • Compatible OpenAI API: Use OpenAI SDK with MiniMax models
    • M2.1 for AI Coding Tools: MiniMax-M2.1 excels at code understanding, dialogue, and reasoning.
    • M2.1 Tool Use & Interleaved Thinking: AI models can call external functions to extend their capabilities.
    Original source
  • December 2025
    • No date parsed from source.
    • First seen by Releasebot:
      Dec 23, 2025
    MiniMax logo

    MiniMax

    MiniMax-M2.1: Polyglot programming mastery, precision code refactoring

    MiniMax-M2.1 launches polyglot video generation from text or images with new models boosting realism and speed. The release outlines an asynchronous API flow to create, track, and download videos via task and file IDs.

    🎉 MiniMax-M2.1: Polyglot programming mastery, precision code refactoring ➔

    This API supports generating videos based on user-provided text, images (including first frame, last frame, or reference images).

    Supported Models

    • MiniMax-Hailuo-2.3: New video generation model, breakthroughs in body movement, facial expressions, physical realism, and prompt adherence.
    • MiniMax-Hailuo-2.3-Fast: New Image-to-video model, for value and efficiency.
    • MiniMax-Hailuo-02: Video generation model supporting higher resolution (1080P), longer duration (10s), and stronger adherence to prompts.

    API Usage Guide

    Video generation is asynchronous and consists of three APIs: Create Video Generation Task, Query Video Generation Task Status, and File Management. Steps are as follows:

    • Use the Create Video Generation Task API: (Text to Video, Image to Video, Start / End to Video, Subject Reference to Video) to start a task. On success, it will return a task_id.
    • Use the Query Video Generation Task Status API with the task_id to check progress. When the status is success, a file ID (file_id) will be returned.
    • Use the Download the Video File API with the file_id from step 2 to view and download the generated video.

    Official MCP

    Visit the official MCP for more capabilities: https://github.com/MiniMax-AI/MiniMax-MCP

    Original source
  • December 2025
    • No date parsed from source.
    • First seen by Releasebot:
      Dec 23, 2025
    MiniMax logo

    MiniMax

    MiniMax-M2.1: Polyglot programming mastery, precision code refactoring

    New Image Generation service introduces Text-to-Image and Image-to-Image capabilities. Generate images from detailed prompts or from reference images to preserve subject characteristics and maintain visual identity across contexts.

    The Image Generation service provides two core capabilities: Text-to-Image and Image-to-Image.

    Generate Images from Text

    Create images directly from detailed text descriptions (prompts) that specify the desired content.

    Generate Images with Reference Images

    This feature allows you to supply one or more reference images (including online image URLs) that contain a clear subject. Combined with a text prompt, the service generates a new image that preserves the subject’s key characteristics.
    This is particularly useful for scenarios that require consistent visual identity, such as generating images of the same virtual character in different contexts.

    Original source
  • December 2025
    • No date parsed from source.
    • First seen by Releasebot:
      Dec 23, 2025
    MiniMax logo

    MiniMax

    MiniMax-M2.1: Polyglot programming mastery, precision code refactoring

    Music Generation API now lets you generate full songs with vocals from text prompts and lyrics. Define style, mood, tempo, and vocal traits to craft ready-to-use tracks for videos, games, or apps. Aimed at quick, theme-driven music creation.

    The Music Generation API

    The Music Generation API can create a complete song with vocals based on a text description and lyrics.

    Use the prompt parameter to define the music’s style, mood, and scenario, and the lyrics parameter to provide the vocal content.
    This feature is ideal for quickly generating unique theme songs for videos, games, or applications.

    Example: Text-to-Music Creation

    import requests
    import os
    
    url = "https://api.minimax.io/v1/music_generation"
    api_key = os.environ["MINIMAX_API_KEY"]
    headers = {
      "Authorization": f"Bearer {api_key}"
    }
    
    payload = {
      "model": "music-2.0",
      "prompt": "This is a contemporary R&B/Pop track with distinct Trap influences, radiating a confident, assertive, and empowered energy. It features a bright, clear, and agile female vocal with a polished and heavily processed modern sound. The singer's rhythmic and confident delivery is defined by the heavy and stylistic use of Auto-Tune, creating its signature character. Extensive backing vocals, including layered harmonies and ad-libs built upon stacked unison vocals, produce a rich and full texture, enhanced by moderate reverb for a spacious feel. Set at a tempo of 80 BPM, the arrangement is driven by a dominant 808 bassline and electronic drums with intricate hi-hat patterns and sharp claps, while atmospheric synth pads and subtle sound effects craft a dynamic backdrop. This track is perfect for clubbing, parties, driving with the windows down, or a workout session, making it an essential addition to any confidence-boosting playlist.",
      "lyrics": "[chorus]\nSummit, i reached the summit\nI'm the peak with the fire, they all want from it\nSpill a bit of my glow, like a comet\nI ain't worried 'bout hills, you just plummet\nSummit, i reached the summit\nObsidian shards 'round my throat, now they run from it\nAin't no wonder why the valleys all run from it\nI'm awake, from the summit\n[verse]\nI know what i hold\nAnd i'm about to erupt, yeah\nA story untold, yeah\nI know you won't interrupt it\nKeep your eyes on the rise, no surprise that i'm bright\nGot one stream for the sea, other stream for the night\nI be flowin', you're erodin'\nSwear you're slowin', i'm explodin'\nPressure's growin', growin', growin'\n[interlude]\nSummit, i reached the summit\nI'm the peak with the fire, they all want from it\nSpill a bit of my glow, like a comet\nI ain't worried 'bout stone\n[verse]\nI ain't worried 'bout nada\nUnless it's new earth, unless it's magma\nUnless it's deep core, a new nirvana\nUnless it's shaping a new savanna\nI wanna feel like i'm mother gaia\nI wanna feel like i'm way up\nRumbling, grumbling 'til the world pay up\nMade another island, no layups\nStay hot every single day i wake up\n[chorus]\nSummit, i reached the summit\nI'm the peak with the fire, they all want from it\nSpill a bit of my glow, like a comet\nI ain't worried 'bout hills, you just plummet\nSummit, i reached the summit\nObsidian shards 'round my throat, now they run from it\nAin't no wonder why the valleys all run from it\nI'm awake, from the summit\n[outro]\nSummit\nRooo-ar",
      "audio_setting": {
        "sample_rate": 44100,
        "bitrate": 256000,
        "format": "mp3"
      }
    }
    
    response = requests.post(url, headers = headers, json = payload)
    response.raise_for_status()
    audio_hex = response.json()["data"]["audio"]
    
    with open("output.mp3", "wb") as f:
        f.write(bytes.fromhex(audio_hex))
    
    Original source
Releasebot

Curated by the Releasebot team

Releasebot is an aggregator of official release notes from hundreds of software vendors and thousands of sources.

Our editorial process involves the manual review and audit of release notes procured with the help of automated systems.