Minimaxi Release Notes
Last updated: Feb 14, 2026
- Feb 13, 2026
- Date parsed from source:Feb 13, 2026
- First seen by Releasebot:Feb 14, 2026
Forge: 大规模原生 Agent RL 系统
MiniMax M2.5 发布,Forge 推出异步原生 Agent RL 系统,解耦 Agent 与训练推理,达到百万样本吞吐和持续 Reward 提升。白盒/黑盒通用、前缀树合并、Windowed FIFO 调度等优化,显著提升长上下文任务的稳定性与泛化。
1. 问题建模
在深入探讨架构设计之前,我们首先将 Agent 强化学习系统的优化目标形式化为“最大化有效训练收益(J)”:
其中,Throughput 是指每秒处理的原始 Token 数量,其主要受RL系统中的四部分控制: Rollout、Training、Data Processing和 I/O。Sample Efficiency 则是指每个样本带来的平均性能提升,由数据分布、数据质量、算法效率以及Offpolicy程度决定。而稳定性和收敛性则能够基于训练过程中监测指标来判定。
要实现 的最大化,我们需要克服以下三类挑战:
1.1 Agent可扩展性与框架灵活性
当前常见的RL框架和范式对 Agent 的复杂度限制很大,主要体现在:
- Agent 自由度受限:将 Agent 视为白盒就要求在 Agent 和 RL Framework 之间共享和传递状态。这种设计难以对复杂的 Agent 架构(如动态上下文管理、Multi-Agent RL等)进行建模,导致模型能力无法在复杂的黑盒Agent上有效泛化。
- Token 一致性问题:现有的 TITO(Token-In-Token-Out)模式迫使 Agent 与底层的 Tokenizer 逻辑深度耦合。在复杂的上下文管理机制下,要想维持 Agent 和 RL 之间的严格一致性,其工程成本是非常大的。
1.2 系统效率与计算冗余
Rollout 的完成时间存在极大的方差——短则几秒长则数小时。这带来了一个异步调度问题:
- 训推异步调度逻辑:跑过异步 RL 的同学都知道,在 MFU 和 RL 算法稳定性之间权衡是非常复杂的。严格的 FIFO(First In First Out)/同步调度会被于长尾样本 block;而 Greedy/FFFO(First Finish First Out) 虽然最大化了吞吐量,却带来了不可控的 distribution shift,极易导致 RL 中途崩掉。
- 前缀冗余:在多轮 Agent 请求和 group-level 的 Rollout 中,Tokenizer 的 encode-decode 不一致性和上下文管理机制,会导致请求间共享了大量的前缀,这种冗余在训练期间造成了巨大的计算浪费。
1.3 Credit Assignment与优化稳定性
- 稀疏奖励问题:复杂的 Agent 任务的 trajectory 通常包括长达数千步,使得基于稀疏奖励的 credit assignment 在数学上非常不稳定。这种稀疏性导致回报计算中的信噪比极低,引起高梯度方差,破坏了大规模模型训练的稳定性。
- Long CoT 的负面影响:在 R1 出来之后大家的 RL 都很关注 response length 的增长。但在真实的 Agent 场景中,用户其实对执行时间非常关注,如果不加以限制可能会导致训出来的模型虽然刷榜很强,但用户体验很差。
2. 系统架构与 Agent RL范式
2.1 RL系统设计
为了实现真正可扩展的架构,我们不再局限于具体的 Agent,而是转向了通用的抽象层设计,将Agent的执行逻辑与底层的训推引擎彻底解耦。我们的RL系统由3个核心模块组成:
- Agent:该层抽象了通用 Agent(涵盖白盒和黑盒架构)及其运行环境。它负责协调环境交互,使Agent成为一个纯粹的 Trajectory Producer 。通过将环境交互与 LLM generation 解耦,Agent可以专注于核心业务逻辑(如 context management 和复杂的环境交互等),而无需关心底层的训练和推理细节。
- 中间件抽象层:作为桥梁,该层在物理上将Agent 侧与训练/推理引擎隔离。
- Gateway Server :充当标准化通信网关,处理Agent与LLM之间的交互请求。通过通用标准协议,它有效地将底层模型的复杂性与Agent的高层行为逻辑隔离开来。
- Data Pool :作为分布式数据存储,异步收集trajectory和process signal。它充当生成和训练解耦的缓冲区,允许灵活的数据处理和批处理策略。
- 训练与推理引擎:
- Rollout Engine :专用于高吞吐量 Token 生成,响应 Agent 的生成请求。
- Train Engine :通过Scheduler从 Data Pool 中 fetch 数据,更新 Agent model,并与采样引擎保持同步,确保Agent使用最新的策略分布进行探索。
我们在离线评估中发现,不同 Agent 脚手架会导致显著的性能偏差。借助该模块化设计,我们在无需修改Agent内部代码的情况下,使用大量的 Agent 框架进行了训练。这种“引擎与Agent完全解耦”的架构确保了模型能在各类环境中泛化,目前我们已集成了数百种框架和数千种不同的工具调用格式。
2.2 白盒 Agent RL:以上下文管理为例
对于白盒 Agent,我们可以通过充分的脚手架设计和增广,以直接观测和优化模型在特定类型 Agent 上的表现。在M2.5中,我们特别优化了过去模型在带上下文管理的长程任务(如 DeepSearch)中出现的一些问题:
- 上下文场景性能退化:随着交互轮次增加,中间推理和冗余观察的积累会产生“注意力稀释”。这种噪声会导致模型在绝对上下文窗口内对关键信息失去焦点。
- 训推不一致:虽然上下文管理可以延长交互周期,提升 Agent 在长上下文场景的表现,但仅在推理时使用会由于偏离RL训练的数据分布,迫使模型在推理时被迫接受上下文变迁,处理不常见的长下文,从而影响模型表现。
为了解决这些问题,我们将上下文管理(Context Management, CM)机制直接整合到RL交互循环中,将其视为驱动状态转换的功能性动作:
- CM 驱动的状态转换:我们将CM建模为agent action,而上下文变迁则蕴含在环境的 dynamics 中。状态从 S_t 到 S_{t+1} 的转换隐式包含了上下文切换的逻辑,将上下文适应包含在了模型的训练目标中。
- 自适应推理模式:通过在此框架内优化策略 π ,模型学会了内化分布偏移,涌现出优先关注 State-critical Token 的鲁棒推理模式。
- 感知上下文管理策略:在该策略下,模型在RL生成过程中就需要学会预见可能的上下文管理和改变,模型通过主动保留与目标任务相关的信息和减少无关上下文信息,大幅提升了在 Context-Management Agent 下的性能。
2.3 黑盒 Agent RL:跨框架的鲁棒性
许多用户的真正在用的 Agent 实际上是闭源的,我们完全无法感知内部的 Agent loop 逻辑。为了确保模型在不透明架构上也能对脚手架针对性优化,我们采用了以下方案:
- 非侵入式集成:Forge 不感知 Agent 内部的实现细节,内部只需要将请求打到 RL 服务的GateWay,框架内部即可进行数据收集和训练,因此在实际 RL 训练时可以兼容任意上下文操作(如记忆压缩、历史重写),任意内部的Agent Loop(例如 Deep Think、Multi-Agent 等等)。
- 多框架泛化:通过将训练循环与Agent内部状态解耦,Minimax-M2.5广泛适配大量黑盒Agent——无论是以沙盒+MCP环境为主的代码Agent(例如我们将 Opencode Agent 直接视为一个黑盒Agent来训练),还是使用激进上下文缩减策略的Agent(如 Truncate BC)。实验表明,该方法在完全不透明的黑盒系统上依然能带来稳定的提升。
2.3 工程优化
(此处原文重复编号,保留结构)
3. 工程优化
3.1 混合调度策略:Windowed FIFO
为了解决吞吐量与数据分布一致性之间的冲突,我们提出了 Windowed FIFO 调度策略。该策略介于 FIFO 和 Greedy 之间,即可以保证系统的吞吐,也控制了样本的 off-policyness。
假设当前达到了最大的生成并发量(如 N = 8192),生成队列为 Q ,当前头部位于索引 H 。训练调度器受限于一个大小为 W (如 W = 4096 )的可见窗口:
- 受限可见性:调度器只能从 [H, H+W] 范围内获取已完成的轨迹。
- 局部贪婪(窗口内):在活动窗口内,调度器可立即提取任何已完成轨迹,避免了队头阻塞(HoL),快速任务无需等待头部任务完成。
- 全局严格阻塞(窗口外):即使索引为 H+W+k 的任务已完成,调度器也禁止获取它。
- 约束推进:只有当头部的任务被消费时,窗口才向前滑动(H ← H + 1)。这迫使调度器必须等待当前窗口内的“长周期落后任务”,防止训练分布向“快而简单”的样本严重偏移。
3.2 Prefix Tree Merging
Agent 的多轮请求间存在很高的上下文前缀重合度,传统方法将每个请求视为独立样本,重复计算公共前缀,浪费了大量的训练算力。
我们提出了Prefix Tree Merging 方案,将训练样本从“线性序列”重构为“树形结构”,下面是具体的数据处理和训练策略:
- 只要共享基础前缀,Completions 就能在样本级别合并到一棵前缀树中(即使后续响应或采样分支不同)。
- 通过利用 Attention Mask 原语(如 Magi Attention)表示不同 branch 之间的依赖关系,可以保证前向计算在数学上与 naive 方案完全一致,在计算 loss 时,我们会把前缀树 unmerge 为序列的格式,不影响后续的 loss 计算和指标统计。
- 该方案消除了冗余的前缀,相比于 naive 方案实现了约 40倍的训练加速 ,且显著降低了显存开销。
3.3 推理加速
引入异步 RL 之后虽然 Rollout 阶段算力占比降低到了 60% 左右,但推理本身还有很大优化空间,我们通过下面的几项优化来加速 LLM 推理:
- Dynamic MTP :首先我们引入 MTP 进行推理加速,同时为了保证训练过程中维持 draft model 的高接受率,我们通过 Top-K KL Loss在RL过程中持续训练 detached MTP Head,与 RL policy保持对齐。
- Rollout 侧的 PD 分离 :PD分离可以消除 MoE 调度中的 PD 干扰,为每个实例提供独立的并行和生成策略,在最大化吞吐量的同时优化长尾样本的延迟,防止极端样本阻塞 FIFO scheduler,并带来较高的 offpolicy。
- 全局 L3 KV Cache Pool :在多轮和超长上下文的 Agent 场景下,请求间拥有极高的共享前缀比例,但是局部的 kv cache 受容量限制,无法达到满意的 prefix cache 命中率,甚至在 RL batch size 极大的情况下,会发生大量由于驱逐导致的重计算,因此需要支持全局的 L3 KV cache。同时,Forge 还通过 scheduler cost-aware 的调度机制,权衡排队延迟和缓存传输时间来动态路由请求,在不使实例超载的前提下最大化缓存局部性。
4. Scalable Agent RL 算法
在 M2 系列中我们整体上沿用了 M1 时期提出的 CISPO 算法,尽管场景发生了 显著变化——从 几十 k Token 的 Long CoT 到 200k Context 的 Agent 场景,CISPO 依然提供了很强的 baseline。在此基础上,我们也针对 Long-horizon Agent 的特性进行了专门的适配与优化。同时在 Windowed FIFO 的基础上,我们采用了 Multi-Domain混合训练策略 。我们将 Reasoning、General QA、Code Agent、General Agent 等多个领域的任务同时混合训练。这缓解了分阶段训练中的遗忘问题,显著增强了模型的泛化能力。
4.1 Dence & Process Reward
为了解决超长轨迹的信用分配问题并确保稳定,我们设计了一个由三部分组成的复合奖励:
- 过程奖励(Process Reward):监督 Agent 的中间行为(如惩罚语言混合或特定工具调用错误),提供密集反馈,而不只依赖最终结果。
- 任务完成时间奖励:将相对完成时间作为奖励信号。因为真实延迟不仅取决于 Token 生成,还受工具执行和子 Agent 调用影响,这能激励 Agent 主动利用并行策略、选择最短的执行路径来加速任务。
- 用于降低方差的后续奖励(Reward-to-go):长周期任务的稀疏奖励容易引发高梯度方差。我们使用 Reward-to-go 来标准化回报,大幅提高了信用分配的精度,稳定了优化过程。
在 Minimax M2.5 的训练过程中,在面对数十万个真实的 Agent 脚手架和环境以及 200k 的上下文长度时,我们的 RL 系统做到了每天百万级样本量的吞吐,并实现持续稳定的 Reward 上涨和真实的模型能力提升。
Original source Report a problem - February 2026
- No date parsed from source.
- First seen by Releasebot:Feb 11, 2026
MiniMax 將於 2026 年 3 月 2 日公佈 2025 年全年業績
MiniMax to unveil 2025 full year results with an investor call on March 2, 2026; registration, dial-in details, and live/recorded access provided.
香港, 2026 年 2 月 11 日 /美通社/ -- 全球領先的通用人工智能科技公司 MiniMax Global Inc. (「MiniMax」或「公司」; 香港聯交所股票代碼: 00100) 今日宣布, 將於 2026 年 3 月 2 日 (星期一) 在香港市場收市後公佈其截至 2025 年 12 月 31 日的 2025 年全年業績。
公司管理層將於 2026 年 3 月 2 日 (星期一) 北京時間晚上 8:00 (美國東部時間上午 7:00) 舉行電話會議, 討論業績。
如欲參與電話會議, 須通過鏈接預先登記:
中文專線 (普通話):
https://s1.c-conf.com/diamondpass/10053116-eg81mx.html
英文同聲傳譯專線 (僅供聆聽):
https://s1.c-conf.com/diamondpass/10053115-hu76t5.html
參會者預先登記時可選擇中文專線或英文同聲傳譯專線。請注意, 英文同聲傳譯專線僅供聆聽。如需提問, 請登記中文專線。完成登記後, 參會者將會收到列有會議撥入詳情、會議密碼及特定登記號碼的電子郵件。參會者可使用相關訊息直接參與電話會議。參會者可在電話會議開始前及進行期間, 隨時預先登記。
此外, 公司投資者關係網站亦會提供電話會議 (包括中文專線及英文同聲傳譯專線) 的直播及存檔, 網址為:
https://ir-tool.minimaxi.com/calendar/index.html?lang=zh-hk。
可致電以下電話號碼收聽錄音重播, 錄音有效期至 2026 年 3 月 9 日。
撥入號碼
中國大陸: 4000 483 168
中國香港: 800 906 986
美國/加拿大: 1 855 336 4664
中文會議編號: 6883691
英文同聲傳譯編號: 3663475關於 MiniMax
MiniMax 是全球領先的通用人工智能科技公司。公司以「與所有人共創智能」為使命, 致力於推動人工智能科技前沿發展, 實現通用人工智能(AGI)。MiniMax 自主研發了一系列多模態通用大模型, 具備強大的代碼和 Agent 能力, 以及超長上下文處理能力, 能夠理解、生成並整合包括文本、音頻、圖像、視頻和音樂在內的多種模態。基於這些自研模型, MiniMax 已經面向全球推出一系列 AI 原生產品, 以及面向企業和開發者的開放平台, 共同為全球用戶提供極致的智能體驗, 提升社會整體生產力, 豐富個人生活質量。有關公司的詳細資料, 請瀏覽網站 https://ir.minimaxi.com/zh-HK 。
投資者及媒體查詢,請聯繫:
Original source Report a problem
MiniMax
投資者關係
電郵: [email protected]
媒體查詢
電郵: [email protected]
Piacente Financial Communications
電郵: [email protected] All of your release notes in one feed
Join Releasebot and get updates from Minimaxi and hundreds of other software products.
- Feb 4, 2026
- Date parsed from source:Feb 4, 2026
- First seen by Releasebot:Feb 4, 2026
MiniMax and Hyperbond Studio: Bringing AI Companions to Life with Speech 2.8
Hyperbond Studio partners with MiniMax to voice Call Me Sensei’s AI senseis, making language learning more immersive. MiniMax Speech 2.8 delivers ultra-fast, expressive, multilingual voice for real time chats. Open alpha on iOS/Android now; beta slated for Feb 2026.
We are excited to announce that Hyperbond Studio has partnered with MiniMax to power the voices of all AI companions in Call Me Sensei, their innovative language-learning dating simulator app. With MiniMax Speech 2.8, every conversation with an AI sensei now sounds and feels strikingly real.
Hyperbond Studio builds emotionally intelligent AI products combining education, companionship, and interactive entertainment. Their flagship app, Call Me Sensei, lets users learn languages by dating AI characters called "senseis" in their target language. Instead of drills and flashcards, learners practice by living out immersive scenarios like first dates at cafés, navigating train stations, or handling travel emergencies. Each sensei acts as romantic interest, conversation partner, and language coach, all in one.
Now, with MiniMax Speech 2.8, these characters come to life through voice. Senseis can shift smoothly from teasing to serious, encouraging to disappointed — carrying emotional texture across entire story arcs rather than just individual lines.
Why MiniMax Speech 2.8
MiniMax Speech 2.8 is our latest speech model designed for voice agent scenarios — ultra-fast, ultra-human, ultra-smart:
Ultra-Fast
- End-to-end latency under 250 milliseconds for real-time conversational interactions. Audio generation is no longer a bottleneck — conversations flow naturally.
Ultra-Human
- Full Voice Clone + Fluent LoRA technology delivers natural, expressive dialogue. Hyperbond works with voice actors to establish each sensei's core sound, then our models generate emotionally rich speech across countless situations.
Ultra-Smart
- Smart Text Normalization handles URLs, emails, dates, numbers and more. 40+ languages supported with inline code switching — perfect for language learning scenarios.
Voices You Can Develop Feelings For
"Hyperbond Studio is building a new category at the intersection of language learning, companionship and interactive storytelling," said [TT], [Global Business Manager] at MiniMax. "We're excited our speech models are helping them explore what happens when emotionally rich characters become the primary way people learn."
"Call Me Sensei depends on how human the senseis feel. We evaluated several TTS providers, and MiniMax stood out for emotional range and multilingual consistency, in particular for Asian languages which are often overlooked by other models," said Shawn Tan, Co-founder and CEO of Hyperbond Studio. "MiniMax gives those characters voices you can actually develop feelings for."
For learners, this translates to fluid, responsive dialogue where AI companions remain consistent, expressive, and believable over time — making it easier to focus on language practice rather than the technology beneath it.
What's Next
The MiniMax integration currently underpins all spoken interactions in Call Me Sensei and will support upcoming features including:
- Deeper relationship routes
- Simulated phone calls
- Additional language pairs
Call Me Sensei is now in open alpha on App Store (iOS) and Google Play (Android), with open beta launching February 2026.
Bringing Characters to Life
This partnership exemplifies what's possible when advanced speech technology meets creative storytelling. MiniMax Speech 2.8's combination of ultra-low latency, emotional expressiveness, and multilingual support makes it the ideal foundation for applications where voice isn't just a feature — it's the heart of the experience.
We look forward to seeing how Hyperbond Studio continues to push the boundaries of what AI companions can be.
Original source Report a problem - Jan 30, 2026
- Date parsed from source:Jan 30, 2026
- First seen by Releasebot:Feb 13, 2026
不止 Office,钻进桌面,MiniMax Agent 一直都可以帮你做 PPT、Excel、Word 和 PDF
MiniMax Agent 新增桌面端与专家 Agent,直接在工作环境整理文件、跨源检索、图片归类、个性化文书撰写等多场景能力。专家 Agent 一句话启动复杂任务,覆盖法务、财务、留学申请等深度分析与报告生成。
MiniMax Agent 在办公场景的优秀实践
今天,我们分享一些 MiniMax Agent 在办公场景的优秀 pratices。
从上线起,MiniMax Agent 就可以处理 PPT、Excel、Word 和 PDF 等文件。我们认为,一个好的 Agent,需要点亮足够丰富的技能树,才能够在不同的场景,帮助不同角色、不同职能完成任务。
最近一次更新中,我们上线了桌面端与专家 Agent:桌面端让 Agent 进入你的工作环境来直接帮你整理文件、梳理信息;专家 Agent 在注入特定的知识和行为模版,学习具体的 SOP 后,能够在某类任务中干得更加漂亮。
接下来,我们通过几类典型场景,展示 MiniMax Agent 如何工作。
01 长程复杂任务:跨源搜索+复杂逻辑理解+多条件匹配
在大规模跨源信息检索的场景下,Agent 能够完成遍历网站、并行浏览,细节提取和结构化整理的任务。
HR 工作中,每年都需要根据毕业季节,提前几个月做校招计划,需要针对每所重点学校,收集应届生校园招聘会的信息(时间、地点、费用、报名方式等)。HR 需要逐个访问学校官网(这里也存在每个学校网站结构都不同的问题),手动复制信息,粘贴到 Excel,过程复杂耗时。
以北美 20 所名校的校园招聘会信息整合为例,Agent 能够自动列出学校清单,逐个访问各校官网,提取过去半年和未来半年的活动时间、企业报名方式、参与费用等细节信息,整理成 Excel 表格。
Query:
我要帮公司规划明年的北美校招,你先帮我列一下 CS 和 AI 最强的那些学校——Stanford、CMU、MIT 这些,大概 20 所左右。然后一个一个去查他们 Career Fair 的情况:过去半年办过哪些、接下来半年还有哪些、企业要怎么报名参加、大概要花多少钱。查完之后做一份 Excel 给我。在结构化检索整理信息的基础上,Agent 还能理解复杂逻辑并筛选匹配。
每年 3 月是省考招录报名期。和国考的一次性集中公示招录岗位不同,31 个省份(直辖市、自治区)的招考公告分散在各自的人事考试网上,考生需要逐个访问。
更麻烦的是条件匹配的复杂性:岗位要求"本科及以上"时,硕士是否符合?要求"新闻传播类(专业代码 0503)"时,代码为 055200 的"新闻与传播专业硕士"是否属于此类?要求"应届毕业生"时,已工作 1 年就不符合。还有"本省户籍或生源"这类地域限制。考生需要在每个岗位前逐条对照自己的年龄、学历、专业、户籍、政治面貌,判断是否符合。
Agent 能理解这些规则("岗位的学历要求是本科及以上时,硕士学历也是符合该岗位的""055200 和 0503 都属于新闻传播类"),根据考生的年龄、学历、专业、政治面貌等信息,跨页面抓取招考公告,筛选出所有符合考生条件的岗位,整合生成个性化清单。
02 进入桌面:处理大量本地文件,完成跨模态任务
在真实任务中,Agent 能够完成 500 张电商图片的自动整理和归档。
电商的日常工作中,商品图片需要分类存储,比如"男装-外套-商务"一个文件夹,"女装-T 恤-休闲"另一个文件夹,这样方便快速找图。
一个运营面对 500 张新到的商品图,需要不断确认,这是男装还是女装?偏商务还是休闲?该放在哪个目录下?这些判断过去只能由人完成。Excel 看不懂图片,文件管理器也不知道图片是什么内容。
MiniMax Agent 可以自动识别图片,并据此完成后续自动分类、创建文件夹、移动并重命名文件等所有工作。图片理解和动作执行被合并在同一个系统中,你可以放心把素材整理这类工作交给 AI。
Query:
请识别这四个品牌文件夹中每张图片的性别、衣服种类和适用场景,按照「品牌/性别/衣服种类/场景」的层级结构创建文件夹并移动图片,文件命名格式为「衣服种类_场景_序号.扩展名」(如 Coat_Casual_01.png)整理后文件夹效果为:
除了整理文件,Agent 还能读懂多种格式的本地文件,并基于内容完成写作。
留学申请中,个人陈述(Personal Statement)是关键材料。学生需要根据目标项目的要求,结合自己的作品集、成绩单、个人简历等信息写出一篇有说服力的叙事文章。
Agent 能够读取分散在不同文件里和不同格式的本地文件(PDF 作品集、图片格式的成绩单、Word 简历),理解其中的关键信息(GPA、项目经历、作品亮点),结合申请项目的具体要求,生成个人陈述初稿。
Query:
针对这个文件夹里的所有个人信息,为李记滢写一篇 650 词内的 Personal Statement 英语主文书,用来申请北美的大学。Agent 生成的 PS 语句流畅:
03 专业场景:从庞杂内容到结构化洞察
法务场景中,Agent 能够完成法条逐版差异识别、变化归类和逻辑提炼的全流程。
制药企业的法务部门需要持续追踪《药品管理法实施条例》的变化。这部条例从 2002 年首次发布至今,经历多次修订,最新版本在 2026 年发布。每次修订都可能影响企业的生产、销售、质量管理等环节。
法务人员的工作是:打开所有版本的条例文件,逐条对比。某条在 V1 版本是这样写的,V2 版本改成了什么?是新增了内容,还是删除了某些表述,还是调整了措辞?某条在 V3 版本突然消失了,是被删除了还是合并到其他条款了?
多个版本,上百个条款,人工对比不仅耗时,还容易遗漏关键变化。
Agent 能逐版本对比条款,对重点条款进行解读,提炼每次修订背后的监管逻辑变化,分析对企业的实务影响,输出完整的深度分析报告。
Query:
请对《药品管理法实施条例》进行相邻版本的横向对比分析(V1-V2、V2-V3、V3-V4 等),并采用可视化方式呈现:每一版本对比先给出对比总表(条款号|修订前|修订后|变化类型∣修订时间),并用颜色或标记区分新增/修改/删除;随后对重点条款进行分类解读,配套原文截图或权威来源引用,最后用 2-3 行总结该次修订的监管逻辑与实务影响。Agent 准确高效地交付对比分析报告:
金融财报分析中,Agent 能够完成从数据提取、指标计算到可视化呈现的全流程。
投资分析师要写一份腾讯的深度研报时,需要下载多份财报,提取核心财务数据,计算毛利率、净利率、ROE、资产负债率、增长率等多个指标,按照游戏、广告、金融科技、云服务等不同的业务板块拆解收入,测算未来增长情况,最后要把这些数据做成图表
Agent 能够自动采集所需财务数字,完成各项比率运算,对不同业务线收入进行拆分,梳理现金流状况,输出多维度可视化内容,分 Sheet 有序呈现。
Query:
做一个完整的腾讯的财务分析 Excel,包含:1)近年的核心财务数据汇总;2)游戏、广告、金融科技、云服务这几个业务板块的收入对比;3)算一下盈利能力指标、偿债能力指标和成长性指标;4)现金流部分也需要分析一下;每个维度单独一个 Sheet,最后需要堆这几个维度作图并插入最后一个 Sheet。04 专家 Agent:一句话启动复杂任务
前面展示的案例中,用户都需要写比较详细的指令,需要告诉 Agent 要提取哪些字段、生成什么格式、分几个 Sheet。这对熟悉工作流程的专业用户来说不难,但新手可能不知道该怎么描述需求。
专家 Agent 解决的就是这个问题: 我们把特定领域的专业知识、工作流程、输出标准、相关工具等预先注入到 Agent 中。
用户只需要一句简单的话,Agent 就能自动展开完整的专业级工作。
以"热点追踪"专家为例。如果使用普通 Agent 追踪行业动态和热点事件,用户需要写:
"搜索 Clawdbot 最近 7 天的相关新闻,筛选出 3-5 个有讨论度的话题,每个话题再深度搜索背景信息和各方观点,然后写成 800-1500 字的文章,配 2-3 张图,保存成 docx 文件。"
此刻这套流程已经内置在专家 Agent 中。用户只需输入"Clawdbot 最新消息",Agent 就会自动理解追踪主题并拆解搜索关键词,搜索最新的权威信源,从结果中挖掘近期爆点话题(优先 1 天内、3 天内、5 天内的事件),对每个话题深入搜索背景、数据和观点,撰写 800-1500 字长文并配图,保存为 docx,最后进行事实核查确保准确性。
搭配定时任务,还可以要求专家 Agent 每天定时更新运行一次,每天收到自定义专题的新鲜报道。
整个过程用户只说了一句话,剩下的 Agent 都能够自主细化并完成。
Query:
Clawdbot 的最新消息。我们展示的这些案例,只是 Agent 能力的一部分。欢迎来一起探索 Agent 的更多实用技巧。
Web 端使用链接:
agent.minimaxi.com桌面端下载入口:
agent.minimaxi.com/download企业合作:
[email protected]你的工作场景可能更复杂,你的需求可能更独特——
我们希望倾听你的需求和灵感。如果你希望 Agent 在某个特定场景下帮你完成工作,欢迎参与下方许愿池活动,填写你最想要的 Expert 功能或建议。如被采纳,我们会免费帮你构建对应的 Expert,并将额外赠送 2000 积分作为感谢。
Intelligence with Everyone.
Original source Report a problem - Jan 23, 2026
- Date parsed from source:Jan 23, 2026
- First seen by Releasebot:Feb 14, 2026
MiniMax M2.1:在 Agent 场景下的后训练技术与实践经验
M2.1 在上一代基础上提升为 MoE 架构约230B 参数,关注 Agent 场景的可用性与推理效率。并发布 SWE、AppDev、WebExplorer 三大数据合成与评测体系的新进展,覆盖多语言和强上下文管理,提升实际场景的鲁棒性与性能。
M2.1 在上一代 M2 模型基础上进行了进一步的后训练优化,同时也是上个月开源的最新主力模型。该模型采用 MoE 架构,总参数量约230B,激活参数约10B。
M2.1 在Agent 场景下表现出优异的可用性,即使在激活参数较小的条件下,也能够保持高效的推理速度和出色的性能表现,为各类生产级应用提供稳定的工程化能力。
Agentic 数据合成
首先是数据合成。数据合成大概分为三个部分:
- 真实数据驱动的数据合成:SWE Scaling
- 专家驱动的数据合成:AppDev
- 虚拟长程任务合成:WebExplorer
前面两个其实是关于 Coding 的场景,后面这个是偏通用的搜索场景。
真实数据驱动的数据合成:SWE Scaling
首先是比较核心是整个软件工程场景下的数据 Scaling,要利用好 GitHub 这样一个非常庞大且结构化的数据来源,去合成各种各样 Verifiable(可校验)的任务。有了这些任务之后,无论是去做拒绝采样、构建 SFT 数据,还是去做 RL,都会有非常好的数据基础。
我大致介绍一下整体流程。最原始的数据来源是 GitHub 的 PR(Pull Request)和 Commit。这里会去做质量筛选,方法有很多,简单一点的可以筛选最终被 Merge 的 PR,当然也会有一些其他的规则,比如是否带相关的测试用例。
有了这些 PR 之后,接下来一步是相对核心的点,我们要针对这些 PR 构建一个能够运行的 Docker 镜像环境。
镜像环境的构建本身不是一件容易的事情。现在的通用做法是通过 Agent 在代码沙盒里给它一些工具,让它不断地 Build,并根据 Build 结果进行自我迭代循环,尝试把环境构建出来。
理想情况下,这个过程可以完全自动化,但目前还没有达到特别理想的状态。在特定语言或特定库的版本中,环境不一定能构建得很好,这时会需要一些有经验的专家知识来帮助优化 Agent 的执行流程,这种流程算是一种 Skill(技能)的注入。
这样,我们就针对这些 PR 构建了一个可以运行的虚拟 Docker 镜像环境。
在这一步之后,会对 PR 本身做一些 Tagging(清理)和分流。PR 本身包含很多种数据类型,比如 Bugfix、增加 Feature、性能优化、测试用例的构建与重构等,大概有几十种。
做分流的目的是因为对于不同种类的 PR 或 Commit,后续利用的方法是不太一样的。
举个最简单的例子,比如最主流的 Bugfix 场景,我们会去抽取它的一些 F2P 以及 P2P 的测试用例。有了这些测试用例之后,我们可以去校验。例如当一个 Golden Patch 能通过时,我们认为这条数据是可以通过的,那么最后让模型作为一个 Agent 在沙盒里修复这个 Bug,并用 F2P 和 P2P 的 Test Case 去校验它是否真的通过了。
这里 P2P 的主要作用是防止在修复 Bug 的过程中引入额外的 Bug。如果进入新增 Feature 的场景,上述做法可能就不一定成立。例如在新增 Feature 的情况下,写的测试用例可能会依赖新增功能本身的代码脚本(如函数签名)。
这种情况下,它在构建阶段可能就失败了,不一定会存在 F2P 或 P2P 的测试用例,而且测试用例的抽取逻辑就会不同。例如需要专注于抽取 PR 过程中新增的测试点,然后同样需要校验原始 Golden Patch 至少在这些新增测试点上是能够通过的。
再比如在性能优化上,并不存在修复 Bug 的过程,所以不一定存在 F2P 测试,这种情况下就是抽取 P2P 测试,包括定位真正做了性能改进的测试点,去验证它在修复前后确实有稳定且显著的性能差异。
这里举的是一些基础例子,不同种类的 PR 也会有不同的做法。
在这一步之后,进入第三步(列),即模型对数据做校验。校验的目的是因为,即使在基础的 Bugfix 场景下,由于使用的是原始 GitHub PR 数据,有的并不规范,或者测试用例不一定能准确涵盖原始描述的 Issue。这种情况下会出现按照问题描述永远解不了 Bug 的情况。
这时就需要通过模型对原始测试用例和问题做校验,确保其一致性。如果确实缺失关键信息,可以通过模型在原始问题描述中做补充,保证它是一个自包含的完整问题。
在这一步之后,对于同一个 PR 或 Issue,有很多种不同的利用方式。比如,可以增加额外的 Bug,或者把多个相邻的 Commit 或 PR 合并到一起,增加难度,类似于 SWE-smith 的做法。
还有一种做法是,Bugfix 场景和 SWE-Test 场景是可以完全等价转换的。原先 Bugfix 是指在应用 Golden Patch 之前会挂掉,应用后能通过。
如果把它改造成写测试用例的任务,任务就反转过来了:要求模型写一个测试用例,使得它在应用 Patch 前的代码库状态下会 Fail,而修复后能通过。这对应到模型需要有较强的写测试用例的能力,且该任务与原先的 Bugfix 同源,并且是 Verifiable 的。
再者,可以做代码审阅类的任务。代码审阅任务与前面有些不同,它可以直接从 GitHub Filtering 这一步连线到 SWE Review,原因是本质上可以构建一些不需要完全运行环境的代码审阅任务。
比如我们平时在本地开发项目,本地环境不一定能完整运行所有测试或跑起整个代码库。这种情况下,模型需要看代码库,静态分析文件间的依赖关系并提出问题。由于不依赖环境构建,它可以做到更好的多样性。
工作原理是类似的,PR 包含修改前后的文件,如果模型能 Review 出修改前已包含的 Bug,再用一个 LLM 校验其一致性,本身也可以作为近似可校验的任务。总的来说,可以有很多问题的变形和增强方式。
最终,我们会得到一个 SWE 类的数据及其运行环境,包括原始问题描述、基于测试用例的完全 Verifiable 的 Reward,以及 Docker 运行环境。
有了数据之后,用法包括 SFT 和 RL。在 SFT 方面,我们会通过多脚手架(Multi-scaffold)去做拒绝采样,以此优化模型的泛化性。
另外在 RL 方面,使用多脚手架的原因是现在的脚手架往往包含复杂的上下文管理逻辑。如果只在一个简单的 React Agent 框架里做数据,模型很难泛化到其他脚手架的行为上。
比如在 Claude Code 里会有很多 System Reminder 或 Skill claude MD 等内容,如果模型从未见过,它就没有办法泛化到训练数据未涉及的空间。所以我们会有一套完善的工程基建,确保能在多脚手架环境中对模型做拒绝采样,生成更好的轨迹。
整个 SWE 数据核心思想的总结:
基于原始 GitHub 数据构建 Agent 驱动的自动化数据管线,产出多样的、可校验的 SWE 类数据和环境。
这方面很需要创造力,比如任务的合成方法以及可校验任务的构建。目前这方面的研究很多,最近也有很多新的开源工作出来,大家可以关注。
截止到 M2.1,我们整个 SWE 数据的 Scaling 已经做到了不错的状态。覆盖了超过 10 种主流编程语言,涵盖了各种代码任务类型和编程场景。最终可用的 Repo(能直接跑 SFT)数量大于 1 万个 PR,可变任务数量超过 14 万个。
最后是在几个 SWE 类核心榜单上的指标。对比 M2.1 和 M2,尤其在 Multi-SWE、SWE-bench 等多语言场景上,由于做了更充分的 Scaling,有显著提升。
另外在不同脚手架上评测,能发现模型保持了较好的性能稳定性。不同脚手架的上限不同,比如 Claude Code 设计较好,而 Mini-SWE-agent 本质是纯 Bash 脚手架,通过 Bash 读写文件会消耗更多 Context,导致模型上限更低,这符合预期,但模型对不同脚手架展现了较好的适应性。
专家驱动的数据合成:AppDev
Coding 方向,我们将其分为两类:SWE 类和 APPDev 类。
我们将 APPDev 定义为从零到一完成全栈软件开发任务。将其与 SWE 区分的原因是,APPDev 开发无法预先定义一系列固定的测试用例。SWE 场景基于 GitHub,行为空间受限,是在成熟 Repo 中操作;而 AppDev 是从零到一编写,无法预先限定死测试用例,其 Reward 链路与 SWE 完全不同。
在 APPDev 场景中,会用到 Expert in the Loop,利用数据专家的经验来帮助迭代和生成数据。
目前 M2.1 重点关注前端、后端、安卓和 iOS。公司内相关场景的专业研发同学会作为数据专家加入,帮我们优化 APPDev 的数据合成。
比如最开始,专家会写一些 Prompt 或 Meta Query。这些 Meta Query 结合特定场景的 Random Seed,能合成多样化的用户 Query。有了 Query 后进行采样,采样前需要构建 Rubric-based Reward,这非常依赖专家经验,因为不同专家对不同任务的校验标准不同,无法通过全自动方法完成。
除了 Rubric,专家还可以在流程中注入经验。例如在写网页前端时,上一代模型 M2 会有一些不好的习惯,比如写出较丑的渐变色。而专家可以设计一些先验 Prompt 指导页面设计。如果模型有不错的 System Prompt 遵循能力,就能在指导下采样出更好的轨迹。有了轨迹后,可以使用类似于 Prompt distillation 的方法,采样时带有 System 信息,但训练时去掉,这样专家的 Best Practice 就会变成模型的默认行为。
最后是在多脚手架上做拒绝采样及 RL。此外,还会利用 Agent as a Verifier 来做 Reward 校验。原因是在前端等场景,仅给出一个 Rubric 很难根据静态代码做判断。我们会让模型在沙盒环境中把项目完整部署起来,通过 Playwright 等工具与界面交互,根据交互后的状态变化对照 Rubric 打分。
这与 LLM-as-a-judge 的区别在于,它需要利用工具做多轮交互才能进行判断。
整个 AppDev 领域目前开源榜单中较受关注的是 Hot Arena,我们在上面排行开源模型第一。关于 App 我们也有自建榜单 VIBE Arena,后面会展开介绍。
虚拟长程任务合成:WebExplorer
除了 Coding Agent 场景,我们也在投入精力做偏通用的 Agent 场景。Search(搜索)是通用场景的基础。
我们之前有一篇 WebExplorer 的工作:Explore and Evolve for Training Long-Horizon Web Agents,该工作目前在 arXiv 上可见。核心思想包含两部分:第一步是通过智能体自由探索构建信息丰富的种子问题;第二步是迭代式地进行 Query 进化来提升问题复杂度。
展开一个例子:对于 WebExplorer 来说,最开始只有一个随机种子,比如“巴西国家队”。模型通过搜索找到 1950 年世界杯及“马拉卡纳惨案”。比赛信息中提到裁判叫 George Reader,他多年后在英格兰俱乐部担任主席,该俱乐部在 1976 年足总杯击败曼联夺冠,制胜球由 Bobby Stokes 打进。
模型最后将搜索链路的线索综合,生成一个信息丰富的初始问题。这个问题虽然信息量大,但有明显的搜索入口。进化的策略包括移除、模糊化和替换。比如:
- 把具体的比赛信息模糊化为“赛制独特、没有淘汰赛的世界杯”;
- 把“足总杯击败曼联”这种显著信息改为“带领乙级联赛俱乐部战胜甲级豪门”,增加搜索难度;
- 把球员去世年龄等容易在 Wiki 搜到的信息移除。
最终进化出的 Query 相比初始问题复杂得多,没有明显的搜索入口,模型必须根据线索一步步探索。
这种长程任务合成的标准是解决问题的平均轮次。初始问题约 7.9 轮解决,进化后达到 9.9 轮。该策略已上线 M2.1。最终 M2.1 在 BrowsComp 榜单上,尤其在带有 Context Management(上下文管理)的情况下,表现接近 GPT-4.5 的 SOTA 指标。上下文管理是目前 Search Agent 的主流做法,即在评测过程中不断清空上下文,保持其清晰干净,使模型能持续进行 Test-time Scaling。
Agentic RL 框架和算法
接下来介绍 RL 框架和算法。
Forge
我们使用的是内部自研框架 Forge,它在 M2 研发之初就是面向 Agent 场景设计的。Forge 的一个重要 Feature 是支持任意 Agent 脚手架运行 RL。接入 Forge 只需实现四种接口:
- agent_reprocess:预处理(初始化)
- agent_run:运行
- agent_postprocess:后处理
- calculate_reward:计算 Reward
例如,即使是只有二进制程序的黑盒 Agent 也可以接入。在 Agent 运行时,将其 Base URL 替换为 Forge 内部的推理引擎服务。目前该引擎已支持内部推理框架及 SGLang。替换后,Agent 运行循环中的所有日志都会在推理服务器端落盘。Data Coordinator 会对日志做后处理,提取 Sub-agent 的轨迹。为了提升训练效率,框架还会对 Trajectory 做智能的前缀合并。
CISPO
算法方面,其实我们截止到 M2.1 时代,整个 RL(强化学习)算法的核心仍然沿用了原先在 MiniMax M1 论文中所提出的 CISP 算法。当时在 M1 里面提到了两个比较核心的点:一个是 CISP 本身的重要性采样截断(Importance Sampling Truncation)设计,另一个是当时针对 FP32 精度做的修复。
第一部分是 CISPO 的目标函数。你可以将其理解为:它与 Reinforce 的目标函数非常相似。它本质上就是一个 Reinforce 在 Off-policy 场景下做了重要性采样修正的目标函数。CISPO 改变的点在于,它会对重要性采样的权重——也就是标量加权的量——进行 Clip(裁剪)。在实践中,这个 Clip 通常是指限制其上限,从而确保梯度不会过大,本质上是在控制梯度的运行幅度。
CISPO 算法最初设计是在大家都在复现 R1-ZERO 那套东西(包括字节的 DAPO)的时候,当时我们在复现过程中有一个核心观察:在整个 RL 运行过程中,有一些 Token 会一直被 PPO 的 Clip 机制过滤掉。一旦被 Clip,这些 Token 就永远失去了梯度。统计发现,这些 Token 往往是类似于 “wait” 这种转折词。这意味着 PPO 的裁剪机制会导致很多 Token 无法通过训练涌现出来。
在这种情况下,DAPO 的做法是提高 PPO Clip 的上界。而我们 CISPO 的想法是让所有的 Token 都可以计算梯度,只是我们需要控制梯度重要性采样的加权系数。通过这种方式,我们虽然引入了一些 Bias(偏差),但减少了整体优化的方差。
当时我们在前期完成小模型实验并迁移到 MiniMax M1 这个更大的模型做 RL 时,发现整体 Reward 几乎不增长。我们将训练概率和推理概率打印出来后发现,它们出现了一些明显的分段横线,且相关系数相比 Dense 模型低很多。后来推理同学进行逐层排查,最终发现预测层 LLM_head 的精度至关重要。将该层精度修复到 FP32 之后,整个训推一致性得到了显著加强,从而实现了稳定的训练提升。
再到 M2 这一代,主要的变化在于它进入了 Agentic RL 的场景,涉及多轮工具调用。这些工具调用本质上会引入来自外界环境的噪声注入,导致运行轨迹变得更加极端、更加 Off-policy,或者出现统计值异常的情况。
在这种情况下,我们吸收了社区提出的一些主要方法,包括 MIS(重要性采样修正)以及基于 PPO 的轨迹过滤。这里的核心思想是过滤掉统计值偏离异常、处于长尾分布的轨迹,防止梯度出现巨大波动,从而保障整体 RL 训练的稳定性。
这个图引用了 Meta 论文《The Art of Scaling Reinforcement Learning Compute for LLMs》里的一张图。当时他们其实做了一个比较系统的实验对比,他们的实验结论与我们也是比较一致的。
实验发现,整个 CISPO 算法无论从收敛速度还是收敛上限来看,在整个 Scaling 的过程中表现都非常出色。图中左侧部分刻画的是 CISPO 的重要性采样(Importance Sampling Trick)的效果,右侧部分则刻画了 FP32 精度修复带来的收益。
总的来说,这篇文章的实验做得非常充分。如果对强化学习(RL)感兴趣的同学,我建议并推荐大家可以去看一看。
Agent 评测
接下来分享一下我们在 M2.1 时代同步推出的三个评测。
- VIBE: Visual & Interactive Benchmark for Execution in Application Development
- SWE-Review
- OctoCodingBench
目前 VIBE 评测的数据已经在 Hugging Face 上开源了,不过它的基建现在还没有完全就绪,我们也在紧锣密鼓地推进中。
VIBE
首先说 VIBE,它对应的是 AppDev(应用开发)场景。由于市面上缺乏衡量此类效果的榜单,我们自建了该评测,涵盖了前端、模拟安卓、iOS 和后端。M2.1 相比 M2 在这方面有长足进步。
在校验逻辑上,我们采用了 Agent as a Verifier 的方案,即利用智能体在真实环境中执行。其 Reward(奖励)包含三个维度:
- 执行层:验证代码是否编译成功;
- 交互层:使用工具与界面进行交互,判断业务逻辑是否正确;
- 视觉层:基于美学标准打分,虽然这带有主观性,但我们会关注一致性较高的标准。
相比传统的 LLM-as-a-judge 仅通过静态截图进行评估,Agent 验证能够通过动态交互,更全面地反映 Bug 和实现上的缺陷。
SWE-Revier & OctoBench
另外两个是 SWE Review 和 OctoBench。SWE Review 对应管线中的审阅场景,设计了覆盖多语言、多场景的评测集,指标同时考虑召回率和幻觉率。OctoBench 评估 Agent 场景下的指令遵循能力。
与传统的 IF 榜单不同,Agent 场景中的指令不只来自 User 或 System Prompt,还可能来自 System Reminder、Claude.md 或工具 Schema。OctoBench 同步自研,通过 Checklist 进行 Rubric-based 打分。
最后是这个指标。M2.1 相比于 M2,因为我们在这方面做了一些优化,所以其实整体提升还是比较显著的,包括 SWE review、OctoBench 的这个指令遵从的效果。
Original source Report a problem - Jan 23, 2026
- Date parsed from source:Jan 23, 2026
- First seen by Releasebot:Feb 13, 2026
“95后”正在尝试一种很新的工作方式
MiniMax 推出 AI 原生 Workspace 升级,Agent 实习生深入真实工作流,运维、销售等场景实现自动化协作与跨系统接入,支持本地文档、代码与告警处理。现提供限时免费体验,桌面端与专家 Agents 入口已上线。
诞生时刻:为解决痛点,我们造了一位新同事
在过去的几个月里,我们尝试让 AI 融入组织,像信任人类同事一样,把真正的任务放心地交给它。
今天,我们想带你看看 MiniMax 内部真实的工作方式。
在这个过程中,Agent 不再仅仅是一个对话框里的工具,它开始活跃在我们的工作文档的流转里、代码仓库的迭代中,甚至是凌晨三点的告警群里。
25 年 10 月底,随着 MiniMax 文本模型 M2 的发布,我们发现,模型在复杂任务理解、工具调用和长链路执行上的能力,首次稳定跨过了一个关键阈值。
与此同时,在内部,我们的两名研发工程师基于日常工作痛点,自发搓出了一套适应内部办公系统的 Agent 工具,并开放给全公司使用,这意味着,每个人都可以根据需要 DIY 不止一个“实习生”。我们把它命名为 “Agent 实习生”。
Agent 实习生能够直接嵌入真实工作流:
- 接入授权后的本地文档、邮件、日程、 GitLab、日志系统
- 能够读代码、改代码、提 MR、监控告警
- 能够理解业务上下文,而不仅是单条指令
很快,内部的研发、销售、HR、运维等各职能的同学纷纷使用。
在过去数周内,MiniMax 内部接近 100% 的同学使用 Agent 实习生。我们统计了内部对于 Agent 实习生使用的前十大场景,获得如下结果:其中 41.9% 的查询为 AI 工具能力咨询(可见同事们正积极探索何使用 Agent)。
注:一个任务可能归属多个场景,加总大于 100%
组织进化实录:当 Agent 渗入业务的毛细血管
计算平台:高压岗位上,Agent 把人从消耗型工作中解放出来
因为不可预期的流量变化与全天候待命,计算平台与运维(DevOps)可能是互联网及人工智能行业中压力最大的工种之一。
对于运维工程师而言,On-call(随时待命)是一种极度紧绷的生活状态: 无论是在吃饭、睡觉,甚至是在迪士尼乐园排队,都必须背着电脑。一旦响起告警,必须立刻接入内网,排查是因为突发流量导致系统崩溃,还是仅仅是一次误报。
与此同时,在 AI 行业高速迭代的架构中,系统极其复杂,新人需要熟练掌握长达数百页的运维值班手册,大量的精力也被消耗在分析告警是否误判上。这种工作琐碎、高压,且对个人成长的边际价值极低。
我们引入了 Agent 实习生来重构这一流程。
Agent 实习生能实时自动响应告警,能够通过链路分析、结合上下文响应并判断。如果是误报,Agent 实习生会直接过滤;如果是真问题,它会给出初步诊断。我们的运维同学反馈,Agent 实习生帮助工程师完成约 80% 的查 Bug 工作量。
我们也为 Agent 实习生重构了知识的传递方式。我们的工程师开始编写《运维指南(Agent 阅读版)》,可以直接把规则、代码特征、处理逻辑等对于人类不易阅读的格式,直接写给 Agent 看,Agent 拥有无限的耐心和记忆力,它学习一次,就能在每一次故障中精准执行。
从时刻紧绷的被动响应中解放了出来,工程师们就有了更多的专注时间,用于思考更具创造性、有长期价值的架构优化。
国际化销售:Agent 让销售拥有了左膀右臂
在国际化业务的拓展中,面对背景多元潜在客户与跨语言沟通,销售往往难以同时兼顾“质与量”。
对于一线销售而言,通过海外社媒建立链接是一项极其消耗心力的工作。因为每一封想获得回复的私信,都不能是千篇一律的群发。销售需要花费大量时间去阅读陌生的个人主页,理解对方的行业背景、近期动态,再构思如何“自然地打招呼”。
我们引入了 Agent 实习生作为销售的“超级装备”,销售同学能够“装备”两位 Agent 实习生:
- Agent A 负责读取社媒页面,理解提炼客户背景,并基于对方动态思考切入点;
- Agent B 则负责对私信文案进行语言润色,让表达更地道
有了 Agent 实习生的协作,销售同学能够把精力聚焦于更高价值的商业判断:客户是否值得深度跟进?当前阶段应传递什么核心信息?如何设计后续沟通节奏?
Agent 实习生可以执行,人可以专注负责判断与策略。
资深研发:我想到 idea 了,只要丢给 Agent 就好
在推广 Agent 的过程中,我们观察到一个有趣的现象:往往是那些在旧范式里最顶尖的人才(如资深程序员),最难完成对 Agent 的认知迁移。
这或许是人性弱点,越是能力强的人,越难放手。
因为肌肉记忆太强,他们习惯了自己盯着屏幕,一行行敲下代码。
而 Agent 带来的是一种异步协作的可能,你可以同时让 Agent 进行多个任务,从而带来极高的效率提升。
我们的资深研发同学分享过他的体验:
去年上半年,我大量的时间都在 coding 上,我自己写代码(当然我是很喜欢写代码的,我觉得写代码比打游戏好玩),但当时我的大部分时间花在 feature 的实现、idea 的实现上。
但现在我已经很少打开 IDE 了,我做的就是不断把 idea 扔出去——我可以同时发给 Agent 5 到 10 个任务,然后只需 Review Agent 给我返回来的结果。
这对效率的提升是碾压式的。我可以源源不断地想到 Idea,我可以在走路的时候、开车的时候、甚至是在睡醒的时候,都会冒出 idea,当我有了 idea 了,只要丢给 Agent 就好。
或许这不仅仅是工具的升级,更是思维方式的升级。
我们对 Agent 的长期思考
在 MiniMax 内部的实践中,我们发现 PM、设计师和工程师不再是流水线上的上下游,传统互联网所定义的产品、设计和工程职能边界也开始消融,每个人都成为了 Agent 设计师。
我们进一步思考:Agent 的终局形态,究竟应该是什么?
在 MiniMax 的构想中,终极的 Agent 不再是在真空中做决策的机器,而是一个深度嵌入工作环境、拥有完整职业上下文的长期合伙人,能够
- 拥有穿越周期的长时记忆,能深刻理解领域专家的工作偏好;
- 梳理工作中散落的知识,整合各类规范与 SOP 应用于工作流;
- 在 CRM 或 ERP 等业务系统之间,保持对环境的敏锐感知——捕捉到关键信号,无需指令,便能通过触发器主动响应
这种从被动等待到主动感知、从单一执行到动态环境生存的进化,是 AI 走出实验室、直面现实世界的真正形态。
如果将这种拥有完整上下文与环境感知的“终局形态”定义为 100%,那么回顾过去一年的 Agent 1.0 时代,我们从堆砌路由工程链的 Chatbot,演化到了等待用户输入指令策动工具的初级 Agent,刚刚走完了最初的 30%。
但这 30% 至关重要。在这一年,我们告别了实验室模式,直面现实世界的复杂性,亲历了一场关于 AI 参与工作的价值重构。正是这最初的 30%,让我们验证了方向,并明确了迈向剩下的 70% 的路径。基于此,我们上线了 AI-native Workspace。
AI-native Workspace 有两个核心更新:
- 桌面端,应用于本地环境,通过指定本地 Workspace 作为工作空间和上下文,构建一个专属于你的智能工作环境。真正渗入到各个职能角色的核心工作流中
- 专家 Agents,允许用户构建在特定领域达到 95 分甚至 100 分的领域的专家 Agent。不仅仅是简单的 Prompt 调整,而是深度的知识与能力注入。
我们已开启限时免费体验。欢迎大家登录 Web 端体验专家 Agents 功能,亦可通过体验链接获取桌面端安装包。体验地址:
https://agent.minimaxi.com最后想说,在 MiniMax,我们始终相信:
每个人最宝贵的资产,是创造力与热爱。
我们还在持续探索 AI 原生组织形态的路上,我们将保持高频的迭代节奏,不断将内部验证过的最新范式注入到产品中。
这是一个持续进化的过程。我们邀请你与我们一同定义、一同探索、一同见证。
这也是 MiniMax 的长期愿景:
Intelligence with Everyone。
最后,让我们的实习生带你一起逛逛MiniMax吧。
Original source Report a problem - Jan 22, 2026
- Date parsed from source:Jan 22, 2026
- First seen by Releasebot:Jan 23, 2026
MiniMax M2.1: Post-Training Experience and Insights for Agent Models
M2.1 debuts as a flagship open‑source model with Mixture‑of‑Experts, ~230B total and ~10B active params, delivering production‑ready agentic performance. It introduces SWE scaling data pipelines, AppDev and WebExplorer syntheses, and new benchmarks VIBE SWE‑Review OctoBench.
M2.1 overview
M2.1 is the latest flagship open-source model released last month, built upon further post-training optimization over the previous M2 generation. It adopts a Mixture-of-Experts (MoE) architecture, with approximately 230B total parameters and ~10B activated parameters.
In Agent scenarios, M2.1 demonstrates excellent usability. Even with a relatively small number of activated parameters, it maintains high inference efficiency and strong performance, offering stable and production-ready engineering capabilities for a wide range of real-world applications.
Agentic Data Synthesis
We begin with data synthesis, which can be roughly divided into three categories:
- Real-data-driven synthesis: SWE Scaling
- Expert-driven synthesis: AppDev
- Virtual long-horizon task synthesis: WebExplorer
The first two primarily target coding scenarios, while the last focuses on more general-purpose search tasks.
Real-Data-Driven Synthesis: SWE Scaling
At the core of SWE Scaling is data scaling for software engineering scenarios. We leverage GitHub, an enormous and well-structured data source, to synthesize a wide variety of verifiable tasks. With such tasks, we can effectively perform rejection sampling, construct SFT datasets, or conduct RL, all on a solid data foundation.
Data Pipeline OverviewThe raw data source consists of GitHub Pull Requests (PRs) and Commits. We first apply quality filtering—simple rules include selecting PRs that were eventually merged, along with additional criteria such as the presence of relevant test cases.
PR Tagging and Task Diversification
The next and most critical step is constructing a runnable Docker environment for each PR.
Environment construction is non-trivial. The common approach today is to let an Agent iteratively build the environment in a code sandbox, equipped with tools that allow it to repeatedly attempt builds and self-correct based on build results.
Ideally, this process would be fully automated, but in practice, it is not yet perfect. For certain languages or library versions, environments may fail to build reliably. In such cases, expert knowledge is required to optimize the Agent's execution flow, this can be seen as injecting skills into the Agent.
Once completed, we obtain a runnable virtual Docker environment for each PR.Next, we perform tagging and routing on PRs. PRs contain diverse data types—bug fixes, feature additions, performance optimizations, test construction or refactoring, and dozens of other categories.
Bug Fix Example
Routing is necessary because different PR types require different downstream treatment.For standard bug-fix scenarios, we extract F2P (Fail-to-Pass) and P2P (Pass-to-Pass) test cases. If a golden patch passes these tests, the data is considered valid. We then let the model act as an Agent to fix the bug in a sandbox and verify correctness using both F2P and P2P tests.
Feature Addition and Performance Optimization
P2P tests are particularly important to ensure that no new bugs are introduced during the fix.For feature additions, traditional F2P/P2P logic may not apply, since tests often depend on newly introduced code (e.g., function signatures). Instead, we focus on extracting newly added test points and ensuring the golden patch passes them.
Model-Based Validation
For performance optimization, there is no bug-fixing process. In such cases, we extract P2P tests that can verify stable and significant performance differences before and after the optimization.
Different PR types naturally require different handling strategies.Even in basic bug-fix scenarios, raw GitHub PRs are not always well-structured, and test cases may not fully cover the described issue. This can lead to situations where a bug is impossible to fix purely based on the problem description.
Task Transformations and Augmentation
To address this, we use the model itself to validate consistency between test cases and problem descriptions. If key information is missing, the model augments the original description to make it a self-contained and solvable problem.For the same PR or issue, there are many ways to reuse the data:
- Inject additional bugs
- Merge adjacent commits or PRs to increase difficulty (similar to SWE-smith)
- Convert BugFix tasks into SWE-Test tasks
In the SWE-Test formulation, the task is inverted: the model must write a test case that fails before applying the patch and passes after, requiring strong test-writing capability while remaining verifiable and task-equivalent.
We can also construct code review tasks, which do not necessarily require a runnable environment. The model performs static analysis, reviews code changes, and identifies issues. Consistency can be verified using another LLM, making such tasks approximately verifiable while offering greater diversity.
Ultimately, we obtain SWE-style datasets that include:
- Original problem descriptions
- Fully verifiable rewards based on test cases
- Runnable Docker environments
These datasets are used for both SFT and RL. For SFT, we apply multi-scaffold rejection sampling to improve generalization. For RL, multi-scaffold training is essential because different scaffolds introduce different context management and execution logic. Training on a single scaffold (e.g., a simple ReAct loop) severely limits generalization.
A summary of the core idea behind the SWE data: building agent-driven automated data pipelines based on raw GitHub data to produce diverse, verifiable SWE-style datasets and environments.
As of M2.1, SWE scaling covers:
- 10+ major programming languages
- A wide variety of coding tasks and scenarios
- 10,000+ runnable PRs
- 140,000+ variable tasks
On benchmarks such as Multi-SWE and SWE-bench, M2.1 significantly outperforms M2, especially in multilingual settings. Performance remains stable across different scaffolds, demonstrating strong adaptability.
Expert-Driven Data Synthesis: AppDev
We divide coding into two categories:
- SWE: tasks within existing repositories with fixed verification
- AppDev: full-stack application development from scratch
AppDev differs fundamentally from SWE because test cases cannot be fully predefined. As a result, its reward structure is entirely different.
AppDev heavily relies on experts-in-the-loop. Internal specialists in frontend, backend, Android, and iOS development help design prompts, meta-queries, and rubric-based rewards, which cannot be fully automated.
Experts also inject best practices through system prompts. During training, the system prompts can be omitted in the training data, thereby distilling expert heuristics into the model's default behavior.
Verification uses Agent-as-a-Verifier: the Agent deploys the app in a sandbox, interacts with it via tools such as Playwright, and scores performance against rubrics. Unlike LLM-as-a-judge, this requires multi-step tool-based interaction.
M2.1 currently ranks #1 among open-source models on the Hot Arena leaderboard. We also built an internal benchmark called VIBE Arena, which will be introduced in more detail later.Synthetic Long-Horizon Task Generation: WebExplorer
Beyond coding agents, we also focus on general-purpose agent scenarios, with search as a foundational capability.
Our work "Explore and Evolve for Training Long-Horizon Web Agents" (available on arXiv) proposes a two-step approach:- Exploration: Agents freely explore the web to construct information-rich seed questions.
- Evolution: Queries are iteratively evolved to increase complexity.
Starting from a seed like "Brazil national football team", the Agent discovers the 1950 World Cup, the Maracanazo Blow, referee George Reader, his later role as an English club chairman, and the 1976 FA Cup final, which was won by Bobby Stoke's goal.
In the final stage, the model synthesizes clues gathered along the search trajectory to generate an information-rich initial query. Although this query contains a large amount of information, it still has clear entry points for retrieval. The evolution strategies include removal, obfuscation, and substitution. For example:- Converting specific match details into a vague description such as "a World Cup with a unique format and no knockout stage"
- Replacing salient information like "defeating Manchester United in the FA Cup" with "leading a second-division club to victory over a top-tier powerhouse", thereby increasing retrieval difficulty
- Removing easily searchable facts, such as a player's age at the time of death, which can be readily found on Wiki.
The evolved query is ultimately far more complex than the original question and lacks obvious retrieval entry points, requiring the model to explore step by step based on the available clues.
The evaluation criterion for this type of long-horizon task synthesis is the average number of reasoning turns required to solve a problem. The original questions require approximately 7.9 turns on average, while the evolved versions increase this to 9.9 turns. This strategy has been deployed in M2.1.
On the BrowsComp leaderboard, M2.1 achieves performance close to GPT-4.5's SOTA metrics, particularly when Context Management is enabled. Context Management is currently the dominant approach for Search Agents: during evaluation, the context is continuously cleared to keep it concise and uncluttered, enabling the model to sustain effective Test-time Scaling.Agentic RL Framework and Algorithms
Forge
We use an internally developed framework, Forge, which was designed for Agent-centric scenarios from the very beginning of M2 development. One of Forge's key features is its support for running reinforcement learning over arbitrary Agent scaffolds.
Integrating with Forge only requires implementing four interfaces:- agent_reprocess: preprocessing (initialization)
- agent_run: execution
- agent_postprocess: postprocessing
- calculate_reward: reward computation
For example, even black-box Agents that are available only as binary executables can be integrated. During Agent execution, the Agent's base URL is redirected to Forge's internal inference engine service. This engine currently supports both the internal inference framework and SGLang. After redirection, all logs generated during the Agent's execution loop are persisted on the inference server.
A Data Coordinator then post-processes these logs to extract Sub-agent trajectories. To further improve training efficiency, the framework performs intelligent prefix merging over trajectories.
CISPO
From an algorithmic perspective, up through the M2.1 stage, the core reinforcement learning (RL) algorithm has largely continued to rely on CISPO, which was originally proposed in the MiniMax M1 paper. In M1, two key aspects were emphasized: first, the importance sampling truncation design inherent to CISPO itself; and second, a fix targeting FP32 precision issues identified at the time.
The first part concerns the objective function of CISPO. Conceptually, it can be understood as being very similar to the REINFORCE objective. In essence, it is a REINFORCE-style objective augmented with importance sampling corrections for the off-policy setting. The key modification introduced by CISPO is that it applies clipping to the importance sampling weights—that is, the scalar weighting factors. In practice, this clipping typically imposes an upper bound, ensuring that gradients do not become excessively large. Fundamentally, this mechanism serves to control the magnitude of gradient updates.
CISPO was originally designed during a period when many teams were reproducing the R1-ZERO pipeline (including ByteDance's DAPO). During our reproduction efforts, we made a critical observation: throughout RL training, certain tokens were consistently filtered out by PPO's clipping mechanism. Once clipped, these tokens effectively lost their gradients permanently. Empirical analysis showed that such tokens were often discourse or transition words, such as "wait". This implies that PPO-style clipping can prevent a large number of tokens from ever emerging through training.
In this context, DAPO's approach was to increase the upper bound of PPO clipping. By contrast, the CISPO approach aims to allow all tokens to receive gradients, while controlling the importance-sampling weighting coefficients instead. Although this introduces some bias, it significantly reduces the overall variance of the optimization process.
When we completed early-stage experiments on smaller models and migrated RL training to the larger MiniMax M1 model, we observed that the overall reward barely increased. After logging and comparing the training-time and inference-time probabilities, we found clear piecewise horizontal segments, and the correlation between them was much lower than that observed in dense models. Further layer-by-layer debugging by the inference team revealed that the numerical precision of the prediction layer, namely the LLM head, was critical. After restoring this layer to FP32 precision, training–inference consistency improved substantially, enabling stable and sustained training gains.
Moving into the M2 generation, the primary change was the transition to an agentic RL setting involving multi-turn tool usage. These tool calls inherently introduce noise from the external environment, making execution trajectories more extreme, more off-policy, or prone to anomalous statistics.
Under these conditions, we incorporated several major techniques proposed by the broader community, including multiple importance sampling (MIS) and PPO-based trajectory filtering. The core idea is to filter out trajectories with anomalous statistics that lie in the long-tail distribution, thereby preventing excessive gradient fluctuations and ensuring the overall stability of RL training.This figure is adapted from a figure in Meta's paper "The Art of Scaling Reinforcement Learning Compute for LLMs." In that work, the authors conducted a fairly systematic set of experimental comparisons, and their conclusions are largely consistent with ours.
The experiments show that the CISPO algorithm performs exceptionally well throughout the scaling process, both in terms of convergence speed and the final convergence ceiling. The left portion of the figure illustrates the effect of CISPO's importance sampling trick, while the right portion highlights the gains brought by fixing FP32 precision issues.
Overall, the experimental evaluation in this paper is very thorough. For those interested in reinforcement learning (RL), I strongly recommend taking a look at it.Agent Evaluation
Next, I'd like to introduce three benchmarks that we released alongside M2.1:
- VIBE: Visual & Interactive Benchmark for Execution in Application Development
- SWE-Review
- OctoCodingBench
At present, the VIBE dataset has been open-sourced on Hugging Face. However, its infrastructure is not yet fully ready, and we are actively working to complete it.
VIBE
Let's start with VIBE, which targets the AppDev (application development) scenario. Due to the lack of existing leaderboards that effectively measure performance in this domain, we built this benchmark ourselves. It covers a wide range of settings, including frontend development, simulated Android, iOS, and backend tasks. Compared to M2, M2.1 shows substantial improvements on this benchmark.
For verification, we adopt an Agent-as-a-Verifier approach, in which an agent executes tasks directly in a real environment. The reward signal consists of three dimensions:
- Execution level: verifying whether the code compiles and runs successfully
- Interaction level: interacting with tools and user interfaces to assess whether the business logic is correct
- Visual level: scoring based on aesthetic criteria. Although this dimension is inherently subjective, we focus on standards with high inter-rater consistency.
Compared to traditional LLM-as-a-Judge methods that rely solely on static screenshots for evaluation, Agent-based verification enables dynamic interaction, providing a more comprehensive view of bugs and implementation flaws.
SWE-Revier & OctoBench
The other two benchmarks are SWE-Review and OctoBench.
SWE-Review targets review scenarios within the development pipeline. It is designed as a benchmark suite covering multiple programming languages and diverse use cases, with metrics that jointly consider recall and hallucination rate.
OctoBench evaluates an agent's instruction-following capability in agentic settings. Unlike traditional instruction-following (IF) leaderboards, where instructions typically originate only from the user or system prompts, instructions in agent-based scenarios may also come from system reminders, Claude.md files, or tool schemas.
OctoBench is developed in-house and employs checklist-based, rubric-driven scoring to perform the evaluation.Finally, regarding this metric: compared to M2, M2.1 shows a notable overall improvement as a result of several targeted optimizations, including gains on SWE-Review and OctoBench in terms of instruction-following performance.
Original source Report a problem - Jan 19, 2026
- Date parsed from source:Jan 19, 2026
- First seen by Releasebot:Feb 13, 2026
Kilo 携手 MiniMax 发布编程新产品 Kilo for Slack
Kilo 推出 Slack 版 AI 编码助手,默认切换到国产大模型 MiniMax M2.1,称其为开源领先且与前沿模型并驾齐驱。此为正式产品更新,标志 Kilo for Slack 进入新阶段并提升开发者工作流的编码效率。
Kilo 发布 Slack 产品并切换默认模型
硅谷明星公司 Kilo 在最新编程产品 Slack 发布中,宣布将其默认模型切换为国产大模型 MiniMax M2.1,"MiniMax M2 系列是全球领先的开源模型",在第三方评估中,M2.1 的表现极具竞争力。"在面向社区的 AI 基准测试开放平台 LMArena 中,M2.1 排名第四,紧随 OpenAI、Anthropic 和 Google 之后,"
Kilo Code 联合创始人兼 CEO Breitenother 指出,"这表明,在开发者直接评判的真实编码工作流程中,M2.1 能够与前沿模型相媲美。"
Kilo 作为当下全球最受欢迎的开源代码平台,月均处理最高达 6.1 万亿 token。其在过去一年中吸引了大量关注和资本,于 2025 年 12 月从 Breakers、Cota Capital、General Catalyst、Quiet Capital 和 Tokyo Black 等资本的 800 万美元种子轮融资。
这款名为 Kilo for Slack 的产品推出,正值人工智能辅助编码市场竞争白热化,数十亿美元的收购和融资轮次层出不穷之际。数据显示,2025 年开源软件模型开始赶上闭源模型,在多个关键基准测试中,性能差距已从 8% 缩小到仅 1.7%。
Kilo Blog: https://blog.kilo.ai/p/announcing-kilo-for-slack
Original source Report a problem - Jan 4, 2026
- Date parsed from source:Jan 4, 2026
- First seen by Releasebot:Jan 4, 2026
M2.1: Multilingual and Multi-Task Coding with Strong Generalization
MiniMax-M2.1 delivers a multi‑language, multi‑task coding agent with strong scaffold generalization and top‑tier benchmarks, signaling a practical leap toward enterprise coding, testing, and collaboration. The release outlines scalable RL training, broader problem coverage, and a bold roadmap for future efficiency and scope.
The Gap Between SWE-Bench and Real-World Coding
In 2025, SWE-Bench has become the most authoritative evaluation standard for code generation scenarios. In this evaluation, LLMs must face bugs from real GitHub repositories and fix them through multiple rounds of code reading and testing. The core value of SWE-Bench lies in the fact that the tasks it evaluates are highly close to a programmer's daily work, and the results can be objectively verified via test cases — a feature particularly crucial for reinforcement learning training. We can directly use the test pass rate as a reward signal, continuously optimizing the model in a real code environment without relying on the noise introduced by human labeling or model evaluation.
However, like all evaluation standards, SWE-Bench is not perfect. For a coding agent to be usable in real-world scenarios, there are more capability dimensions beyond SWE-Bench that need attention:
- Limited Language Coverage: SWE-Bench currently only covers Python. In real development scenarios, developers need to handle multiple languages such as Java, Go, TypeScript, Rust, and C++, often collaborating across multiple languages within the same project.
- Restricted Task Types: SWE-Bench only involves bug-fixing tasks. Other real-world capabilities, such as implementing new features, generating test cases, project refactoring, code review, performance optimization, and CI/CD configuration can't be evaluated.
- Scaffold Binding: SWE-Bench usually only evaluates the model's performance on a specific scaffold, so the model's generalization on other scaffolds cannot be accurately observed. Meanwhile, different agent scaffolds design various context management strategies, and the model needs to be able to adapt to these differences.
How to Fill These Gaps
Environment Scaling
We often see developers complaining that current coding agents perform well on languages like Python/JavaScript but show lackluster results in more serious enterprise-level development scenarios. If the task involves complex project understanding, the performance degrades further.
To solve this problem, during the training cycle of MiniMax-M2.1, we built a comprehensive data pipeline covering Top 10+ mainstream programming languages. We retrieved a massive number of Issues, PRs, and corresponding test cases from GitHub, and conducted strict filtering, cleaning, and rewriting based on this raw data to ensure the quality of Post Training data. A coding agent is naturally suited for mass-producing this kind of training environment. During this process, we found that for both the M2 model and other frontier models, the success rate of constructing multi-language environments was lower than that of Python. There are several distinct situations here:
- Environmental Complexity of Compiled Languages: Python, as an interpreted language, has relatively simple configuration. However, for compiled languages like Java, Go, Rust, and C++, we need to handle complex compilation toolchains, version compatibility, and cross-compilation issues. A Java project might depend on a specific version of JDK, Maven/Gradle, and numerous third-party libraries; an error in any link can lead to build failure.
- Diverse Test Frameworks: In the Python ecosystem, pytest dominates, but test frameworks in other languages are more fragmented. Java has JUnit and TestNG; JavaScript has Jest, Mocha, and Vitest; Go has the built-in testing package but also extensions like testify; Rust has built-in tests and criterion, etc. We need to design specialized test execution and result parsing logic for each framework.
- Dependency Management & Project Structure: Package managers for different languages differ vastly in dependency resolution, version locking, and private repository support. The nested structure of npm's node_modules, Maven's central repository mechanism, and Cargo's semantic versioning all require targeted handling. Simultaneously, project structure standards vary: Python structures are flexible, but Java projects usually follow strict Maven/Gradle directory standards; Go projects have GOPATH and Go Modules modes; Rust projects have the concept of a workspace. Understanding these dependency management mechanisms and project structures is crucial for correctly locating code and running tests.
- Difficulty in Parsing Error Messages: Error message formats produced by different languages and toolchains vary widely; compile errors, link errors, and runtime errors also manifest differently. We need to train the model to understand these diverse error messages and extract useful debugging clues from them.
Ultimately, we built a multi-language training system covering over ten languages including JS, TS, HTML, CSS, Python, Java, Go, C++, Kotlin, C, and Rust. We obtained over 100,000 environments usable for training and evaluation from real GitHub repositories, with each environment containing complete Issues, code, and test cases.
To support such massive Environment Scaling and RL training, we built a high-concurrency sandbox infrastructure capable of launching over 5,000 isolated execution environments within 10 seconds, while supporting the concurrent operation of tens of thousands of environments.
This infrastructure allows us to efficiently conduct large-scale multi-language coding agent training.
Beyond Bug Fix: Multi-Task Capabilities
Real software development is far more than just fixing bugs. A programmer's daily routine includes writing tests, code reviews, performance optimization, and other tasks. In the training of MiniMax-M2.1, we also conducted targeted optimization for these scenarios, including acquiring high-quality problems and designing corresponding Reward signals:
- Test Generation Capability: Early in the R&D of M1, we discovered that the ability to write tests was a major bottleneck restricting the accuracy of code generated by language models. In the agentless framework, the model generates multiple fix solutions in parallel and then uses its own generated test code to select the final solution. However, due to unreasonable reward design in the RL process for M1, it consistently wrote overly simple test code, causing a large number of incorrect fix solutions to be selected. Generating high-quality test cases requires the model to deeply understand code logic, boundary conditions, and potential failure scenarios. MiniMax-M2.1 synthesized a large volume of training samples to enhance testing ability based on GitHub PRs and self-generated Code Patches, eventually tying with Claude Sonnet 4.5 on SWT-bench, which evaluates testing capabilities.
- Code Performance Optimization: Besides implementation correctness, execution efficiency is also critical in actual development. The model needs to understand low-level knowledge like algorithm complexity, memory usage, and concurrency handling, while also mastering best practices for specific APIs in software development.
During training, MiniMax-M2.1 was encouraged to write more efficient code, subsequently achieving significant progress on SWE-Perf, with an average performance boost of 3.1%.
In the future, we will apply corresponding optimization methods to other performance-sensitive scenarios like Kernel optimization and database query optimization. - Code Review Capability: Based on the SWE framework, we built an internal Benchmark called SWE-Review, covering multiple languages and scenarios to evaluate the recall rate and hallucination rate of code defects.
A review is judged as correct only if it accurately identifies the target defect without producing any false positives, imposing high requirements on the model's precision.
Generalization on OOD Scaffolds
Generalization on OOD scaffolds is vital for a coding agent. Developers use different scaffolds — some use Claude Code, some use Cursor, and others use proprietary agent frameworks. If a model is optimized only for a specific scaffold, its performance will be severely discounted in other environments, strictly limiting its capability in real development scenarios. In MiniMax-M2.1, we believe scaffold generalization primarily tests the model's long-range instruction following ability and adaptability to context management strategies:
- Long-Range Instruction Following: Complex development scenarios require the model to integrate and execute "composite instruction constraints" from multiple sources, including System Prompt, User Query, Memory, Tool Schema, and various specification files (such as
Agents.md,
Claude.md,
Skill.md, etc.). Developers strictly constrain the model's expected behavior by designing these specifications. Once the agent fails to meet a requirement at any step during inference, it may lead to a severe degradation in end-to-end results. - Adaptability to Context Management: During the early release of M2, the community did not fully understand the design of Interleaved Thinking. When used in many scaffolds, the results were inconsistent with the model's inherent capabilities. At that time, we found that some popular scaffold designs would discard some historical thinking content in multi-turn conversations; this design caused M2's performance to drop by varying degrees across different evaluation sets. In MiniMax-M2.1, on one hand, we still recommend developers use the Interleaved Thinking feature to unleash the full potential of M2.1; on the other hand, we designed corresponding training methods to ensure the model's "IQ" remains online even when users employ all sorts of imaginative context management strategies.
To verify MiniMax-M2.1's scaffold generalization, we directly tested SWE-Bench performance on different scaffolds and also constructed a test set closer to real-world usage to observe if the model meets various scaffold instruction constraints. Ultimately, we found that MiniMax-M2.1 maintained an SWE-Bench score above 67 in
mini-swe-agent,
Droid, and
Claude Code.Compared to M2, MiniMax-M2.1 shows significant improvement across different OOD scaffolds. On OctoCodingbench, M2.1 improved from M2's 13.3 to 26.1, demonstrating strong compliance with scaffold instruction constraints.
- Long-Range Instruction Following: Complex development scenarios require the model to integrate and execute "composite instruction constraints" from multiple sources, including System Prompt, User Query, Memory, Tool Schema, and various specification files (such as
2026 TODOs
We believe the development of coding agents still has a long way to go. Therefore, this year we will explore several interesting directions:
- Defining the Reward Signal for Developer Experience: Beyond the optimization directions mentioned above, we hope to further quantify and optimize developer experience. Current evaluation standards mainly focus on whether the task is ultimately completed, ignoring the user experience during the process. We plan to explore richer Reward dimensions: regarding code quality, including readability, modularity, and comment completeness; regarding interaction experience, including response latency, information transparency, and interpretability of intermediate states; regarding engineering standards, including commit message quality, PR description completeness, and code style consistency. Although these metrics are difficult to evaluate fully automatically, we are exploring hybrid solutions combining static analysis tools, Agent-as-a-Verifier, and human preference learning, hoping to make the coding agent not only complete tasks but also deliver high-quality code like an excellent human engineer.
- Improving Problem-Solving Efficiency: MiniMax-M2.1 still has some issues with over-exploration, such as repeatedly reading the same file or executing redundant tests. We plan to optimize efficiency from multiple angles: reducing trial-and-error through better planning capabilities; reducing unnecessary file reads through more precise code localization; avoiding repetitive exploration through better memory mechanisms; and responding quickly to simple tasks through adaptive thinking depth.
- RL Scaling: The Scaling Law of reinforcement learning still holds huge potential for coding agents. We have verified the positive correlation between environment count, training steps, and model capability, but we are far from reaching convergence. We plan to continue exploring in three dimensions: Compute dimension, increasing concurrent environment count and training iterations; Data dimension, building a larger-scale and more diverse training task pool; Algorithm dimension, exploring more efficient exploration strategies, more stable training objectives, and better reward shaping methods. Simultaneously, we are researching how to make the RL training process itself more efficient, including better curriculum learning designs, smarter sample reuse strategies, and cross-task knowledge transfer.
- Coding World Model & User Simulator: As mentioned earlier, the training of this generation of coding agents (M2.1) relies heavily on execution in real environments, which brings massive computational overhead and environment construction costs. We are exploring building a World Model capable of predicting code execution results: given a piece of code and environment state, the model can predict whether tests pass, what error messages will be produced, and how the program will behave. This will enable us to perform large-scale rollout and policy optimization without actually executing code. Meanwhile, we are also building a user behavior simulator to model the patterns of interaction between real developers and the agent—including vague requirement descriptions, mid-stream requirement changes, and feedback on intermediate results—allowing the model to adapt to various user behavior patterns in real scenarios during the training phase.
- Extremely Efficient Data Pipeline: Building a data pipeline capable of automatically discovering, filtering, and generating harder, longer-range tasks to continuously raise the model's ceiling. High-quality training data is a key bottleneck for coding agent progress. We are building an automated data flywheel: automatically discovering high-quality Issues and PRs from GitHub; using models to assess task difficulty and perform stratification; automatically augmenting tasks that the current model can easily solve to make them more challenging; and analyzing failure causes for failed cases to generate targeted training data. The ideal state here is to build an "inexhaustible" source of high-quality tasks, keeping training data difficulty slightly above the model's current capability to maintain optimal learning efficiency. We are also exploring how to automatically generate ultra-long-range tasks that require hours or even days to complete, pushing the model's capability boundaries in complex project understanding and long-term planning.
- More Scenario Coverage: Expanding to more specialized fields such as GPU Kernel development, compiler development, smart contracts, and machine learning. Each field has its unique knowledge system, toolchain, and best practices, while possessing real application scenarios and commercial value. We plan to gradually build training environments and evaluation systems for these professional fields, enabling the coding agent to handle more specialized and high-value development tasks. Looking further ahead, we believe the paradigm of "Define Problem - Define Reward - Environment Construction - Model Training" demonstrated in coding agent training can be transferred to more scenarios requiring complex reasoning and execution feedback.
- Dec 17, 2025
- Date parsed from source:Dec 17, 2025
- First seen by Releasebot:Dec 18, 2025
MiniMax x Retell AI: Your smarter TTS for real-time conversations
MiniMax Speech now integrates with Retell AI, delivering ultra-human real-time TTS across 40+ languages and 20+ voices inside Retell AI. Expect sub 250 ms latency, smart text normalization, and seamless use for videos, podcasts, and interactive agents.
MiniMax Speech: Ultra-Fast. Ultra-Human. Ultra-Smart.
We’re excited to announce that MiniMax Speech is now integrated with Retell AI, bringing state-of-the-art text-to-speech directly to creators and developers.
With this integration, you can generate ultra-human, ultra-fast, and ultra-smart speech across more languages and voices—seamlessly within Retell AI’s all-in-one platform.MiniMax Speech helps power videos, presentations, podcasts, and real-time interactive agents with professional-grade audio. Built for both real-time and production use, it delivers speed and realism at scale.
Ultra-Fast Performance
- < 250 ms latency, enabling real-time conversational and interactive use cases.
Smart Text Normalization
- Automatically handles URLs, emails, dates, numbers, and other structured text for natural pronunciation.
Multilingual Support
- Supports 40+ languages with seamless inline code switching within a single utterance.
More choices, More Authentic Voices
- Access 20+ high-quality voices across different languages, genders, and accents, with continuous updates over time.
How to Use MiniMax Speech in Retell AI
Open Your Retell AI Project
Log in and create a new project or open an existing one.Select MiniMax as Your Voice Engine
Go to Global Settings → Voice & Language, then choose MiniMax.
Pick from 20+ authentic voices across different languages, accents, and styles.Import or Write your prompts
Type directly or import prompts via the Global Prompt dialog.Generate or Preview Audio
Save your prompt and test your agent—MiniMax delivers lifelike speech in under 250 ms.
Step into the Future of Audio Creation
No more juggling tools or fragmented workflows. With MiniMax Speech fully integrated into Retell AI, everything you need to turn text into studio-quality speech is now in one place.
Original source Report a problem
Powered by MiniMax and Retell AI.