Kimi Release Notes

Last updated: Apr 21, 2026

Get this feed:
  • April 2026
    • No date parsed from source.
    • First seen by Releasebot:
      Apr 21, 2026
    Kimi logo

    Kimi

    Mooncake

    Kimi opensources Mooncake’s Transfer Engine and Mooncake Store, plus technical report and traces, while expanding disaggregated LLM serving across vLLM, SGLang, TensorRT-LLM and more. The release highlights faster KV cache transfer, scalable multimodal inference and stronger ecosystem support.

    Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. Now both the Transfer Engine and Mooncake Store are open-sourced! This repository also hosts its technical report and the open-sourced traces.

    🔄 Updates

    • Mar 19, 2026: TorchSpec: Speculative Decoding Training at Scale is open sourced, using Mooncake to decouple inference and training via efficient hidden states management.
    • Mar 5, 2026: LightX2V now supports disaggregated deployment based on Mooncake, enabling encoder/transformer service decoupling with Mooncake Transfer Engine for high-performance cross-device and cross-machine data transfer.
    • Feb 25, 2026: SGLang merged Encoder Global Cache Manager, introducing a Mooncake-powered global multimodal embedding cache that enables cross-instance sharing of ViT embeddings to avoid redundant GPU computation.
    • Feb 24, 2026: vLLM-Omni introduces disaggregated inference connectors with support for both MooncakeStoreConnector and MooncakeTransferEngineConnector for multi-node omni-modality pipelines.
    • Feb 12, 2026: Mooncake Joins PyTorch Ecosystem. We are thrilled to announce that Mooncake has officially joined the PyTorch Ecosystem!
    • Jan 28, 2026: FlexKV, a distributed KV store and cache system from Tencent and NVIDIA in collaboration with the community, now supports distributed KVCache reuse with the Mooncake Transfer Engine.
    • Dec 27, 2025: Collaboration with ROLL! Check out the paper here.
    • Dec 23, 2025: SGLang introduces Encode-Prefill-Decode (EPD) Disaggregation with Mooncake as a transfer backend. This integration allows decoupling compute-intensive multimodal encoders (e.g., Vision Transformers) from language model nodes, utilizing Mooncake's RDMA engine for zero-copy transfer of large multimodal embeddings.
    • Dec 19, 2025: Mooncake Transfer Engine has been integrated into TensorRT LLM for KVCache transfer in PD-disaggregated inference.
    • Dec 19, 2025: Mooncake Transfer Engine has been directly integrated into vLLM v1 as a KV Connector in PD-disaggregated setups.
    • Nov 07, 2025: RBG + SGLang HiCache + Mooncake, a role-based out-of-the-box solution for cloud native deployment, which is elastic, scalable, and high-performance.
    • Sept 18, 2025: Mooncake Store empowers vLLM Ascend by serving as the distributed KV cache pool backend.
    • Sept 10, 2025: SGLang officially supports Mooncake Store as a hierarchical KV caching storage backend. The integration extends RadixAttention with multi-tier KV cache storage across device, host, and remote storage layers.
    • Sept 10, 2025: The official & high-performance version of Mooncake P2P Store is open-sourced as checkpoint-engine. It has been successfully applied in K1.5 and K2 production training, updating Kimi-K2 model (1T parameters) across thousands of GPUs in ~20s.
    • Aug 23, 2025: xLLM high-performance inference engine builds hybrid KV cache management based on Mooncake, supporting global KV cache management with intelligent offloading and prefetching.
    • Aug 18, 2025: vLLM-Ascend integrates Mooncake Transfer Engine for KV cache register and disaggregate prefill, enabling efficient distributed inference on Ascend NPUs.
    • Jul 20, 2025: Mooncake powers the deployment of Kimi K2 on 128 H200 GPUs with PD disaggregation and large-scale expert parallelism, achieving 224k tokens/sec prefill throughput and 288k tokens/sec decode throughput.
    • Jun 20, 2025: Mooncake becomes a PD disaggregation backend for LMDeploy.
    • May 9, 2025: NIXL officially supports Mooncake Transfer Engine as a backend plugin.
    • May 8, 2025: Mooncake x LMCache unite to pioneer KVCache-centric LLM serving system.
    • May 5, 2025: Supported by Mooncake Team, SGLang release guidance to deploy DeepSeek with PD Disaggregation on 96 H100 GPUs.
    • Apr 22, 2025: LMCache officially supports Mooncake Store as a remote connector.
    • Apr 10, 2025: SGLang officially supports Mooncake Transfer Engine for disaggregated prefilling and KV cache transfer.
    • Mar 7, 2025: We open-sourced the Mooncake Store, a distributed KVCache based on Transfer Engine. vLLM's xPyD disaggregated prefilling & decoding based on Mooncake Store will be released soon.
    • Feb 25, 2025: Mooncake receives the Best Paper Award at FAST 2025!
    • Feb 21, 2025: The updated traces used in our FAST'25 paper have been released.
    • Dec 16, 2024: vLLM officially supports Mooncake Transfer Engine for disaggregated prefilling and KV cache transfer.
    • Nov 28, 2024: We open-sourced the Transfer Engine, the central component of Mooncake. We also provide two demonstrations of Transfer Engine: a P2P Store and vLLM integration.
    • July 9, 2024: We open-sourced the trace as a JSONL file.
    • June 27, 2024: We present a series of Chinese blogs with more discussions on zhihu 1, 2, 3, 4, 5, 6, 7.
    • June 26, 2024: Initial technical report release.

    🎉 Overview

    Mooncake features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated KVCache pool.

    The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges in highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake’s innovative architecture enables Kimi to handle 75% more requests.

    🧩 Components

    Mooncake Core Component: Transfer Engine (TE)

    The core of Mooncake is the Transfer Engine (TE), which provides a unified interface for batched data transfer across various storage devices and network links. Supporting multiple protocols including TCP, RDMA, CXL/shared-memory, and NVMe over Fabric (NVMe-of), TE is designed to enable fast and reliable data transfer for AI workloads. Compared to Gloo (used by Distributed PyTorch) and traditional TCP, TE achieves significantly lower I/O latency, making it a superior solution for efficient data transmission.

    P2P Store and Mooncake Store

    Both P2P Store and Mooncake Store are built on the Transfer Engine and provide key/value caching for different scenarios. P2P Store focuses on sharing temporary objects (e.g., checkpoint files) across nodes in a cluster, preventing bandwidth saturation on a single machine. Mooncake Store, on the other hand, supports distributed pooled KVCache, specifically designed for XpYd disaggregation to enhance resource utilization and system performance.

    Mooncake Integration with Leading LLM Inference Systems

    Mooncake has been seamlessly integrated with several popular large language model (LLM) inference systems. Through collaboration with the vLLM and SGLang teams, Mooncake now officially supports prefill-decode disaggregation. By leveraging the high-efficiency communication capabilities of RDMA devices, Mooncake significantly improves inference efficiency in prefill-decode disaggregation scenarios, providing robust technical support for large-scale distributed inference tasks. In addition, Mooncake has been successfully integrated with SGLang's Hierarchical KV Caching, vLLM's prefill serving, and LMCache, augmenting KV cache management capabilities across large-scale inference scenarios.

    Elastic Expert Parallelism Support

    Mooncake adds elasticity and fault tolerance support for MoE model inference, enabling inference systems to remain responsive and recoverable in the event of GPU failures or changes in resource configuration. This functionality includes automatic faulty rank detection and can work with the EPLB module to dynamically route tokens to healthy ranks during inference.

    Tensor-Centric Ecosystem

    Mooncake establishes a full-stack, Tensor-oriented AI infrastructure where Tensors serve as the fundamental data carrier. The ecosystem spans from the Transfer Engine, which accelerates Tensor data movement across heterogeneous storage (DRAM/VRAM/NVMe), to the P2P Store and Mooncake Store for distributed management of Tensor objects (e.g., Checkpoints and KVCache), up to the Mooncake Backend enabling Tensor-based elastic distributed computing. This architecture is designed to maximize Tensor processing efficiency for large-scale model inference and training.

    🔥 Show Cases

    Use Transfer Engine Standalone (Guide)

    Transfer Engine is a high-performance data transfer framework. Transfer Engine provides a unified interface to transfer data from DRAM, VRAM or NVMe, while the technical details related to hardware are hidden. Transfer Engine supports multiple communication protocols including TCP, RDMA (InfiniBand/RoCEv2/eRDMA/NVIDIA GPUDirect), NVMe over Fabric (NVMe-of), NVLink, HIP, CXL, and Ascend. When built with the corresponding runtime, Transfer Engine can also detect and route accelerator memory on CUDA, MUSA, HIP, and Cambricon MLU devices. For a complete list of supported protocols and configuration guide, see the Supported Protocols Documentation.

    Highlights

    • Efficient use of multiple RDMA NIC devices. Transfer Engine supports the use of multiple RDMA NIC devices to achieve the aggregation of transfer bandwidth.
    • Topology aware path selection. Transfer Engine can select optimal devices based on the location (NUMA affinity, etc.) of both source and destination.
    • More robust against temporary network errors. Once transmission fails, Transfer Engine will try to use alternative paths for data delivery automatically.

    Performance

    With 40 GB of data (equivalent to the size of the KVCache generated by 128k tokens in the LLaMA3-70B model), Mooncake Transfer Engine delivers up to 87 GB/s and 190 GB/s of bandwidth in 4×200 Gbps and 8×400 Gbps RoCE networks respectively, which are about 2.4x and 4.6x faster than the TCP protocol.

    P2P Store (Guide)

    P2P Store is built on the Transfer Engine and supports sharing temporary objects between peer nodes in a cluster. P2P Store is ideal for scenarios like checkpoint transfer, where data needs to be rapidly and efficiently shared across a cluster. P2P Store has been used in the checkpoint transfer service of Moonshot AI.

    Highlights

    • Decentralized architecture. P2P Store leverages a pure client-side architecture with global metadata managed by the etcd service.
    • Efficient data distribution. Designed to enhance the efficiency of large-scale data distribution, P2P Store avoids bandwidth saturation issues by allowing replicated nodes to share data directly. This reduces the CPU/RDMA NIC pressures of data providers (e.g., trainers).

    Mooncake Store (Guide)

    Mooncake Store is a distributed KVCache storage engine specialized for LLM inference based on Transfer Engine. It is the central component of the KVCache-centric disaggregated architecture. The goal of Mooncake Store is to store the reusable KV caches across various locations in an inference cluster. Mooncake Store has been supported in SGLang's Hierarchical KV Caching, vLLM's prefill serving, and is now integrated with LMCache to provide enhanced KVCache management capabilities.

    Highlights

    • Multi-replica support: Mooncake Store supports storing multiple data replicas for the same object, effectively alleviating hotspots in access pressure.
    • High bandwidth utilization: Mooncake Store supports striping and parallel I/O transfer of large objects, fully utilizing multi-NIC aggregated bandwidth for high-speed data reads and writes.

    SGLang Integration (Guide)

    SGLang officially supports Mooncake Store as a HiCache storage backend. This integration enables scalable KV cache retention and high-performance access for large-scale LLM serving scenarios.

    Highlights

    • Hierarchical KV Caching: Mooncake Store serves as an external storage backend in SGLang's HiCache system, extending RadixAttention with multi-level KV cache storage across device, host, and remote storage layers.
    • Flexible Cache Management: Supports multiple cache policies including write-through, write-through-selective, and write-back modes, with intelligent prefetching strategies for optimal performance.
    • Comprehensive Optimizations: Features advanced data plane optimizations including page-first memory layout for improved I/O efficiency, zero-copy mechanisms for reduced memory overhead, GPU-assisted I/O kernels delivering fast CPU-GPU transfers, and layer-wise overlapping for concurrent KV cache loading while computation executes.
    • Elastic Expert Parallel: Mooncake's collective communication backend and expert parallel kernels are integrated into SGLang to enable fault-tolerant expert parallel inference.
    • Significant Performance Gains: The multi-turn benchmark demonstrates substantial performance improvements over the non-HiCache setting.
    • Community Feedback: Effective KV caching significantly reduces TTFT by eliminating redundant and costly re-computation. Integrating SGLang HiCache with the Mooncake service enables scalable KV cache retention and high-performance access. In our evaluation, we tested the DeepSeek-R1-671B model under PD-disaggregated deployment using in-house online requests sampled from a general QA scenario. On average, cache hits achieved an 84% reduction in TTFT compared to full re-computation. – Ant Group

    vLLM Integration (Guide v0.2)

    To optimize LLM inference, the vLLM community is working on supporting disaggregated prefilling. This feature allows separating the prefill phase from the decode phase in different processes. The vLLM uses nccl and gloo as the transport layer by default, but currently it cannot efficiently decouple both phases in different machines.

    We have implemented vLLM integration, which uses Transfer Engine as the network layer instead of nccl and gloo, to support inter-node KVCache transfer. Transfer Engine provides simpler interfaces and more efficient use of RDMA devices.

    We will soon release the new vLLM integration based on Mooncake Store, which supports xPyD prefill/decode disaggregation.

    Performance

    By supporting Topology Aware Path Selection and multi-card bandwidth aggregation, Mean TTFT of vLLM with Transfer Engine is up to 25% lower than traditional TCP-based transports. In the future, we will further improve TTFT through GPUDirect RDMA and zero-copy.

    More advanced features are coming soon, so stay tuned!

    🚀 Quick Start

    Before using Mooncake

    Mooncake is designed and optimized for high-speed RDMA networks. Though Mooncake supports TCP-only data transfer, we strongly recommend users to evaluate the functionality and performance of Mooncake with RDMA network support.

    The following need to be installed before running any component of Mooncake:

    • RDMA Driver & SDK, such as Mellanox OFED.
    • Python 3.10, virtual environment is recommended.
    • CUDA 12.1 and above, including NVIDIA GPUDirect Storage Support, if the package is built with -DUSE_CUDA (disabled by default). You may install them from here.
    • Cambricon Neuware, if the package is built with -DUSE_MLU. By default Mooncake looks for Neuware under NEUWARE_HOME or /usr/local/neuware.

    Use Python package

    The simplest way to use Mooncake Transfer Engine is using pip:

    For CUDA-enabled systems:

    CUDA < 13.0

    pip install mooncake-transfer-engine
    

    CUDA >= 13.0

    pip install mooncake-transfer-engine-cuda13
    

    For non-CUDA systems:

    pip install mooncake-transfer-engine-non-cuda
    

    Important

    • The CUDA version (mooncake-transfer-engine) includes Mooncake-EP and GPU topology detection, requiring CUDA 12.1+.
    • The non-CUDA version (mooncake-transfer-engine-non-cuda) is for environments without CUDA dependencies.
    • MLU support is currently available through source builds with -DUSE_MLU=ON; there is no dedicated prebuilt MLU wheel yet.
    • If users encounter problems such as missing lib*.so, they should uninstall the package they installed and build the binaries manually.

    Use Docker image

    Mooncake supports Docker-based deployment, see Build Guide in detail.

    To produce an image that compiles Mooncake from source, builds the wheel via scripts/build_wheel.sh, and installs that wheel inside the container, use build-wheel.dockerfile:

    docker build -f docker/mooncake.Dockerfile \
      --build-arg PYTHON_VERSION=3.10 \
      --build-arg EP_TORCH_VERSIONS="2.9.1" \
      -t mooncake:from-source .
    

    The resulting image already has a virtual environment at /opt/venv with the freshly built wheel installed. Launch it with GPU/RDMA access as needed, for example:

    docker run --gpus all --network host -it mooncake:from-source /bin/bash
    

    Note

    Make sure you build the image from the repository root so that Git metadata and submodules are available inside the build context.

    Build and use binaries

    The following are additional dependencies for building Mooncake:

    • Build essentials, including gcc, g++ (9.4+) and cmake (3.16+).
    • Go 1.20+, if you want to build with -DWITH_P2P_STORE, -DUSE_ETCD (enabled by default to use etcd as metadata servers), or -DSTORE_USE_ETCD (use etcd for the failover of the store master).
    • CUDA 12.1 and above, including NVIDIA GPUDirect Storage Support, if the package is built with -DUSE_CUDA. This is NOT included in the dependencies.sh script. You may install them from here.
    • Cambricon Neuware, if you want to build with -DUSE_MLU. This is NOT included in the dependencies.sh script. Mooncake resolves it from NEUWARE_HOME or /usr/local/neuware by default, and also supports overriding MLU_INCLUDE_DIR / MLU_LIB_DIR during CMake configure.
    • [Optional] Rust Toolchain, if you want to build with -DWITH_RUST_EXAMPLE. This is NOT included in the dependencies.sh script.
    • [Optional] hiredis, if you want to build with -DUSE_REDIS to use Redis instead of etcd as metadata servers.
    • [Optional] curl, if you want to build with -DUSE_HTTP to use HTTP instead of etcd as metadata servers.

    The build and installation steps are as follows:

    1. Retrieve source code from GitHub repo

      git clone https://github.com/kvcache-ai/Mooncake.git
      cd Mooncake
      
    2. Install dependencies

      bash dependencies.sh
      
    3. Compile Mooncake and examples

      mkdir build
      cd build
      cmake ..
      make -j
      sudo make install
      # optional, make it ready to be used by vLLM/SGLang
      

    For Cambricon MLU builds, configure CMake with -DUSE_MLU=ON. For example:

    mkdir build
    cd build
    cmake .. -DUSE_MLU=ON -DNEUWARE_ROOT=/usr/local/neuware
    make -j
    

    🛣️ Incoming Milestones

    • First release of Mooncake and integrate with latest vLLM (Completed)
    • Share KV caches across multiple serving engines (Incomplete)
    • User and developer documentation (Incomplete)

    📦 Open Source Trace

    The trace dataset includes timing of request arrivals, number of input tokens, number of output tokens, and remapped block hash. Privacy mechanisms are applied to remove user-related information while preserving utility for simulated evaluation. More descriptions of the trace (e.g., up to 50% cache hit ratio) can be found in Section 4 of the technical report.

    📑 Citation

    Please kindly cite our paper if you find the paper or the traces are useful:

    • Qin Ruoyu et al., "Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving," ACM Trans. Storage, 2025.
    • Ruoyu Qin et al., "Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot," 23rd USENIX Conference on File and Storage Technologies (FAST 25), 2025.
    • Ruoyu Qin et al., "Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving," arXiv preprint, 2024.

    About

    Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

    Original source
  • April 2026
    • No date parsed from source.
    • First seen by Releasebot:
      Apr 21, 2026
    Kimi logo

    Kimi

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi introduces k1.5, an o1-level multimodal model with strong short-CoT and long-CoT performance across math, coding, and vision benchmarks, plus longer context scaling and improved reinforcement learning for more capable planning, reflection, and correction.

    🚀 Introducing Kimi k1.5 --- an o1-level multi-modal model

    • Sota short-CoT performance, outperforming GPT-4o and Claude Sonnet 3.5 on 📐AIME, 📐MATH-500, 💻 LiveCodeBench by a large margin (up to +550%)
    • Long-CoT performance matches o1 across multiple modalities (👀MathVista, 📐AIME, 💻Codeforces, etc)

    Key Ingredients of Kimi k1.5

    There are a few key ingredients about the design and training of k1.5.

    • Long context scaling. We scale the context window of RL to 128k and observe continued improvement of performance with an increased context length. A key idea behind our approach is to use partial rollouts to improve training efficiency---i.e., sampling new trajectories by reusing a large chunk of previous trajectories, avoiding the cost to re-generate the new trajectories from scratch. Our observation identifies the context length as a key dimension of the continued scaling of RL with LLMs.
    • Improved policy optimization. We derive a formulation of RL with long-CoT and employ a variant of online mirror descent for robust policy optimization. This algorithm is further improved by our effective sampling strategy, length penalty, and optimization of the data recipe.
    • Simplistic Framework. Long context scaling, combined with the improved policy optimization methods, establishes a simplistic RL framework for learning with LLMs. Since we are able to scale the context length, the learned CoTs exhibit the properties of planning, reflection, and correction. An increased context length has an effect of increasing the number of search steps. As a result, we show that strong performance can be achieved without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models.
    • Mutimodalities. Our model is jointly trained on text and vision data, which has the capabilities of jointly reasoning over the two modalities.

    Citation

    @article{team2025kimi,
      title={Kimi k1.5: Scaling reinforcement learning with llms},
      author={Team, Kimi and Du, Angang and Gao, Bofei and Xing, Bowei and Jiang, Changjiu and Chen, Cheng and Li, Cheng and Xiao, Chenjun and Du, Chenzhuang and Liao, Chonghua and others},
      journal={arXiv preprint arXiv:2501.12599},
      year={2025}
    }
    
    Original source
  • All of your release notes in one feed

    Join Releasebot and get updates from Kimi and hundreds of other software products.

    Create account
  • April 2026
    • No date parsed from source.
    • First seen by Releasebot:
      Apr 21, 2026
    Kimi logo

    Kimi

    MoBA: Mixture of Block Attention for Long-Context LLMs

    Kimi introduces MoBA, a block sparse attention system for long-context requests that helps the model focus on the most relevant blocks, transition between full and sparse attention, and improve efficiency for large language model processing.

    🚀 Introducing MoBA --- Mixture of Block Attention

    • Trainable Block Sparse Attention: The full context is divided into blocks, where each query token learns to attend to the most relevant KV blocks, enabling efficient processing of long sequences.
    • Parameter-less Gating Mechanism: A novel Parameter-less top-k gating mechanism is introduced to selects the most relevant blocks for each query token, ensuring that the model focuses only on the most informative blocks.
    • Seamlessly Transition between Full and Sparse Attention: MoBA is designed to be a flexible substitute for full attention, allowing seamless transitions between full and sparse attention modes.

    Note: MoBA requires continue training of existing models to achieve its acceleration benefits. It is not a drop-in sparse attention solution that can be directly applied to pretrained models without additional training.

    Abstract

    Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored.

    In this work, we propose a solution that adheres to the “less structure” principle, allowing the model to autonomously determine where to attend, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi’s long-context requests and demonstrates significant advancements in efficient attention computation for LLMs.

    Our code is available at MoonshotAI/MoBA.

    Evaluation with 1M context length

    Environment Setup

    Note that current kernel implementations rely on flash-attn==2.6.3 and torch >= 2.1.0

    conda create -n moba python=3.10
    conda activate moba
    pip install .
    

    Quick Start

    We provide a transformers-friendly implementation for MoBA.
    Feel free to choose attention backends by --attn between moba and moba_naive.

    python3 examples/llama.py --model meta-llama/Llama-3.1-8B --attn moba
    

    Implementation Details

    • moba_naive: A naive implementation based on attention masks. It's designed to help understand how MoBA selects corresponding chunks. You may save and visualize the attention masks to see the block selection process.
    • moba_efficient: Our production-ready implementation optimized for performance. It achieves up to 40x speedup compared to moba_naive (tested with 32K sequence length, 1 attention head, MoBA Block 2048 and MoBA Topk 3). We recommend using this version for practical applications.

    Unit Tests

    pytest tests/test_moba_attn.py
    

    References

    • Llama Implementation: huggingface/transformers
    • Flash Attention: Dao-AILab/flash-attention

    Citation

    If you find MoBA is useful or want to use in your projects, please kindly cite our paper:

    @article{lu2025mobamixtureblockattention,
      author = {Enzhe Lu and Zhejun Jiang and Jingyuan Liu and Yulun Du and Tao Jiang and Chao Hong and Shaowei Liu and Weiran He and Enming Yuan and Yuzhi Wang and Zhiqi Huang and Huan Yuan and Suting Xu and Xinran Xu and Guokun Lai and Yanru Chen and Huabin Zheng and Junjie Yan and Jianlin Su and Yuxin Wu and Yutao Zhang and Zhilin Yang and Xinyu Zhou and Mingxing Zhang and Jiezhong Qiu},
      title = {MoBA: Mixture of Block Attention for Long-Context LLMs},
      journal={arXiv preprint arXiv:2502.13189},
      year={2025}
    }
    
    Original source
  • April 2026
    • No date parsed from source.
    • First seen by Releasebot:
      Apr 21, 2026
    Kimi logo

    Kimi

    Moonlight

    Kimi introduces Moonlight, a 3B/16B MoE model trained with Muon that aims for better performance with far fewer training FLOPs, and opens its distributed Muon implementation plus pretrained and instruction-tuned checkpoints for research.

    Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves ∼ 2× computational efficiency compared to AdamW with compute optimal training.

    Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models.

    We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.

    Our code is available at MoonshotAI/Moonlight.

    Key Ingredients

    Our work builds upon Muon while systematically identifying and resolving its limitations in large-scale training scenarios. Our technical contributions include:

    • Analysis for Effective Scaling of Muon: Through extensive analysis, we identify that weight decay plays a crucial roles in Muon's scalability. Besides, we proposed to keep a consistent update root mean square (RMS) across different matrix and non-matrix parameters through parameter-wise update scale adjustments. Such adjustments significantly enhanced training stability.

    • Efficient Distributed Implementation: We develop a distributed version of Muon with ZeRO-1 style optimization, achieving optimal memory efficiency and reduced communication overhead while preserving the mathematical properties of the algorithm.

    • Scaling Law Validation: We performed scaling law research that compares Muon with strong AdamW baselines, and showed the superior performance of Muon (see Figure 1). Based on the scaling law results, Muon achieves comparable performance to AdamW trained counterparts while requiring only approximately 52% of the training FLOPs.

    Scaling up with Muon. (a) Scaling law experiments comparing Muon and Adam. Muon is 2 times more sample efficient than Adam. (b) The MMLU performance of our Moonlight model optimized with Muon and other comparable models. Moonlight advances the Pareto frontier of performance vs training FLOPs.

    Performance

    We named our lightweight model trained with Muon "Moonlight". We compared Moonlight with SOTA public models at similar scale:

    • LLAMA3-3B is a 3B-parameter dense model trained with 9T tokens
    • Qwen2.5-3B is a 3B-parameter dense model trained with 18T tokens
    • Deepseek-v2-Lite is a 2.4B/16B-parameter MOE model trained with 5.7T tokens

    Moonlight has the same architecture as DeepSeek-V3, which is supported by many popular inference engines, such as VLLM and SGLang. As a result, our model can also be easily deployed using these tools.

    We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.48.2 as the development environment.

    For our pretrained model (Moonlight):

    from transformers import AutoModelForCausalLM, AutoTokenizer
    model_path = "moonshotai/Moonlight-16B-A3B"
    model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    prompt = "1+1=2, 1+2="
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to(model.device)
    generated_ids = model.generate(**inputs, max_new_tokens=100)
    response = tokenizer.batch_decode(generated_ids)[0]
    print(response)
    

    For our instruct model (Moonlight-Instruct):

    from transformers import AutoModelForCausalLM, AutoTokenizer
    model_path = "moonshotai/Moonlight-16B-A3B-Instruct"
    model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    messages = [
      {"role": "system", "content": "You are a helpful assistant provided by Moonshot-AI."},
      {"role": "user", "content": "Is 123 a prime?"}
    ]
    input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
    generated_ids = model.generate(inputs=input_ids, max_new_tokens=500)
    response = tokenizer.batch_decode(generated_ids)[0]
    print(response)
    

    Training

    # train qwen-like dense model with muon
    python3 examples/toy_train.py --model qwen --optimizer muon --dataset openwebtext-100k --hidden_size 896 --lr 1e-3
    
    # train qwen-like dense model with adamw
    python3 examples/toy_train.py --model qwen --optimizer adamw --dataset openwebtext-100k --hidden_size 896 --lr 1e-3
    

    Intermediate Checkpoints

    To support ongoing research efforts, we will soon release our intermediate checkpoints. Coming soon...

    Citation

    If you find Moonlight is useful or want to use in your projects, please kindly cite our paper:

    @misc{liu2025muonscalablellmtraining,
          title={Muon is Scalable for LLM Training}, 
          author={Jingyuan Liu and Jianlin Su and Xingcheng Yao and Zhejun Jiang and Guokun Lai and Yulun Du and Yidao Qin and Weixin Xu and Enzhe Lu and Junjie Yan and Yanru Chen and Huabin Zheng and Yibo Liu and Shaowei Liu and Bohong Yin and Weiran He and Han Zhu and Yuzhi Wang and Jianzhou Wang and Mengnan Dong and Zheng Zhang and Yongsheng Kang and Hao Zhang and Xinran Xu and Yutao Zhang and Yuxin Wu and Xinyu Zhou and Zhilin Yang},
          year={2025},
          eprint={2502.16982},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2502.16982}, 
    }
    
    Original source
  • April 2026
    • No date parsed from source.
    • First seen by Releasebot:
      Apr 21, 2026
    Kimi logo

    Kimi

    Introducing Kimi-Dev: A Strong and Open-source Coding LLM for Issue Resolution

    Kimi releases Kimi-Dev-72B, a new open-source coding LLM for software engineering tasks that sets a new open-source state of the art on SWE-bench Verified. It is available for download and deployment on Hugging Face and GitHub.

    We introduce Kimi-Dev-72B, our new open-source coding LLM for software engineering tasks. Kimi-Dev-72B achieves a new state-of-the-art on SWE-bench Verified among open-source models.

    • Kimi-Dev-72B achieves 60.4% performance on SWE-bench Verified. It surpasses the runner-up, setting a new state-of-the-art result among open-source models.
    • Kimi-Dev-72B is optimized via large-scale reinforcement learning. It autonomously patches real repositories in Docker and gains rewards only when the entire test suite passes. This ensures correct and robust solutions, aligning with real-world development standards.
    • Kimi-Dev-72B is available for download and deployment on Hugging Face and GitHub. We welcome developers and researchers to explore its capabilities and contribute to development.

    Performance of Open-source Models on SWE-bench Verified.

    ⚙️ Installation

    # clone repo
    git clone https://github.com/MoonshotAI/Kimi-Dev.git
    # create env
    conda create -n kimidev python=3.12
    # local install
    pip install -e .
    

    🛠️ How to use

    Prepare repo structure [From Agentless]

    Since for each issue in the benchmark (both SWE-Bench Lite and SWE-Bench Verified) we need to checkout the repository and process the files, you might want to save some time by downloading the preprocessed data here: swebench_repo_structure.zip. After downloading, please unzip and export the location as such

    export PROJECT_FILE_LOC={folder which you saved}
    

    Deploy vLLM Model

    Installation

    # Install vLLM with CUDA 12.8.
    # If you are using pip.
    pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128
    # If you are using uv.
    uv pip install vllm --torch-backend=auto
    

    Serving

    vllm serve Kimi-Dev-72B --served-model-name kimi-dev --host 0.0.0.0 --port 8000 --gpu-memory-utilization 0.95 --max-seq-len-to-capture 131072 --tensor-parallel-size 8
    

    Rollout

    Kimi-Dev adopts a simplified two-stage framework for handling code repair and test writing tasks:

    1. File Localization: Intelligently identify key files that need modification based on problem descriptions and repository structure
    2. Code Editing: Perform precise code modifications on the located files, including bug fixes or unit test insertions

    Compared to multi-step localization methods, we perform localization at the file level and then pass the complete file to the repair step for more detailed reasoning.

    Run rollout script:

    conda activate kimidev
    # Bugfixer
    python kimidev/examples/rollout_messages_bugfixer.py --model_name {vllm_serve_model}
    # Testwriter
    python kimidev/examples/rollout_messages_testwriter.py --model_name {vllm_serve_model}
    

    👀 Example Results

    We provide some example result files as well as the files required for test-time scaling here.

    You can also download these files from Google Drive.

    💪 Contributing

    Welcome to submit Pull Requests or create Issues to help improve the project.

    😺 Contact

    If you have any questions, please feel free to submit a GitHub issue or contact [email protected].

    📝 Citation

    If you find our code and models useful, please kindly cite the following information.

    @misc{yang2025kimidevagentlesstrainingskill,
          title={Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents}, 
          author={Zonghan Yang and Shengjie Wang and Kelin Fu and Wenyang He and Weimin Xiong and Yibo Liu and Yibo Miao and Bofei Gao and Yejie Wang and Yingwei Ma and Yanhao Li and Yue Liu and Zhenxing Hu and Kaitai Zhang and Shuyi Wang and Huarong Chen and Flood Sung and Yang Liu and Yang Gao and Zhilin Yang and Tianyu Liu},
          year={2025},
          eprint={2509.23045},
          archivePrefix={arXiv},
          primaryClass={cs.AI},
          url={https://arxiv.org/abs/2509.23045}, 
    }
    
    Original source
  • April 2026
    • No date parsed from source.
    • First seen by Releasebot:
      Apr 21, 2026
    Kimi logo

    Kimi

    Kimi K2: Open Agentic Intelligence

    Kimi launches Kimi K2, an open-source Mixture-of-Experts model built for agentic tasks, coding, math, and knowledge work. It adds Kimi-K2-Base and Kimi-K2-Instruct, and lets web, mobile, and API users try the new model for free.

    Kimi K2

    Kimi K2 is our latest Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters. It achieves state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models. But it goes further — meticulously optimized for agentic tasks, Kimi K2 does not just answer; it acts.

    And now, it is within your reach. Today, we are open-sourcing:

    • Kimi-K2-Base: The foundation model, a strong start for researchers and builders who want full control for fine-tuning and custom solutions.
    • Kimi-K2-Instruct: The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking.

    With Kimi K2, advanced agentic intelligence is more open and accessible than ever. We can't wait to see what you build.

    Agentic and Competitive Coding

    • SWE-bench Verified
    • SWE-bench Multilingual
    • LiveCodeBench v6
    • OJBench
    • Tool Use
    • Tau2-bench weighted average*
    • AceBench(en)
    • Math & STEM
    • AIME 2025
    • GPQA-Diamond
    • All models evaluated above are non-thinking models.
    • For Tau2-Bench, average is weighted by tasks.
    • For Swe-Bench Multilingual, we evaluated only Claude 4 Sonnet because the cost of Claude 4 Opus was prohibitive.

    Use cases

    Give Kimi K2 your tools and describe your task.

    It automatically understands how to use the tools and gets the job done. You don't have to write any complex workflow for agentic applications.

    Salary Data Analysis

    Example 1

    Example 2

    Example 3

    Summary of Findings

    Using 2020–2025 salary data, I tested the effect of remote-work ratio on salary and whether this effect differs significantly across experience levels (EN / MI / SE / EX). The analysis reveals a significant interaction effect between remote-work ratio and experience level on salary.

    Key Statistical Evidence

    • Interaction Effect: The impact of remote work on salary is not uniform across experience levels. Specifically:
      • Entry-Level Employees: Remote workers earn 8.0% more on average than on-site workers (p ≈ 0.20, not significant).
      • Mid-Level Employees: Remote workers earn 8.5% less on average than on-site workers (p = 0.003, highly significant).
      • Senior-Level Employees: Remote workers earn 3.5% less on average than on-site workers (p ≈ 0.05, marginally significant).
      • Expert-Level Employees: Remote workers earn 4.6% more on average than on-site workers (p ≈ 0.61, not significant).

    Visual Analysis

    The interaction plot and bar charts illustrate that the salary premium or penalty associated with remote work varies by experience level. Notably, mid-level employees show a significant salary disadvantage when working remotely, while entry-level and expert-level employees tend to have a salary advantage (though not statistically significant in this dataset).

    Conclusion

    The analysis confirms a significant interaction effect between remote-work ratio and experience level on salary. The effect of remote work on salary differs across experience levels, with mid-level employees experiencing a notable salary penalty when working remotely, while entry-level and expert-level employees may benefit from remote work arrangements.

    Imagine using Kimi K2 to explore remote-work salaries with the Salary Data Analysis example, where 16 IPython calls generate stats, visualizations, and an interactive webpage of insights. Dive into the Stanford NLP Genealogy, and get Kimi K2 to generate an interactive site built through 5 web searches, 4 browsings, 3 clicks, 5 scrolls, 6 edits, and 2 deployments. Or plan your dream Coldplay Tour 2025 in London with Kimi K2, it crafted the plan for you through 17 seamless tool calls spanning search, calendar, Gmail, flights, Airbnb, and restaurant bookings.

    Bring Kimi K2 to your command line. It edits files. It runs commands.

    Kimi K2 understands your environment, decides what actions to take, and executes them seamlessly.

    Benchmarking Kimi K2

    Evaluation Results

    Kimi-K2-Instruct

    Kimi-K2-Base

    The table below details the performance of Kimi-K2-Instruct, showing that it matches—or outperforms—the latest open-source and proprietary models across a diverse set of tasks. The model shines on knowledge-intensive and reasoning benchmarks, delivering outstanding results in natural-language understanding, mathematics and sciences, code generation, and agentic tool uses.

    Open Agentic Intelligence

    Pre-training is the crucial foundation for Agentic Intelligence, establishing the priors that makes reinforcement learning (RL) exploration tractable, efficient, and generalizable. However, as Ilya Sutskever also observes, human data is a finite "fossil fuel", and its growth is lagging far behind the pace of compute. This makes token efficiency during pre-training a new critical coefficient in the AI scaling laws.

    Post-training is pivotal in the "Era of Experience" (David Silver, Richard Sutton, 2025). In this era, LLMs increasingly learn from their own self-generated interactions, receiving rewards that free them from the limits of human data and enable them to surpass human capabilities.

    Kimi K2 is forged from these very insights.

    MuonClip Optimizer

    Without rigor, given an approximately finite pretraining dataset and a fixed model configuration, a more token-efficient optimizer generates more intelligence. Our previous work Moonlight has demonstrated that the Muon optimizer substantially outperforms the widely-used AdamW optimizer for LLM training.

    Kimi K2 was designed to further scale up Moonlight, which employs an architecture similar to DeepSeek-V3. Based on scaling-law analysis, we reduce the number of heads for long-context efficiency, and increase MoE sparsity for greater token efficiency. While scaling up, we encountered a persistent challenge: training instability caused by exploding attention logits, an issue that occurs more frequently with Muon but less with AdamW in our experiments. Existing solutions such as logit soft-capping and query-key normalization were found inadequate.

    To address this, we introduce the MuonClip optimizer that improves Muon with our proposed qk-clip technique. Specifically, qk-clip stabilizes training by directly rescaling the weight matrices of the query and key projections after Muon updates, thus controlling the scale of attention logits at the source.

    Our experiments show that MuonClip effectively prevents logit explosions while maintaining downstream task performance. In practice, Kimi K2 was pre-trained on 15.5T tokens using MuonClip with zero training spike, demonstrating MuonClip as a robust solution for stable, large-scale LLM training.

    Agentic Capabilities

    The enhanced agentic capabilities of Kimi K2 originate from two important aspects — large-scale agentic data synthesis and general reinforcement learning.

    Large-Scale Agentic Data Synthesis for Tool Use Learning:

    To teach the model sophisticated tool-use capabilities, we developed a comprehensive pipeline inspired by ACEBench that simulates real-world tool-using scenarios at scale. Our approach systematically evolves hundreds of domains containing thousands of tools—including both real MCP (Model Context Protocol) tools and synthetic ones—then generates hundreds of agents with diverse tool sets.

    All tasks are rubric-based, enabling consistent evaluation. Agents interact with simulated environments and user agents, creating realistic multi-turn tool-use scenarios. An LLM judge evaluates simulation results against task rubrics, filtering for high-quality training data. This scalable pipeline generates diverse, high-quality data, paving the way for large-scale rejection sampling and reinforcement learning.

    General Reinforcement Learning:

    The key challenge is to apply RL to tasks with both verifiable and non-verifiable rewards; typical examples of verifiable tasks are math and competition coding, while writing a research report is usually viewed as non-verifiable. Going beyond verifiable rewards, our general RL system uses a self-judging mechanism where the model acts as its own critic, providing scalable, rubric-based feedback for non-verifiable tasks.

    Meanwhile, on-policy rollouts with verifiable rewards are used to continuously update the critic so that the critic keeps improving its evaluation accuracy on the latest policy. This can be viewed as a way of using verifiable rewards to improve the estimation of non-verifiable rewards.

    Getting started with Kimi K2

    Try Kimi K2 on kimi.com

    Starting today, Kimi users on web and mobile can select and use the new Kimi K2 model for free. At this moment, our MCP features for web and app are still in development. We hope to begin rolling them out in the coming weeks. In the meantime, you’re welcome to try our Researcher for an early look at its agentic capabilities. Please note that vision features are not supported for Kimi K2 yet.

    Use Kimi K2 with API

    The Kimi Platform offers an OpenAI/Anthropic compatible interface, allowing for easy adaptation of your existing applications to Kimi K2. We encourage developers to explore our tool calling API for building agent applications. For detailed information, visit platform.moonshot.ai.

    Serve Kimi K2 on your own

    We recommend running Kimi K2 on one of the following inference engines: vLLM, SGLang, KTransformers, or TensorRT-LLM. For detailed deployment instructions, please see our GitHub repository.

    What's next

    While Kimi K2 serves as a strong foundation for open agentic intelligence, a general agent uses more advanced capabilities such as thinking and visual understanding. We plan to add these to Kimi K2 in the future.

    Limitations

    In our internal tests, we've identified some limitations in current Kimi K2 models. When dealing with hard reasoning tasks or unclear tool definition, the model may generate excessive tokens, sometimes leading to truncated outputs or incomplete tool calls. Additionally, performance may decline on certain tasks if tool use is enabled. When building complete software projects, one-shot prompting yields performance degradation compared to using K2 under an agentic framework. We are working to address these issues in future releases and looking forward to more feedbacks.

    Original source
  • April 2026
    • No date parsed from source.
    • First seen by Releasebot:
      Apr 21, 2026
    Kimi logo

    Kimi

    Rebuilding the "Chain of Trust": Kimi Vendor Verifier

    Kimi releases the Kimi Vendor Verifier alongside Kimi K2.6, open-sourcing a project that helps users verify open-source model inference implementations and benchmark accuracy. It adds a public validation workflow, vendor coverage, and continuous benchmarking support.

    Alongside the release of the Kimi K2.6 model, we are open-sourcing the Kimi Vendor Verifier (KVV) project, designed to help users of open-source models verify the accuracy of their inference implementations.

    Not as an afterthought, but because we learned the hard way that open-sourcing a model is only half the battle. The other half is ensuring it runs correctly everywhere else.

    Official Evaluation Results

    Think
    Non-Think

    Benchmark Metric Temperature TopP MaxTokens Kimi API OCRBench acc 1.0 0.95 16384 91.0 AIME2025 avg@32 1.0 0.95 98304 98.4 MMMU Pro Vision acc 1.0 0.95 65536 78.8

    You can click here to access the Kimi API K2VV evaluation results for calculating the F1 score.

    Why We Built KVV

    From Isolated Incidents to Systemic Issues

    Since the release of K2 Thinking, we have received frequent feedback from the community regarding anomalies in benchmark scores. Our investigation confirmed that a significant portion of these cases stemmed from the misuse of Decoding parameters. To mitigate this immediately, we built our first line of defense at the API level: enforcing Temperature=1.0 and TopP=0.95 in Thinking mode, with mandatory validation that thinking content is correctly passed back.

    However, more subtle anomalies soon triggered our alarm. In a specific evaluation on LiveBenchmark, we observed a stark contrast between third-party API and official API. After extensive testing of various infrastructure providers, we found this difference is widespread.

    This exposed a deeper problem in the open-source model ecosystem: The more open the weights are, and the more diverse the deployment channels become, the less controllable the quality becomes.

    If users cannot distinguish between "model capability defects" and "engineering implementation deviations," trust in the open-source ecosystem will inevitably collapse.

    Our Solution

    Six Critical Benchmarks (selected to expose specific infra failures):

    1. Pre-Verification: Validates that API parameter constraints (temperature, top_p, etc.) are correctly enforced. All tests must pass before proceeding to benchmark evaluation.
    2. OCRBench: 5 minutes smoke test for multimodal pipelines.
    3. MMMU Pro: Verify Vision input preprocessing by testing diverse visual inputs.
    4. AIME2025: Long-output stress test. Catches KV cache bugs and quantization degradation that short benchmarks hide.
    5. K2VV ToolCall: Measures trigger consistency (F1) and JSON Schema accuracy. Tool errors compound in agents; we catch them early.
    6. SWE-Bench: Full agentic coding test. (Not open sourced due to dependency of sandbox)

    Upstream Fix: We embed with vLLM/SGLang/KTransformers communities to fix root causes, not just detect symptoms.

    Pre-Release Validation: Rather than waiting for post-deployment complaints, we provide early access to test models. This lets infrastructure providers validate their stacks before users encounter issues.

    Continuous Benchmarking: We will maintain a public leaderboard of vendor results. This transparency encourages vendors to prioritize accuracy.

    Testing Cost Estimation

    We completed full evaluation workflow validation on Two NVIDIA H20 8-GPU servers, with sequential execution taking approximately 15 hours. To improve evaluation efficiency, scripts have been optimized for long-running inference scenarios, including streaming inference, automatic retry, and checkpoint resumption mechanisms.

    An Open Invitation

    Weights are open. The knowledge to run them correctly must be too.

    We are expanding vendor coverage and seeking lighter agentic tests.

    Contact Us: [email protected]

    Original source
  • April 2026
    • No date parsed from source.
    • First seen by Releasebot:
      Apr 21, 2026
    Kimi logo

    Kimi

    Introducing WorldVQA

    Kimi releases WorldVQA, a new benchmark for testing factual visual world knowledge in multimodal LLMs. It includes 3,500 verified image-question pairs, clear head and tail knowledge splits, and open-sourced evaluation tools to help improve accuracy, calibration, and honesty in multimodal AI.

    A benchmark for evaluating atomic visual world knowledge in Multimodal LLMs.

    Authors Kimi Team

    Overview

    We are releasing WorldVQA, a new benchmark designed to measure the factual correctness of Multimodal Large Language Models (MLLMs). While recent models have demonstrated impressive capabilities in visual reasoning and description, measuring their reliability regarding visual world knowledge remains a challenge.

    WorldVQA focuses on a critical question: Does the model actually recognize the specific entity it sees, or is it merely hallucinating based on visual patterns?

    Our results show that WorldVQA creates a significant challenge for frontier models. Even state-of-the-art models struggle to achieve high accuracy on long-tail visual knowledge, often falling below 50% accuracy. This benchmark aims to drive progress toward more factually reliable and knowledgeable multimodal AI.

    The Dataset

    The dataset consists of 3,500 high-quality image-question pairs. The distribution aims to test a model's encyclopedic breadth across the world. The dataset distinguishes itself through three core design principles:

    • Factuality & Unambiguity: Every question has a single, verifiable ground-truth answer. We exclude subjective questions or ambiguous visual scenarios.
    • Rich Taxonomy: The dataset spans 9 categories to ensure broad coverage of world knowledge.
    • Head vs. Tail Distribution: We explicitly separate data into Head (common knowledge) and Tail (rare/long-tail knowledge). This allows us to measure how model performance degrades as knowledge becomes more obscure.

    Note on Quality: To ensure the benchmark is a reliable gold standard, all images and question-answer pairs underwent rigorous multi-stage human verification to filter out noise and ambiguity.

    Using WorldVQA to compare models

    Overall Model Accuracy

    The benchmark results show that state-of-the-art models achieve accuracy often below 50%, highlighting the challenge posed by WorldVQA.

    Measuring Calibration: Confidence vs. Accuracy

    In our experiments comparing model confidence with actual accuracy, we utilized two key metrics to measure the alignment between a model's subjective belief and its objective performance:

    • ECE (Expected Calibration Error): Measures the average gap between the model's subjective confidence and its objective accuracy. The ideal value is 0.
    • Slope (Weighted Average Slope): Measures the correlation and sensitivity between the model's accuracy and its own confidence. The ideal value is 1.0.

    Calibration and Confidence Distribution Analysis. Left: Reliability diagrams plotting Actual Accuracy against Stated Confidence. To ensure statistical significance, only bins containing more than 20 samples are visualized. The size of each data point is proportional to the number of samples in that bin. The black dashed diagonal (y=x) represents perfect calibration, while colored dashed lines indicate the weighted average slope for each model. Right: The distribution of stated confidence scores across the full dataset (without sample thresholding). The plots reveal a severe overconfidence trend, with most models concentrating their predictions in the 90-100% confidence range.

    Our experiments reveal that all evaluated models are currently far from the ideal state, exhibiting a universal tendency toward overconfidence.

    While Kimi-K2.5 achieves best performance on both metrics—recording an ECE of 37.9% and a Slope of 0.550—there remains a significant gap to bridge in the pursuit of "honesty" and "alignment." Enhancing the self-awareness boundaries of multimodal models represents a critical direction for future exploration.

    Conclusion

    WorldVQA is a simple but challenging benchmark for evaluating the atomic visual knowledge of frontier models. Improving performance on WorldVQA is a necessary step for the next generation of AI agents. We are open-sourcing the WorldVQA dataset and evaluation scripts to help the community address the visual knowledge gap.

    Original source
  • April 2026
    • No date parsed from source.
    • First seen by Releasebot:
      Apr 21, 2026
    Kimi logo

    Kimi

    Kimi K2.6: Advancing Open-Source Coding

    Kimi releases Kimi K2.6, an open-source model now available on Kimi.com, the Kimi App, API, and Kimi Code, with stronger coding, long-horizon execution, agent swarm coordination, and proactive agent workflows for more capable front-end, full-stack, and autonomous tasks.

    Long-Horizon Coding

    We are open sourcing our latest model, Kimi K2.6, featuring state-of-the-art coding, long-horizon execution, and agent swarm capabilities. Kimi K2.6 is now available via Kimi.com, the Kimi App, the API, and Kimi Code.

    Kimi K2.6 shows strong improvements in long-horizon coding tasks, with reliable generalization across programming languages (e.g., Rust, Go, and Python) and tasks (e.g., front-end, devops, and performance optimization). On Kimi Code Bench, our internal coding benchmark covering diverse complicated end-to-end tasks, Kimi K2.6 demonstrates significant improvements over Kimi K2.5.

    Kimi K2.6 demonstrates strong long-horizon coding in complex engineering tasks:

    Kimi K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac. By implementing and optimizing model inference in Zig —a highly niche programming language—it demonstrated exceptional out-of-distribution generalization. Across 4,000+ tool calls, over 12 hours of continuous execution, and 14 iterations, Kimi K2.6 dramatically improved throughput from ~15 to ~193 tokens/sec, ultimately achieving speeds ~20% faster than LM Studio.

    Kimi K2.6 autonomously overhauled exchange-core, an 8-year-old open-source financial matching engine. Over a 13-hour execution, the model iterated through 12 optimization strategies, initiating over 1,000 tool calls to precisely modify more than 4,000 lines of code. Acting as an expert systems architect, Kimi K2.6 analyzed CPU and allocation flame graphs to pinpoint hidden bottlenecks and boldly reconfigured the core thread topology (from 4ME+2RE to 2ME+1RE). Despite the engine already operating near its performance limits, Kimi K2.6 extracted a 185% medium throughput leap (from 0.43 to 1.24 MT/s) and a 133% performance throughput gain (soaring from 1.23 to 2.86 MT/s).

    In beta tests, K2.6 performs well on long-horizon coding tasks in enterprise evaluations (randomly ordered):

    • Ollama: "Kimi K2.6 raises the bar for open-source models. It excels in coding and especially for agentic tools like OpenClaw and Hermes. In early testing, it sustains long multi-step sessions with impressive stability. It will work all of Ollama's integrations out of the box, and we're excited to see what developers build with it." — Michael Chiang, Co-founder
    • Kilo.ai: "K2.6 offers SOTA-level performance at a fraction of the cost. It's tremendously good at long-context tasks across the codebase, as well as the day-to-day work needed to support an always-on agent like KiloClaw." — Scott Breitenother, Cofounder and CEO
    • Augment Code: "What impressed us most about K2.6 is its surgical precision in large codebases. When an initial path is blocked, it is strong at pivoting intelligently: following existing architectural patterns, finding hidden related changes, and keeping fixes scoped to the real problem. That kind of focused adaptability helps Augment Code reduce wasted cycles and deliver faster, more cost-effective agentic coding for enterprise-scale engineering work." — Igor Ostrovsky, Co-Founder and CTO
    • Fireworks.ai: "We are thrilled to see another leap in open source models with Kimi K2.6 release, which marks a significant advancement for high-stakes, agentic workflows. The most impactful improvements lie in its long-horizon reliability and instruction following. K2.6 excels at maintaining architectural integrity over extended coding sessions, making it a stable foundation for autonomous agent pipelines, like all the 'claws'. It demonstrates a measurable leap over K2.5 in long-context tasks, achieving state-of-the-art performance in complex reasoning." — Yun Jin, Head of AI Infrastructure
    • OpenCode.ai: "Within OpenCode, Kimi K2.6 proves to be exceptionally reliable. Its approach to task decomposition and tool calling is both steady and consistent. With a sharper grasp of task requirements and more streamlined multi-step operations, it effectively minimizes repetitive overhead, resulting in a smoother, more trustworthy end-to-end experience." — Frank Wang, Founder
    • Qoder.com: "Kimi K2.6 delivered a strong performance in Qoder's internal evaluations, showing significant progress over K2.5. Specifically, there has been a notable increase in the frequency of tool calling and model invocations, reflecting a substantial boost in the model's proactivity and intelligence during task execution. This heightened initiative in tool calling enables the model to more actively grasp developer intent and automatically complete context, thereby minimizing user interruptions and wait times." — Chen Xin, Senior Technical Expert
    • Vercel.com: "K2.6 shows major gains over K2.5 on the capabilities our developers care about most: we're seeing more than 50% improvement on our Next.js benchmark, putting it among the top-performing models on the platform. Combined with its cost-performance ratio, it's a compelling option for agentic coding and front-end generation through AI Gateway. We're excited to offer it to our developer community." — Jerilyn Zheng, PM for Vercel AI
    • Factory.ai: "K2.6 is a clear improvement on K2.5 on both our benchmarks (+15%) and in side-by-side comparisons. It seems to have better instruction following, more thorough exploration and reasoning, and less likely to make coding errors or use hacks." — Leo Tchourakov, Member of Technical Staff
    • Baseten.co: "Kimi K2.6's evolution is impressive. It excels on coding tasks at a level comparable to leading closed source models, and offers strong tool calling quality due to its deep understanding of third party frameworks. Kimi K2.6's excellent reliability makes it a great choice for complex and long-horizon engineering tasks." — Bola Malek, Head of Labs
    • Anything.com: "In a no-code environment, AI has to handle every edge case. There's no developer to step in when something doesn't work as expected. K2.6 is noticeably more effective than K2.5 at navigating nuanced API behaviors and recovering when things break, and it runs longer-horizon tasks before hitting a wall. We've seen a real improvement in getting users from idea to deployment compared to K2.5." — Ahmad Jiha, Founding AI Engineer
    • Hermes Agent: "Got an early look at K2.6 and ran it through Hermes Agent. Tool calling and agentic loops feel noticeably tighter, coding is a clear step up, and the creative range surprised us. We're super excited about running a hackathon with Kimi on creativity. Kimi team continues to beat expectations!" — Thomas Eastman, Hermes Agent
    • CodeBuddy.ai: "Kimi K2.6 demonstrates significant improvements over K2.5 in internal evaluations conducted by CodeBuddy: code generation accuracy increased by 12%, long-context stability improved by 18%, and tool invocation success rate reached 96.60%. Its stronger reasoning capabilities and more consistent output quality provide robust support for ensuring a reliable user experience in CodeBuddy WorkBuddy." — CodeBuddy WorkBuddy Eval Team
    • Blackbox.ai: "Kimi K2.6 sets a new level for open-sourced models, especially in long-horizon, agent-style coding workflows. It handles complex, multi-step tasks with stronger instruction following and consistently high code quality. We've seen it sustain extended coding sessions with remarkable stability, far beyond typical models. It also surfaces deep, non-obvious bugs that would normally take significant developer time to uncover. Overall, K2.6 sets a new bar for reliable coding." — Robert Rizk, Cofounder and CEO

    Coding-Driven Design

    Based on the strong coding capabilities, Kimi K2.6 can turn simple prompts into complete front-end interfaces, generating structured layouts with deliberate design choices such as aesthetic hero sections, as well as interactive elements and rich animations, including scroll-triggered effects. With strong proficiency in leveraging image and video generation tools, Kimi K2.6 supports the generation of visually coherent assets and contributes to higher-quality, more salient hero sections.

    Moreover, Kimi K2.6 expands beyond static frontend development to simple full-stack workflows—spanning authentication to user interaction to database operations for lightweight use cases like transaction logging or session management.

    We established an internal Kimi Design Bench, organized into four categories: Visual Input Tasks, Landing Page Construction, Full-Stack Application Development, and General Creative Programming. In comparison with Google AI Studio, Kimi K2.6 shows promising results and performs well across these categories.

    Agent Swarms, Elevated

    Scaling out, not just up. An Agent Swarm dynamically decomposes tasks into heterogeneous subtasks executed concurrently by self-created domain-specialized agents.

    Based on the K2.5 Agent Swarm research preview, Kimi K2.6 Agent Swarm demonstrates a qualitative leap in the agent swarm experience. It seamlessly coordinates heterogeneous agents to combine complementary skills: broad search layered with deep research, large-scale document analysis fused with long-form writing, and multi-format content generation executed in parallel. This compositional intelligence enables the swarm to deliver end-to-end outputs—spanning documents, websites, slides, and spreadsheets—within a single autonomous run.

    The architecture scales horizontally to 300 sub-agents executing across 4,000 coordinated steps simultaneously, a substantial expansion from K2.5's 100 sub-agents and 1,500 steps. This massive parallelization fundamentally reduces end-to-end latency while significantly enhancing output quality and expanding the operational boundaries of Agents swarms.

    It can also turn any high-quality files such as PDFs, spreadsheets, slides, and Word documents into Skills. Kimi K2.6 captures and maintains the documents' structural and stylistic DNA, enabling you to reproduce the same quality and format in future tasks.

    Proactive Agents

    K2.6 demonstrates strong performance in autonomous, proactive agents such as OpenClaw and Hermes, which operate across multiple applications with continuous, 24/7 execution.

    Unlike simple chat-based interactions, these workflows require AI to proactively manage schedules, execute code, and orchestrate cross-platform operations as a persistent background agent.

    Our RL infra team used a K2.6-backed agent that operated autonomously for 5 days, managing monitoring, incident response, and system operations, demonstrating persistent context, multi-threaded task handling, and full-cycle execution from alert to resolution.

    Kimi K2.6 delivers measurable improvements in real-world reliability: more precise API interpretation, stabler long-running performance, and enhanced safety awareness during extended research tasks.

    Performance gains are quantified by our internal Claw Bench, the evaluation suite spanning five domains: Coding Tasks, IM Ecosystem Integration, Information Research & Analysis, Scheduled Task Management, and Memory Utilization. Across all metrics, Kimi K2.6 significantly outperforms Kimi K2.5 in task completion rates and tool invocation accuracy—particularly in workflows requiring sustained autonomous operation without human oversight.

    Bring Your Own Agents

    Building upon Kimi K2.6's robust orchestration capabilities, Kimi K2.6 extends your proactive agents to Claw Groups as a research preview—a new instantiation of the Agent Swarm architecture.

    Claw Groups embrace an open, heterogeneous ecosystem: Multiple agents and humans operate as true collaborators. Users can onboard agents from any device, running any model, each carrying their own specialized toolkits, skills and persistent memory contexts. Whether deployed on local laptops, mobile devices, or cloud instances, these diverse agents integrate seamlessly into a shared operational space.

    At the center of this swarm, Kimi K2.6 serves as an adaptive coordinator. It dynamically matches tasks to agents based on their specific skill profiles and available tools, optimizing for capability fit. When an agent encounters failure or stalls, the coordinator detects the interruption, automatically reassigns the task or regenerates subtasks, and actively manages the full lifecycle of deliverables—from initiation through validation to completion.

    We also want to thank the K2.6-powered agents in Claw Groups—we've been dogfooding our own agent marketing team by refining human–agent workflows in practice. Using Claw Groups, we run end-to-end content production and launch campaigns, with specialized agents like Demo Makers, Benchmark Makers, Social Media Agents, and Video Makers working together. K2.6 coordinates the process, enabling agents to share intermediate results and turn ideas into consistent, fully packaged deliverables.

    We are moving beyond simply asking AI a question or assigning AI a task, and entering a phase where human and AI collaborate as genuine partners—combining strengths to solve problems collectively. Claw Groups marks our latest efforts toward a future where the boundaries between "my agent," "your agent," and "our team" dissolve seamlessly into a collaborative system.

    Benchmark Table and Footnotes are included in the full release but omitted here for brevity.

    Original source
  • Apr 21, 2026
    • Date parsed from source:
      Apr 21, 2026
    • First seen by Releasebot:
      Apr 21, 2026
    Kimi logo

    Kimi

    Introducing Kimi K2 Thinking

    Kimi introduces K2 Thinking, its open-source thinking model, with state-of-the-art reasoning, agentic search, coding, and writing gains. It can handle 200 to 300 sequential tool calls and is now live in chat on kimi.com, with API access available.

    Evaluations

    Kimi K2 Thinking sets new records across benchmarks that assess reasoning, coding, and agent capabilities. K2 Thinking achieves 44.9% on HLE with tools, 60.2% on BrowseComp, and 71.3% on SWE-Bench Verified, demonstrating strong generalization as a state-of-the-art thinking agent model.

    Agentic Reasoning

    K2 Thinking demonstrates outstanding reasoning and problem-solving abilities. On Humanity’s Last Exam (HLE)—a rigorously crafted, closed‑ended benchmark—spanning thousands of expert‑level questions across more than 100 subjects, K2 Thinking achieved a state-of-the-art score of 44.9%, with search, python, and web-browsing tools, establishing new records in multi‑domain expert‑level reasoning performance.

    By reasoning while actively using a diverse set of tools, K2 Thinking is capable of planning, reasoning, executing, and adapting across hundreds of steps to tackle some of the most challenging academic and analytical problems. In one instance, it successfully solved a PhD-level mathematics problem through 23 interleaved reasoning and tool calls, exemplifying its capacity for deep, structured reasoning and long-horizon problem solving.

    Agentic Coding

    K2 Thinking exhibits substantial gains in coding and software development tasks. It achieves scores of 61.1% on SWE-Multilingual, 71.3% on SWE-Bench Verified, and 47.1% on Terminal-Bench, showcasing strong generalization across programming languages and agent scaffolds. The model delivers notable improvements on HTML, React, and component-intensive front-end tasks—translating ideas into fully functional, responsive products. In agentic coding settings, it reasons while invoking tools, integrating fluidly into software agents to execute complex, multi-step development workflows with precision and adaptability.

    Agentic Search and Browsing

    K2 Thinking demonstrates strong performance in agentic search and browsing scenarios. On BrowseComp—a challenging benchmark designed to evaluate models' ability to continuously browse, search, and reason over hard-to-find real-world web information—K2 Thinking achieved a score of 60.2%, significantly outperforming the human baseline of 29.2%. This result highlights K2 Thinking's superior capability for goal-directed, web-based reasoning and its robustness in dynamic, information-rich environments.

    K2 Thinking can execute 200–300 sequential tool calls, driven by long-horizon planning and adaptive reasoning. It performs dynamic cycles of think → search → browser use → think → code, continually generating and refining hypotheses, verifying evidence, reasoning, and constructing coherent answers. This interleaved reasoning allows it to decompose ambiguous, open-ended problems into clear, actionable subtasks.

    General Capabilities

    Creative Writing: K2 Thinking delivers improvements in completeness and richness. It shows stronger command of style and instruction, handling diverse tones and formats with natural fluency. Its writing becomes more vivid and imaginative—poetic imagery carries deeper associations, while stories and scripts feel more human, emotional, and purposeful. The ideas it expresses often reach greater thematic depth and resonance.

    Practical Writing: K2 Thinking demonstrates marked gains in reasoning depth, perspective breadth, and instruction adherence. It follows prompts with higher precision, addressing each requirement clearly and systematically—often expanding on every mentioned point to ensure thorough coverage. In academic, research, and long-form analytical writing, it excels at producing rigorous, logically coherent, and substantively rich content, making it particularly effective in scholarly and professional contexts.

    Personal & Emotional: When addressing personal or emotional questions, K2 Thinking responds with more empathy and balance. Its reflections are thoughtful and specific, offering nuanced perspectives and actionable next steps. It helps users navigate complex decisions with clarity and care—grounded, practical, and genuinely human in tone.

    Original source
  • Apr 21, 2026
    • Date parsed from source:
      Apr 21, 2026
    • First seen by Releasebot:
      Apr 21, 2026
    Kimi logo

    Kimi

    Kimi K2.5: Visual Agentic Intelligence

    Kimi introduces K2.5, its most powerful open-source model yet, with native multimodal coding and vision, agent swarm automation, and stronger office productivity. It is available on Kimi.com, the Kimi App, the API, and Kimi Code, with Agent Swarm in beta on Kimi.com.

    Today, we are introducing Kimi K2.5, the most powerful open-source model to date.

    Kimi K2.5 builds on Kimi K2 with continued pretraining over approximately 15T mixed visual and text tokens. Built as a native multimodal model, K2.5 delivers state-of-the-art coding and vision capabilities and a self-directed agent swarm paradigm.

    For complex tasks, Kimi K2.5 can self-direct an agent swarm with up to 100 sub-agents, executing parallel workflows across up to 1,500 tool calls. Compared with a single-agent setup, this reduces execution time by up to 4.5x. The agent swarm is automatically created and orchestrated by Kimi K2.5 without any predefined subagents or workflow.

    Kimi K2.5 is available via Kimi.com, the Kimi App, the API, and Kimi Code. Kimi.com & Kimi App now supports 4 modes: K2.5 Instant, K2.5 Thinking, K2.5 Agent, and K2.5 Agent Swarm (Beta). Agent Swarm is currently in beta on Kimi.com, with free credits available for high-tier paid users.

    Across three agentic benchmarks—HLE, BrowseComp, and SWE-Verified—Kimi K2.5 delivers strong performance at a fraction of the cost.

    1. Coding with Vision

    Kimi K2.5 is the strongest open-source model to date for coding, with particularly strong capabilities in front-end development.

    K2.5 can turn simple conversations into complete front-end interfaces, implementing interactive layouts and rich animations such as scroll-triggered effects. Beyond text prompts, K2.5 excels at coding with vision. By reasoning over images and video, K2.5 improves image/video-to-code generation and visual debugging, lowering the barrier for users to express intent visually.

    This capability stems from massive-scale vision-text joint pre-training. At scale, the trade-off between vision and text capabilities disappears — they improve in unison.

    K2.5 excels in real-world software engineering tasks. We evaluate it using Kimi Code Bench, our internal coding benchmark covering diverse end-to-end tasks — from building to debugging, refactoring, testing, and scripting — across multiple programming languages. On this benchmark, K2.5 shows consistent and meaningful improvements over K2 across task types.

    To try out K2.5's agentic coding capabilities, K2.5 Agent offers a set of preconfigured tools for immediate, hands-on experiences. For software engineering use cases, we recommend pairing Kimi K2.5 with our new coding product, Kimi Code.

    Kimi Code works in your terminal and can be integrated with various IDEs including VSCode, Cursor, Zed, etc. Kimi Code is open-sourced and supports images and videos as inputs. It also automatically discovers and migrates existing skills and MCPs into your working environment in Kimi Code.

    Here's an example using Kimi Code to translate the aesthetic of Matisse's La Danse into the Kimi App. This demo highlights a breakthrough in autonomous visual debugging. Using visual inputs and documentation lookup, K2.5 visually inspects its own output and iterates on it autonomously. It creates an art-inspired webpage created end to end.

    2. Agent Swarm

    Scaling Out, Not Just Up.

    We release K2.5 Agent Swarm as a research preview, marking a shift from single-agent scaling to self-directed, coordinated swarm-like execution.

    Trained with Parallel-Agent Reinforcement Learning (PARL), K2.5 learns to self-direct an agent swarm of up to 100 sub-agents, executing parallel workflows across up to 1,500 coordinated steps, without predefined roles or hand-crafted workflows.

    PARL uses a trainable orchestrator agent to decompose tasks into parallelizable subtasks, each executed by dynamically instantiated, frozen subagents. Running these subtasks concurrently significantly reduces end-to-end latency compared to sequential agent execution.

    Training a reliable parallel orchestrator is challenging due to delayed, sparse, and non-stationary feedback from independently running subagents. A common failure mode is serial collapse, where the orchestrator defaults to single-agent execution despite having parallel capacity. To address this, PARL employs staged reward shaping that encourages parallelism early in training and gradually shifts focus toward task success.

    We define the reward as a weighted sum of parallel instantiation reward, sub-agent finish rate, and task-level outcome performance reward. The performance reward evaluates the overall success and quality of the solution for a given task. The parallel instantiation reward mitigates serial collapse by incentivizing subagent instantiation, encouraging exploration of concurrent scheduling spaces. The finish reward focuses on successful completion of subtasks to prevent spurious parallelism, guiding the policy toward valid and effective decompositions. Hyperparameters are annealed to zero over training.

    To further force parallel strategies to emerge, we introduce a computational bottleneck that makes sequential execution impractical. Instead of counting total steps, we evaluate performance using Critical Steps, a latency-oriented metric inspired by the critical path in parallel computation. This metric captures orchestration overhead and the slowest subagent at each stage. Under this metric, spawning more subtasks only helps if it shortens the critical path.

    An agent swarm has an orchestrator that dynamically creates specialized subagents (e.g., AI Researcher, Physics Researcher, Fact Checker) and decomposes complex tasks into parallelizable subtasks for efficient distributed execution.

    In our parallel-agent reinforcement learning environment, the reward increases smoothly as training progresses. At the same time, the level of parallelism during training also gradually increases.

    K2.5 Agent Swarm improves performance on complex tasks through parallel, specialized execution. In our internal evaluations, it leads to an 80% reduction in end-to-end runtime while enabling more complex, long-horizon workloads.

    Agent Swarm reduces the minimum critical steps required to achieve target performance by 3×–4.5× compared to single-agent execution in wide search scenario, with savings scaling as targets rise—translating to up to 4.5× wall-clock time reduction via parallelization.

    Here are representative trajectories demonstrating K2.5 Agent Swarm in action: Parallel at Scale - 100 Sub-agents Hunting for Creators. The task is to identify the top three YouTube creators across 100 niche domains. K2.5 Agent Swarm first researches and defines each domain, then autonomously creates 100 sub-agents to conduct parallel searches. Each sub-agent identifies leading creators within its assigned niche, and the results—300 YouTuber profiles—are aggregated into a structured spreadsheet.

    3. Office Productivity

    Kimi K2.5 brings agentic intelligence into real-world knowledge work.

    K2.5 Agent can handle high-density, large-scale office work end to end. It reasons over large, high-density inputs, coordinates multi-step tool use, and delivers expert-level outputs: documents, spreadsheets, PDFs, and slide decks—directly through conversation.

    With a focus on real-world professional tasks, we design two internal expert productivity benchmarks. The AI Office Benchmark evaluates end-to-end Office output quality, while the General Agent Benchmark measures multi-step, production-grade workflows against human expert performance. Across both benchmarks, K2.5 shows 59.3% and 24.3% improvements over K2 Thinking, reflecting stronger end-to-end performance on real-world tasks.

    K2.5 agent supports advanced tasks such as adding annotations in Word, constructing financial models with Pivot Tables, and writing LaTeX equations in PDFs, while scaling to long-form outputs like 10,000-word papers or 100-page documents.

    Tasks that once took hours or days now complete in minutes.

    4. Conclusion

    Grounded in advances in coding with vision, agent swarms, and office productivity, Kimi K2.5 represents a meaningful step toward AGI for the open-source community, demonstrating strong capability on real-world tasks under real-world constraints. Looking ahead, we will push further into the frontier of agentic intelligence, redefining the boundaries of AI in knowledge work.

    Benchmark table and detailed footnotes provide extensive evaluation results comparing Kimi K2.5 with other leading models across reasoning, vision, coding, and productivity benchmarks.

    To reproduce official Kimi-K2.5 benchmark results, we recommend using the official API. For third-party providers, refer to Kimi Vendor Verifier (KVV) to choose high-accuracy services.

    Original source
  • Jan 27, 2026
    • Date parsed from source:
      Jan 27, 2026
    • First seen by Releasebot:
      Apr 21, 2026
    Kimi logo

    Kimi

    Kimi-K2-Instruct-0905

    Kimi releases K2-Instruct-0905, its latest MoE model with stronger agentic coding, better frontend coding, and a longer 256K context window. It also adds API access, tool calling support, and deployment guidance for popular inference engines.

    1. Model Introduction

    Kimi K2-Instruct-0905 is the latest, most capable version of Kimi K2. It is a state-of-the-art mixture-of-experts (MoE) language model, featuring 32 billion activated parameters and a total of 1 trillion parameters.

    Key Features

    • Enhanced agentic coding intelligence: Kimi K2-Instruct-0905 demonstrates significant improvements in performance on public benchmarks and real-world coding agent tasks.
    • Improved frontend coding experience: Kimi K2-Instruct-0905 offers advancements in both the aesthetics and practicality of frontend programming.
    • Extended context length: Kimi K2-Instruct-0905’s context window has been increased from 128k to 256k tokens, providing better support for long-horizon tasks.

    2. Model Summary

    Architecture: Mixture-of-Experts (MoE)
    Total Parameters: 1T
    Activated Parameters: 32B
    Number of Layers (Dense layer included): 61
    Number of Dense Layers: 1
    Attention Hidden Dimension: 7168
    MoE Hidden Dimension (per Expert): 2048
    Number of Attention Heads: 64
    Number of Experts: 384
    Selected Experts per Token: 8
    Number of Shared Experts: 1
    Vocabulary Size: 160K
    Context Length: 256K
    Attention Mechanism: MLA
    Activation Function: SwiGLU

    3. Evaluation Results

    All K2-Instruct-0905 numbers are reported as mean ± std over five independent, full-test-set runs. Before each run we prune the repository so that every Git object unreachable from the target commit disappears; this guarantees the agent sees only the code that would legitimately be available at that point in history.

    Except for Terminal-Bench (Terminus-2), every result was produced with our in-house evaluation harness. The harness is derived from SWE-agent, but we clamp the context windows of the Bash and Edit tools and rewrite the system prompt to match the task semantics. All baseline figures denoted with an asterisk (*) are excerpted directly from their official report or public leaderboard; the remaining metrics were evaluated by us under conditions identical to those used for K2-Instruct-0905.

    For SWE-Dev we go one step further: we overwrite the original repository files and delete any test file that exercises the functions the agent is expected to generate, eliminating any indirect hints about the desired implementation.

    4. Deployment

    You can access Kimi K2's API on https://platform.moonshot.ai, we provide OpenAI/Anthropic-compatible API for you.

    The Anthropic-compatible API maps temperature by real_temperature = request_temperature * 0.6 for better compatible with existing applications.

    Our model checkpoints are stored in the block-fp8 format, you can find it on Huggingface.

    Currently, Kimi-K2 is recommended to run on the following inference engines:

    • vLLM
    • SGLang
    • KTransformers
    • TensorRT-LLM

    Deployment examples for vLLM and SGLang can be found in the Model Deployment Guide.

    5. Model Usage

    Chat Completion

    Once the local inference service is up, you can interact with it through the chat endpoint:

    (The page includes example code for chat interaction and recommended temperature setting: temperature = 0.6)

    Tool Calling

    Kimi-K2-Instruct-0905 has strong tool-calling capabilities. To enable them, you need to pass the list of available tools in each request, then the model will autonomously decide when and how to invoke them.

    The page provides an example demonstrating calling a weather tool end-to-end, including tool implementation, schema definition, and usage.

    The tool_call_with_client function implements the pipeline from user query to tool execution. This pipeline requires the inference engine to support Kimi-K2’s native tool-parsing logic. For more information, see the Tool Calling Guide.

    6. License

    Both the code repository and the model weights are released under the Modified MIT License.

    7. Third Party Notices

    See THIRD PARTY NOTICES.

    7. Contact Us

    If you have any questions, please reach out at [email protected].

    Original source
  • Jul 10, 2025
    • Date parsed from source:
      Jul 10, 2025
    • First seen by Releasebot:
      Apr 21, 2026
    Kimi logo

    Kimi

    Kimina-Prover-72B

    Kimi releases Kimina-Prover-72B, a new formal reasoning model for Lean 4 that hits a new miniF2F benchmark high, improves sample efficiency, and ships open source distilled models, a rectified test set, and the Kimina Lean Server.

    🚀 UPDATE - Jul 10, 2025

    We are excited to announce the official release of Kimina-Prover-72B! For detailed information about this release, please check out our blog post.

    📈 Introducing Kimina-Prover Preview, the first large formal reasoning model that can reason in a human-like way and prove mathematical theorems rigorously in the Lean 4 language.

    • SotA performance: It achieves 80%+ pass rate on miniF2F benchmark for the first time, among all published results. It outperforms all prior works such as BFS-Prover (72.9%, previous SotA), Hunyuan-Prover, DeepSeek-Prover and Leanabelle-Prover by a large margin.
    • High Sample Efficiency: Kimina-Prover Preview delivers strong results even with very small sample budget, e.g. 68.85% on pass@32 and 65.16% on pass@8.
    • Open Source: We release two distilled versions of our RL model and one autoformalization model on Hugging Face. We also release a rectified version of miniF2F-test as our model helps to identify at least 5 problems in the miniF2F-test dataset that were wrongly formalized. All proofs found by Kimina-Prover Preview in miniF2F-test are also released in this repo (zipped to avoid contamination). Lastly, we release Kimina Lean Server, the workhorse Lean server used during the entire training process of Kimina-Prover.

    Key Ingredients of Kimina-Prover Preview

    Some key ingredients about the design and training of Kimina-Prover Preview are listed as follows.

    • Whole-proof Generation Enhanced by RL: All proofs are generated without any prover feedback during training and test. And consistent with the results of Kimi k1.5, we show that strong performance can be achieved without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models.
    • Model Size Scaling: Our experiments demonstrate performance scaling with model size, a trend previously unobserved for neural theorem provers. Specifically, a model size of 72B is applied and much stronger performance is obtained compared to models of smaller size.
    • Long Context Scaling: We adopt a context window of RL up to 32K tokens for training and inference, which is the longest context used in the neural theorem proving community.
    • Distinct Reasoning Style: We carefully design a reasoning style that we call Formal Reasoning Pattern that bridges the gap between formal verification and informal mathematical intuition.
    Original source
  • Jun 21, 2025
    • Date parsed from source:
      Jun 21, 2025
    • First seen by Releasebot:
      Apr 21, 2026
    Kimi logo

    Kimi

    Kimi-VL-A3B-Thinking-2506

    Kimi releases Kimi-VL, an efficient open-source vision-language model with strong multimodal reasoning, long-context understanding, and agent capabilities. It also adds Kimi-VL-Thinking and a 2506 variant that improves reasoning, video understanding, high-resolution perception, and token efficiency.

    We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities—all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B).

    Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent interaction tasks (e.g.,OSWorld), achieving state-of-the-art results comparable to flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, optical character recognition (OCR), mathematical reasoning, multi-image understanding, and etc.

    In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several specialized domains.

    Kimi-VL also advances the pareto frontiers of multimodal models in processing long contexts and perceiving clearly: Equipped with a 128K extended context window, Kimi-VL can processes long and diverse inputs, achieving impressive scores of 64.5 on LongVideoBench, and 35.1 on MMLongBench-Doc; Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost with common visual inputs and general tasks.

    Building on this foundation, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameter footprint, setting a new standard for efficient yet capable multimodal thinking models.

    Besides original model variants, we also provide a new Kimi-VL-A3B-Thinking-2506 variant with several new or improved abilities:

    • It Thinks Smarter while Consuming Less Tokens: The 2506 version reaches better accuracy on multimodal reasoning benchmarks: 56.9 on MathVision (+20.1), 80.1 on MathVista (+8.4), 46.3 on MMMU-Pro (+3.2), 64.0 on MMMU (+2.1), while in average reducing 20% thinking length.
    • It Sees Clearer with Thinking: Unlike the previous version that specializes on thinking tasks, the 2506 version can also achieve the same or even better ability on general visual perception and understanding, e.g. MMBench-EN-v1.1 (84.4), MMStar (70.4), RealWorldQA (70.0), MMVet (78.4) compared to the original non-thinking version (Kimi-VL-A3B-Instruct).
    • It Extends to Video Scenarios: The new 2506 version also improves on video reasoning and understanding benchmarks. It sets new state-of-the-art for open-source models on VideoMMMU (65.2), while also retaining good ability on general video understanding (71.9 on Video-MME).
    • It Extends to Higher Resolution: The new 2506 version supports 3.2 million total pixels in a single image (1792x1792), 4X compared to the original release. This leads to non-trivial improvements on high-resolution perception and OS-agent grounding benchmarks: 83.2 on V* Benchmark (without extra tools), 52.8 on ScreenSpot-Pro, 52.5 on OSWorld-G (full set with refusal).

    2025.06.21: Release of Kimi-VL-A3B-Thinking-2506: Tech Blog & Cookbook, 🤗 Hugging Face
    2025.04.15: vLLM has supported Kimi-VL deployment. See #16387 for details.
    2025.04.14: LLaMA-Factory has supported Kimi-VL finetuning. See #7719 for details.

    For common general multimodal perception and understanding, OCR, long video and long document, video perception, and OS-agent uses, we recommend Kimi-VL-A3B-Instruct for efficient inference; meanwhile, our new thinking version, Kimi-VL-A3B-Thinking-2506 also has excellent multimodal perception, long video and long document and OS-agent grounding abilities while achieving better multimodal reasoning skills. See this blog for more information.

    Note Recommended parameter settings:

    • For Thinking models, it is recommended to use Temperature = 0.8.
    • For Instruct models, it is recommended to use Temperature = 0.2.

    As an efficient model, Kimi-VL can robustly handle diverse tasks (fine-grained perception, math, college-level problems, OCR, agent, etc) across a broad spectrum of input forms (single-image, multi-image, video, long-document, etc).

    With effective long-thinking abilities, Kimi-VL-A3B-Thinking (2504 version) can match the performance of 30B/70B frontier open-source VLMs on MathVision benchmark.

    Original source
  • Jun 20, 2025
    • Date parsed from source:
      Jun 20, 2025
    • First seen by Releasebot:
      Apr 21, 2026
    Kimi logo

    Kimi

    Kimi-Researcher: End-to-End RL Training for Emerging Agentic Capabilities

    Kimi introduces Kimi-Researcher, an autonomous search and reasoning agent now rolling out to users. It deepens research with multi-turn web exploration, strong benchmark results, and end-to-end reinforcement learning, bringing a more capable research experience inside Kimi.

    Meet Kimi-Researcher, an autonomous agent that excels at multi-turn search and reasoning. It performs an average of 23 reasoning steps and explores over 200 URLs per task. Built on an internal version of the Kimi k-series model and trained entirely through end-to-end agentic reinforcement learning (RL), it achieved a Pass@1 score of 26.9%—a state-of-the-art result—on Humanity's Last Exam, and Pass@4 accuracy of 40.17%. Starting from an initial HLE score of 8.6%, Kimi-Researcher reached 26.9% almost entirely through end-to-end RL training, providing compelling evidence that end-to-end agentic RL can significantly advance agent intelligence.

    Kimi-Researcher has also achieved strong performance across several complex and challenging real-world benchmarks. On xbench, a new, dynamic, professionally-aligned suite designed to bridge AI capabilities with real-world productivity, Kimi-Researcher achieved 69% pass@1 (averaged on 4 runs) on xbench-DeepSearch, outperforming models such as o3 with search tools. On benchmark tests for multi-turn search reasoning (FRAMES, Seal-0) and factual information (SimpleQA), Kimi-Researcher also achieved strong performance.

    Figure 1

    1. Potential fluctuations in tools, such as search engines, may affect performance. The results are tested on: HLE on June 17, 2025; and xbench-DeepSearch, Seal-0, Frames, and SimpleQA on June 18, 2025.
    2. All Kimi-Researcher results were evaluated using o3-mini. Scores of other models are referenced from the relevant papers or leaderboards.[1][2][3][4][5]
    3. For benchmarks with fewer than 200 test samples (xbench, Seal-0), we performed four runs and reported the average result (avg@4).
    4. We do not compare multi-agent workflows based on multiple frontier models here, as our focus is on evaluating model capabilities.

    End-to-end agentic RL is promising but challenging

    Kimi-Researcher is an autonomous agentic and thinking model designed to solve complex problems through multi-step planning, reasoning, and tool use. It leverages three main tools: a parallel, real-time internal search tool; a text-based browser tool for interactive web tasks; and a coding tool for automated code execution.

    Traditional agent development has key limitations:

    1. Workflow-Based Systems: Multi-agent workflows assign roles to specialized agents and coordinate the agents using prompt-based workflows. While effective, they are tied to specific LLM versions and need frequent manual updates as models or environments change, reducing scalability and flexibility.
    2. Imitation Learning with Supervised Finetuning (SFT): Imitation learning aligns models well with human demonstrations but struggles with data labeling—especially for long-horizon, agentic tasks in dynamic environments. Furthermore, SFT datasets are tightly coupled with specific tool versions, resulting in poor generalization as tools evolve.

    End-to-end agentic reinforcement learning trains a single model to solve problems holistically: given a query, the agent explores a large number of possible strategies, receives rewards for correct solutions, and learns from the full trajectory. Unlike SFT, it naturally handles long, on-policy reasoning and adapts to changing tools and environments; unlike modular approaches, all skills—planning, perception, and tool use—are learned together without hand-crafted rules or workflow templates. Previous work like OpenAI's Deep Research also highlights the strong performance of this approach, but it introduces new challenges:

    • Dynamic Environments: Agents must adapt to constantly changing conditions, as even identical queries can yield different results over time. The goal is robust generalization despite distribution shifts.
    • Long-Horizon Tasks: Kimi-Researcher can run 70+ search queries* per trajectory, with context windows reaching hundreds of thousands of tokens. This demands advanced memory management and long-context models.
    • Data Scarcity: High-quality RL datasets for agentic QA are rare. We address this by automatically synthesizing training data, allowing large-scale learning without manual labeling.
    • Rollout Efficiency: Multi-turn reasoning and heavy tool use can slow training and cause GPU under-utilization. Optimizing rollout efficiency is crucial for scalable, practical agent RL training.
    • calculated based on a small set of queries.

    Approach

    Kimi-Researcher is trained via end-to-end reinforcement learning. We observe a consistent improvement in agent performance across different domains. Figure 2-a illustrates the overall training accuracy of Kimi-Researcher throughout the reinforcement learning process. Figure 2-b presents model performance on several internal datasets.

    Training data

    To address the scarcity of high-quality agentic datasets, we engineered our training corpus with two complementary objectives.
    First, we developed a suite of challenging, tool-centric tasks designed to promote robust tool-use learning. These prompts are deliberately constructed such that solving the task requires invoking specific tools—making naive approaches either infeasible or substantially less efficient. By embedding tool dependencies into task design, the agent learns not only when to invoke a tool, but also how to orchestrate tool use effectively in complex, real-world settings. (See Figure 3 for tool invocation rates using these training data.)

    Second, we curated and synthesized reasoning-intensive tasks to reinforce the agent's core cognitive abilities and its capacity to integrate reasoning with tool usage. This component is further subdivided into:

    • Math and Code Reasoning: Tasks that target logical inference, algorithmic problem-solving, and sequential computation. Kimi-Researcher learns to solve this kind of problem with our toolset beyond purely using chain-of-thought.
    • Hard Search: Scenarios where the agent must iteratively search, synthesize, and reason within context constraints to derive valid answers. Case studies illustrate how these hard search tasks drive the emergence of deeper planning and robust, tool-augmented reasoning strategies.

    To build this diverse prompt set at scale, we developed a fully automated pipeline capable of generating and validating many question-answer pairs with minimal manual intervention, ensuring both diversity and correctness at unprecedented scale. Ensuring accurate ground truth (GT) is critical for synthetic tasks, so we introduced a robust GT extraction method to guarantee that each question is paired with a reliable answer whenever possible. Additionally, a rigorous filtering funnel removes ambiguous, trivial, or incorrect pairs—with Pass@N checks ensuring only non-trivial questions are retained. Figure 4 shows the effectiveness of our synthetic tasks based on two experimental results.

    RL training

    The model is primarily trained using the REINFORCE algorithm. We have observed that the following factors contribute to more stable training:

    • On-policy Training: It is critical to generate strict on-policy data. During training, we disable LLM engine mechanisms like toolcall format enforcers to ensure each trajectory is generated entirely based on the model's own probability distribution.
    • Negative Sample Control: Negative samples lead to a decrease in token probabilities, which increases the risk of entropy collapse during RL training. To address this, we discard some negative samples strategically, allowing the model to continue improving over a longer training period.

    Kimi-Researcher uses outcome rewards for training, aiming to provide a constant preference in a dynamic training environment.

    • Format Reward: The model is penalized for trajectories that include invalid tool calls or if the context/iteration exceeds the maximum limitation.
    • Correctness Reward: For trajectories without format errors, rewards are based on the comparison between the model's answer and the ground truth.

    To promote efficiency, a gamma-decay factor is applied to correct trajectories. This encourages the model to discover shorter, more efficient exploration. For example, while two correct trajectories may receive equal final rewards, the shorter one earns a higher reward for its initial actions.

    Context management

    A long-horizon research trajectory may involve massive observation contexts, and a naive agent without memory management can easily exceed the limitation within 10 iterations. To address this, we design a context-management mechanism that allows the model to retain important information while discarding unnecessary documents, thereby extending a single rollout trajectory to over 50 iterations. An early ablation study shows that a model trained with context management uses 30% more iterations, which enables it to acquire more information and achieve higher performance.

    Large-scale agent RL infra

    To address the efficiency and stability challenges of large-scale Agent RL, we have developed a suite of infrastructure with the following key features:

    • Fully asynchronous rollout: We implement a fully asynchronous rollout system with extensible Gym-like interfaces. The server-based architecture efficiently orchestrates actor rollouts, environmental interactions, and reward calculations in parallel. This design significantly outperforms its synchronous counterpart by eliminating resource idle time.
    • Turn-level partial rollout: During Agent RL, while the majority of tasks completed at the early stage, a small fraction required extensive turns. To solve this long-tail problem, we designed a Turn-level Partial Rollout mechanism. Concretely, tasks that exceed a time budget would be saved to a replay buffer. In subsequent iterations, the remaining turns would be executed with updated model weights. Combined with adapted algorithms, this mechanism delivers substantial rollout acceleration (at least 1.5x).
    • Robust sandbox environment: Our unified sandbox architecture eliminates inter-container overhead while maintaining isolation. Zero-downtime scheduling with Kubernetes-based hybrid cloud architecture enables dynamic resource allocation. Agent-tool communication via Model Context Protocol (MCP) maintains stateful sessions with reconnection capabilities. Our implementation supports multi-replica deployment, ensuring fault-tolerant operation and high availability in production environments.

    Emerging agentic capacities

    During end-to-end reinforcement learning, we observed several notable emergent abilities in Kimi-Researcher. Here are two highlights:

    • When presented with conflicting information from multiple sources, Kimi-Researcher resolves inconsistencies through iterative hypothesis refinement and self-correction.
    • Kimi-Researcher demonstrates caution and rigor: even for seemingly straightforward questions, it deliberately performs additional searches and cross-validates information before answering.

    Use cases

    Kimi-Researcher supports diverse applications including Academic research, Legal & regulatory insights, Obscure information retrieval, Clinical evidence review, and Corporate financial analysis.

    What's next

    Kimi-Researcher is beginning its gradual rollout to users today. It empowers you to conduct deep, comprehensive research on any topic directly within Kimi. Join the waitlist here.

    It represents the early stage of our broader vision: evolving from a focused search and reasoning agent into a general-purpose agent capable of solving a wide range of complex tasks with an ever-expanding toolkit. To realize this vision, we are expanding the agent's capabilities across both tools and task domains, while also advancing the underlying reinforcement learning infrastructure and algorithms to ensure greater training stability and efficiency.

    To facilitate more research efforts in the field, we are planning on open-sourcing the base pretrained model as well as the reinforcement-learned model underlying Kimi-Researcher in the following months.

    Original source

Related vendors