Hugging Face Release Notes

Last updated: Mar 20, 2026

Hugging Face Products

All Hugging Face Release Notes (25)

  • Mar 18, 2026
    • Date parsed from source:
      Mar 18, 2026
    • First seen by Releasebot:
      Mar 20, 2026
    Hugging Face logo

    Hugging Face

    Mar 18, 26

    Hugging Face adds Markdown Papers pages and a new AI agent skill for paper search and Hub discovery.

    When AI agents such as Cursor or Claude Code fetch a Hugging Face Papers page, Markdown versions are served automatically, saving tokens and improving efficiency — e.g. huggingface.co/papers/2601.15621.md.

    A new hugging-face-paper-pages skill for AI agents lets agents search papers by title, author, or semantic similarity, read their content, and discover linked models, datasets, and Spaces on the Hub.

    Original source Report a problem
  • Mar 18, 2026
    • Date parsed from source:
      Mar 18, 2026
    • First seen by Releasebot:
      Mar 20, 2026
    Hugging Face logo

    Hugging Face

    Mar 18, 26

    Hugging Face adds a Repositories page in settings to visualize storage use and help users manage repository resources.

    User and organization settings now include a Repositories page to visualize repository storage consumption.

    This update makes it easier to monitor usage, understand how storage is distributed across repositories, and manage resources more effectively.

    Original source Report a problem
  • All of your release notes in one feed

    Join Releasebot and get updates from Hugging Face and hundreds of other software products.

  • Mar 10, 2026
    • Date parsed from source:
      Mar 10, 2026
    • First seen by Releasebot:
      Mar 20, 2026
    Hugging Face logo

    Hugging Face

    Mar 10, 26

    Hugging Face adds mutable storage Buckets on the Hub for fast, deduplicated uploads and downloads with CLI, API and CDN support.

    Buckets bring mutable, non-versioned object storage to the Hub, available for users and organizations using your existing storage plan. Upload training checkpoints, intermediate artifacts, logs and processed data shards without version control overhead. Manage them from the hf CLI, Python, JavaScript, or the Hub API directly.

    Built on Xet, uploads and downloads are deduplicated and fast. Buckets also support CDN pre-warming to place your data close to compute (AWS and GCP are supported at launch).

    Read the blog post and the Storage Buckets Documentation to get started.

    Original source Report a problem
  • Mar 5, 2026
    • Date parsed from source:
      Mar 5, 2026
    • First seen by Releasebot:
      Mar 20, 2026
    Hugging Face logo

    diffusers by Hugging Face

    Diffusers 0.37.0: Modular Diffusers, New image and video pipelines, multiple core library improvements, and more 🔥

    diffusers releases Modular Diffusers for building pipelines from reusable blocks, while expanding image, video, and audio generation with new models like Z-Image, Flux2 Klein, Qwen Image Layered, LTX-2, and Helios. It also adds core caching, context parallelism, and broad bug fixes.

    Modular Diffusers

    Modular Diffusers introduces a new way to build diffusion pipelines by composing reusable blocks. Instead of writing entire pipelines from scratch, you can now mix and match building blocks to create custom workflows tailored to your specific needs! This complements the existing DiffusionPipeline class, providing a more flexible way to create custom diffusion pipelines.

    Find more details on how to get started with Modular Diffusers here, and also check out the announcement post.

    New Pipelines and Models

    Image 🌆

    • Z Image Omni Base: Z-Image is the foundation model of the Z-Image family, engineered for good quality, robust generative diversity, broad stylistic coverage, and precise prompt adherence. While Z-Image-Turbo is built for speed, Z-Image is a full-capacity, undistilled transformer designed to be the backbone for creators, researchers, and developers who require the highest level of creative freedom. Thanks to @RuoyiDufor for contributing this in #12857.
    • Flux2 Klein:FLUX.2 [Klein] unifies generation and editing in a single compact architecture, delivering state-of-the-art quality with end-to-end inference in as low as under a second. Built for applications that require real-time image generation without sacrificing quality, and runs on consumer hardware, with as little as 13GB VRAM.
    • Qwen Image Layered: Qwen-Image-Layered is a model capable of decomposing an image into multiple RGBA layers. This layered representation unlocks inherent editability: each layer can be independently manipulated without affecting other content. Thanks to @naykun for contributing this in #12853.
    • FIBO Edit: Fibo Edit is an 8B parameter image-to-image model that introduces a new paradigm of structured control, operating on JSON inputs paired with source images to enable deterministic and repeatable editing workflows. Featuring native masking for granular precision, it moves beyond simple prompt-based diffusion to offer explicit, interpretable control optimized for production environments. Its lightweight architecture is designed for deep customization, empowering researchers to build specialized “Edit” models for domain-specific tasks while delivering top-tier aesthetic quality. Thanks galbria for contributing it in #12930.
    • Cosmos Predict2.5: Cosmos-Predict2.5, the latest version of the Cosmos World Foundation Models (WFMs) family, specialized for simulating and predicting the future state of the world. Thanks to @miguelmartin75 for contributing it in #12852.
    • Cosmos Transfer2.5: Cosmos-Transfer2.5 is a conditional world generation model with adaptive multimodal control, that produces high-quality world simulations conditioned on multiple control inputs. These inputs can take different modalities—including edges, blurred video, segmentation maps, and depth maps. Thanks to @miguelmartin75 for contributing it in #13066.
    • GLM-Image: GLM-Image is an image generation model adopts a hybrid autoregressive + diffusion decoder architecture, effectively pushing the upper bound of visual fidelity and fine-grained details. In general image generation quality, it aligns with industry-standard LDM-based approaches, while demonstrating significant advantages in knowledge-intensive image generation scenarios. Thanks to @zRzRzRzRzRzRzR for contributing it in #12973.
    • RAE: Representation Autoencoders (aka RAE) are an exciting alternative to traditional VAEs, typically used in the area of latent-space diffusion models of image generation. RAEs leverage pre-trained vision encoders and train lightweight decoders for the task of reconstruction.

    Video + audio 🎥 🎼

    • LTX-2: LTX-2 is an audio-conditioned text-to-video generation model that can generate videos with synced audio. Full and distilled model inference, as well as two-stage inference with spatial sampling, is supported. We also support a conditioning pipeline that allows for passing different conditions (such as images, series of images, etc.). Check out the docs to learn more!
    • Helios: Helios is a 14B video generation model that runs at 17 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching a strong baseline in quality. Thanks to @SHYuanBest for contributing this in #13208.

    Improvements to Core Library

    New caching methods

    • MagCache — thanks to @AlanPonnachan!
    • TaylorSeer — thanks to @toilaluan!

    New context-parallelism (CP) backends

    • Unified Sequence Parallel attention — thanks to @Bissmella!
    • Ulysses Anything Attention — thanks to @DefTruth!

    Misc

    • Mambo-G Guidance: New guider implementation (#12862)
    • Laplace Scheduler for DDPM (#11320)
    • Custom Sigmas in UniPCMultistepScheduler (#12109)
    • MultiControlNet support for SD3 Inpainting (#11251)
    • Context parallel in native flash attention (#12829)
    • NPU Ulysses Attention Support (#12919)
    • Fix Wan 2.1 I2V Context Parallel Inference (#12909)
    • Fix Qwen-Image Context Parallel Inference (#12970)
    • Introduction to @apply_lora_scale decorator for simplifying model definitions (#12994)
    • Introduction of pipeline-level “cpu” device_map (#12811)
    • Enable CP for kernels-based attention backends (#12812)
    • Diffusers is fully functional with Transformers V5 (#12976)
    • A lot of the above features/improvements came as part of the MVP program we have been running. Immense thanks to the contributors!

    Bug Fixes

    • Fix QwenImageEditPlus on NPU (#13017)
    • Fix MT5Tokenizer → use T5Tokenizer for Transformers v5.0+ compatibility (#12877)
    • Fix Wan/WanI2V patchification (#13038)
    • Fix LTX-2 inference with num_videos_per_prompt > 1 and CFG (#13121)
    • Fix Flux2 img2img prediction (#12855)
    • Fix QwenImage txt_seq_lens handling (#12702)
    • Fix prefix_token_len bug (#12845)
    • Fix ftfy imports in Wan and SkyReels-V2 (#12314, #13113)
    • Fix is_fsdp determination (#12960)
    • Fix GLM-Image get_image_features API (#13052)
    • Fix Wan 2.2 when either transformer isn't present (#13055)
    • Fix guider issue (#13147)
    • Fix torchao quantizer for new versions (#12901)
    • Fix GGUF for unquantized types with unquantize kernels (#12498)
    • Make Qwen hidden states contiguous for torchao (#13081)
    • Make Flux hidden states contiguous (#13068)
    • Fix Kandinsky 5 hardcoded CUDA autocast (#12814)
    • Fix aiter availability check (#13059)
    • Fix attention mask check for unsupported backends (#12892)
    • Allow prompt and prior_token_ids simultaneously in GlmImagePipeline (#13092)
    • GLM-Image batch support (#13007)
    • Cosmos 2.5 Video2World frame extraction fix (#13018)
    • ResNet: only use contiguous in training mode (#12977)

    All commits

    • [PRX] Improve model compilation by @WaterKnight1998 in #12787
    • Improve docstrings and type hints in scheduling_dpmsolver_singlestep.py by @delmalih in #12798
    • [Modular]z-image by @yiyixuxu in #12808
    • Fix Qwen Edit Plus modular for multi-image input by @sayakpaul in #12601
    • [WIP] Add Flux2 modular by @DN6 in #12763
    • [docs] improve distributed inference cp docs. by @sayakpaul in #12810
    • post release 0.36.0 by @sayakpaul in #12804
    • Update distributed_inference.md to correct syntax by @sayakpaul in #12827
    • [lora] Remove lora docs unneeded and add " # Copied from ..." by @sayakpaul in #12824
    • support CP in native flash attention by @sywangyi in #12829
    • [qwen-image] edit 2511 support by @naykun in #12839
    • fix pytest tests/pipelines/pixart_sigma/test_pixart.py::PixArtSigmaPi… by @sywangyi in #12842
    • Support for control-lora by @lavinal712 in #10686
    • Add support for LongCat-Image by @junqiangwu in #12828
    • fix the prefix_token_len bug by @junqiangwu in #12845
    • extend TorchAoTest::test_model_memory_usage to other platform by @sywangyi in #12768
    • Qwen Image Layered Support by @naykun in #12853
    • Z-Image-Turbo ControlNet by @hlky in #12792
    • Cosmos Predict2.5 Base: inference pipeline, scheduler & chkpt conversion by @miguelmartin75 in #12852
    • more update in modular by @yiyixuxu in #12560
    • Feature: Add Mambo-G Guidance as Guider by @MatrixTeam-AI in #12862
    • Add OvisImagePipeline in AUTO_TEXT2IMAGE_PIPELINES_MAPPING by @alvarobartt in #12876
    • Cosmos Predict2.5 14b Conversion by @miguelmartin75 in #12863
    • Use T5Tokenizer instead of MT5Tokenizer (removed in Transformers v5.0+) by @alvarobartt in #12877
    • Add z-image-omni-base implementation by @RuoyiDu in #12857
    • fix torchao quantizer for new torchao versions by @vkuzo in #12901
    • fix Qwen Image Transformer single file loading mapping function to be consistent with other loader APIs by @mbalabanski in #12894
    • Z-Image-Turbo from_single_file fix by @hlky in #12888
    • chore: fix dev version in setup.py by @DefTruth in #12904
    • Community Pipeline: Add z-image differential img2img by @r4inm4ker in #12882
    • Fix typo in src/diffusers/pipelines/cosmos/pipeline_cosmos2_5_predict.py by @miguelmartin75 in #12914
    • Fix wan 2.1 i2v context parallel by @DefTruth in #12909
    • fix the use of device_map in CP docs by @sayakpaul in #12902
    • [core] remove unneeded autoencoder methods when subclassing from AutoencoderMixin by @sayakpaul in #12873
    • Detect 2.0 vs 2.1 ZImageControlNetModel by @hlky in #12861
    • Refactor environment variable assignments in workflow by @paulinebm in #12916
    • Add codeQL workflow by @paulinebm in #12917
    • Delete .github/workflows/codeql.yml by @paulinebm (direct commit on v0.37.0-release)
    • CodeQL workflow for security analysis by @paulinebm (direct commit on v0.37.0-release)
    • Check for attention mask in backends that don't support it by @dxqb in #12892
    • [Flux.1] improve pos embed for ascend npu by computing on npu by @zhangtao0408 in #12897
    • LTX Video 0.9.8 long multi prompt by @yaoqih in #12614
    • Add FSDP option for Flux2 by @leisuzz in #12860
    • Add transformer cache context for SkyReels-V2 pipelines & Update docs by @tolgacangoz in #12837
    • [docs] fix torchao typo. by @sayakpaul in #12883
    • Update wan.md to remove unneeded hfoptions by @sayakpaul in #12890
    • Improve docstrings and type hints in scheduling_edm_euler.py by @delmalih in #12871
    • [Modular] Video for Mellon by @asomoza in #12924
    • Add LTX 2.0 Video Pipelines by @dg845 in #12915
    • Add environment variables to checkout step by @paulinebm in #12927
    • Improve docstrings and type hints in scheduling_consistency_decoder.py by @delmalih in #12928
    • Fix: Remove hardcoded CUDA autocast in Kandinsky 5 to fix import warning by @adi776borate in #12814
    • Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in #12865
    • fix the warning torch_dtype is deprecated by @msdsm in #12841
    • [NPU] npu attention enable ulysses by @TmacAaron in #12919
    • Torchao floatx version guard by @howardzhang-cv in #12923
    • Bugfix for dreambooth flux2 img2img2 by @leisuzz in #12825
    • [Modular] qwen refactor by @yiyixuxu in #12872
    • [modular] Tests for custom blocks in modular diffusers by @sayakpaul in #12557
    • [chore] remove controlnet implementations outside controlnet module. by @sayakpaul in #12152
    • [core] Handle progress bar and logging in distributed environments by @sayakpaul in #12806
    • Improve docstrings and type hints in scheduling_consistency_models.py by @delmalih in #12931
    • [Feature] MultiControlNet support for SD3Impainting by @ishan-modi in #11251
    • Laplace Scheduler for DDPM by @gapatron in #11320
    • Store vae.config.scaling_factor to prevent missing attr reference (sdxl advanced dreambooth training script) by @Teriks in #12346
    • Add thread-safe wrappers for components in pipeline (examples/server-async/utils/requestscopedpipeline.py) by @FredyRivera-dev in #12515
    • [Research] Latent Perceptual Loss (LPL) for Stable Diffusion XL by @kashif in #11573
    • Change timestep device to cpu for xla by @bhavya01 in #11501
    • [LoRA] add lora_alpha to sana README by @linoytsaban in #11780
    • Fix wrong param types, docs, and handles noise=None in scale_noise of FlowMatching schedulers by @Promisery in #11669
    • [docs] Remote inference by @stevhliu in #12372
    • Align HunyuanVideoConditionEmbedding with CombinedTimestepGuidanceTextProjEmbeddings by @samutamm in #12316
    • [Fix] syntax in QwenImageEditPlusPipeline by @SahilCarterr in #12371
    • Fix ftfy name error in Wan pipeline by @dsocek in #12314
    • [modular] error early in enable_auto_cpu_offload by @sayakpaul in #12578
    • [ChronoEdit] support multiple loras by @zhangjiewu in #12679
    • fix how is_fsdp is determined by @sayakpaul in #12960
    • [LoRA] add LoRA support to LTX-2 by @sayakpaul in #12933
    • Fix: typo in autoencoder_dc.py by @tvelovraf in #12687
    • [Modular] better docstring by @yiyixuxu in #12932
    • [docs] polish caching docs. by @sayakpaul in #12684
    • Fix typos by @omahs in #12705
    • Fix link to diffedit implementation reference by @JuanFKurucz in #12708
    • Fix QwenImage txt_seq_lens handling by @kashif in #12702
    • Bugfix for flux2 img2img2 prediction by @leisuzz in #12855
    • Add Flag to PeftLoraLoaderMixinTests to Enable/Disable Text Encoder LoRA Tests by @dg845 in #12962
    • Add Unified Sequence Parallel attention by @Bissmella in #12693
    • [Modular] Changes for using WAN I2V by @asomoza in #12959
    • Z rz rz rz rz rz rz r cogview by @sayakpaul in #12973
    • Update distributed_inference.md to reposition sections by @sayakpaul in #12971
    • [chore] make transformers version check stricter for glm image. by @sayakpaul in #12974
    • Remove 8bit device restriction by @SunMarc in #12972
    • disable_mmap in pipeline from_pretrained by @hlky in #12854
    • [Modular] mellon utils by @yiyixuxu in #12978
    • LongCat Image pipeline: Allow offloading/quantization of text_encoder component by @Yahweasel in #12963
    • Add ChromaInpaintPipeline by @hameerabbasi in #12848
    • fix Qwen-Image series context parallel by @DefTruth in #12970
    • Flux2 klein by @yiyixuxu in #12982
    • [modular] fix a bug in mellon param & improve docstrings by @yiyixuxu in #12980
    • add klein docs. by @sayakpaul in #12984
    • LTX 2 Single File Support by @dg845 in #12983
    • [core] gracefully error out when attn-backend x cp combo isn't supported. by @sayakpaul in #12832
    • Improve docstrings and type hints in scheduling_cosine_dpmsolver_multistep.py by @delmalih in #12936
    • [Docs] Replace root CONTRIBUTING.md with symlink to source docs by @delmalih in #12986
    • make style && make quality by @sayakpaul (direct commit on v0.37.0-release)
    • Revert "make style && make quality" by @sayakpaul (direct commit on v0.37.0-release)
    • [chore] make style to push new changes. by @sayakpaul in #12998
    • Fibo edit pipeline by @galbria in #12930
    • Fix variable name in docstring for PeftAdapterMixin.set_adapters by @geekuillaume in #13003
    • Improve docstrings and type hints in scheduling_ddim_cogvideox.py by @delmalih in #12992
    • [scheduler] Support custom sigmas in UniPCMultistepScheduler by @a-r-r-o-w in #12109
    • feat: accelerate longcat-image with regional compile by @lgyStoic in #13019
    • Improve docstrings and type hints in scheduling_ddim_flax.py by @delmalih in #13010
    • Improve docstrings and type hints in scheduling_ddim_inverse.py by @delmalih in #13020
    • fix Dockerfiles for cuda and xformers. by @sayakpaul in #13022
    • Resnet only use contiguous in training mode. by @jiqing-feng in #12977
    • feat: add qkv projection fuse for longcat transformers by @lgyStoic in #13021
    • Improve docstrings and type hints in scheduling_ddim_parallel.py by @delmalih in #13023
    • Improve docstrings and type hints in scheduling_ddpm_flax.py by @delmalih in #13024
    • Improve docstrings and type hints in scheduling_ddpm_parallel.py by @delmalih in #13027
    • Remove pooled_ mentions from Chroma inpaint by @hameerabbasi in #13026
    • Flag Flax schedulers as deprecated by @delmalih in #13031
    • [modular] add auto_docstring & more doc related refactors by @yiyixuxu in #12958
    • Upgrade GitHub Actions to latest versions by @salmanmkc in #12866
    • [From Single File] support from_single_file method for WanAnimateTransformer3DModel by @samadwar in #12691
    • Fix: Cosmos2.5 Video2World frame extraction and add default negative prompt by @adi776borate in #13018
    • [GLM-Image] Add batch support for GlmImagePipeline by @JaredforReal in #13007
    • [Qwen] avoid creating attention masks when there is no padding by @kashif in #12987
    • [modular]support klein by @yiyixuxu in #13002
    • [QwenImage] fix prompt isolation tests by @sayakpaul in #13042
    • fast tok update by @itazap in #13036
    • change to CUDA 12.9. by @sayakpaul in #13045
    • remove torchao autoquant from diffusers docs by @vkuzo in #13048
    • docs: improve docstring scheduling_dpm_cogvideox.py by @delmalih in #13044
    • Fix Wan/WanI2V patchification by @Jayce-Ping in #13038
    • LTX2 distilled checkpoint support by @rootonchair in #12934
    • [wan] fix layerwise upcasting tests on CPU by @sayakpaul in #13039
    • [ci] uniform run times and wheels for pytorch cuda. by @sayakpaul in #13047
    • docs: fix grammar in fp16_safetensors CLI warning by @Olexandr88 in #13040
    • [wan] fix wan 2.2 when either of the transformers isn't present. by @sayakpaul in #13055
    • [bug fix] GLM-Image fit new get_image_features API by @JaredforReal in #13052
    • Fix aiter availability check by @lauri9 in #13059
    • [Modular]add a real quick start guide by @yiyixuxu in #13029
    • feat: support Ulysses Anything Attention by @DefTruth in #12996
    • Refactor Model Tests by @DN6 in #12822
    • [Flux2] Fix LoRA loading for Flux2 Klein by adaptively enumerating transformer blocks by @songkey in #13030
    • [Modular] loader related by @yiyixuxu in #13025
    • [Modular] mellon doc etc by @yiyixuxu in #13051
    • [modular] change the template modular pipeline card by @sayakpaul in #13072
    • Add support for Magcache by @AlanPonnachan in #12744
    • [docs] Fix syntax error in quantization configuration by @sayakpaul in #13076
    • docs: improve docstring scheduling_dpmsolver_multistep_inverse.py by @delmalih in #13083
    • [core] make flux hidden states contiguous by @sayakpaul in #13068
    • [core] make qwen hidden states contiguous to make torchao happy. by @sayakpaul in #13081
    • Feature/zimage inpaint pipeline by @CalamitousFelicitousness in #13006
    • GGUF fix for unquantized types when using unquantize kernels by @dxqb in #12498
    • docs: improve docstring scheduling_dpmsolver_multistep_inverse.py by @delmalih in #13085
    • [modular]simplify components manager doc by @yiyixuxu in #13088
    • ZImageControlNet cfg by @hlky in #13080
    • [Modular] refactor Wan: modular pipelines by task etc by @yiyixuxu in #13063
    • [Modular] guard ModularPipeline.blocks attribute by @yiyixuxu in #13014
    • LTX 2 Improve encode_video by Accepting More Input Types by @dg845 in #13057
    • Z image lora training by @linoytsaban in #13056
    • [modular] add modular tests for Z-Image and Wan by @sayakpaul in #13078
    • [Docs] Add guide for AutoModel with custom code by @DN6 in #13099
    • [SkyReelsV2] Fix ftfy import by @asomoza in #13113
    • [lora] fix non-diffusers lora key handling for flux2 by @sayakpaul in #13119
    • [CI] Refactor Wan Model Tests by @DN6 in #13082
    • docs: improve docstring scheduling_edm_dpmsolver_multistep.py by @delmalih in #13122
    • [Fix]Allow prompt and prior_token_ids to be provided simultaneously in GlmImagePipeline by @JaredforReal in #13092
    • docs: improve docstring scheduling_flow_match_euler_discrete.py by @delmalih in #13127
    • Cosmos Transfer2.5 inference pipeline: general/{seg, depth, blur, edge} by @miguelmartin75 in #13066
    • [modular] add tests for robust model loading. by @sayakpaul in #13120
    • Fix LTX-2 Inference when num_videos_per_prompt > 1 and CFG is Enabled by @dg845 in #13121
    • [CI] Fix setuptools pkg_resources Errors by @dg845 in #13129
    • docs: improve docstring scheduling_flow_match_heun_discrete.py by @delmalih in #13130
    • [CI] Fix setuptools pkg_resources Bug for PR GPU Tests by @dg845 in #13132
    • fix cosmos transformer typing. by @sayakpaul in #13134
    • Sunset Python 3.8 & get rid of explicit typing exports where possible by @sayakpaul in #12524
    • feat: implement apply_lora_scale to remove boilerplate. by @sayakpaul in #12994
    • [docs] fix ltx2 i2v docstring. by @sayakpaul in #13135
    • [Modular] add different pipeine blocks to init by @yiyixuxu in #13145
    • fix MT5Tokenizer by @yiyixuxu in #13146
    • fix guider by @yiyixuxu in #13147
    • [Modular] update doc for ModularPipeline by @yiyixuxu in #13100
    • [Modular] add explicit workflow support by @yiyixuxu in #13028
    • [LTX2] Fix wrong lora mixin by @asomoza in #13144
    • [Pipelines] Remove k-diffusion by @DN6 in #13152
    • [tests] accept recompile_limit from the user in tests by @sayakpaul in #13150
    • [core] support device type device_maps to work with offloading. by @sayakpaul in #12811
    • [Bug] Fix QwenImageEditPlus Series on NPU by @zhangtao0408 in #13017
    • [CI] Add ftfy as a test dependency by @DN6 in #13155
    • docs: improve docstring scheduling_flow_match_lcm.py by @delmalih in #13160
    • [docs] add docs for qwenimagelayered by @stevhliu in #13158
    • Flux2: Tensor tuples can cause issues for checkpointing by @dxqb in #12777
    • [CI] Revert setuptools CI Fix as the Failing Pipelines are Deprecated by @dg845 in #13149
    • Fix ftfy import for PRX Pipeline by @dg845 in #13154
    • [core] Enable CP for kernels-based attention backends by @sayakpaul in #12812
    • remove deps related to test from ci by @sayakpaul in #13164
    • [CI] Fix new LoRAHotswap tests by @DN6 in #13163
    • [gguf][torch.compile time] Convert to plain tensor earlier in dequantize_gguf_tensor by @anijain2305 in #13166
    • Support Flux Klein peft (fal) lora format by @asomoza in #13169
    • Fix T5GemmaEncoder loading for transformers 5.x composite T5GemmaConfig by @DavidBert in #13143
    • Allow Automodel to use from_config with custom code. by @DN6 in #13123
    • Fix AutoModel typing Import Error by @dg845 in #13178
    • migrate to transformers v5 by @sayakpaul in #12976
    • fix: graceful fallback when attention backends fail to import by @sym-bot in #13060
    • [docs] Fix torchrun command argument order in docs by @sayakpaul in #13181
    • [attention backends] use dedicated wrappers from fa3 for cp. by @sayakpaul in #13165
    • Cosmos Transfer2.5 Auto-Regressive Inference Pipeline by @miguelmartin75 in #13114
    • Fix wrong do_classifier_free_guidance threshold in ZImagePipeline by @kirillsst in #13183
    • Fix Flash Attention 3 interface for new FA3 return format by @veeceey in #13173
    • Fix LTX-2 image-to-video generation failure in two stages generation by @Songrui625 in #13187
    • Fixing Kohya loras loading: Flux.1-dev loras with TE ("lora_te1_" prefix) by @christopher5106 in #13188
    • [Modular] update the auto pipeline blocks doc by @yiyixuxu in #13148
    • [tests] consistency tests for modular index by @sayakpaul in #13192
    • [modular] fallback to default_blocks_name when loading base block classes in ModularPipeline by @yiyixuxu in #13193
    • [chore] updates in the pypi publication workflow. by @sayakpaul in #12805
    • [tests] enable cpu offload test in torchao without compilation. by @sayakpaul in #12704
    • remove db utils from benchmarking by @sayakpaul in #13199
    • [AutoModel] Fix bug with subfolders and local model paths when loading custom code by @DN6 in #13197
    • [AutoModel] Allow registering auto_map to model config by @DN6 in #13186
    • [Modular] Save Modular Pipeline weights to Hub by @DN6 in #13168
    • docs: improve docstring scheduling_ipndm.py by @delmalih in #13198
    • Clean up accidental files by @DN6 in #13202
    • [modular]Update model card to include workflow by @yiyixuxu in #13195
    • [modular] not pass trust_remote_code to external repos by @yiyixuxu in #13204
    • [Modular] implement requirements validation for custom blocks by @sayakpaul in #12196
    • cogvideo example: Distribute VAE video encoding across processes in CogVideoX LoRA training by @jiqing-feng in #13207
    • Fix group-offloading bug by @SHYuanBest in #13211
    • Add Helios-14B Video Generation Pipelines by @dg845 in #13208
    • [Z-Image] Fix more do_classifier_free_guidance thresholds by @asomoza in #13212
    • [lora] fix zimage lora conversion to support for more lora. by @sayakpaul in #13209
    • adding lora support to z-image controlnet pipelines by @christopher5106 in #13200
    • Add LTX2 Condition Pipeline by @dg845 in #13058
    • Fix Helios paper link in documentation by @SHYuanBest in #13213
    • [attention backends] change to updated repo and version. by @sayakpaul in #13161
    • feat: implement rae autoencoder. by @Ando233 in #13046
    • Release: v0.37.0-release by @sayakpaul (direct commit on v0.37.0-release)

    Significant community contributions

    The following contributors have made significant changes to the library over the last release:

    @delmalih

    • Improve docstrings and type hints in scheduling_dpmsolver_singlestep.py (#12798)
    • Improve docstrings and type hints in scheduling_edm_euler.py (#12871)
    • Improve docstrings and type hints in scheduling_consistency_decoder.py (#12928)
    • Improve docstrings and type hints in scheduling_consistency_models.py (#12931)
    • Improve docstrings and type hints in scheduling_cosine_dpmsolver_multistep.py (#12936)
    • [Docs] Replace root CONTRIBUTING.md with symlink to source docs (#12986)
    • Improve docstrings and type hints in scheduling_ddim_cogvideox.py (#12992)
    • Improve docstrings and type hints in scheduling_ddim_flax.py (#13010)
    • Improve docstrings and type hints in scheduling_ddim_inverse.py (#13020)
    • Improve docstrings and type hints in scheduling_ddim_parallel.py (#13023)
    • Improve docstrings and type hints in scheduling_ddpm_flax.py (#13024)
    • Improve docstrings and type hints in scheduling_ddpm_parallel.py (#13027)
    • Flag Flax schedulers as deprecated (#13031)
    • docs: improve docstring scheduling_dpm_cogvideox.py (#13044)
    • docs: improve docstring scheduling_dpmsolver_multistep_inverse.py (#13083)
    • docs: improve docstring scheduling_dpmsolver_multistep_inverse.py (#13085)
    • docs: improve docstring scheduling_edm_dpmsolver_multistep.py (#13122)
    • docs: improve docstring scheduling_flow_match_euler_discrete.py (#13127)
    • docs: improve docstring scheduling_flow_match_heun_discrete.py (#13130)
    • docs: improve docstring scheduling_flow_match_lcm.py (#13160)
    • docs: improve docstring scheduling_ipndm.py (#13198)

    @yiyixuxu

    • [Modular]z-image (#12808)
    • more update in modular (#12560)
    • [Modular] qwen refactor (#12872)
    • [Modular] better docstring (#12932)
    • [Modular] mellon utils (#12978)
    • Flux2 klein (#12982)
    • [modular] fix a bug in mellon param & improve docstrings (#12980)
    • [modular] add auto_docstring & more doc related refactors (#12958)
    • [modular]support klein (#13002)
    • [Modular]add a real quick start guide (#13029)
    • [Modular] loader related (#13025)
    • [Modular] mellon doc etc (#13051)
    • [modular]simplify components manager doc (#13088)
    • [Modular] refactor Wan: modular pipelines by task etc (#13063)
    • [Modular] guard ModularPipeline.blocks attribute (#13014)
    • [Modular] add different pipeine blocks to init (#13145)
    • fix MT5Tokenizer (#13146)
    • fix guider (#13147)
    • [Modular] update doc for ModularPipeline (#13100)
    • [Modular] add explicit workflow support (#13028)
    • [Modular] update the auto pipeline blocks doc (#13148)
    • [modular] fallback to default_blocks_name when loading base block classes in ModularPipeline (#13193)
    • [modular]Update model card to include workflow (#13195)
    • [modular] not pass trust_remote_code to external repos (#13204)

    @sayakpaul

    • Fix Qwen Edit Plus modular for multi-image input (#12601)
    • [docs] improve distributed inference cp docs. (#12810)
    • post release 0.36.0 (#12804)
    • Update distributed_inference.md to correct syntax (#12827)
    • [lora] Remove lora docs unneeded and add " # Copied from ..." (#12824)
    • fix the use of device_map in CP docs (#12902)
    • [core] remove unneeded autoencoder methods when subclassing from AutoencoderMixin (#12873)
    • [docs] fix torchao typo. (#12883)
    • Update wan.md to remove unneeded hfoptions (#12890)
    • [modular] Tests for custom blocks in modular diffusers (#12557)
    • [chore] remove controlnet implementations outside controlnet module. (#12152)
    • [core] Handle progress bar and logging in distributed environments (#12806)
    • [modular] error early in enable_auto_cpu_offload (#12578)
    • fix how is_fsdp is determined (#12960)
    • [LoRA] add LoRA support to LTX-2 (#12933)
    • [docs] polish caching docs. (#12684)
    • Z rz rz rz rz rz rz r cogview (#12973)
    • Update distributed_inference.md to reposition sections (#12971)
    • [chore] make transformers version check stricter for glm image. (#12974)
    • add klein docs. (#12984)
    • [core] gracefully error out when attn-backend x cp combo isn't supported. (#12832)
    • make style && make quality
    • Revert "make style && make quality"
    • [chore] make style to push new changes. (#12998)
    • fix Dockerfiles for cuda and xformers. (#13022)
    • [QwenImage] fix prompt isolation tests (#13042)
    • change to CUDA 12.9. (#13045)
    • [wan] fix layerwise upcasting tests on CPU (#13039)
    • [ci] uniform run times and wheels for pytorch cuda. (#13047)
    • [wan] fix wan 2.2 when either of the transformers isn't present. (#13055)
    • [modular] change the template modular pipeline card (#13072)
    • [docs] Fix syntax error in quantization configuration (#13076)
    • [core] make flux hidden states contiguous (#13068)
    • [core] make qwen hidden states contiguous to make torchao happy. (#13081)
    • [modular] add modular tests for Z-Image and Wan (#13078)
    • [lora] fix non-diffusers lora key handling for flux2 (#13119)
    • [modular] add tests for robust model loading. (#13120)
    • fix cosmos transformer typing. (#13134)
    • Sunset Python 3.8 & get rid of explicit typing exports where possible (#12524)
    • feat: implement apply_lora_scale to remove boilerplate. (#12994)
    • [docs] fix ltx2 i2v docstring. (#13135)
    • [tests] accept recompile_limit from the user in tests (#13150)
    • [core] support device type device_maps to work with offloading. (#12811)
    • [core] Enable CP for kernels-based attention backends (#12812)
    • remove deps related to test from ci (#13164)
    • migrate to transformers v5 (#12976)
    • [docs] Fix torchrun command argument order in docs (#13181)
    • [attention backends] use dedicated wrappers from fa3 for cp. (#13165)
    • [tests] consistency tests for modular index (#13192)
    • [chore] updates in the pypi publication workflow. (#12805)
    • [tests] enable cpu offload test in torchao without compilation. (#12704)
    • remove db utils from benchmarking (#13199)
    • [Modular] implement requirements validation for custom blocks (#12196)
    • [lora] fix zimage lora conversion to support for more lora. (#13209)
    • [attention backends] change to updated repo and version. (#13161)

    Release: v0.37.0-release

    @DN6

    • [WIP] Add Flux2 modular (#12763)
    • Refactor Model Tests (#12822)
    • [Docs] Add guide for AutoModel with custom code (#13099)
    • [CI] Refactor Wan Model Tests (#13082)
    • [Pipelines] Remove k-diffusion (#13152)
    • [CI] Add ftfy as a test dependency (#13155)
    • [CI] Fix new LoRAHotswap tests (#13163)
    • Allow Automodel to use from_config with custom code. (#13123)
    • [AutoModel] Fix bug with subfolders and local model paths when loading custom code (#13197)
    • [AutoModel] Allow registering auto_map to model config (#13186)
    • [Modular] Save Modular Pipeline weights to Hub (#13168)
    • Clean up accidental files (#13202)

    @naykun

    • [qwen-image] edit 2511 support (#12839)
    • Qwen Image Layered Support (#12853)

    @junqiangwu

    • Add support for LongCat-Image (#12828)
    • fix the prefix_token_len bug (#12845)

    @hlky

    • Z-Image-Turbo ControlNet (#12792)
    • Z-Image-Turbo from_single_file fix (#12888)
    • Detect 2.0 vs 2.1 ZImageControlNetModel (#12861)
    • disable_mmap in pipeline from_pretrained (#12854)
    • ZImageControlNet cfg (#13080)

    @miguelmartin75

    • Cosmos Predict2.5 Base: inference pipeline, scheduler & chkpt conversion (#12852)
    • Cosmos Predict2.5 14b Conversion (#12863)
    • Fix typo in src/diffusers/pipelines/cosmos/pipeline_cosmos2_5_predict.py (#12914)
    • Cosmos Transfer2.5 inference pipeline: general/{seg, depth, blur, edge} (#13066)
    • Cosmos Transfer2.5 Auto-Regressive Inference Pipeline (#13114)

    @RuoyiDu

    • Add z-image-omni-base implementation (#12857)

    @r4inm4ker

    • Community Pipeline: Add z-image differential img2img (#12882)

    @yaoqih

    • LTX Video 0.9.8 long multi prompt (#12614)

    @dg845

    • Add LTX 2.0 Video Pipelines (#12915)
    • Add Flag to PeftLoraLoaderMixinTests to Enable/Disable Text Encoder LoRA Tests (#12962)
    • LTX 2 Single File Support (#12983)
    • LTX 2 Improve encode_video by Accepting More Input Types (#13057)
    • Fix LTX-2 Inference when num_videos_per_prompt > 1 and CFG is Enabled (#13121)
    • [CI] Fix setuptools pkg_resources Errors (#13129)
    • [CI] Fix setuptools pkg_resources Bug for PR GPU Tests (#13132)
    • [CI] Revert setuptools CI Fix as the Failing Pipelines are Deprecated (#13149)
    • Fix ftfy import for PRX Pipeline (#13154)
    • Fix AutoModel typing Import Error (#13178)
    • Add Helios-14B Video Generation Pipelines (#13208)
    • Add LTX2 Condition Pipeline (#13058)

    @kashif

    • [Research] Latent Perceptual Loss (LPL) for Stable Diffusion XL (#11573)
    • Fix QwenImage txt_seq_lens handling (#12702)
    • [Qwen] avoid creating attention masks when there is no padding (#12987)

    @bhavya01

    • Change timestep device to cpu for xla (#11501)

    @linoytsaban

    • [LoRA] add lora_alpha to sana README (#11780)
    • Z image lora training (#13056)

    @stevhliu

    • [docs] Remote inference (#12372)
    • [docs] add docs for qwenimagelayered (#13158)

    @hameerabbasi

    • Add ChromaInpaintPipeline (#12848)
    • Remove pooled_ mentions from Chroma inpaint (#13026)

    @galbria

    • Fibo edit pipeline (#12930)

    @JaredforReal

    • [GLM-Image] Add batch support for GlmImagePipeline (#13007)
    • [bug fix] GLM-Image fit new get_image_features API (#13052)
    • [Fix]Allow prompt and prior_token_ids to be provided simultaneously in GlmImagePipeline (#13092)

    @rootonchair

    • LTX2 distilled checkpoint support (#12934)

    @AlanPonnachan

    • Add support for Magcache (#12744)

    @CalamitousFelicitousness

    • Feature/zimage inpaint pipeline (#13006)

    @Ando233

    • feat: implement rae autoencoder. (#13046)
    Original source Report a problem
  • Mar 4, 2026
    • Date parsed from source:
      Mar 4, 2026
    • First seen by Releasebot:
      Mar 20, 2026
    Hugging Face logo

    transformers by Hugging Face

    v5.3.0: EuroBERT, VibeVoice ASR, TimesFM2.5, PP-DocLayoutV2, OlmoHybrid, ModernVBert, Higgs Audio V2

    transformers adds new multilingual, audio, time-series, and document models, including EuroBERT, VibeVoice ASR, TimesFM 2.5, PP-DocLayoutV2, OLMo Hybrid, ModernVBERT, and Higgs Audio V2, alongside breaking changes, quantization updates, and broad fixes.

    New Model additions

    EuroBERT

    EuroBERT is a multilingual encoder model based on a refreshed transformer architecture, akin to Llama but with bidirectional attention. It supports a mixture of European and widely spoken languages, with sequences of up to 8192 tokens.

    Links: Documentation | Paper | Blog Post

    Add eurobert (#39455) by @ArthurZucker in #39455

    VibeVoice ASR

    VibeVoice ASR is an automatic speech recognition model from Microsoft that combines acoustic and semantic audio tokenizers with a causal language model for robust speech-to-text transcription. The model uses VibeVoice's acoustic and semantic tokenizers that process audio at 24kHz, paired with a Qwen2-based language decoder for generating transcriptions. It can process up to 60 minutes of continuous audio input, supports customized hotwords, performs joint ASR/diarization/timestamping, and handles over 50 languages with code-switching support.

    Links: Documentation | Paper

    Add VibeVoice ASR (#43625) by @ebezzam in #43625

    TimesFM2.5

    TimesFM 2.5 is a pretrained time-series foundation model that uses a decoder-only attention architecture with input patching for forecasting. The model is designed to provide accurate zero-shot forecasts across different domains, forecasting horizons and temporal granularities without requiring dataset-specific training. It builds on the original TimesFM architecture with enhancements including rotary attention, QK normalization, per-dimension attention scaling, and continuous quantile prediction.

    Links: Documentation | Paper

    Timesfm 2.5 (#41763) by @kashif in #41763

    PP-DocLayoutV2

    PP-DocLayoutV2 is a dedicated lightweight model for layout analysis, focusing specifically on element detection, classification, and reading order prediction. The model is composed of two sequentially connected networks: an RT-DETR-based detection model that performs layout element detection and classification, followed by a pointer network that orders these layout elements. It is designed to analyze document layouts by identifying and organizing various layout components in their proper reading sequence.

    Links: Documentation

    [Model] Add PP-DocLayoutV2 Model Support (#43018) by @zhang-prog in #43018

    OlmoHybrid

    OLMo Hybrid is a hybrid architecture model from Ai2 that combines standard transformer attention layers with linear attention layers using the Gated Deltanet. This hybrid approach aims to improve efficiency while maintaining model quality by interleaving full attention layers with linear attention layers. The model uses a custom cache system that handles both KV cache for attention layers and recurrent state for linear attention layers.

    Links: Documentation

    Add OLMo Hybrid model (#43358) by @yanhong-lbh in #43358

    ModernVBert

    ModernVBert is a Vision-Language encoder that combines ModernBert with a SigLIP vision encoder. It is optimized for visual document understanding and retrieval tasks, making it suitable for processing documents that contain both text and visual elements.

    Links: Documentation | Paper

    Add ModernVBERT models (#42504) by @paultltc in #42504

    ColModernVBert

    ColModernVBert is a model for efficient visual document retrieval that leverages ModernVBert to construct multi-vector embeddings directly from document images, following the ColPali approach. The model enables retrieval and scoring of visual documents by processing both text queries and document images to generate embeddings that can be compared for relevance scoring.

    Links: Documentation | Paper

    Add ModernVBERT models (#42504) by @paultltc in #42504

    Higgs Audio V2

    Higgs Audio V2 is a powerful audio foundation model developed by Boson AI that was pretrained on over 10 million hours of audio data and diverse text data. Despite having no post-training or fine-tuning, the model excels in expressive audio generation thanks to its deep language and acoustic understanding. The model supports various audio generation tasks including single-speaker and multi-speaker smart voice, zero-shot voice cloning, and multi-speaker voice cloning.

    Links: Documentation

    Add Higgs Audio V2 Model (#40294) by @szhengac in #40294

    Higgs Audio V2 Tokenizer

    The Higgs Audio V2 Tokenizer is an audio tokenization model that operates at a low frame rate of 25 fps while maintaining high audio quality, effectively halving the frame rate of many baseline models. It uses unified 24 kHz training that mixes speech, music, and sound-event clips in one model to capture both semantic and acoustic details, facilitating the training of audio language models. The model enables fast inference by avoiding diffusion steps, with an encoder/decoder architecture that processes batches quickly for real-time or large-scale tasks.

    Links: Documentation

    Add Higgs Audio V2 Model (#40294) by @szhengac in #40294

    Breaking changes

    Tensor parallelism (TP) support for dense and MoE decoder-only models has been fixed and stabilized, requiring users to update their TP configurations and conversion mappings accordingly.

    🚨 fix + tests dense & MoE TP all reduce (decoder only) (#43722) by @3outeille

    The Ernie4.5 VL MoE model class and configuration names have been renamed to align with vLLM/SGLang conventions, requiring users to update any references to the old model names in their code.

    🚨 [Ernie 4.5 VL Moe] Fix up namings to vllm/sglang convention (#44299) by @vasqu

    Several pipeline tasks have been removed or updated in the V5 cleanup (including question-answering, visual-question-answering, and image-to-image), requiring users to migrate to the replacement pipelines or updated task names.

    🚨 More V5 pipeline cleanup (#43325) by @Rocketknight1

    3D position IDs for vision-language models have been unified under a common interface (sourced from qwen2-vl), requiring users of affected VLMs (e.g., Ernie, GLM4V) to update their processors and any code that manually constructs position IDs.

    🚨 Unify 3D position ids (#43972) by @zucchini-nlp

    🚨 Tokenizer x vLLM fixes 🚨 :

    Unigram tokenizers were missing the spm precompiled charsmap support. We ran an overall v4 vs v5 regression test and fixed what we had missed.

    This was done in:

    [vllm + v5 fix] handle TokenizersBackend fallback properly for v5 (#44255) by @itazap

    Generation

    Generation input preparation was significantly refactored to stop relying on cache_position and instead pass pre-sliced input_ids/inputs_embeds directly to prepare_inputs_for_generation, simplifying the generation loop and laying groundwork for broader cache_position removal. Several bug fixes were also applied, including correct sampling for HiggsAudioV2, flaky cache-equality test stabilization for Idefics, and restored generation integration tests.

    [higgs-audio-v2] fix sampling (#44386) by @eustlb in [#44386]

    fix(flaky): idefics generate cache flake (#44180) by @tarekziade in [#44180]

    Fix generation integration tests (#44225) by @zucchini-nlp in [#44225]

    [generate] Always pass full input_ids in prepare_inputs_for_generation (#44226) by @Cyrilvallez in [#44226]

    fix: HiggsAudioV2 cached decode inputs in compiled generation (#44201) by @tarekziade in [#44201]

    [generate] Completely stop relying on cache_position to prepare inputs (#44130) by @Cyrilvallez in [#44130]

    Simplify input preparation in generate (#44126) by @Cyrilvallez in [#44126]

    Tokenization

    Several tokenization bugs were fixed in this release, including resolving an AttributeError in MLukeTokenizer caused by the v5 rename of additional_special_tokens, correcting the Fuyu tokenizer class mapping, fixing LayoutXLM tokenization test failures from the slow tokenizer removal refactor, and adding olmo_hybrid to the auto-tokenizer mapping. The tokenizer documentation was also updated to reflect the new unified v5 backend architecture and reorganized for clarity.

    [tiny] Add olmo_hybrid to tokenizer auto-mapping (#44416) by @tyler-romero in [#44416]

    fix(tokenizer): Fix MLukeTokenizer AttributeError post-v5 refactor (#44362) by @harshaljanjani in [#44362]

    update fuyu tokenizer class (#44235) by @itazap in [#44235]

    fix(testing): Fix LayoutXLM tokenization test and LightOnOCR SDPA flash test failures on main CI (#43988) by @harshaljanjani in [#43988]

    [docs] tokenizer summary (#43965) by @stevhliu in [#43965]

    [docs] refactor tokenizer docs (#43900) by @stevhliu in [#43900]

    Kernels

    Fixed several kernel-related issues including a security vulnerability, corrected Mamba kernel loading to handle incompatible import structures, ensured Liger Kernel is properly enabled during hyperparameter search, and expanded Flash Attention to support multiple compatible implementations.

    Fix kernels security issue (#44395) by @Cyrilvallez in [#44395]

    Enable Liger Kernel when doing hyperparameter search. (#44329) by @linfeng-du in [#44329]

    [Mamba] Fix kernel loading (#44176) by @vasqu in [#44176]

    [Flash Attn] Enable compatible implementations (#44177) by @vasqu in [#44177]

    Fix percentage formatting in help messages for gradient checkpointing, Liger Kernel, and empty cache steps (#44100) by @qgallouedec in [#44100]

    Quantization

    This release adds several new quantization backends and fixes, including MLX quantization support for MPS devices, Four Over Six (4/6) NVFP4 quantization integration for NVIDIA Blackwell GPUs, and CPU support for MXFP4 models, alongside a bug fix for MXFP4 model saving using reverse_op.

    [Quantization] Fixing mxfp4 saving using reverse_op (#43148) by @MekkCyber in [#43148]

    [Quantization] Add metal quantization for MPS devices! (#43934) by @MekkCyber in [#43934]

    Enable mxfp4 model on CPU (#43512) by @jiqing-feng in [#43512]

    Add Four Over Six quantization integration (#43970) by @jackcook in [#43970]

    Vision

    Fixed backward compatibility for image processors loaded from older remote code that lack valid_kwargs definitions, and resolved test failures in AMD ROCm CI by adding the missing timm dependency to the Docker image.

    [AMD CI] Add missing timm dependency to ROCm Docker image (#44389) by @Abdennacer-Badaoui in [#44389]

    update glm image model expected out for tests (#43907) by @kaixuanliu in [#43907]

    Fix image processors from_dict backward compatibility with old remote code (#44245) by @yonigozlan in [#44245]

    Bugfixes and improvements

    Update PR template (#44415) by @SunMarc in [#44415]

    Add Qwen3.5 support for sequence classification (#44406) by @medhakimbedhief in [#44406]

    update the expected output for qwen2_5_vl w/ pytorch 2.10 XPU (#44426) by @kaixuanliu in [#44426]

    add support for nemotron_3 (#44390) by @liding-nv in [#44390]

    [ Dynamic weight loader] fix remote code when format matches (#44396) by @ArthurZucker in [#44396]

    [timesfm2_5] fix timesfm2.5 loss (#44331) by @kashif in [#44331]

    Fix peft conversion mappings (#44413) by @Cyrilvallez in [#44413]

    Reduce tqdm verbosity during model loading (#44414) by @Cyrilvallez in [#44414]

    docs: Add NeMo Automodel community integration docs (#44304) by @adil-a in [#44304]

    [CB] Small fixes (#44227) by @remi-or in [#44227]

    Support non-gated experts (#44319) by @IlyasMoutawwakil in [#44319]

    [Bugfix] fix qwen3.5 no split module (#44382) by @JJJYmmm in [#44382]

    Fix mutable default arguments and resource leaks (#44287) by @jashshah999 in [#44287]

    skip 2 invalid test cases for voxtral_realtime model (#44321) by @kaixuanliu in [#44321]

    Mamba-1/-2 init weights in mixer class (#43778) by @kevinli573 in [#43778]

    add expectations for xpu for olmo_hybrid model (#44353) by @kaixuanliu in [#44353]

    [VITS] Add speaking_rate as an optionl forward argument (#43283) by @gau-nernst in [#43283]

    Strict export cleanup (#44293) by @IlyasMoutawwakil in [#44293]

    [docs] kernelconfig fix (#44337) by @stevhliu in [#44337]

    Add ProcessingKwargs ImagesKwargs etc. to docs (#44269) by @yonigozlan in [#44269]

    Fix typos in comments and docstrings (#44332) by @tysoncung in [#44332]

    Add testing guide for agents for trainer tests (#44328) by @SunMarc in [#44328]

    Update common tests Trainer (#44260) by @SunMarc in [#44260]

    [timesfm2_5] fix timesfm mlp bias (#44325) by @kashif in [#44325]

    fix zero3 init config (#44236) by @SunMarc in [#44236]

    Update expected output for Jais2 model tests (#43910) by @kaixuanliu in [#43910]

    Improve has_similar_generate_outputs assertions (#44166) by @tarekziade in [#44166]

    Fix failed test case for exaone_moe model (#43938) by @kaixuanliu in [#43938]

    fix(modeling_attn_mask_utils): remove FutureWarning from logger.warning_once() (#44307) by @imstevenpmwork in [#44307]

    Remove remaining vestiges of the TranslationPipeline (#43869) by @Rocketknight1 in [#43869]

    XPU now supports backward for the FA2 fixed path (#43905) by @YangKai0616 in [#43905]

    Fix: use TokenizersBackend for Olmo3 to preserve custom pre_tokenizer (#44294) by @mario-sanz in [#44294]

    Fix special token maps BC (#44281) by @ArthurZucker in [#44281]

    [Modular] Fix file type regression (#44283) by @vasqu in [#44283]

    [auto_docstring] Improve typing parsing and add tests (#43748) by @yonigozlan in [#43748]

    Restore response_schema saving-loading (#44282) by @Rocketknight1 in [#44282]

    Use associative scan HOP mamba recurrentgemma (#43737) by @riccardofelluga in [#43737]

    chore: fixes in Trainer class docs (compute_loss & hyperparameter_search) (#44268) by @ethanknights in [#44268]

    fix(trainer): pass optim_args to SGD, Adagrad, and RMSprop optimizers (#44203) by @nightcityblade in [#44203]

    fix(utils): Make torch_compilable_check compatible with torch.export strict mode (#44266) by @harshaljanjani in [#44266]

    Fix TypeError in convert_rope_params_to_dict when ignore_keys is a list (#44272) by @hangjun-ezra in [#44272]

    [docs] callbacks and collators (#44239) by @stevhliu in [#44239]

    [docs] trainer part 1 (#44185) by @stevhliu in [#44185]

    Remove refs to grouped_entities (#44182) by @Rocketknight1 in [#44182]

    [mimi] nit (#44237) by @eustlb in [#44237]

    Fix local dataset loading priority in run_image_classification_no_tra… (#44199) by @gowthamr-tech in [#44199]

    chore: added CLAUDE.md alias (#44232) by @tarekziade in [#44232]

    fix: add missing return type annotations to type-checking utilities in generic.py (#44241) by @yushiran in [#44241]

    Fix return value - fixes #44238 (#44240) by @tarekziade in [#44240]

    fix regression report_to "all" (#44250) by @SunMarc in [#44250]

    [fix] Set input_modalities on various architectures that aren't just text (#44078) by @tomaarsen in [#44078]

    Add processing tests for phi4 multimodal (#44234) by @yonigozlan in [#44234]

    fix: VersionComparison.from_string return type mismatch (#43709) by @tarekziade in [#43709]

    refactor _inner_training_loop to smaller methods (#44041) by @winglian in [#44041]

    [docs] fix broken chat_templating links in tasks docs (#44115) by @Deep-unlearning in [#44115]

    Add missing backtick in AnyToAnyPipeline.call docstring (#44229) by @alvarobartt in [#44229]

    Docs(it): fix typo in sentencepiece install command (#44218) by @matisgagneux21 in [#44218]

    Docs(it): fix typo in docstring wording (#44219) by @matisgagneux21 in [#44219]

    fix bug with position_ids on qwen3-vl models, such that position_ids include text position (#44158) by @leopold-tzafon in [#44158]

    Update 404ing BillSum dataset URL on Summarization Task guide (#44212) by @alexandercarruthers in [#44212]

    fix(models): Fix LayoutLMv2 NER crash and broken batched truncation/padding (#44187) by @harshaljanjani in [#44187]

    [CB] [Major] Asynchronous batching (#43960) by @remi-or in [#43960]

    Fix LASR feature extractor regression from invalid center argument (#44207) by @ainergiz in [#44207]

    Models with incorrect tokenizer_class in tokenization_config.json tha… (#44179) by @itazap in [#44179]

    chore(typing): initial ty integration (#44167) by @tarekziade in [#44167]

    fix(flaky): test_generate_with_and_without_position_ids in GLM ORC (#44173) by @tarekziade in [#44173]

    [docs] Add Chinese translations for common NLP task tutorials (#44144) by @TinderZ in [#44144]

    [Mimi] Calibrate to ensure encoder streaming performs correctly (#43971) by @caffeinism in [#43971]

    ESM2 attention_mask and token_dropout fix (#44163) by @lhallee in [#44163]

    bring back our demons: clean_up_tokenization_spaces (#44035) by @ArthurZucker in [#44035]

    Fix Seq2SeqTrainingArguments documentation (#35258) by @qgallouedec in [#35258]

    AutoGrad support for grouped_mm fallback (#44152) by @IlyasMoutawwakil in [#44152]

    Patch setitem on ModelOutput even if the parameter was previously None (#44080) by @tomaarsen in [#44080]

    [simple] Fix up repr whitespace/brackets (#44048) by @tomaarsen in [#44048]

    [chore] Fix incorrect forward type hint for Gemma3n (#44051) by @tomaarsen in [#44051]

    Raise informative error when loading video processors (#44125) by @zucchini-nlp in [#44125]

    fix(flaky): Different approach to make sure loss exists (#43804) by @tarekziade in [#43804]

    [voxtral] fix voxtral proc (#44132) by @eustlb in [#44132]

    [docs] Fix typos in GenerationConfig docstring (#44143) by @nightcityblade in [#44143]

    Fix gemma3n get_audio_features (#44040) by @zucchini-nlp in [#44040]

    Fix UMT5EncoderModel embedding weights not being tied after loading (#43880) by @jiqing-feng in [#43880]

    fix(testing): Update stale device override test in GraniteSpeech (#44113) by @harshaljanjani in [#44113]

    [Misc][vlms] Use text_config when initializing the fine-grained FP8Expert (#44032) by @JJJYmmm in [#44032]

    docs: fix typo 'AuoQuant' → 'AutoQuant' and clarify FINEGRAINED_FP8 library column (#44131) by @cluster2600 in [#44131]

    Update post proc (#44090) by @itazap in [#44090]

    Fix: flaky Kosmos2ModelTest test (#44061) by @tarekziade in [#44061]

    AutoTokenizer ignores config when model_type is None (#44127) by @itazap in [#44127]

    Migrate GPT2 to standardized output capture decorators (#43983) by @Aki-07 in [#43983]

    grouped_mm fallback (#44043) by @IlyasMoutawwakil in [#44043]

    Bump dev version (#44099) by @qgallouedec in [#44099]

    Fix loading logic issue (#44095) by @Cyrilvallez in [#44095]

    [docs] customizing tokenizers (#43929) by @stevhliu in [#43929]

    Merge test_keep_in_fp32_modules and test_keep_in_fp32_modules_strict (#44097) by @Rocketknight1 in [#44097]

    [voxtral-realtime] update runner expected values (#44096) by @eustlb in [#44096]

    Use torch.isfinite (#44069) by @cyyever in [#44069]

    add default flash impl (#44081) by @ArthurZucker in [#44081]

    Remove unused dependencies (#43904) by @cyyever in [#43904]

    Fix patchtsmixer call to post_init (#44082) by @Cyrilvallez in [#44082]

    Fix false positive right-padding warning for decoder-only models in pipeline (#44021) by @ in [#44021]

    Significant community contributions

    The following contributors have made significant changes to the library over the last release:

    @ArthurZucker

    Add eurobert (#39455)

    [ Dynamic weight loader] fix remote code when format matches (#44396)

    Fix special token maps BC (#44281)

    bring back our demons: clean_up_tokenization_spaces (#44035)

    add default flash impl (#44081)

    @liding-nv

    add support for nemotron_3 (#44390)

    @kashif

    [timesfm2_5] fix timesfm2.5 loss (#44331)

    [timesfm2_5] fix timesfm mlp bias (#44325)

    Timesfm 2.5 (#41763)

    @remi-or

    [CB] Small fixes (#44227)

    [CB] [Major] Asynchronous batching (#43960)

    @ebezzam

    [VibeVoice ASR] Use updated padding cache for ASR model. (#44392)

    Add VibeVoice ASR (#43625)

    @MekkCyber

    [Quantization] Fixing mxfp4 saving using reverse_op (#43148)

    [Quantization] Add metal quantization for MPS devices! (#43934)

    @tarekziade

    perf: Optimize SynthID logits processor batch index construction (#44172)

    Improve has_similar_generate_outputs assertions (#44166)

    fix(flaky): idefics generate cache flake (#44180)

    chore: added CLAUDE.md alias (#44232)

    Fix return value - fixes #44238 (#44240)

    fix: VersionComparison.from_string return type mismatch (#43709)

    fix: HiggsAudioV2 cached decode inputs in compiled generation (#44201)

    chore(typing): initial ty integration (#44167)

    fix(flaky): test_generate_with_and_without_position_ids in GLM ORC (#44173)

    fix(flaky): Different approach to make sure loss exists (#43804)

    Fix: flaky Kosmos2ModelTest test (#44061)

    @zhang-prog

    [Model] Add PP-DocLayoutV2 Model Support (#43018)

    @yanhong-lbh

    Add OLMo Hybrid model (#43358)

    @vasqu

    🚨 [Ernie 4.5 VL Moe] Fix up namings to vllm/sglang convention (#44299)

    [Modular] Fix file type regression (#44283)

    [Mamba] Fix kernel loading (#44176)

    [Flash Attn] Enable compatible implementations (#44177)

    @jackcook

    Add Four Over Six quantization integration (#43970)

    @winglian

    refactor _inner_training_loop to smaller methods (#44041)

    @paultltc

    Add ModernVBERT models (#42504)

    @TinderZ

    [docs] Add Chinese translations for common NLP task tutorials (#44144)

    @szhengac

    Add Higgs Audio V2 Model (#40294)

    Original source Report a problem
  • Feb 26, 2026
    • Date parsed from source:
      Feb 26, 2026
    • First seen by Releasebot:
      Mar 20, 2026
    Hugging Face logo

    Hugging Face

    Feb 26, 26

    Hugging Face adds Public Storage add-ons with Xet deduplication and flexible billing for 1 TB to 50 TB plans.

    Public Storage add-ons are available starting at $12 per TB / month. Storage is powered by Xet deduplication to optimize uploads, downloads, and space usage.

    You can purchase, upgrade, or cancel storage plans (1 TB to 50 TB) from your billing settings.

    Original source Report a problem
  • Feb 17, 2026
    • Date parsed from source:
      Feb 17, 2026
    • First seen by Releasebot:
      Mar 20, 2026
    Hugging Face logo

    transformers by Hugging Face

    v5.2.0: GLM-5, Qwen3.5, Voxtral Realtime, VibeVoice Acoustic Tokenizer

    transformers releases VoxtralRealtime, GLM-5, Qwen3.5 and VibeVoice support, bringing new streaming speech, multimodal and large-scale model additions plus a breaking new attention mask interface and broad bug fixes.

    New Model additions

    VoxtralRealtime

    VoxtralRealtime is a streaming speech-to-text model from Mistral AI, designed for real-time automatic speech recognition (ASR). Unlike the offline Voxtral model which processes complete audio files, VoxtralRealtime is architected for low-latency, incremental transcription by processing audio in chunks as they arrive.

    The model combines an audio encoder with a Mistral-based language model decoder, using time conditioning embeddings and causal convolutions with padding caches to enable efficient streaming inference.

    Add Voxtral Realtime (#43769) by @eustlb

    GLM-5 - GlmMoeDsa

    The zAI team launches GLM-5, and introduces it as such:

    GLM-5, targeting complex systems engineering and long-horizon agentic tasks. Scaling is still one of the most important ways to improve the intelligence efficiency of Artificial General Intelligence (AGI). Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens. GLM-5 also integrates DeepSeek Sparse Attention (DSA), largely reducing deployment cost while preserving long-context capacity.

    Reinforcement learning aims to bridge the gap between competence and excellence in pre-trained models. However, deploying it at scale for LLMs is a challenge due to the RL training inefficiency. To this end, we developed slime, a novel asynchronous RL infrastructure that substantially improves training throughput and efficiency, enabling more fine-grained post-training iterations. With advances in both pre-training and post-training, GLM-5 delivers significant improvement compared to GLM-4.7 across a wide range of academic benchmarks and achieves best-in-class performance among all open-source models in the world on reasoning, coding, and agentic tasks, closing the gap with frontier models.

    Add GlmMoeDsa (#43858) by @Cyrilvallez

    Qwen3.5, Qwen3.5 Moe

    The Qwen team launches Qwen 3.5, and introduces it as such:

    We are delighted to announce the official release of Qwen3.5, introducing the open-weight of the first model in the Qwen3.5 series, namely Qwen3.5-397B-A17B. As a native vision-language model, Qwen3.5-397B-A17B demonstrates outstanding results across a full range of benchmark evaluations, including reasoning, coding, agent capabilities, and multimodal understanding, empowering developers and enterprises to achieve significantly greater productivity. Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability. We have also expanded our language and dialect support from 119 to 201, providing broader accessibility and enhanced support to users around the world.

    Adding Support for Qwen3.5 (#43830) by @bozheng-hit

    VibeVoice Acoustic Tokenizer

    VibeVoice is a novel framework for synthesizing high-fidelity, long-form speech with multiple speakers by employing a next-token diffusion approach within a Large Language Model (LLM) structure. It's designed to capture the authentic conversational "vibe" and is particularly suited for generating audio content like podcasts and multi-participant audiobooks.

    One key feature of VibeVoice is the use of two continuous audio tokenizers, one for extracting acoustic features and another for semantic features.

    Add VibeVoice Acoustic Tokenizer (#43400) by @ebezzam

    Breaking changes

    🚨 [Attn] New attn mask interface everywhere (#42848)

    🚨 Modify ModernBERT's default attention implementation to stop using FA (#43764)

    🚨 This one is quite breaking for super super super old modles: 🚨 🚨

    fix: Prevent AutoTokenizer type mismatch from directory name substrin… (#43791)

    If the config does not have a model-type field, we no longer check the name of the folder like for https://huggingface.co/prajjwal1/bert-tiny/blob/main/config.json

    Bugfixes and improvements

    • [docs] deploying (#43241) by @stevhliu
    • [Trainer] Move NEFTune impl to standalone functions (#43714) by @SunMarc
    • Fix convert_rope_params_to_dict so it uses rope_theta from the config (#43766) by @hmellor
    • Bump dev version (#43777) by @qgallouedec
    • Improved AGENTS.md (#43763) by @tarekziade
    • Fix-release-ubild (#43773) by @ArthurZucker
    • unpin torch for CircleCI (#43790) by @ydshieh
    • [Modular Dependencies] Fixup qwen rms norms (#43772) by @vasqu
    • fix(testing): Fix BLOOM tokenizer, CLAP audio features, and CLVP text tester usage in tests (#43798) by @harshaljanjani
    • Remove unconditional train_batch_size assignment (#43770) by @lordaarush
    • [Repo Consistency] Fix rms norm (#43803) by @vasqu
    • fix: Prevent AutoTokenizer type mismatch from directory name substrin… (#43791) by @tarekziade
    • Refactor trainer data_collator and callbacks tests (#43776) by @SunMarc
    • [core] Faster and thread-safe check_model_inputs implementation (#43765) by @Cyrilvallez
    • [Trainer] use deepspeed SP process group when Accelerate doesn’t build a mesh (#43799) by @kashif
    • fix(flaky): enforce manual seed to reduce flakiness (#43794) by @tarekziade
    • Add TRL CI bot workflow to trigger tests on PR comments (#43809) by @qgallouedec
    • Fix DeepSpeed model preparation logic in Trainer class (#43780) by @qgallouedec
    • [docs] reveal more in toctree (#43808) by @stevhliu
    • Fix markdown documentation (#43076) by @cyyever
    • Fix slack-report workflow file (#43851) by @ydshieh
    • add do_sample=False to qwen2_5_vl model tests to stablize the output (#43728) by @kaixuanliu
    • Fix incorrect timestamp calculation in Qwen3VL Processor (#43659) by @jonathan-fulton
    • Remove GPU tracking from TrackioCallback and remove env var support (#43371) by @qgallouedec
    • Add id and resume support to SwanLab integration (#43719) by @i-pj
    • fix gptoss crash in tp (#43853) by @sywangyi
    • Delete batch_split from EncoderDecoderCache (#43814) by @cyyever
    • delete unnecessary code to make moe compatible to full graph compile (#43855) by @kaixuanliu
    • Update ModelType for Unigram tokenizer (#43860) by @pavel-esir
    • [docs] Remove pipeline() examples from summarization/translation tasks (#43831) by @Mr-Neutr0n
    • Fix video interpolation in pe_audio_video (#43811) by @Rocketknight1
    • Look for the pad_token_id in the right place for Llama4 (#43539) by @Rocketknight1
    • Fix cardinality error for DETR models without explicit background class (#43513) by @heathdutton
    • docs: Add Switch Transformers docstring notes and update spectrogram comment (#43336) by @harshaljanjani
    • [xLSTM] Fix bugs preventing small model training (#43209) by @Anri-Lombard
    • docs: correct typo 'neccessary' to 'necessary' (#43868) by @thecaptain789
    • Improve PR comment CI feedback (#43852) by @ydshieh
    • Fix init weights in remote code (#43768) by @zucchini-nlp
    • Fix GlmMoeDsaConfig default mlp_layer_types in modular conversion (#43876) by @OiPunk
    • [MistralCommonBackend] fix loading proc (#43887) by @eustlb
    • [Jamba] Fallback to slow path and warn instead of error out (#43889) by @vasqu
    • Fix SwanLab callback to forward resume init args (#43848) by @OiPunk
    • Fix old tech stack in doc (#43879) by @cyyever
    • Update TrainingArguments (#43806) by @SunMarc
    • Remove unnecessary code or checks for PT 2.4+ (#43787) by @cyyever
    • Make it possible to evaluate when using sequence parallel in HF Trainer (#43517) by @jp1924
    • [Trainer] Move optimizer cls init to trainer_optimizer.py (#43738) by @SunMarc
    • fix the error of tests/quantization/fbgemm_fp8/test_fbgemm_fp8.py::Fb… (#43547) by @sywangyi
    • fix fbgemm fp8 multi-device load failure. (#43581) by @sywangyi
    • Refactor trainer init (#43807) by @SunMarc
    • [fix] Use last_hidden_state key from get_image_features for llama4 (#43882) by @tomaarsen
    • [Docs] Add docs for GLM-OCR and fix EomT-DINOv3 (#43710) by @NielsRogge
    • Update hub metadata (#43892) by @zucchini-nlp
    • [fix] DAC model: Apply STE in Dac.from_latents to match the forward pass (#43820) by @harshaljanjani
    • Separate check_model_inputs into capture_outputs and merge_with_config_defaults + ensure correctness (#43862) by @Cyrilvallez
    • Remove mask slicing in all eager attentions (#42186) by @Cyrilvallez
    • Fix expected DAC outputs due to (old) change in CI settings. (#43896) by @ebezzam
    • Minor changes trainer (#43744) by @SunMarc
    • adding BC for custom toks accessing slow tok attrs deprecated in v5 (#43898) by @itazap
    • Fix typo in quantization_operations in PEFT integrations (#43821) by @redpanda1995
    • Update KERNELS_MIN_VERSION to 0.10.2 to be the same as setup.py (#43753) by @cyyever
    • Decorate cache updates with no_grad, just in case (#43897) by @Rocketknight1
    • revert place_model_on_device to property (#43895) by @SunMarc
    • Train sampler unification (#43138) by @jiosephlee
    • fix(moe): Handle dtype mismatch in torch._grouped_mm with autocast (#43839) by @Mr-Neutr0n
    • Fix missing fast image patch counter in Glm46V (#43877) by @OiPunk
    • Fix old tech stack in doc (#43902) by @cyyever
    • Move _keys_to_ignore_on_load_missing for now (#43893) by @ArthurZucker
    • Changes to cache_utils should trigger all tests all the time (#43920) by @Cyrilvallez
    • Ernie4 5 vl moe (#43755) by @kaixuanliu
    • Harmonize input_embeds to inputs_embeds everywhere (#43916) by @Cyrilvallez
    • fix: TextClassificationPipeline docs mentioning deprecated return_all_scores (#43903) by @math-hiyoko
    • Revert #43897 (#43923) by @Rocketknight1
    • Fix AttributeError in OwlViT conversion script for Python 3.10+ (#43922) by @DimiChatzipavlis
    • add openAI style image_url content support in apply_chat_template (#43786) by @kaixuanliu
    • Prepare and keep track of position ids in generate (#43734) by @zucchini-nlp
    • Fix lifted_tensor in Gemma3n export which dynamo can't reason about (#43801) by @robell
    • Fix bark test (#43942) by @Cyrilvallez
    • Fix docker files (#43946) by @ydshieh
    • Fix flaky test for multimodal LLMs (#43944) by @Rocketknight1
    • Add explicit utf-8 encoding to CircleCI scripts for Windows compatibility (#43925) by @
    • Modernize string formatting (f-strings) in conversion scripts (#43943) by @
    • Fix weight decay exclusions in run_*_no‑trainer.py examples (#42769) by @casinca
    • fix: Better weight decay exclusion in run_*_no‑trainer.py examples (#43947) by @casinca
    • Timm backbone saves and loads out_features (#43886) by @zucchini-nlp
    • Fix qwen-vl position ids when generating several times (#43952) by @zucchini-nlp
    • Fix get_number_of_image_tokens (#43948) by @zucchini-nlp
    • Fix typos in docstrings, comments, and error messages (#43949) by @
    • Fix LASR test layerdrop issue (#43954) by @Rocketknight1
    • [kernels] fix kernel versions (#43955) by @MekkCyber
    • [Doc tests] Fix bug (#43729) by @NielsRogge
    • fix(models): Preserve custom token IDs through DiaConfig save and load (#43928) by @harshaljanjani
    • update somes audio models (#43865) by @Deep-unlearning
    • Improve memory allocator during loading (#43945) by @Cyrilvallez
    • Inclusion of process_group in the gather_full_tensor function in tensor_parallel.py (#43932) by @quic-meetkuma
    • Fix sync gradient (#43919) by @SunMarc
    • Reorder Trainer methods (#43914) by @SunMarc
    • Fix TypeError in dot_natural_key when state_dict keys have mixed types at same position (#43966) by @shtse8
    • Enhance JSON schema generation to support instance, static, and class methods (#43968) by @qgallouedec
    • Remove unused squeeze from VJEPA2 embeddings rotation (#43984) by @materight
    • Improve new failing test analysis for PR comment CI (#44033) by @ydshieh
    • Remove other_workflow_run_ids for issue_comment in utils/notification_service.py (#44036) by @ydshieh
    • stable grouped_mm API (#43977) by @IlyasMoutawwakil
    • create .git-blame-ignore-revs file (#43982) by @SunMarc
    • docs: fix typos across documentation files (#43993) by @saurav0369
    • update python requirement to 3.10+ to match codebase (#44009) by @mariam851
    • Improve use of torch.is_autocast_enabled (#43930) by @cyyever
    • Use torch.xlogy (#44006) by @cyyever
    • [Deespeed] fix WeightConverter.convert() use (#43926) by @kashif
    • Reduce reduce CUDA sync (#44005) by @cyyever
    • split out accelerator args builder method (#43987) by @winglian
    • SINQ quantization strategy integration (adapted for Transformers V5) (#43112) by @ChiaraBoretti
    • fix(models): Unpack BitNet packed weights to fix CI failure (#43721) by @harshaljanjani

    Significant community contributions

    The following contributors have made significant changes to the library over the last release:

    • @ChiaraBoretti
      • SINQ quantization strategy integration (adapted for Transformers V5) (#43112)
    • @cyyever
      • Reduce reduce CUDA sync (#44005)
      • Use torch.xlogy (#44006)
      • Improve use of torch.is_autocast_enabled (#43930)
      • Fix old tech stack in doc (#43902)
      • Update KERNELS_MIN_VERSION to 0.10.2 to be the same as setup.py (#43753)
      • Remove unnecessary code or checks for PT 2.4+ (#43787)
      • Fix old tech stack in doc (#43879)
      • Delete batch_split from EncoderDecoderCache (#43814)
      • Fix markdown documentation (#43076)
    • @eustlb
      • Add Voxtral Realtime (#43769)
      • [MistralCommonBackend] fix loading proc (#43887)
    • @ebezzam
      • Fix expected DAC outputs due to (old) change in CI settings. (#43896)
      • Add VibeVoice Acoustic Tokenizer (#43400)
    • @vasqu
      • [Jamba] Fallback to slow path and warn instead of error out (#43889)
      • 🚨 [Attn] New attn mask interface everywhere (#42848)
      • [Repo Consistency] Fix rms norm (#43803)
      • [Modular Dependencies] Fixup qwen rms norms (#43772)
    • @bozheng-hit
      • Adding Support for Qwen3.5 (#43830)
    Original source Report a problem
  • Feb 10, 2026
    • Date parsed from source:
      Feb 10, 2026
    • First seen by Releasebot:
      Mar 20, 2026
    Hugging Face logo

    Hugging Face

    Feb 10, 26

    Hugging Face expands full-text search with organization, user, and repository filters in the UI and API.

    Full-text search now supports filtering by organizations, users, or specific repositories, available both in the UI and through the API. Combine multiple filters using OR logic to refine your results, and share searches easily with filters persisted directly in the URL.

    Original source Report a problem
  • Feb 5, 2026
    • Date parsed from source:
      Feb 5, 2026
    • First seen by Releasebot:
      Mar 20, 2026
    Hugging Face logo

    transformers by Hugging Face

    v5.1.0: EXAONE-MoE, PP-DocLayoutV3, Youtu-LLM, GLM-OCR

    transformers adds new model support for EXAONE-MoE, PP-DocLayoutV3, Youtu-LLM, and GLM-OCR, while also shipping broad bug fixes, generation cache updates, MoE and XPU improvements, and multiple breaking refactors to keep the library evolving.

    New Model additions

    EXAONE-MoE

    K-EXAONE is a large-scale multilingual language model developed by LG AI Research. Built using a Mixture-of-Experts architecture, K-EXAONE features 236 billion total parameters, with 23 billion active during inference. Performance evaluations across various benchmarks demonstrate that K-EXAONE excels in reasoning, agentic capabilities, general knowledge, multilingual understanding, and long-context processing.

    Add EXAONE-MoE implementations (#43080) by @nuxlear

    PP-DocLayoutV3

    PP-DocLayoutV3 is a unified and high-efficiency model designed for comprehensive layout analysis. It addresses the challenges of complex physical distortions—such as skewing, curving, and adverse lighting—by integrating instance segmentation and reading order prediction into a single, end-to-end framework.

    [Model] Add PP-DocLayoutV3 Model Support (#43098) by @zhang-prog

    Youtu-LLM

    Youtu-LLM is a new, small, yet powerful LLM, contains only 1.96B parameters, supports 128k long context, and has native agentic talents. On general evaluations, Youtu-LLM significantly outperforms SOTA LLMs of similar size in terms of Commonsense, STEM, Coding and Long Context capabilities; in agent-related testing, Youtu-LLM surpasses larger-sized leaders and is truly capable of completing multiple end2end agent tasks.

    Add Youtu-LLM model (#43166) by @LuJunru

    GlmOcr

    GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization. The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.

    [GLM-OCR] GLM-OCR Support (#43391)by @zRzRzRzRzRzRzR

    Breaking changes

    🚨 T5Gemma2 model structure (#43633) - Makes sure that the attn implementation is set to all sub-configs. The config.encoder.text_config was not getting its attn set because we aren't passing it to PreTrainedModel.init. We can't change the model structure without breaking so I manually re-added a call to self.adjust_attn_implemetation in modeling code

    🚨 Generation cache preparation (#43679) - Refactors cache initialization in generation to ensure sliding window configurations are now properly respected. Previously, some models (like Afmoe) created caches without passing the model config, causing sliding window limits to be ignored. This is breaking because models with sliding window attention will now enforce their window size limits during generation, which may change generation behavior or require adjusting sequence lengths in existing code.

    🚨 Delete duplicate code in backbone utils (#43323) - This PR cleans up backbone utilities. Specifically, we have currently 5 different config attr to decide which backbone to load, most of which can be merged into one and seem redundant

    After this PR, we'll have only one config.backbone_config as a single source of truth. The models will load the backbone from_config and load pretrained weights only if the checkpoint has any weights saved. The overall idea is same as in other composite models. A few config arguments are removed as a result.

    🚨 Refactor DETR to updated standards (#41549) - standardizes the DETR model to be closer to other vision models in the library.

    🚨Fix floating-point precision in JanusImageProcessor resize (#43187) - replaces an int() with round(), expect light numerical differences

    🚨 Remove deprecated AnnotionFormat (#42983) - removes a missnamed class in favour of AnnotationFormat.

    Bugfixes and improvements

    fix(models): Migrate legacy segmentation_indices to out_indices in BeitConfig (#43505) by @harshaljanjani

    [docs] Update torch version (#42135) by @stevhliu

    Remove SDPA workarounds for torch 2.4+ (#43754) by @cyyever

    add use_deterministic to guarantee the consistency for youtu-llm model (#43759) by @kaixuanliu

    fix: add compatible_model_types to suppress model type mismatch warnings (#43495) by @leoneperdigao

    Fix T5 v1.1 detection (#43681) by @githubnemo

    Add moonshine streaming (#43702) by @eustlb

    Allow bi-directional attention for all models (#43705) by @Cyrilvallez

    Docs: fix Training step by removing tokenizer from trainer initialization (#43733) by @nesjett

    Fix scheduler initialization order (#43711) by @SunMarc

    Fix accelerate integration import (#43732) by @SunMarc

    Update torch minimum version to 2.4 (#41307) by @cyyever

    Fix dtype in image-text-to-text pipe (#43731) by @zucchini-nlp

    Preventing initialization of siglip's lecun_normal_, default_flax_embed_init in ZeRO3 (#43574) by @jp1924

    fix: AttributeError for Qwen3_omni_moe (#43593) by @Vallabh-1504

    Improve typing/explanations for general model properties (#43712) by @Cyrilvallez

    [Kernels] kernel migration updates for activation kernels (#43518) by @ariG23498

    [feat] Allow loading T5Gemma2Encoder with AutoModel (#43559) by @tomaarsen

    Added S110 - try-except-pass rule (#43687) by @tarekziade

    [docs] benchmarks (#43694) by @stevhliu

    fix norm_eps dtype (#43669) by @fschlatt

    Llava onevision: output align for tests and add image_sizes input param (#43678) by @kaixuanliu

    Fix CLIPOutput attentions not being returned (#43657) by @jonathan-fulton

    [Attn] Fixup interface usage after refactor (#43706) by @vasqu

    Fix model/processor mismatch in SigLIP2 quantization example (#43652) by @jonathan-fulton

    Fix crash of custom models in Notebook or Repl (#43690) by @Cyrilvallez

    Simplify TrainingArguments docstring (#43568) by @SunMarc

    Composite model inherit automatically all important properties from their children (#43691) by @Cyrilvallez

    Update configuration_qwen3.py (#43703) by @francesco-bertolotti

    fix gptoss tp crash (#43695) by @sywangyi

    [CB] Keep order of incoming requests (#43626) by @remi-or

    Fix Apertus model loading (NotImplementedError: Cannot copy out of meta tensor; no data!) (#43473) by @xenova

    Remove num_frames in ASR pipeline (#43546) by @jiqing-feng

    remove ipex and ccl for xpu and cpu (#42852) by @yao-matrix

    update guide with new attr name for toks (#43689) by @itazap

    Docs: fix typos in Get started (index, quicktour) (#43666) by @CodeByKodi

    the cache class is deprecated by @vasqu (direct commit on main)

    custom tok init fix (#43591) by @itazap

    More export friendly rewrites and skipping the failing ones (#43436) by @IlyasMoutawwakil

    Cast byte_count to int in caching_allocator_warmup for MPS compatibility (#43608) by @tobyliu2004

    [Docs] Complete missing Llama4 configuration docs (#43460) by @udaymehta

    Fix t5 failures (#43374) by @Abdennacer-Badaoui

    Add EoMT with DINOv3 backbone (#41212) by @NielsRogge

    Update DBRX docs to reference re-uploaded checkpoint (#43196) by @qgallouedec

    [loading] Fix forced upcasting to fp32 (#43683) by @Cyrilvallez

    Fix FP8Expert for Qwen (#43670) by @yiliu30

    Simplify loading structure (#43589) by @Cyrilvallez

    [CB] Refactor logic for inputs and outputs outside of the main API (#43569) by @remi-or

    Make sure hub errors are surfaced in PreTrainedTokenizerBase (#43675) by @tarekziade

    Fix FP8Expert for DeepSeek R1 (#43616) by @yiliu30

    Use correct sampling rate in chat template (#43674) by @zucchini-nlp

    [HunYuan] Fix RoPE init (#43411) by @vasqu

    XPU now supports MoE kernel(MegaBlocks) implementation (#43435) by @YangKai0616

    [Sam] Fixup training flags (#43567) by @vasqu

    remove torchao.autoquant from transformers (#43561) by @vkuzo

    [DeepSpeed] properly handle MoE weight conversion (#43524) by @kashif

    Tie zamba weights correctly (#43623) by @zucchini-nlp

    [kernels] Centralize kernels tests (#42819) by @MekkCyber

    Fix process_bad_commit_report.py: avoid items to appear in null author in the report (#43662) by @ydshieh

    Fix KeyError in check_bad_commit.py (#43655) by @ydshieh

    [Benchmark] Minor fix for benchmark: kernel is not correctly called (#43428) by @sywangyi

    Add explicit commit info to PR comment CI feedback (#43635) by @ydshieh

    Better new failures reporting for PR comment CI (#43629) by @ydshieh

    [docs] serving (#42853) by @stevhliu

    add XPU expected output for MixedInt8GPT2Test (#43615) by @kaixuanliu

    Don't modify mappings in tests (#43634) by @Rocketknight1

    Allow Attention and Experts to be used as standalone modules (#43622) by @Cyrilvallez

    Don't modify tied_weight_keys in-place (#43619) by @zucchini-nlp

    [Rope] Revert #43410 and make inheritance implicit again (#43620) by @vasqu

    [vllm compat] Separate renaming from conversion ops (#43621) by @Cyrilvallez

    refactor + robusts tests for Tensor Parallel (#42809) by @3outeille

    add contiguous operation for diffllama model for xpu to enable compile mode. (#43614) by @kaixuanliu

    add xpu expectation for lw_detr model (#43339) by @kaixuanliu

    minimax_m2: fix failed test case for XPU (#43324) by @kaixuanliu

    Improve new failures reporting (#43628) by @ydshieh

    Fix extras on all supported Python versions (#43490) by @tarekziade

    fix(models): Fix suno/bark-small CPU offload device mismatch causing CI failures (#43607) by @harshaljanjani

    [CB] [Serve] Fix broken serve tests (#43594) by @remi-or

    Docs: fix typo in weight converter guide (#43610) by @KOKOSde

    [MoE] Use int input for histc on CUDA to support deterministic algorithms (#43583) by @YangKai0616

    Fixes configuration default values (#43592) by @zucchini-nlp

    Fix make_batched_video with 5D arrays (#43486) by @zucchini-nlp

    Operation Green CI II (#43537) by @Rocketknight1

    enable cpu paged cache (#42869) by @jiqing-feng

    Qwen3 omni - fix get video features (#43588) by @zucchini-nlp

    [GLM-Image] Add batch > 1 support and fix configuration defaults (#43342) by @JaredforReal

    [Model] Refactor modernbert with the attention interface (#43030) by @YangKai0616

    Regex post processing in loading (#43585) by @Cyrilvallez

    simplify extra tokens logic in base (#43230) by @itazap

    Add XPU support to the tests for solar_open (#43579) by @YangKai0616

    remove FbgemmFp8LinearTest (#43545) by @sywangyi

    Increase default ReadTimeout in tests (#43586) by @Wauplin

    Fix mistral checkpoint loading in utils/fetch_hub_objects_for_ci.py: avoid too many requests and/or timeout (#43584) by @ydshieh

    [CI][AMD] Fix Pipeline CI (#43178) by @Abdennacer-Badaoui

    fix(converter): speed up MistralConverter.extract_vocab_merges_from_model (#43557) by @tarekziade

    Improve GPU monitoring: switch to multiprocessing and use amdsmi for AMD GPUs (#43552) by @Abdennacer-Badaoui

    Update test of Youtu-LLM to pr-aligned repos (#43578) by @LuJunru

    Rework dependencies and extras + Remove outdated templates folder (#43536) by @Cyrilvallez

    Fix repo. consistency bot (push permission issue) (#43570) by @ydshieh

    Fix Wav2vec and a few others (#43566) by @Cyrilvallez

    [Modular] Allow to add new bases that are not present in the inherited class (#43556) by @vasqu

    add an option to disable Sam3VideoModel progress bar (#43564) by @ndeybach

    check/fix repo. check bot workflow (#43565) by @ydshieh

    Increase timeout when preparing CI (#43560) by @Rocketknight1

    43054: Add Siglip2Tokenizer to enforce training-time text preprocessing defaults (#43101) by @vaibhav-research

    check PR bot permission - part 3 (try content attribute) (#43555) by @ydshieh

    check PR bot permission - part 2 (style only) (#43554) by @ydshieh

    check PR bot permission - part 1 (#43553) by @ydshieh

    Fix failing tests due to no attribute pad_token_id (#43453) by @Sai-Suraj-27

    fix: GPT OSS Conversion Script Enhancements (#42901) by @KyleMylonakisProtopia

    [Quantization] Fix triton_kernels name after being renamed to gpt-oss-triton-kernels (#43528) by @MekkCyber

    [Quantization] Add cutlass kernel for FP8 (#43304) by @MekkCyber

    [CB] Minor perf improvements and ty compatibility (#43521) by @remi-or

    Fix tiles mixing for batched input, add tie_word_embeddings to LFM2VL config (#43379) by @ankke

    fix: return labels instead of label in reduce_label method in BeitImageProcessorFast (#43527) by @sbucaille

    [RoPE] Make explicit inheritance (#43410) by @vasqu

    Fix for #43530 (#43535) by @Rocketknight1

    Operation Green CI (#43530) by @Rocketknight1

    Tie the weights even if initializing from a config on meta device (#43523) by @Cyrilvallez

    [kernels] Update cv_utils name (#43529) by @MekkCyber

    add trackio to training notebooks (#43442) by @merveenoyan

    Mark test_prompt_lookup_decoding as flaky (#42184) by @Rocketknight1

    Fix some MoE routers (#43445) by @IlyasMoutawwakil

    batched_mm is slow on cpu (#43438) by @IlyasMoutawwakil

    fix: initialize BatchNorm2d buffers only when needed (#43520) by @tarekziade

    Fix loading of Qwen3 FP8 (#43494) by @githubnemo

    fix ShieldGemma2IntegrationTest::test_model (#43343) by @sywangyi

    Update SamHQModelIntegrationTest::test_inference_mask_generation_batched_points_batched_images for XPU (#43511) by @sywangyi

    Revert utils files changes from PR #42845 (#43507) by @ydshieh

    Move hardcoded time_step params to config for Bamba, FalconH1, GraniteMoeHybrid (#43461) by @raimbekovm

    Prepare inputs for generation is called from super() (#43280) by @zucchini-nlp

    Enhance repo. consistency bot (#43503) by @ydshieh

    Add pytest-random-order for reproducible test randomization (#43483) by @tarekziade

    Add missing GPURawMetrics.from_dict() method in benchmark_v2 (#43499) by @Abdennacer-Badaoui

    push dev version 5.0.1.dev0 by @ArthurZucker (direct commit on main)

    Fix failing markuplm & perception_lm integration tests (#43464) by @Sai-Suraj-27

    fix(Phi4Multimodal): Fix incorrect default vision/audio config initialization in Phi4MultimodalConfig (#43480) by @charlieJ107

    handle 1D position_ids for modeling_flash_attention_utils as well (#43403) by @kaixuanliu

    Remove stale TODO comments in UDOP tied weights (#43477) by @raimbekovm

    Fix Mxfp4 dequantize (#43326) by @Cyrilvallez

    Significant community contributions

    The following contributors have made significant changes to the library over the last release:

    @cyyever

    Remove SDPA workarounds for torch 2.4+ (#43754)

    Update torch minimum version to 2.4 (#41307)

    🚨 Remove deprecated AnnotionFormat (#42983)

    @eustlb

    Add moonshine streaming (#43702)

    @tarekziade

    Added S110 - try-except-pass rule (#43687)

    Make sure hub errors are surfaced in PreTrainedTokenizerBase (#43675)

    Fix extras on all supported Python versions (#43490)

    fix(converter): speed up MistralConverter.extract_vocab_merges_from_model (#43557)

    fix: initialize BatchNorm2d buffers only when needed (#43520)

    Add pytest-random-order for reproducible test randomization (#43483)

    @nuxlear

    Add EXAONE-MoE implementations (#43080)

    @vasqu

    [Attn] Fixup interface usage after refactor (#43706)

    the cache class is deprecated

    [HunYuan] Fix RoPE init (#43411)

    [Sam] Fixup training flags (#43567)

    [Rope] Revert #43410 and make inheritance implicit again (#43620)

    [Modular] Allow to add new bases that are not present in the inherited class (#43556)

    [RoPE] Make explicit inheritance (#43410)

    @remi-or

    [CB] Keep order of incoming requests (#43626)

    [CB] Refactor logic for inputs and outputs outside of the main API (#43569)

    [CB] [Serve] Fix broken serve tests (#43594)

    [CB] Minor perf improvements and ty compatibility (#43521)

    @NielsRogge

    Add EoMT with DINOv3 backbone (#41212)

    @YangKai0616

    XPU now supports MoE kernel(MegaBlocks) implementation (#43435)

    [MoE] Use int input for histc on CUDA to support deterministic algorithms (#43583)

    [Model] Refactor modernbert with the attention interface (#43030)

    Add XPU support to the tests for solar_open (#43579)

    @ydshieh

    Fix process_bad_commit_report.py: avoid items to appear in null author in the report (#43662)

    Fix KeyError in check_bad_commit.py (#43655)

    Add explicit commit info to PR comment CI feedback (#43635)

    Better new failures reporting for PR comment CI (#43629)

    Improve new failures reporting (#43628)

    Fix mistral checkpoint loading in utils/fetch_hub_objects_for_ci.py: avoid too many requests and/or timeout (#43584)

    Fix repo. consistency bot (push permission issue) (#43570)

    check/fix repo. check bot workflow (#43565)

    check PR bot permission - part 3 (try content attribute) (#43555)

    check PR bot permission - part 2 (style only) (#43554)

    check PR bot permission - part 1 (#43553)

    Revert utils files changes from PR #42845 (#43507)

    Enhance repo. consistency bot (#43503)

    @JaredforReal

    [GLM-Image] Add batch > 1 support and fix configuration defaults (#43342)

    @zhang-prog

    [Model] Add PP-DocLayoutV3 Model Support (#43098)

    @LuJunru

    Update test of Youtu-LLM to pr-aligned repos (#43578)

    Add Youtu-LLM model (#43166)

    @zRzRzRzRzRzRzR

    [GLM-OCR] GLM-OCR Support (#43391)

    Original source Report a problem
  • Jan 26, 2026
    • Date parsed from source:
      Jan 26, 2026
    • First seen by Releasebot:
      Mar 20, 2026
    Hugging Face logo

    transformers by Hugging Face

    Transformers v5

    transformers releases its first major v5 update, bringing major API simplification, dynamic weight loading, tokenizer and config refactors, faster model loading, weekly minor releases, and broad bug fixes plus new model support across vision, audio, and language tasks.

    Transformers v5 release notes

    Highlights

    Significant API changes: dynamic weight loading, tokenization

    Backwards Incompatible Changes

    Bugfixes and improvements

    We have a migration guide that will be continuously updated available on the main branch, please check it out in case you're facing issues: migration guide.

    Highlights

    We are excited to announce the initial release of Transformers v5. This is the first major release in five years, and the release is significant: 1200 commits have been pushed to main since the latest minor release. This release removes a lot of long-due deprecations, introduces several refactors that significantly simplify our APIs and internals, and comes with a large number of bug fixes.

    We give an overview of our focus for this release in the following blogpost. In these release notes, we'll focus directly on the refactors and new APIs coming with v5.

    This release is the full V5 release. It sets in motion something bigger: going forward, starting with v5, we'll now release minor releases every week, rather than every 5 weeks. Expect v5.1 to follow next week, then v5.2 the week that follows, etc.

    We're moving forward with this change to ensure you have access to models as soon as they're supported in the library, rather than a few weeks after.

    In order to install this release, please do so with the following:

    pip install transformers
    

    For us to deliver the best package possible, it is imperative that we have feedback on how the toolkit is currently working for you. Please try it out, and open an issue in case you're facing something inconsistent/a bug.

    Transformers version 5 is a community endeavor, and we couldn't have shipped such a massive release without the help of the entire community.

    Significant API changes

    Dynamic weight loading

    We introduce a new weight loading API in transformers, which significantly improves on the previous API. This
    weight loading API is designed to apply operations to the checkpoints loaded by transformers.
    Instead of loading the checkpoint exactly as it is serialized within the model, these operations can reshape, merge,
    and split the layers according to how they're defined in this new API. These operations are often a necessity when
    working with quantization or parallelism algorithms.

    This new API is centered around the new WeightConverter class:

    class WeightConverter(WeightTransform):
    operations: list[ConversionOps]
    source_keys: Union[str, list[str]]
    target_keys: Union[str, list[str]]
    

    The weight converter is designed to apply a list of operations on the source keys, resulting in target keys. A common
    operation done on the attention layers is to fuse the query, key, values layers. Doing so with this API would amount
    to defining the following conversion:

    conversion = WeightConverter(
    ["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"], # The input layers
    "self_attn.qkv_proj", # The single layer as output
    operations=[Concatenate(dim=0)],
    )
    

    In this situation, we apply the Concatenate operation, which accepts a list of layers as input and returns a single
    layer.

    This allows us to define a mapping from architecture to a list of weight conversions. Applying those weight conversions
    can apply arbitrary transformations to the layers themselves. This significantly simplified the from_pretrained method
    and helped us remove a lot of technical debt that we accumulated over the past few years.

    This results in several improvements:

    • Much cleaner definition of transformations applied to the checkpoint
    • Reversible transformations, so loading and saving a checkpoint should result in the same checkpoint
    • Faster model loading thanks to scheduling of tensor materialization
    • Enables complex mix of transformations that wouldn't otherwise be possible (such as quantization + MoEs, or TP + MoEs)

    Linked PR: #41580

    Tokenization

    Just as we moved towards a single backend library for model definition, we want our tokenizers, and the Tokenizer object to be a lot more intuitive. With v5, tokenizer definition is much simpler; one can now initialize an empty LlamaTokenizer and train it directly on your corpus.

    Defining a new tokenizer object should be as simple as this:

    from transformers import TokenizersBackend, generate_merges
    from tokenizers import pre_tokenizers, Tokenizer
    from tokenizers.model import BPE
    class Llama5Tokenizer(TokenizersBackend):
    def __init__(self, unk_token="<unk>",bos_token="<s>", eos_token="</s>", vocab=None, merges=None ):
    if vocab is None:
    self._vocab = {
    str(unk_token): 0,
    str(bos_token): 1,
    str(eos_token): 2,
    }
    else:
    self._vocab = vocab
    self._merges = merges
    self._tokenizer = Tokenizer(
    BPE(vocab=self._vocab, merges=self._merges, fuse_unk=True)
    )
    self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(
    replacement="▁", prepend_scheme=_get_prepend_scheme(self.add_prefix_space, self), split=False
    )
    super().__init__(
    tokenizer_object=self._tokenizer,
    unk_token=unk_token,
    bos_token=bos_token,
    eos_token=eos_token,
    )
    

    Once the tokenizer is defined as above, you can load it with the following: Llama5Tokenizer(). Doing this returns you an empty, trainable tokenizer that follows the definition of the authors of Llama5 (it does not exist yet 😉).

    The above is the main motivation towards refactoring tokenization: we want tokenizers to behave similarly to models: trained or empty, and with exactly what is defined in their class definition.

    Backend Architecture Changes: moving away from the slow/fast tokenizer separation

    Up to now, transformers maintained two parallel implementations for many tokenizers:

    • "Slow" tokenizers (tokenization_<model>.py) - Python-based implementations, often using SentencePiece as the backend.
    • "Fast" tokenizers (tokenization_<model>_fast.py) - Rust-based implementations using the 🤗 tokenizers library.

    In v5, we consolidate to a single tokenizer file per model: tokenization_<model>.py. This file will use the most appropriate backend available:

    • TokenizersBackend (preferred): Rust-based tokenizers from the 🤗 tokenizers library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem:
      • handling additional tokens
      • a full python API for setting and updating
      • automatic parallelization,
      • automatic offsets
      • customization
      • training
    • SentencePieceBackend: for tokenizers requiring the sentencepiece library. It inherits from PythonBackend.
    • PythonBackend: a Python implementations of the features provided by tokenizers. Basically allows adding tokens.
    • MistralCommonBackend: relies on MistralCommon's tokenization library. (Previously known as the MistralCommonTokenizer)

    The AutoTokenizer automatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to use AutoTokenizer.from_pretrained() as before. This allows transformers to be future-proof and modular to easily support future backends.

    Defining a tokenizers outside of the existing backends

    We enable users and tokenizer builders to define their own tokenizers from top to bottom. Tokenizers are usually defined using a backend such as tokenizers, sentencepiece or mistral-common, but we offer the possibility to design the tokenizer at a higher-level, without relying on those backends.

    To do so, you can import the PythonBackend (which was previously known as PreTrainedTokenizer). This class encapsulates all the logic related to added tokens, encoding, and decoding.

    If you want something even higher up the stack, then PreTrainedTokenizerBase is what PythonBackend inherits from. It contains the very basic tokenizer API features:

    • encode
    • decode
    • vocab_size
    • get_vocab
    • convert_tokens_to_ids
    • convert_ids_to_tokens
    • from_pretrained
    • save_pretrained

    among a few others

    API Changes

    1. Direct tokenizer initialization with vocab and merges

    Starting with v5, we now enable initializing blank, untrained tokenizers-backed tokenizers:

    from transformers import LlamaTokenizer
    tokenizer = LlamaTokenizer()
    

    This tokenizer will therefore follow the definition of the LlamaTokenizer as defined in its class definition. It can then be trained on a corpus as can be seen in the tokenizers documentation.

    These tokenizers can also be initialized from vocab and merges (if necessary), like the previous "slow" tokenizers:

    from transformers import LlamaTokenizer
    vocab = {"&lt;unk&gt;": 0, "&lt;s&gt;": 1, "&lt;/s&gt;": 2, "hello": 3, "world": 4}
    merges = [("h", "e"), ("l", "l"), ("o", " ")]
    tokenizer = LlamaTokenizer(vocab=vocab, merges=merges)
    

    This tokenizer will behave as a Llama-like tokenizer, with an updated vocabulary. This allows comparing different tokenizer classes with the same vocab; therefore enabling the comparison of different pre-tokenizers, normalizers, etc.

    ⚠️ The vocab_file (as in, a path towards a file containing the vocabulary) cannot be used to initialize the LlamaTokenizer as loading from files is reserved to the from_pretrained method.

    2. Simplified decoding API

    The batch_decode and decode methods have been unified to reflect behavior of the encode method. Both single and batch decoding now use the same decode method. See an example of the new behavior below:

    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("t5-small")
    inputs = ["hey how are you?", "fine"]
    tokenizer.decode(tokenizer.encode(inputs))
    

    Gives:

    • 'hey how are you?</s> fine</s>'
    • ['hey how are you?</s>', 'fine</s>']

    We expect encode and decode to behave, as two sides of the same coin: encode, process, decode, should work.

    Note

    A common use-case would be: encode, model.generate, decode. However, using generate would return list[list[int]], which would then be incompatible with decode.

    3. Unified encoding API

    The encode_plus method is deprecated in favor of the single call method.

    4. apply_chat_template returns BatchEncoding

    Previously, apply_chat_template returned input_ids for backward compatibility. Starting with v5, it now consistently returns a BatchEncoding dict like other tokenizer methods.

    # v5
    messages = [
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there!"}
    ]
    # Now returns BatchEncoding with input_ids, attention_mask, etc.
    outputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
    print(outputs.keys()) # dict_keys(['input_ids', 'attention_mask'])
    

    5. Removed legacy configuration file saving:

    We simplify the serialization of tokenization attributes:

    • special_tokens_map.json - special tokens are now stored in tokenizer_config.json.
    • added_tokens.json - added tokens are now stored in tokenizer.json.
    • added_tokens_decoder is only stored when there is no tokenizer.json.

    When loading older tokenizers, these files are still read for backward compatibility, but new saves use the consolidated format. We're gradually moving towards consolidating attributes to fewer files so that other libraries and implementations may depend on them more reliably.

    6. Model-Specific Changes

    Several models that had identical tokenizers now import from their base implementation:

    • LayoutLM → uses BertTokenizer
    • LED → uses BartTokenizer
    • Longformer → uses RobertaTokenizer
    • LXMert → uses BertTokenizer
    • MT5 → uses T5Tokenizer
    • MVP → uses BartTokenizer

    These modules will eventually be removed altogether.

    Removed T5-specific workarounds

    The internal _eventually_correct_t5_max_length method has been removed. T5 tokenizers now handle max length consistently with other models.

    Testing Changes

    A few testing changes specific to tokenizers have been applied:

    • Model-specific tokenization test files now focus on integration tests.
    • Common tokenization API tests (e.g., add_tokens, encode, decode) are now centralized and automatically applied across all tokenizers. This reduces test duplication and ensures consistent behavior

    For legacy implementations, the original BERT Python tokenizer code (including WhitespaceTokenizer, BasicTokenizer, etc.) is preserved in bert_legacy.py for reference purposes.

    7. Deprecated / Modified Features

    Special Tokens Structure:

    SpecialTokensMixin: Merged into PreTrainedTokenizerBase to simplify the tokenizer architecture.

    special_tokens_map: Now only stores named special token attributes (e.g., bos_token, eos_token). Use extra_special_tokens for additional special tokens (formerly additional_special_tokens). all_special_tokens includes both named and extra tokens.

    # v4
    tokenizer.special_tokens_map # Included 'additional_special_tokens'
    # v5
    tokenizer.special_tokens_map # Only named tokens
    tokenizer.extra_special_tokens # Additional tokens
    

    special_tokens_map_extended and all_special_tokens_extended: Removed. Access AddedToken objects directly from _special_tokens_map or _extra_special_tokens if needed.

    additional_special_tokens: Still accepted for backward compatibility but is automatically converted to extra_special_tokens.

    Deprecated Methods:

    sanitize_special_tokens(): Already deprecated in v4, removed in v5.

    prepare_seq2seq_batch(): Deprecated; use call() with text_target parameter instead.

    # v4
    model_inputs = tokenizer.prepare_seq2seq_batch(src_texts, tgt_texts, max_length=128)
    # v5
    model_inputs = tokenizer(src_texts, text_target=tgt_texts, max_length=128, return_tensors="pt")
    model_inputs["labels"] = model_inputs.pop("input_ids_target")
    

    BatchEncoding.words(): Deprecated; use word_ids() instead.

    Removed Methods:

    create_token_type_ids_from_sequences(): Removed from base class. Subclasses that need custom token type ID creation should implement this method directly.

    prepare_for_model(), build_inputs_with_special_tokens(), truncate_sequences(): Moved from tokenization_utils_base.py to tokenization_python.py for PythonBackend tokenizers. TokenizersBackend provides model-ready input via tokenize() and encode(), so these methods are no longer needed in the base class.

    _switch_to_input_mode(), _switch_to_target_mode(), as_target_tokenizer(): Removed from base class. Use call() with text_target parameter instead.

    # v4
    with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
    # v5
    labels = tokenizer(text_target=tgt_texts, ...)
    

    parse_response(): Removed from base class.

    Performance

    MoE Performance

    The v5 release significantly improves the performance of the MoE models, as can be seen in the graphs below. We improve and optimize MoE performance through batched and grouped experts implementations, and we optimize them for decoding using batched_mm.

    Core performance

    We focus on improving the performance of loading weights on device (which gives speedups up to 6x in tensor parallel situations); this is preliminary work that we'll continue to work on in the coming weeks. Some notable improvements:

    • [saving] Simplify general logic by @Cyrilvallez in #42766
    • Do not rely on config for inferring model dtype by @Cyrilvallez in #42838
    • Improve BatchFeature: stack list and lists of torch tensors by @yonigozlan in #42750
    • Remove tied weights from internal attribute if they are not tied by @Cyrilvallez in #42871
    • Enforce call to post_init and fix all of them by @Cyrilvallez in #42873
    • Simplify tie weights logic by @Cyrilvallez in #42895
    • Add buffers to _init_weights for ALL models by @Cyrilvallez in #42309
    • [loading] Really initialize on meta device for huge perf gains by @Cyrilvallez in #42941
    • Do not use accelerate hooks if the device_map has only 1 device by @Cyrilvallez in #43019
    • Move missing weights and non-persistent buffers to correct device earlier by @Cyrilvallez in #43021

    Library-wide changes with lesser impact

    Default dtype update

    We have updated the default dtype for all models loaded with from_pretrained to be auto. This will lead to model instantiations respecting the dtype in which the model was saved, rather than forcing it to load in float 32.

    You can, of course, still specify the dtype in which you want to load your model by specifying it as an argument to the from_pretrained method.

    Shard size

    The Hugging Face Hub infrastructure has gradually moved to a XET backend. This will significantly simplify uploads and downloads, with higher download and upload speeds, partial uploads, and, most notably, a higher threshold for accepted file sizes on the Hugging Face Hub.

    To reflect this, we're increasing the default shard size of models serialized on the Hub to 50GB (up from 5GB).

    use_auth_token

    The use_auth_token argument/parameter is deprecated in favor of token everywhere.

    You should be able to search and replace use_auth_token with token and get the same logic.

    Linked PR: #41666

    Attention-related features

    We decided to remove some features for the upcoming v5 as they are currently only supported in a few old models and no longer integrated in current model additions. It's recommended to stick to v4.x in case you need them. Following features are affected:

    • No more head masking, see #41076. This feature allowed to turn off certain heads during the attention calculation and only worked for eager.
    • No more relative positional biases in Bert-like models, see #41170. This feature was introduced to allow relative position scores within attention calculations (similar to T5). However, this feature is barely used in official models and a lot of complexity instead. It also only worked with eager.
    • No more head pruning, see #41417 by @gante. As the name suggests, it allowed to prune heads within your attention layers.

    Updates to supported torch APIs

    We dropped support for two torch APIs:

    • torchscript in #41688
    • torch.fx in #41683

    Those APIs were deprecated by the PyTorch team, and we're instead focusing on the supported APIs dynamo and export.

    Quantization changes

    We clean up the quantization API in transformers, and significantly refactor the weight loading as highlighted
    above.

    We drop support for two quantization arguments that have been deprecated for some time:

    • load_in_4bit
    • load_in_8bit

    We remove them in favor of the quantization_config argument which is much more complete. As an example, here is how
    you would load a 4-bit bitsandbytes model using this argument:

    from transformers import AutoModelForCausalLM, BitsAndBytesConfig
    quantization_config = BitsAndBytesConfig(load_in_4bit=True)
    model_4bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    device_map="auto",
    quantization_config=quantization_config
    )
    

    Configuration

    Methods to init a nested config such as from_xxx_config are deleted. Configs can be init from the init method in the same way. See #41314.

    It is no longer possible to load a config class from a URL file. Configs must be loaded from either a local path or a repo on the Hub. See #42383.

    All parameters for configuring model's rotary embedding are now stored under mode.rope_parameters, including the rope_theta and rope_type. Model's config.rope_parameters is a simple dictionaty in most cases, and can also be a nested dict in special cases (i.e. Gemma3 and ModernBert) with different rope parameterization for each layer type. Trying to get config.rope_theta will throw an attribute error from now on. See #39847 and #42255

    Qwen-VL family configuration is in a nested format and trying to access keys directly will throw an error (e.g. config.vocab_size). Users are expected to access keys from their respective sub-configs (config.text_config.vocab_size).

    Configurations of non-generative models (any model that doesn't call model.generate()) will no longer have a generation_config and model.config.generation_config will throw an attribute error.

    Processing

    Tokenization

    Slow tokenizer files (aka: tokenization_<model>.py ) will be removed in favor of using fast tokenizer files tokenization_<model>fast.py --> will be renamed to tokenization<model>.py. As fast tokenizers are 🤗tokenizers - backend, they include a wider range of features that are maintainable and reliable.

    Other backends (sentence piece, tokenizers, etc.) will be supported with a light layer if loading a fast tokenizer fails

    • Remove legacy files like special_tokens_map.json and added_tokens.json
    • Remove _eventually_correct_t5_max_length
    • encode_plus --> call
    • batch_decode --> decode
    • apply_chat_template by default returns naked input_ids rather than a BatchEncoding dict.
    • This was inconvenient - it should return a BatchEncoding dict like tokenizer.call(), but we were stuck with
    • it for backward compatibility. The method now returns a BatchEncoding.

    Linked PRs:

    • #40938
    • #40936
    • #41626

    Processing classes

    In processing classes each attribute will be serialized under processor_config.json as a nested dict, instead of serializing attributes in their own config files. Loading will be supported for all old format processors (#41474)

    XXXFeatureExtractors classes are completely removed in favor of XXXImageProcessor class for all vision models (#41174)

    Minor change: XXXFastImageProcessorKwargs is removed in favor of XXXImageProcessorKwargs which will be shared between fast and slow processors (#40931)

    Modeling

    Some RotaryEmbeddings layers will start returning a dict of tuples, in case the model uses several RoPE configurations (Gemma2, ModernBert). Each value will be a tuple of "cos, sin" per RoPE type.

    Config attribute for RotaryEmbeddings layer will be unified and accessed via config.rope_parameters. Config attr for rope_theta might not be accessible anymore for some models, and instead will be in config.rope_parameters['rope_theta']. BC will be supported for a while as much as possible, and in the near future we'll gradually move to the new RoPE format (#39847)

    Vision Language models will not have a shortcut access to its language and vision component from the generative model via model.language_model. It is recommended to either access the module with model.model.language_model or model.get_decoder(). See #42156

    All models now accept kwargs in their forward methods

    Generate

    Old, deprecated output type aliases were removed (e.g. GreedySearchEncoderDecoderOutput). We now only have 4 output classes built from the following matrix: decoder-only vs encoder-decoder, uses beams vs doesn't use beams (#40998)

    Removed deprecated classes regarding decoding methods that were moved to the Hub due to low usage (constraints and beam scores) (#41223)

    If generate doesn't receive any KV Cache argument, the default cache class used is now defined by the model (as opposed to always being DynamicCache) (#41505)

    Generation parameters are no longer accessible via model's config. If generation paramaters are serialized in config.json for any old model, it will be loaded back into model's generation config. Users are expected to access or modify generation parameters only with model.generation_config.do_sample = True.

    Trainer

    New Features

    ALST/Ulysses Sequence Parallelism Integration

    Added sequence parallelism support via HF Accelerate for training with longer sequences. Enables splitting sequences across devices using ALST (All-to-All Long Sequence Training) and Ulysses algorithms with DeepSpeed.

    Improved compute_loss_func Handling

    compute_loss_func now always takes priority over the model's built-in loss computation, giving users consistent control over custom loss functions.

    num_items_in_batch in Prediction Step

    The num_items_in_batch argument is now passed to compute_loss during prediction_step, enabling proper loss scaling during evaluation.

    Breaking Changes

    report_to now defaults to "none"

    Logging integrations are no longer auto-detected by default; users must explicitly specify which reporting backends to use.

    Removing arguments without deprecation cycle in TrainingArguments due to low usage

    • mp_parameters -> legacy param that was later on added to the Sagemaker trainer
    • _n_gpu -> not intended for users to set, we will initialize it correctly instead of putting it in the TrainingArguments
    • overwrite_output_dir - > replaced by resume_from_checkpoint, and it was only used in the examples script, no impact on Trainer.
    • logging_dir -> only used for tensorboard, set TENSORBOARD_LOGGING_DIR env var instead
    • jit_mode_eval -> use use_torch_compile instead, as torchscript is not recommended anymore
    • tpu_num_cores-> It is actually better to remove it, as it is not recommended to set the number of cores. By default, all TPU cores are used . Set TPU_NUM_CORES env var instead
    • past_index -> it was only used for a very small number of models that have special architecture like transformersxl + it was not documented at all how to train those models
    • ray_scope -> only for a minor arg for ray integration. Set RAY_SCOPE var env instead
    • warmup_ratio -> use warmup_step instead. We combined both args together by allowing passing float values in warmup_step.

    Removing deprecated arguments in TrainingArguments

    • fsdp_min_num_params and fsdp_transformer_layer_cls_to_wrap -> use fsdp_config
    • tpu_metrics_debug -> debug
    • push_to_hub_token -> hub_token
    • push_to_hub_model_id and push_to_hub_organization -> hub_model_id
    • include_inputs_for_metrics -> include_for_metrics
    • per_gpu_train_batch_size -> per_device_train_batch_size
    • per_gpu_eval_batch_size -> per_device_eval_batch_size
    • use_mps_device -> mps will be used by default if detected
    • fp16_backend and half_precision_backend -> we will only rely on torch.amp as everything has been upstreamed to torch
    • no_cuda -> use_cpu
    • include_tokens_per_second -> include_num_input_tokens_seen
    • use_legacy_prediction_loop -> we only use evaluation_loop function from now on

    Removing deprecated arguments in Trainer

    • tokenizer in initialization -> processing_class
    • model_path in train() -> resume_from_checkpoint

    Removed features for Trainer

    • sigpot integration for hp search was removed as the library was archived + the api stopped working
    • drop support for sagemaker API <1.10
    • bump accelerate minimum version to 1.1.0
    • bump peft minimum version to 0.18.0
    • bump bitsandbytes minimum version to 0.46.1

    New defaults for Trainer

    use_cache in the model config will be set to False. You can still change the cache value through TrainingArguments usel_cache argument if needed.

    Pipeline

    Image text to text pipelines will no longer accept images as a separate argument along with conversation chats. Image data has to be embedded in the chat's "content" field. See #42359

    PushToHubMixin

    removed deprecated organization and repo_url from PushToHubMixin. You must pass a repo_id instead.

    removed ignore_metadata_errors from PushToMixin. In practice if we ignore errors while loading the model card, we won't be able to push the card back to the Hub so it's better to fail early and not provide the option to fail later.

    push_to_hub do not accept **kwargs anymore. All accepted parameters are explicitly documented.

    arguments of push_to_hub are now keyword-only to avoid confusion. Only repo_id can be positional since it's the main arg.

    removed use_temp_dir argument from push_to_hub. We now use a tmp dir in all cases.

    Linked PR: #42391.

    CLI

    The deprecated transformers-cli ... command was deprecated, transformers ... is now the only CLI entry point.

    transformers CLI has been migrated to Typer, making it easier to maintain + adding some nice features out of
    the box (improved --help section, autocompletion).

    Biggest breaking change is in transformers chat. This command starts a terminal UI to interact with a chat model.

    It used to also be able to start a Chat Completion server powered by transformers and chat with it. In this revamped
    version, this feature has been removed in favor of transformers serve. The goal of splitting transformers chat
    and transformers serve is to define clear boundaries between client and server code. It helps with maintenance
    but also makes the commands less bloated. The new signature of transformers chat is:

    Usage: transformers chat [OPTIONS] BASE_URL MODEL_ID [GENERATE_FLAGS]...
    Chat with a model from the command line.

    It works hand in hand with transformers serve, which means that if transformers serve is running on its default endpoint, transformers chat can be launched as follows:

    transformers chat HuggingFaceTB/SmolLM3-3B

    It can however use any OpenAI API compatible HTTP endpoint:

    transformers chat HuggingFaceTB/SmolLM3-3B https://router.huggingface.co/v1

    Linked PRs:

    • #40997
    • #41487

    Removal of the run method

    The transformers run (previously transformers-cli run) is an artefact of the past, was not documented nor tested,
    and isn't part of any public documentation. We're removing it for now and ask you to please let us know in case
    this is a method you are using; in which case we should bring it back with better support.

    Linked PR: #42447

    Environment variables

    Legacy environment variables like TRANSFORMERS_CACHE, PYTORCH_TRANSFORMERS_CACHE, and PYTORCH_PRETRAINED_BERT_CACHE have been removed. Please use HF_HOME instead.

    Constants HUGGINGFACE_CO_EXAMPLES_TELEMETRY, HUGGINGFACE_CO_EXAMPLES_TELEMETRY, HUGGINGFACE_CO_PREFIX, and HUGGINGFACE_CO_RESOLVE_ENDPOINT have been removed. Please use huggingface_hub.constants.ENDPOINT instead.

    Linked PR: #42391.

    Requirements update

    transformers v5 pins the huggingface_hub version to >=1.0.0. See this migration guide to learn more about this major release. Here are to main aspects to know about:

    • switched the HTTP backend from requests to httpx. This change was made to improve performance and to support both synchronous and asynchronous requests the same way. If you are currently catching requests.HTTPError errors in your codebase, you'll need to switch to httpx.HTTPError.
    • related to 1., it is not possible to set proxies from your script. To handle proxies, you must set the HTTP_PROXY / HTTPS_PROXY environment variables
    • hf_transfer and therefore HF_HUB_ENABLE_HF_TRANSFER have been completed dropped in favor of hf_xet. This should be transparent for most users. Please let us know if you notice any downside!
    • typer-slim has been added as required dependency, used to implement both hf and transformers CLIs.

    New model additions in v5

    CWM

    The Code World Model (CWM) model was proposed in CWM: An Open-Weights LLM for Research on Code Generation with World Models by Meta FAIR CodeGen Team. CWM is an LLM for code generation and reasoning about code that has, in particular, been trained to better represent and reason about how code and commands affect the state of a program or system. Specifically, we mid-trained CWM on a large number of observation-action trajectories from Python execution traces and agentic interactions in containerized environments. We post-trained with extensive multi-task RL in verifiable coding, math, and multi-turn software engineering environments.

    Add Code World Model (CWM) by @jacobkahn in #41199

    SAM3

    SAM3 (Segment Anything Model 3) was introduced in SAM 3: Segment Anything with Concepts.
    The SAM3 addition adds four new architectures:

    • Sam3
    • Sam3Tracker
    • Sam3TrackerVideo
    • Sam3Video

    SAM3 performs Promptable Concept Segmentation (PCS) on images. PCS takes text and/or image exemplars as input (e.g., "yellow school bus"), and predicts instance and semantic masks for every single object matching the concept.

    Sam3Tracker and Sam3TrackerVideo perform Promptable Visual Segmentation (PVS) on images. PVS takes interactive visual prompts (points, boxes, masks) or text inputs to segment a specific object instance per prompt. This is the task that SAM 1 and SAM 2 focused on, and SAM 3 improves upon it. Sam3Tracker and Sam3TrackerVideo are updated versions of SAM2 Video that maintain the same API while providing improved performance and capabilities.

    SAM3 Video performs Promptable Concept Segmentation (PCS) on videos. PCS takes text as input (e.g., "yellow school bus"), and predicts instance and semantic masks for every single object matching the concept, while preserving object identities across video frames. The model combines a detection module (SAM3) with a tracking module (SAM2-style tracker) to enable robust object tracking across video frames using text prompts.

    Add SAM3 to 🤗 Transformers by @yonigozlan in #42285

    LFM2 MoE

    LFM2-MoE is a Mixture-of-Experts (MoE) variant of LFM2. The LFM2 family is optimized for on-device inference by combining short‑range, input‑aware gated convolutions with grouped‑query attention (GQA) in a layout tuned to maximize quality under strict speed and memory constraints.

    LFM2‑MoE keeps this fast backbone and introduces sparse MoE feed‑forward networks to add representational capacity without significantly increasing the active compute path. The first LFM2-MoE release is LFM2-8B-A1B, with 8.3B total parameters and 1.5B active parameters. The model excels in quality (comparable to 3-4B dense models) and speed (faster than other 1.5B class models).

    [Model] Lfm2Moe by @paulpak58 in #41401

    VideoLlama 3

    The VideoLLaMA3 model is a major update to VideoLLaMA2 from Alibaba DAMO Academy.

    [model] Add VideoLLaMA3 implementation by @lkhl in #40499

    AudioFlamingo 3

    Audio Flamingo 3 (AF3) is a fully open large audio–language model designed for robust understanding and reasoning over speech, environmental sounds, and music. AF3 pairs a Whisper-style audio encoder with a causal language model and performs replace-in-place audio–text fusion: the processor aligns post-pool audio frames to a dedicated placeholder token and the model replaces those token slots with projected audio embeddings during the forward pass.

    The model checkpoint is available at: nvidia/audio-flamingo-3-hf

    Highlights:

    • Unified audio encoder across speech, sound, and music.
    • Long-audio support via windowing and post-pool alignment (up to 10 minutes maximum). The model processes audio in 30-second windows with a hard limit of 20 windows (10 minutes total). Audio longer than 10 minutes will be truncated.
    • Deterministic fusion that preserves sequence length by replacing audio placeholder tokens with audio embeddings.

    [models] Add AudioFlamingo3 integration by @lashahub in #40290

    Nanochat

    NanoChat is a compact decoder-only transformer model designed for educational purposes and efficient training. The model features several fundamental architectural innovations which are common in modern transformer models. Therefore, it is a good model to use as a starting point to understand the principles of modern transformer models. NanoChat is a variant of the Llama architecture, with simplified attention mechanism and normalization layers.

    [MODEL] Nanochat implementation by @burtenshaw in #41634

    FastVLM

    FastVLM is an open-source vision-language model featuring a novel hybrid vision encoder, FastViTHD. Leveraging reparameterizable convolutional layers, scaled input resolution, and a reduced number of visual tokens, FastVLM delivers high accuracy with exceptional efficiency. Its optimized architecture enables deployment even on edge devices, achieving ultra-low TTFT (time to first token) without sacrificing performance.

    Add FastVLM by @camilla-deckard in #41112

    PaddleOCR-VL

    PaddleOCR-VL is a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.

    [Model] Add PaddleOCR-VL Model Support by @zhang-prog in #42178

    SAM: Perception Encoder Audiovisual

    PE Audio (Perception Encoder Audio) is a state-of-the-art multimodal model that embeds audio and text into a shared (joint) embedding space.
    The model enables cross-modal retrieval and understanding between audio and text.

    Text input

    Produces a single embedding representing the full text.

    Audio input

    PeAudioFrameLevelModel

    Produces a sequence of embeddings, one every 40 ms of audio.
    Suitable for audio event localization and fine-grained temporal analysis.

    PeAudioModel

    Produces a single embedding for the entire audio clip.
    Suitable for global audio-text retrieval tasks.

    The resulting embeddings can be used for:

    • Audio event localization
    • Cross-modal (audio–text) retrieval and matching

    Sam: Perception Encoder Audiovisual by @eustlb in #42905

    Jais2

    Jais2 a next-generation Arabic open-weight LLM trained on the richest Arabic-first dataset to date. Built from the ground up with 8B and 70B parameters, Jais 2 understands Arabic the way it's truly spoken across dialects, cuulutre, and modern expression. It is developed by MBZUAI, Inception and Cerebras Systems and based on the transformer architecture with modifications including:

    • LayerNorm instead of RMSNorm
    • ReLU² activation function
    • Rotary Position Embeddings (RoPE)

    adds jais2 model support by @sarathc-cerebras in #42684

    Pixio

    Pixio is a vision foundation model that uses ViT as a feature extractor for multiple downstream tasks like depth estimation, semantic segmentation, feed-forward 3D reconstruction, robotics, and image classification. It is built on the Masked Autoencoder (MAE) pre-training framework, with four minimal yet critical updates: 1) deeper decoder, 2) larger masking granularity, 3) more class tokens, and 4) web-scale curated training data.

    Add Pixio pre-trained models by @LiheYoung in #42795

    Ernie 4.5 VL MoE

    The Ernie 4.5 VL MoE model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes. The Vision-Language series in specific is composed of a novel multimodal heterogeneous structure, sharing paremeters across modalities and dedicating parameters to specific modalities. This becomes especially apparent in the Mixture of Expert (MoE) which is composed of

    • Dedicated Text Experts
    • Dedicated Vision Experts
    • Shared Experts

    This architecture has the advantage to enhance multimodal understanding without compromising, and even improving, performance on text-related tasks. An more detailed breakdown is given in the Technical Report.

    [Ernie 4.5] Ernie VL models by @vasqu in #39585

    GLM-ASR

    GLM-ASR-Nano-2512 is a robust, open-source speech recognition model with 1.5B parameters. Designed for
    real-world complexity, it outperforms OpenAI Whisper V3 on multiple benchmarks while maintaining a compact size.

    Key capabilities include:

    • Exceptional Dialect Support
    • Beyond standard Mandarin and English, the model is highly optimized for Cantonese (粤语) and other dialects,
      effectively bridging the gap in dialectal speech recognition.
    • Low-Volume Speech Robustness
    • Specifically trained for "Whisper/Quiet Speech" scenarios. It captures and accurately transcribes extremely
      low-volume audio that traditional models often miss.
    • SOTA Performance
    • Achieves the lowest average error rate (4.10) among comparable open-source models, showing significant advantages
      in Chinese benchmarks (Wenet Meeting, Aishell-1, etc..).

    This model was contributed by Eustache Le Bihan and Yuxuan Zhang.
    you can check the model card for more details and our
    github repo.

    GLM-ASR Support by @zRzRzRzRzRzRzR in #42875

    GLM 4.7 Flash

    GLM-4.7-Flash offers a new option for lightweight deployment that balances performance and efficiency.

    [GLM-4.7] GLM-Lite Support by @zRzRzRzRzRzRzR in #43031

    GLM Image

    We present GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. We further introduce the GLM-4.6V series, open-source multimodal models with native tool use and a 128K context window. A brief overview is available at this https URL. Code, models and more information are released at https://github.com/zai-org/GLM-V

    [GLM-Image] AR Model Support for GLM-Image by @zRzRzRzRzRzRzR in #43100

    LWDetr

    LW-DETR proposes a light-weight Detection Transformer (DETR) architecture designed to compete with and surpass the dominant YOLO series for real-time object detection. It achieves a new state-of-the-art balance between speed (latency) and accuracy (mAP) by combining recent transformer advances with efficient design choices.

    The LW-DETR architecture is characterized by its simple and efficient structure: a plain ViT Encoder, a Projector, and a shallow DETR Decoder. It enhances the DETR architecture for efficiency and speed using the following core modifications:

    • Efficient ViT Encoder: Uses a plain ViT with interleaved window/global attention and a window-major organization to drastically reduce attention complexity and latency.
    • Richer Input: Aggregates multi-level features from the encoder and uses a C2f Projector (YOLOv8) to pass two-scale features ( 1 / 8 and 1 / 32 ).
    • Faster Decoder: Employs a shallow 3-layer DETR decoder with deformable cross-attention for lower latency and faster convergence.
    • Optimized Queries: Uses a mixed-query scheme combining learnable content queries and generated spatial queries.

    Add LWDetr model by @sbucaille in #40991

    LightOnOCR

    LightOnOcr combines a Vision Transformer encoder (Pixtral-based) with a lightweight text decoder (Qwen3-based) distilled from high-quality open VLMs. It is optimized for document parsing tasks, producing accurate, layout-aware text extraction from high-resolution pages.

    Add LightOnOCR model implementation by @baptiste-aubertin in #41621

    Bugfixes and improvements

    • JetMoe Fix jetmoe after #40132 by @ArthurZucker in #41324
    • Fixed tiny incorrect import in gemma3 by @Sai-Suraj-27 in #41354
    • Rope for Qwen2--5-vl by @zucchini-nlp in #41173
    • 🚨 Bump to Python 3.10 and rework how we check 3rd-party libraries existence by @Cyrilvallez in #41268
    • Standardize PretrainedConfig to PreTrainedConfig by @Cyrilvallez in #41300
    • Fix trainer for py3.9 by @SunMarc in #41359
    • Check model inputs - hidden states by @zucchini-nlp in #40994
    • [ModularChecker] QOL for the modular checker by @ArthurZucker in #41361
    • Fixing a typo for BLT model by @Narsil in #41325
    • 🚨 [v5] Remove relative position embeddings (for bert like models) by @vasqu in #41170
    • Fix typo in model proposal template by @Ombucha in #41352
    • Better typehints for apply_chat_template by @Samoed in #41355
    • 🚨 Remove BetterTransformer by @Cyrilvallez in #41367
    • [testing] update test_longcat_generation_cpu by @ydshieh in #41368
    • Fix flash_attention.py: wrong argument passing for attn_implementation by @TKONIY in #41347
    • Use canonical get_size_with_aspect_ratio (with max_size) from transformers.image_transforms to fix #37939 by @sonianuj287 in #41284
    • Fixes in check_model_inputs, GPTBigCodeModel and ImageGPTModel by @IlyasMoutawwakil in #40811
    • Remove unnecessary list comprehension by @cyyever in #41305
    • make some ut cases pass on xpu w/ latest torch by @yao-matrix in #41337
    • Remove unused function patameters by @cyyever in #41358
    • [CB] Refactors the way we access paged by @ArthurZucker in #41370
    • serve: add non-streaming mode to /v1/responses; stream event parity; remove placeholder logprobs by @antznette1 in #41353
    • Update from pretrained error when loading by @ArthurZucker in #33380
    • [v5] Sync Bert and Bart eager attention by @vasqu in #41248
    • fix asr ut failures by @yao-matrix in #41332
    • fix resample in asr pipeline by @yhzx233 in #41298
    • Correct numerical regression in vision embeddings by @i3hz in #41374
    • [kernels] Kernel Config by @MekkCyber in #41232
    • [Cache] lfm2 cache: allocate empty kv layers during init by @paulpak58 in #41396
    • Fix test for model with dotted name and relative imports by @st81 in #41343
    • Prefer raising TypeError exception for invalid type by @Sai-Suraj-27 in #41346
    • [v5] Bump accelerate to 1.1.0 by @SunMarc in #41234
    • Fix incorrect assignment in update_device_map for GPTQ quantizer by @Sai-Suraj-27 in #41328
    • [v5] Delete left traces of feature extractor by @zucchini-nlp in #41321
    • Remove deprecation warning by @Cyrilvallez in #41425
    • Fix overriding common_kwargs defaults in processor calls by @yonigozlan in #41381
    • v5 dev version by @LysandreJik in #41436
    • Tiny Cleanup - Removed duplicate class field definition's by @Sai-Suraj-27 in #41293
    • 🚨🚨 Remove all traces of legacy cache format by @Cyrilvallez in #41378
    • 🚨 [v5] Prune prune_heads by @gante in #41417
    • [v5] Bump min version of bitsandbytes to 0.46.1 by @SunMarc in #41283
    • Fixing comments in init file by @MekkCyber in #41414
    • Use accelerator API to free device memory by @cyyever in #41195
    • enable new model uts to xpu and fix some failures on xpu by @yao-matrix in #41386
    • [torchao] Add regex support for ModuleFqnToConfig by @jerryzh168 in #41242
    • 🤦 CB nit! by @ArthurZucker in #41413
    • Remove Python 3.9 classifier by @cyyever in #41410
    • [JetMoe] Fix KV head repetition and padding free by @vasqu in #41423
    • [testing] Fix JetMoeIntegrationTest by @ydshieh in #41377
    • Add Top-H decoding (entropy-bounded truncation) as a LogitsWarper for text generation by @ErfanBaghaei in #40837
    • Validate processing kwargs with @strict from huggingface_hub by @zucchini-nlp in #40793
    • Update hqq.md by @prathamesh-chavan-22 in #41452
    • enable some falcon-mamba uts on xpu by @yao-matrix in #41428
    • Fix generate outputs and simplify cache tests by @Cyrilvallez in #41440
    • Fix doc by @Cyrilvallez in #41457
    • 🚨 [v5] Rename left traces of past_key_value in BERT-like models by @zucchini-nlp in #41448
    • Subconfig is a class attribute by @zucchini-nlp in #41308
    • [v5] rm utils/tf_ops/ by @gante in #41402
    • Update GLM-4.1V MMRope implementation by @zRzRzRzRzRzRzR in #41182
    • [kernels] Cleanup deta kernel by @MekkCyber in #41470
    • 🚨 [v5] Rendundant code in nested configs by @zucchini-nlp in #41314
    • Remove KERAS_NLP_IMPORT_ERROR by @cyyever in #41468
    • Fix auto model configuration for encoder of perceptionlm by @fschlatt in #41464
    • Fix tests fsdp by @SunMarc in #41422
    • Import Callable from collections.abc by @cyyever in #41130
    • Pickle - part 2 by @ydshieh in #41476
    • Remove infer_device by @cyyever in #41088
    • Change RT-Detr docs to reflect fixed 640x640 input size by @konstantinos-p in #41364
    • Cleaning hub kernels by @MekkCyber in #41477
    • [v5] remove load_in_4bit and load_in_8bit by @SunMarc in #41287
    • 🚨 [Attention Masks] Bidirectional masks for encoder and encoder-decoder models by @vasqu in #41265
    • [Fix] Fix test file error by @YangKai0616 in #40973
    • enhance patched_tearDown to support python 3.11+ by @yao-matrix in #41429
    • RT-Detr correct 2d positional embeddings for non-square images by @konstantinos-p in #41380
    • Fix bnb fsdp loading for pre-quantized checkpoint by @SunMarc in #41415
    • Remove SigOpt by @SunMarc in #41479
    • Remove past_index by @SunMarc in #41384
    • Remove deprecated args in Trainer for v5 by @SunMarc in #41404
    • Update GLM-4.6 doc by @zRzRzRzRzRzRzR in #41471
    • report_to default changed to "none" + cleaning deprecated env var by @SunMarc in #41375
    • deprecate overwrite_output_dir by @SunMarc in #41323
    • [CI] Fix copies on main by @vasqu in #41486
    • [Trainer] deprecate ray scope by @SunMarc in #41403
    • deprecate jit_mode_eval by @SunMarc in #41376
    • Remove local_rank arg from TrainingArguments by @SunMarc in #41382
    • Update philosophy by @molbap in #41438
    • Remove DISABLE_KERNEL_MAPPING flag by @MekkCyber in #41475
    • Streaming should be handled at the request-level rather than at the istance level by @LysandreJik in #41444
    • fix bnb model loading by @jiqing-feng in #41499
    • [kernels] Remove RWKV kernel finally ! by @MekkCyber in #41493
    • [kernels] rm yoso kernel by @MekkCyber in #41495
    • Try to remove pickle - BloomTokenizerFast by @ydshieh in #41466
    • Fixed tiny incorrect imports in glm4v by @Sai-Suraj-27 in #41483
    • [Parakeet] unnecessary warning & auto mapping by @eustlb in #41412
    • [causallm tester] automate pipeline mappings + bloom tests by @gante in #41318
    • Fix some tests by @Cyrilvallez in #41503
    • fix gemma3n case failure by @yao-matrix in #41426
    • [voxtral] language detection + skipping lang:xx by @eustlb in #41225
    • Set truncation to False in Qwen3Omni to avoid default truncation by @BakerBunker in #41473
    • [QoL] modular conversion shows LoC saved by @molbap in #41500
    • More trainer cleaning by @SunMarc in #41489
    • Bump to hfh 1.0.0.rc5 to fix test by @Wauplin in #41508
    • Revert local_rank deletion and some cleaning by @SunMarc in #41504
    • Fix detectron2 import by @Cyrilvallez in #41510
    • add Trainer import to .md in appropriate cell block for training.ipynb transformers_doc by @benkeene in #41484
    • Remove outdated flags by @Cyrilvallez in #41512
    • remove tpu_num_cores by @SunMarc in #41383
    • Allow optuna's catch kwargs passthrough by @nicha-api in #41496
    • Fix Latex typesetting in documentation by @cyyever in #41177
    • [testing] reduce runtime of HunYuanMoEV1IntegrationTest:test_model_generation by @ydshieh in #41373
    • [Qwen3VL] fix: hidden_states in place modification error by @HollowMan6 in #41535
    • Add MLlam
    Original source Report a problem

Related vendors