transformers Release Notes
Last updated: Apr 10, 2026
- Apr 9, 2026
- Date parsed from source:Apr 9, 2026
- First seen by Releasebot:Apr 10, 2026
Patch release: v5.5.3
transformers fixes Gemma4 device_map auto support in a small patch release.
Small patch release to fix device_map support for Gemma4! It contains the following commit:
[gemma4] Fix device map auto (#45347) by @Cyrilvallez
Original source Report a problem - Apr 9, 2026
- Date parsed from source:Apr 9, 2026
- First seen by Releasebot:Apr 10, 2026
Patch release: v5.5.2
transformers fixes Gemma4 inference with use_cache=False and improves weight conversion mappings in a small patch.
Small patch dedicated to optimizing gemma4, fixing inference with use_cache=False due to k/v states sharing between layers, as well as conversion mappings for some models that would inconsistently serialize their weight names. It contains the following PRs:
- Add MoE to Gemma4 TP plan (#45219) by @sywangyi and @Cyrilvallez
- [gemma4] Dissociate kv states sharing from the Cache (#45312) by @Cyrilvallez
- [gemma4] Remove all shared weights, and silently skip them during loading (#45336) by @Cyrilvallez
- Fix conversion mappings for vlms (#45340) by @Cyrilvallez
All of your release notes in one feed
Join Releasebot and get updates from Hugging Face and hundreds of other software products.
- Apr 9, 2026
- Date parsed from source:Apr 9, 2026
- First seen by Releasebot:Apr 9, 2026
Patch release v5.5.1
transformers ships patch v5.5.1 with Gemma4 and vLLM fixes plus export and integration test improvements.
Patch release v5.5.1
This patch is very small and focuses on vLLM and Gemma4!
- Fix export for gemma4 and add Integration tests (#45285) by @Cyrilvallez
- Fix vllm cis (#45139) by @ArthurZucker
- Apr 2, 2026
- Date parsed from source:Apr 2, 2026
- First seen by Releasebot:Apr 2, 2026
Release v5.5.0
transformers releases Gemma 4 multimodal models, NomicBERT text embeddings, and Music Flamingo for audio-language reasoning, while also adding major cache updates, vision fixes, and broader bug fixes and improvements.
Release v5.5.0
New Model additions
Gemma4Gemma 4 is a multimodal model with pretrained and instruction-tuned variants, available in 1B, 13B, and 27B parameters. The architecture is mostly the same as the previous Gemma versions. The key differences are a vision processor that can output images of fixed token budget and a spatial 2D RoPE to encode vision-specific information across height and width axis.
You can find all the original Gemma 4 checkpoints under the Gemma 4 release.
The key difference from previous Gemma releases is the new design to process images of different sizes using a fixed-budget number of tokens. Unlike many models that squash every image into a fixed square (like 224×224), Gemma 4 keeps the image's natural aspect ratio while making it the right size. There a a couple constraints to follow:
- The total number of pixels must fit within a patch budget
- Both height and width must be divisible by 48 (= patch size 16 × pooling kernel 3)
Important
Gemma 4 does not apply the standard ImageNet mean/std normalization that many other vision models use. The model's own patch embedding layer handles the final scaling internally (shifting values to the [-1, 1] range).
The number of "soft tokens" (aka vision tokens) an image processor can produce is configurable. The supported options are outlined below and the default is 280 soft tokens per image.
Soft Tokens Patches (before pooling) Approx. Image Area 70 630 ~161K pixels 140 1,260 ~323K pixels 280 2,520 ~645K pixels 560 5,040 ~1.3M pixels 1,120 10,080 ~2.6M pixelsTo encode positional information for each patch in the image, Gemma 4 uses a learned 2D position embedding table. The position table stores up to 10,240 positions per axis, which allows the model to handle very large images. Each position is a learned vector of the same dimensions as the patch embedding. The 2D RoPE which Gemma 4 uses independently rotate half the attention head dimensions for the x-axis and the other half for the y-axis. This allows the model to understand spatial relationships like "above," "below," "left of," and "right of."
NomicBERTNomicBERT is a BERT-inspired encoder model that applies Rotary Position Embeddings (RoPE) to create reproducible long context text embeddings. It is the first fully reproducible, open-source text embedding model with 8192 context length that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short-context MTEB and long context LoCo benchmarks. The model generates dense vector embeddings for various tasks including search, clustering, and classification using specific instruction prefixes.
Links: Documentation | Paper
Internalise the NomicBERT model (#43067) by @ed22699 in #43067
MusicFlamingoMusic Flamingo is a fully open large audio–language model designed for robust understanding and reasoning over music. It builds upon the Audio Flamingo 3 architecture by including Rotary Time Embeddings (RoTE), which injects temporal position information to enable the model to handle audio sequences up to 20 minutes. The model features a unified audio encoder across speech, sound, and music with special sound boundary tokens for improved audio sequence modeling.
Links: Documentation | Paper
Add Music Flamingo (#43538) by @lashahub in #43538
Breaking changes
Mamba and hybrid model caches are now first-class native citizens in the library, so users working with Mamba-based or hybrid (Mamba + attention) models should update their code to use the new native cache classes instead of any previous workarounds.
🚨 [Cache] Native mamba & hybrid cache (#44950) by @Cyrilvallez
Remote code execution support has been removed from the native LightGlue integration, so users who were loading LightGlue with trust_remote_code=True must remove that argument and use the model directly through the standard native API.
🚨 [LightGlue] Remove remote code execution (#45122) by @vasqu
Vision
Several vision-related bugs were fixed in this release, including correcting the Gemma vision mask to support video inputs, resolving a dependency issue that incorrectly required torchvision for PIL-based image processors, and patching bugs in the Janus image generation model and image loading. Local code resolution for tokenizers and image processors was also corrected.
- Generalize gemma vision mask to videos (#45185) by @zucchini-nlp in [#45185]
- Fix explicit local code resolution for tokenizers and image processors (#45169) by @hmellor in [#45169]
- fix bug for janus model image generation (#45044) by @kaixuanliu in [#45044]
- [Bugfix] Remove incorrect torchvision requirement from PIL backend image processors (#45045) by @Lidang-Jiang in [#45045]
- Avoid Image.open failure (#44645) by @sywangyi in [#44645]
Cache
Improved the performance of repository checks (check-repo) by introducing file-level and AST-level disk caching, achieving up to a 27x speedup (from ~46s to ~1.6s with a warm cache), and fixed the mlinter cache location in .gitignore.
- refactoring: speedup static checks with disk cache (#44992) by @tarekziade in [#44992]
- refactor: added cache in check_repo (#45012) by @tarekziade in [#45012]
- chore: Fix mlinter cache location (#45052) by @tarekziade in [#45052]
Bugfixes and improvements
- Fix resized LM head weights being overwritten by post_init (#45079) by @javierdejesusda in [#45079]
- [Qwen3.5 MoE] Add _tp_plan to ForConditionalGeneration (#45124) by @danielquintas8 in [#45124]
- fix(models): Fix dtype mismatch in SwitchTransformers and TimmWrapperModel (#45074) by @harshaljanjani in [#45074]
- [misc] fix qwen35 tests: correct the text model type and skip reverse_mapping (#45173) by @JJJYmmm in [#45173]
- 🔒 Pin GitHub Actions to commit SHAs (#45180) by @paulinebm in [#45180]
- Use doc-builder runnable example for GLM-ASR (#44277) by @tarekziade in [#44277]
- CI] Small T5 expectations updated (#45138) by @Abdennacer-Badaoui in [#45138]
- fix: correct type annotations across config classes for @strict validation (#45007) by @Krishnachaitanyakc in [#45007]
- Fix T5Attention shape mismatch under Tensor Parallelism (#45109) by @aws-zhanxun in [#45109]
- [refactor] Serving into proper modules (#44796) by @SunMarc in [#44796]
- Re-add regex substitutions to the response parsing spec (#45166) by @Rocketknight1 in [#45166]
- Fix incorrect TrainingArguments example in training.md (#45150) by @maanas1234 in [#45150]
- Add parse_response to Processor, make it a bit more official (#45143) by @Rocketknight1 in [#45143]
- DeepGEMM (#44832) by @IlyasMoutawwakil in [#44832]
- fix: prefer registered config over remote code in AutoConfig.from_pretrained (#45094) by @HanFa in [#45094]
- [serving] Fix continuous batching JSON response serialization (#45057) by @NathanHB in [#45057]
- Fix stupid test fetcher (#45140) by @ydshieh in [#45140]
- [CB] Add warmup feature (#45112) by @remi-or in [#45112]
- feature: added import complexity checker (#45013) by @tarekziade in [#45013]
- Fix tests for janus model (#44739) by @kaixuanliu in [#44739]
- CB improvements for serving (#45063) by @SunMarc in [#45063]
- [docs] continuous batching (#44896) by @stevhliu in [#44896]
- Fix few issues in Qwen_3_Omni_Moe (#44848) by @Sai-Suraj-27 in [#44848]
- Fix TypeError in rope validation when ignore_keys is a list (#45069) by @Fr0do in [#45069]
- Remove unused TensorFlow env var (#45065) by @Sai-Suraj-27 in [#45065]
- fix: add identity reverse_op to dequantize ops for save_pretrained (#44983) by @Hyungkeun-Park-Nota in [#44983]
- Fix when RoPE params are in kwargs (#45049) by @zucchini-nlp in [#45049]
- chore: update update_metdata.yml (#45054) by @hf-security-analysis[bot] in [#45054]
- [FA] Fix BC support for a few versions + add deprecation cycle (#45061) by @vasqu in [#45061]
- fix(testing): Fix Parakeet, Evolla, Pi0, and Phi-3 test failures on main CI (#45004) by @harshaljanjani in [#45004]
- Allow advanced users to override model_type in AutoConfig.from_pretrained (#45058) by @hmellor in [#45058]
- Fix failing SmolLM3IntegrationTest (#45048) by @Sai-Suraj-27 in [#45048]
- chore: remove old extras (#45024) by @tarekziade in [#45024]
- Embedding VLMs don't need a head (#45000) by @zucchini-nlp in [#45000]
- Fix GraniteConfig type hints to accept int for multiplier fields (#45019) by @javierdejesusda in [#45019]
- fix: preserve rotary_pct across save/load cycle in GPTNeoX configs (#44985) by @Krishnachaitanyakc in [#44985]
Significant community contributions
The following contributors have made significant changes to the library over the last release:
@ed22699
Internalise the NomicBERT model (#43067)@tarekziade
Use doc-builder runnable example for GLM-ASR (#44277)
refactoring: speedup static checks with disk cache (#44992)
feature: added import complexity checker (#45013)
refactor: added cache in check_repo (#45012)
chore: remove old extras (#45024)
chore: Fix mlinter cache location (#45052)
refactor: speed up docstring checker (#45009)@Krishnachaitanyakc
fix: correct type annotations across config classes for @strict validation (#45007)
fix: preserve rotary_pct across save/load cycle in GPTNeoX configs (#44985)@lashahub
Add Music Flamingo (#43538)@Lidang-Jiang
Original source Report a problem
[Bugfix] Remove incorrect torchvision requirement from PIL backend image processors (#45045) - Mar 27, 2026
- Date parsed from source:Mar 27, 2026
- First seen by Releasebot:Mar 27, 2026
Release v5.4.0: PaddlePaddle models 🙌, Mistral 4, PI0, VidEoMT, UVDoc, SLANeXt, Jina Embeddings v3
transformers adds a broad release with new multimodal and OCR models, including Mistral 4, Jina Embeddings v3, VidEoMT, UVDoc, PI0, SLANeXt and PP-OCRv5, plus speedups in quantization, tokenization, kernels, caching and parallelism, alongside important breaking-change updates.
New Model additions
VidEoMT
Video Encoder-only Mask Transformer (VidEoMT) is a lightweight encoder-only model for online video segmentation built on a plain Vision Transformer (ViT). It eliminates the need for dedicated tracking modules by introducing a lightweight query propagation mechanism that carries information across frames and employs a query fusion strategy that combines propagated queries with temporally-agnostic learned queries. VidEoMT achieves competitive accuracy while being 5x-10x faster than existing approaches, running at up to 160 FPS with a ViT-L backbone.
Links: Documentation | Paper
Add VidEoMT (#44285) by @NielsRogge in #44285
UVDoc
UVDoc is a machine learning model designed for document image rectification and correction. The main purpose of this model is to carry out geometric transformation on images to correct document distortion, inclination, perspective deformation and other problems in document images. It provides both single input and batched inference capabilities for processing distorted document images.
Links: Documentation
[Model] Add UVDoc Model Support (#43385) by @XingweiDeng in #43385
Jina Embeddings v3
The Jina-Embeddings-v3 is a multilingual, multi-task text embedding model designed for a variety of NLP applications. Based on the XLM-RoBERTa architecture, this model supports Rotary Position Embeddings (RoPE) replacing absolute position embeddings to support long input sequences up to 8192 tokens. Additionally, it features 5 built-in Task-Specific LoRA Adapters that allow the model to generate task-specific embeddings (e.g., for retrieval vs. classification) without increasing inference latency significantly.
Links: Documentation | Paper
Add Jina-Embeddings-V3 Model (#44251) by @Sai-Suraj-27 in #44251
Mistral4
Mistral 4 is a powerful hybrid model with the capability of acting as both a general instruction model and a reasoning model. It unifies the capabilities of three different model families - Instruct, Reasoning (previously called Magistral), and Devstral - into a single, unified model. The model features a MoE architecture with 128 experts and 4 active, 119B parameters with 6.5B activated per token, 256k context length, and supports multimodal input with both text and image processing capabilities.
Links: Documentation
Add Mistral 4 (#44760) by @juliendenize in #44760
PI0
PI0 is a vision-language-action model for robotics manipulation that jointly processes visual observations and language instructions to generate robot actions. It uses a novel flow matching architecture built on top of a pre-trained vision-language model to inherit Internet-scale semantic knowledge. The model can perform complex dexterous tasks like laundry folding, table cleaning, and assembling boxes across multiple robot platforms including single-arm robots, dual-arm robots, and mobile manipulators.
Links: Documentation | Paper
Add model lerobot PI0 to transformers (#44160) by @molbap in #44160
SLANeXt
SLANeXt is a series of dedicated lightweight models for table structure recognition, focusing on accurately recognizing table structures in documents and natural scenes. The SLANeXt series is a new generation of table structure recognition models independently developed by the Baidu PaddlePaddle Vision Team, with dedicated weights trained separately for wired and wireless tables. The recognition ability for all types of tables has been significantly improved, especially for wired tables.
Links: Documentation
[Model] Add SLANeXt Model Support (#43707) by @liu-jiaxuan in #43707
PP-OCRv5_mobile_rec
PP-OCRv5_mobile_rec is a dedicated lightweight model for text recognition, focusing specifically on efficient recognition and understanding of text elements in multi-language documents and natural scenes. It is designed to efficiently and accurately support the recognition of Simplified Chinese, Traditional Chinese, English, Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters with a single model. While maintaining recognition performance, it also balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.
Links: Documentation
[Model] Add PP-OCRv5_server_rec and PP-OCRv5_mobile_rec models Support (#44808) by @zhang-prog in #44808
PP-OCRv5_server_rec
PP-OCRv5_server_rec is a dedicated lightweight model for text recognition, focusing specifically on efficient recognition and understanding of text elements in multi-language documents and natural scenes. It is designed to efficiently and accurately support the recognition of Simplified Chinese, Traditional Chinese, English, Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters with a single model. While maintaining recognition performance, it also balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.
Links: Documentation
[Model] Add PP-OCRv5_server_rec and PP-OCRv5_mobile_rec models Support (#44808) by @zhang-prog in #44808
PP-OCRv5_mobile_det
PP-OCRv5_mobile_det is a dedicated lightweight model for text detection, focusing specifically on efficient detection and understanding of text elements in multi-language documents and natural scenes. It is part of the latest generation of text detection models developed by the PaddleOCR team that efficiently and accurately supports the detection of text in diverse scenarios—including handwriting, vertical, rotated, and curved text—across multiple languages such as Simplified Chinese, Traditional Chinese, English, and Japanese. The model features robust handling of complex layouts, varying text sizes, and challenging backgrounds, making it suitable for practical applications like document analysis, license plate recognition, and scene text detection.
Links: Documentation
[Model] Add PP-OCRV5_mobile_det Model Support (#43247) by @XingweiDeng in #43247
PPLCNet
PP-LCNet is a family of efficient, lightweight convolutional neural networks designed for real-world document understanding and OCR tasks. It balances accuracy, speed, and model size, making it ideal for both server-side and edge deployment. The model has three main variants optimized for specific tasks: document image orientation classification, table classification, and text line orientation classification.
Links: Documentation
[Model] Add PP-OCRV5_mobile_det Model Support (#43247) by @XingweiDeng in #43247
PPLCNetV3
PPLCNetV3 is a lightweight CPU-optimized convolutional backbone designed for efficient image classification and downstream vision tasks. It builds on the PP-LCNet architecture with improved training strategies and structural refinements for better accuracy-latency tradeoffs on CPU hardware.
Links: Documentation | Paper
[Model] Add PP-OCRV5_mobile_det Model Support (#43247) by @XingweiDeng in #43247
PP-OCRv5_server_det
PP-OCRv5_server_det is a high-performance text detection model optimized for server-side applications, focusing on accurate detection of multi-language text in documents and natural scenes. It supports the detection of text in diverse scenarios—including handwriting, vertical, rotated, and curved text—across multiple languages such as Simplified Chinese, Traditional Chinese, English, and Japanese. The model features robust handling of complex layouts, varying text sizes, and challenging backgrounds, making it suitable for practical applications like document analysis, license plate recognition, and scene text detection.
Links: Documentation
[Model] Add PP-OCRV5_server_det Model Support (#43274) by @XingweiDeng in #43274
CHMv2
CHMv2 is a global, meter-resolution canopy height mapping model that uses DINOv3 to estimate forest canopy heights from high-resolution optical satellite imagery. Building on the original canopy height maps released in 2024, CHMv2 delivers substantial improvements in accuracy, detail, and global consistency by leveraging Meta's self-supervised vision model. The model is trained against airborne laser scanning data and provides essential information for quantifying forest carbon, monitoring restoration and degradation, and assessing habitat structure.
Links: Documentation | Paper | Blog Post
Add CHMv2 (#44595) by @yonigozlan in #44595
Breaking changes
The dual BaseImageProcessor/BaseImageProcessorFast design has been replaced with a unified backend architecture, and the image_processing_utils_fast module has been removed — users should migrate to the new unified image_processing_utils module.
🚨🚨 Refactor Image Processors to support different backends (#43514) by @yonigozlan
PreTrainedConfig and model config classes have been refactored to use @dataclass and no longer accept positional arguments — users must update any config instantiation calls to use keyword arguments only.
🚨 Validate config attributes (#41250) by @zucchini-nlp
Flash Attention 2 (FA2) support now requires version 2.3.3 or newer, and initial Flash Attention 4 (FA4) support has been added — users on older FA2 versions must upgrade to at least 2.3.3.
🚨 [FA4] Initial support (#42435) by @vasqu
Weight tying behavior has changed so that weights are now tied even when both keys are already present in a checkpoint — users relying on the previous behavior (e.g., with .bin checkpoints containing duplicate keys) should verify their models load as expected.
[tie weights] 🚨 If both weights are present with same weights, still tie them (#44497) by @Cyrilvallez
The cache_position argument has been removed from the forward signatures of most major models — users passing cache_position directly to these models should remove it, as it is now handled internally by generate.
[core] 🚨 Completely remove cache positions (#44181) by @Cyrilvallez
Parallelization
Several bug fixes and improvements were made to pipeline parallel (PP) and tensor parallel (TP) support, including fixing supports_tp/pp_plan detection, resolving attribute errors in PP for Qwen2VL-based models, correcting FSDP loading with meta devices, and ensuring TP weight sharding properly updates parent module attributes (e.g., in_features/out_features) to improve compatibility with libraries like PEFT.
Fix several based models' pipeline parallel support (#44699) by @hmellor in [#44699]
[Model] Add PP-Chart2Table Model Support (#43767) by @XingweiDeng in [#43767]
enable tp for benchmark (#43750) by @sywangyi in [#43750]
Fix supports_{tp/pp}_plan (#44696) by @hmellor in [#44696]
Allow to disable stdout hiding for TP (#44608) by @michaelbenayoun in [#44608]
fix FSDP loading with meta devices (#44473) by @winglian in [#44473]
Fix: Conditionally import torch.distributed.fsdp in trainer_seq2seq.py (#44507) by @0xDELUXA in [#44507]
Supplement skip logic for XPU in the CPU-only tp tests (#44536) by @YangKai0616 in [#44536]
Update parent module attributes when sharding with TP (#44421) by @michaelbenayoun in [#44421]
trigger tensor parallel utils test in the CI (#44460) by @3outeille in [#44460]
Quantization
Quantization support was improved with up to 30x faster FP8 grouped and batched matmuls, static FP8 expert support for multi-GPU setups, and a torchao minimum version bump to 0.15.0. Additionally, MXFP4 dependency error messages were made more actionable, and AWQ tests were updated to align with the GPTQModel migration.
fix: split MXFP4 dependency checks for specific error messages (#44930) by @javierdejesusda in [#44930]
Add static FP8 expert support (#44895) by @SunMarc in [#44895]
Bump torchao >=0.15 and fix quantization CI (#44604) by @SunMarc in [#44604]
Fix AWQ tests for GPTQModel migration (#44654) by @jiqing-feng in [#44654]
[Performance] FP8 Grouped and Batched Matmuls (#44231) by @IlyasMoutawwakil in [#44231]
Fix PR comment CI for quantization job (#44579) by @ydshieh in [#44579]
Tokenization
Several performance improvements were made to tokenizer loading and saving, including eliminating redundant file parsing and unnecessary deep copies of large vocabularies that caused significant overhead. Additionally, bug fixes were applied for incorrect tokenizer class names on the Hub (DeepSeek V2/V3, ModernBERT), a clean_up_tokenization_spaces misconfiguration in Llama 3 tokenizer conversion, and a string replacement issue in AutoTokenizer class name resolution.
fix: improve processor loading performance by avoiding redundant tokenizer parsing (#44927) by @ydshieh in [#44927]
fix processing_utils.py: avoid deepcopying tokenizer in ProcessorMixin to improve performance (#44894) by @ydshieh in [#44894]
fix: set clean_up_tokenization_spaces=False in Llama 3 tokenizer conversion (#44914) by @maxsloef-goodfire in [#44914]
deepseek_v2, deepseek_v3, and modernbert fix for having incorrect tokenizer class on the hub (#44801) by @itazap in [#44801]
Add XPU Expectations for vibe voice acoustic tokenizer tests (#44428) by @kaixuanliu in [#44428]
fix(tokenizer): Only strip Fast from class names in AutoTokenizer if used as a suffix (#44443) by @harshaljanjani in [#44443]
Kernels
Kernel support has been expanded with Flash Attention 4 fallback integration, a paged_attention kernel for continuous batching, and Neuron device support for custom kernels. Several stability fixes were also made, including bumping the kernels version dependency to prevent crashes and correcting the LFM2 kernel path.
[FA4] Add kernels fallback (#44797) by @vasqu in [#44797]
Bump kernels version dependency to avoid crashes (#44887) by @Cyrilvallez in [#44887]
Fix lfm2 kernel path (#44634) by @Cyrilvallez in [#44634]
[CB] Add paged_attention kernel (#44379) by @remi-or in [#44379]
Neuron kernels integration (#44417) by @michaelbenayoun in [#44417]
Cache
Several cache-related fixes and improvements were made, including aligning LFM2's cache implementation with other Mamba caches, fixing a tensor indexing crash in KV cache continuation for the transformers serve streaming endpoint, and resolving a generation bug in Idefics3 when using use_cache=False. A caching layer was also added to the model linter to skip unchanged valid files and improve build performance.
Align lfm2 cache to other mamba caches (#44866) by @Cyrilvallez in [#44866]
feat: added cache to the model linter (#44790) by @tarekziade in [#44790]
Fix tensor indexing crash in serve generate_response KV cache continuation (#44735) by @mango766 in [#44735]
Idefics3 without cache fix (#44607) by @gabe-l-hart in [#44607]
Vision
Fixed backward compatibility for full-path imports of Fast Image Processors and resolved a Llama4 vision rotary embedding initialization error where freqs_ci was not registered as a buffer, causing failures when loading models with device_map="auto".
Fix backward compatibility for full path imports of Fast Image Processors (#44926) by @yonigozlan in [#44926]
fix(models, testing): Fix Llama4 vision rotary meta tensor initialization and MyT5 get_tokenizer signature (#44581) by @harshaljanjani in [#44581]
Fix AMD Docker image build timeout by pinning Flash Attention commit (#44546) by @Abdennacer-Badaoui in [#44546]
Generation
The cache_position argument has been fully removed from the generation pipeline, as all models have been updated to no longer use it (with a backward-compatibility path retained for remote code models). Additionally, integration tests for LASR with chunked decoding were added, and outdated references to deprecated pipeline tasks were cleaned up.
[generate] Never use cache_position anymore in generation (#44816) by @Cyrilvallez in [#44816]
Add an integration test for LASR using pipe and chunked decoding (#42823) by @kho in [#42823]
Fix: Remove references to text2text-generation, summarization and translation pipeline tasks (#44510) by @math-hiyoko in [#44510]
Bugfixes and improvements
Dynamic weight conversion is recursive (#44300) by @zucchini-nlp in [#44300]
Don't run tests_hub if no tests found (#45014) by @ydshieh in [#45014]
Fix type hint for attention_chunk_size in Llama4TextConfig (#45002) by @hmellor in [#45002]
Fix AutoProcessor.from_pretrained silently dropping hub kwargs (#44710) by @he-yufeng in [#44710]
Fix maybe_autocast crashing on meta device tensors (#44984) by @Butanium in [#44984]
fix: remove Copied from comments between @torch.jit.script and def for Python 3.13 compat (#44986) by @Krishnachaitanyakc in [#44986]
More small vllm fixes (#44990) by @ArthurZucker in [#44990]
fix(models): Fix Perceiver interpolate_pos_encoding interpolating to the source size (#44899) by @harshaljanjani in [#44899]
Allow mm_token_type be non-padded lists (#44563) by @zucchini-nlp in [#44563]
Fix CPU 16 bytes alignment issue using equivalent fallback (#44970) by @IlyasMoutawwakil in [#44970]
refactor: unify QA calls (#44879) by @tarekziade in [#44879]
Fix tie_word_embedding issues with Qwen2VL (#44976) by @hmellor in [#44976]
Support Modular (!!) + Configs in check_auto_docstrings (#44803) by @yonigozlan in [#44803]
[ vllm x v5] nit (#44971) by @ArthurZucker in [#44971]
LwDetrImageLoss: Fix dtype casting to prevent crash when using amp on cuda device (#44886) by @m-matthias in [#44886]
[AMD CI] Gemma3/Gemma3n Expectations (#44972) by @Abdennacer-Badaoui in [#44972]
Officially launch parse_response (#44674) by @Rocketknight1 in [#44674]
fix load_best_model_checkpoint_at_end do not load the best model chec… (#44583) by @wilnn in [#44583]
Fix failing T5ModelIntegrationTest (#44934) by @Sai-Suraj-27 in [#44934]
Config kwargs (#44953) by @zucchini-nlp in [#44953]
[CB] [Minor] Simplify test suite (#44858) by @remi-or in [#44858]
Allow arbitrary template kwargs in processors (#44881) by @zucchini-nlp in [#44881]
Fix missing post_processor in DebertaV2Tokenizer causing no special t… (#44570) by @umbilnm in [#44570]
incorrect model list update (#44880) by @itazap in [#44880]
refactor: mlinter as its own package (#44939) by @tarekziade in [#44939]
[CB] Add an option to return logprobs (#44835) by @remi-or in [#44835]
[docs] peft (#44804) by @stevhliu in [#44804]
Continuous batching thread safety (#44924) by @Qubitium in [#44924]
Fix variable shadowing in pipeline example and typo in BART docs (BERT → BART) (#44935) by @VanshikaSohal in [#44935]
Fix failing job Update Transformers metadata after #43514 (#44941) by @ydshieh in [#44941]
Clearer type hints and fix rope validation in configs (#44943) by @zucchini-nlp in [#44943]
Correct docstrings for from_pretrained (url input deprecated) (#44946) by @BSchilperoort in [#44946]
fix(i18n): replace broken relative links to awesome-transformers.md with absolute URLs (#44905) by @NicoleRobin in [#44905]
chore(typing): added rule 11 (#44865) by @tarekziade in [#44865]
fix(camembert): add tie_word_embeddings=True to CamembertConfig (#44931) by @r266-tech in [#44931]
Support SizeDict import in get_size_dict (#44903) by @yonigozlan in [#44903]
Add big angry code agent warnings! (#44890) by @Rocketknight1 in [#44890]
[docs] model cards (#44837) by @stevhliu in [#44837]
Add backward compatibility for direct imports from legacy image_processing_utils_fast (#44897) by @yonigozlan in [#44897]
Fix core dumped when NemotronH is torch compiled (#44854) by @ydshieh in [#44854]
fix(testing): Fix PaliGemma 2 and PaddleOCR-VL test failures on main (#44765) by @harshaljanjani in [#44765]
Fix dtype guessing from state dict (#44883) by @Cyrilvallez in [#44883]
Add missing dunder methods to SizeDict (#44884) by @hmellor in [#44884]
Fix VL model rope_deltas batch size mismatch in online RL training (#44873) by @sergiopaniego in [#44873]
Fix layer_types type hint for AFMoE and Llama4 (#44874) by @hmellor in [#44874]
Fix nemotron config docstrings (#44878) by @Cyrilvallez in [#44878]
Fix nemotron_h modular (#44876) by @Cyrilvallez in [#44876]
[Mistral] Fix query scaling for Mistral4 and Ministral3 (#44860) by @Cyrilvallez in [#44860]
Update some type hints (#44851) by @zucchini-nlp in [#44851]
Fix glm dsa (#44564) by @ArthurZucker in [#44564]
Update AFMoE architecture to use v5-style MoE impl (#44063) by @AutumnAurelium in [#44063]
Fix KeyError in convert_to_native_format for dict vocab (#44452) by @ in [#44452]
fix: XLNet: relative_positional_encoding computes on CPU every forward (#44782) by @JiwaniZakir in [#44782]
Fix annotations reader for python 3.14 in PreTrainedModel (#44672) by @neo in [#44672]
[CB] Better parametrization for compile (#44578) by @remi-or in [#44578]
Fix KeyError when patching mistral regex (#43376) by @LeonardoEmili in [#43376]
Correct code block formatting in weightconverter.md (#44839) by @zhulinchng in [#44839]
feat(ci): added a network debug report (#44636) by @tarekziade in [#44636]
Add GreedyLR adaptive learning rate scheduler (#44271) by @balak4 in [#44271]
Fix unexpected position_ids keys when loading OwlViT models (#44508) by @KartikPawade in [#44508]
Update more modular examples (#44834) by @Cyrilvallez in [#44834]
Fix and re-run modular converter on examples (#44833) by @Cyrilvallez in [#44833]
Remove cache_position in more models (4 and last one) (#44828) by @Cyrilvallez in [#44828]
Fix loading issue in Sam3 (#44831) by @zucchini-nlp in [#44831]
feat(integration): Add KubeflowCallback to enable automatic progress … (#44487) by @abhijeet-dhumal in [#44487]
Add GGUF support for MiniMax-M2.1 model (#44526) by @JoursBleu in [#44526]
Centralize AI agent templates in .ai (#44489) by @tarekziade in [#44489]
support xxxFast alias in v5 tokenizers (#44766) by @itazap in [#44766]
Remove cache_position in more models (3) (#44759) by @Cyrilvallez in [#44759]
[CI] Temporarily skip Mistral4 tests as they almost all fail (#44825) by @Cyrilvallez in [#44825]
[Gemma] Update conversion scripts for Transformers v5 Comaptibility (#44631) by @RyanMullins in [#44631]
fix bug embedding_size mismatch with hidden_size in electra model test (#44657) by @kaixuanliu in [#44657]
Fix pegasus conversion (#44571) by @ArthurZucker in [#44571]
Fix repo-check bot (#44812) by @ydshieh in [#44812]
[docs] is_causal feature (#44777) by @stevhliu in [#44777]
docs(tasks): remove references to removed question-answering pipeline (#44787) by @ in [#44787]
Fix configs with @strict (#44770) by @zucchini-nlp in [#44770]
[AMD CI] Fix test failures across important models (#44632) by @Abdennacer-Badaoui in [#44632]
Move VLM conversions to the main mapping (#44627) by @zucchini-nlp in [#44627]
Fix config loading issues (type issues) (#44789) by @ydshieh in [#44789]
Remove is_causal from EuroBertConfig (#44774) by @ydshieh in [#44774]
model-linter: Added rule 10 (#44761) by @tarekziade in [#44761]
[fix] mistral 4 docs (#44776) by @stevhliu in [#44776]
Fix: Eurobert model was missing @strict decorator and invalid test kwargs (#44767) by @tarekziade in [#44767]
fix: sig lip import (#44764) by @tarekziade in [#44764]
Disable async loading when quantizing on the fly (#44576) by @SunMarc in [#44576]
[MistralCommonBackend] Upgrade mistral-common to v1.10.0 (#44656) by @juliendenize in [#44656]
Fix mlcd auto config/model/mapping issues (#44730) by @ydshieh in [#44730]
Fix bug and add XPU Expectations for qwen2 and jamba tests (#44733) by @kaixuanliu in [#44733]
[medasr] doc update (#44633) by @eustlb in [#44633]
Fix missing / incorrect config class in some model class definitions (#44715) by @ydshieh in [#44715]
Update Nvidia CI docker file to use torch 2.10 (#44712) by @ydshieh in [#44712]
[FA] Fix fa detection (#44703) by @vasqu in [#44703]
Fix set_encoder (#44698) by @hmellor in [#44698]
[docs] cb config (#44675) by @stevhliu in [#44675]
Fix more model tester missing parent issue (#44685) by @ydshieh in [#44685]
Add register method for ParallelInterface (#44640) by @michaelbenayoun in [#44640]
[CB] [Bug] Fix crashes when running without cuda (#44673) by @remi-or in [#44673]
Another (small) set of fixes required for tiny model creation (#44666) by @ydshieh in [#44666]
Fix CookieCutter (#44334) by @NielsRogge in [#44334]
pipelines do not have modelcard (#44621) by @KoichiYasuoka in [#44621]
[Chmv2] Fix conversion after capture refactor (#44665) by @vasqu in [#44665]
[CB] Add dedicated config (#44434) by @remi-or in [#44434]
fix(models): Forward timm model kwargs to timm.create_model for OmDet-Turbo (#44611) by @harshaljanjani in [#44611]
Ensure same dtype for subconfig when _from_config (#44629) by @zucchini-nlp in [#44629]
Remove cache_position in more models (2) (#44602) by @Cyrilvallez in [#44602]
fix: cast to proper dtype in EmbeddingParallel (#44612) by @michaelbenayoun in [#44612]
Remove many output_attentions and other traced outputs on 100+ models (#43590) by @molbap in [#43590]
fix: raise error if mm_token_type_ids not supplied (#44433) by @leopold-tzafon in [#44433]
Fix output capturing for Backbones (#44638) by @Cyrilvallez in [#44638]
Fix for VibeVoiceAcousticTokenizer (#44628) by @ydshieh in [#44628]
Fix off-by-one in decode_spans boundary check (#44584) by @mvanhorn in [#44584]
Fix more wrong HF hub checkpoint names (#44624) by @ydshieh in [#44624]
Update agentic contributions guidelines in AGENTS.md to force yielding. (#44411) by @burtenshaw in [#44411]
Expand model-structure lint rules with a fast AST-based, ruff-like framework (#44174) by @tarekziade in [#44174]
feat: add neuron in tensor parallelism initialization (#44498) by @michaelbenayoun in [#44498]
[WIP] FIX Make Mixtral LoRA loading work (#44478) by @BenjaminBossan in [#44478]
Fix Llava tests for torch too! (#44476) by @Rocketknight1 in [#44476]
Fix training ci and clean some tests (#44491) by @SunMarc in [#44491]
Remove useless identity assignment (#44600) by @Cyrilvallez in [#44600]
Add Yoni to run-slow workflow (#44598) by @vasqu in [#44598]
Add shared VLM tests (#42964) by @Rocketknight1 in [#42964]
Fix wrong (non-existing) checkpoints (#44549) by @ydshieh in [#44549]
Remove cache_position in more models (#44330) by @Cyrilvallez in [#44330]
Fix CircleCI summary report not showing due to missing dependency (#44597) by @ydshieh in [#44597]
Fix typos in add_new_model_like docstrings (#43544) by @Olexandr88 in [#43544]
Fix UnboundLocalError for tp_plan_alt when tp_plan is empty (#44540) by @YangKai0616 in [#44540]
FIX Multiple PEFT errors after v5 transition (#44592) by @BenjaminBossan in [#44592]
Fix missing BPE token conversion step in Chameleon (#44582) by @yonigozlan in [#44582]
Make paligemma embed tokens standard (#44432) by @zucchini-nlp in [#44432]
chore(typing): Add type checking to src/transformers/quantizers (#44412) by @tarekziade in [#44412]
Fix: AQLM quantizer to match updated replace_with_aqlm_linear signature (#44577) by @tarekziade in [#44577]
[device_map] Fix device_map computation by correctly adjusting memory available (#44565) by @Cyrilvallez in [#44565]
Fix error message label and docstring default in load_sharded_checkpoint (#44523) by @jnMetaCode in [#44523]
Correct Tapas initialization (#44575) by @Rocketknight1 in [#44575]
[fix] Prevent crash with Apertus without xielu installed (#44567) by @tomaarsen in [#44567]
Fix failing MusicgenStereo integration tests (#44527) by @Sai-Suraj-27 in [#44527]
Fix zamba2 rotary embedding call when use_mem_rope is False (#44551) by @echarlaix in [#44551]
[Bugfix] fix video inference of qwen3vl and qwen3.5 series (#44474) by @JJJYmmm in [#44474]
add XPU Expectations for higgs_audio_v2 tests (#44482) by @kaixuanliu in [#44482]
chameleon added to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS (#44475) by @itazap in [#44475]
Revert "test merge queue 1" (#44552) by @ydshieh in [#44552]
test merge queue 1 (#44529) by @ydshieh2 in [#44529]
fix(testing): Fix MoonshineEncoder UnboundLocalError and Florence2VisionBackbone dtype mismatch (#44503) by @harshaljanjani in [#44503]
Fix: Remove references to transformers run command (#44513) by @math-hiyoko in [#44513]
[LW-DETR] Fix training (#44441) by @NielsRogge in [#44441]
Make _prepare_input_fn and _prepare_output_fn instance methods (#44499) by @michaelbenayoun in [#44499]
Fix ShieldGemma2 non-reproducible outputs by adding _tied_weights_keys (#44358) by @hardikmeisheri in [#44358]
Tensor Parallelism and mps device (#44506) by @michaelbenayoun in [#44506]
Fix failing GPTNeoModelLanguageGenerationTest (#44515) by @Sai-Suraj-27 in [#44515]
Fix failing MarianIntegrationTests (#44519) by @Sai-Suraj-27 in [#44519]
fix pin_memory for contiguous batching (#44455) by @jiqing-feng in [#44455]
Fix continuous batching for multimodal models (#44436) by @jw9603 in [#44436]
Fix KeyError in _parse_type_hint when Union contains Any (#44525) by @jnMetaCode in [#44525]
Fix AssistantTracker.is_active() returning False after activation with empty lists (#44524) by @jnMetaCode in [#44524]
Fix and re-enable extra_state tests (#43510) by @pstjohn in [#43510]
Fix ansi codes in loading reports when not connected to terminal (#44544) by @Cyrilvallez in [#44544]
Follow-up typing checking fixes (#44500) by @tarekziade in [#44500]
Fix backend dependency (#44542) by @Cyrilvallez in [#44542]
Add a new job in build_pr_documentation.yml (will be the new required job) (#44538) by @ydshieh in [#44538]
Update build_pr_documentation workflow for merge_group event (#44532) by @ydshieh in [#44532]
Fixed typo in docs/source/en/kv_cache.md (#44501) by @frogNotToad in [#44501]
Docs: fix SigLIP2 usage examples (#43641) by @KOKOSde in [#43641]
Fix type checker (#44502) by @Cyrilvallez in [#44502]
Add MLU bf16 support to is_torch_bf16_gpu_available (#44381) by @carcel-yu in [#44381]
fix model parallelism bug for eurobert model (#44490) by @kaixuanliu in [#44490]
Update ty to 0.0.20 (#44494) by @tarekziade in [#44494]
Add auto-docstring on configs (#44296) by @zucchini-nlp in [#44296]
Fix failed unit tests for moonshine_streaming model (#43936) by @kaixuanliu in [#43936]
Update distributed tests (#44338) by @SunMarc in [#44338]
Add diffusers to CI docker file (#44480) by @ydshieh in [#44480]
Replace placeholder tokens as specified in added_tokens_decoder (#44468) by @itazap in [#44468]
[vLLM] Fix backward compatibility with hardcoded subprocessors classes in processors (#44447) by @yonigozlan in [#44447]
[remote code/vllm] Fix incorrect tied weights (#44469) by @Cyrilvallez in [#44469]
Integrate the Neuron device to TrainingArguments (#44302) by @michaelbenayoun in [#44302]
Fix failing DepthProModelIntegrationTest (#44456) by @Sai-Suraj-27 in [#44456]
[timesfm2_5] fix loss scaling (#44465) by @kashif in [#44465]
Fix failing ProphetNetModelIntegrationTest (#44439) by @Sai-Suraj-27 in [#44439]
[Trainer] fix SP loss (#44461) by @kashif in [#44461]
skip 1 invalid test case for higgs_audio_v2 (#44350) by @kaixuanliu in [#44350]
Fix position_ids typo in Qwen3_5TextModel forward pass (#44399) by @ in [#44399]
Significant community contributions
The following contributors have made significant changes to the library over the last release:
@ydshieh
Don't run tests_hub if no tests found (#45014)
Fix failing job Update Transformers metadata after #43514 (#44941)
fix: improve processor loading performance by avoiding redundant tokenizer parsing (#44927)
fix processing_utils.py: avoid deepcopying tokenizer in ProcessorMixin to improve performance (#44894)
Fix core dumped when NemotronH is torch compiled (#44854)
Fix repo-check bot (#44812)
Fix config loading issues (type issues) (#44789)
Remove is_causal from EuroBertConfig (#44774)
Fix mlcd auto config/model/mapping issues (#44730)
Fix missing / incorrect config class in some model class definitions (#44715)
Update Nvidia CI docker file to use torch 2.10 (#44712)
Fix more model tester missing parent issue (#44685)
Another (small) set of fixes required for tiny model creation (#44666)
Fix for VibeVoiceAcousticTokenizer (#44628)
Fix more wrong HF hub checkpoint names (#44624)
Fix wrong (non-existing) checkpoints (#44549)
Fix CircleCI summary report not showing due to missing dependency (#44597)
Fix PR comment CI for quantization job (#44579)
Revert "test merge queue 1" (#44552)
Add a new job in build_pr_documentation.yml (will be the new required job) (#44538)
Update build_pr_documentation workflow for merge_group event (#44532)
Add diffusers to CI docker file (#44480)
@NielsRogge
Add VidEoMT (#44285)
Fix CookieCutter (#44334)
[LW-DETR] Fix training (#44441)
@tarekziade
refactor: unify QA calls (#44879)
refactor: mlinter as its own package (#44939)
chore(typing): added rule 11 (#44865)
feat: added cache to the model linter (#44790)
feat(ci): added a network debug report (#44636)
Centralize AI agent templates in .ai (#44489)
model-linter: Added rule 10 (#44761)
Fix: Eurobert model was missing @strict decorator and invalid test kwargs (#44767)
fix: sig lip import (#44764)
Expand model-structure lint rules with a fast AST-based, ruff-like framework (#44174)
chore(typing): Add type checking to src/transformers/quantizers (#44412)
Fix: AQLM quantizer to match updated replace_with_aqlm_linear signature (#44577)
Follow-up typing checking fixes (#44500)
Update ty to 0.0.20 (#44494)
@Sai-Suraj-27
Fix failing T5ModelIntegrationTest (#44934)
Add Jina-Embeddings-V3 Model (#44251)
Fix failing MusicgenStereo integration tests (#44527)
Fix failing GPTNeoModelLanguageGenerationTest (#44515)
Fix failing MarianIntegrationTests (#44519)
Fix failing DepthProModelIntegrationTest (#44456)
Fix failing ProphetNetModelIntegrationTest (#44439)
@remi-or
[CB] [Minor] Simplify test suite (#44858)
[CB] Add an option to return logprobs (#44835)
[CB] Better parametrization for compile (#44578)
[CB] [Bug] Fix crashes when running without cuda (#44673)
[CB] Add dedicated config (#44434)
[CB] Add paged_attention kernel (#44379)
@XingweiDeng
[Model] Add UVDoc Model Support (#43385)
[Model] Add PP-Chart2Table Model Support (#43767)
[Model] Add PP-OCRV5_mobile_det Model Support (#43247)
[Model] Add PP-OCRV5_server_det Model Support (#43274)
@vasqu
[FA4] Add kernels fallback (#44797)
[FA] Fix fa detection (#44703)
🚨 [FA4] Initial support (#42435)
[Chmv2] Fix conversion after capture refactor (#44665)
Add Yoni to run-slow workflow (#44598)
@liu-jiaxuan
[Model] Add SLANeXt Model Support (#43707)
@zhang-prog
[Model] Add PP-OCRv5_server_rec and PP-OCRv5_mobile_rec models Support (#44808)
@balak4
Add GreedyLR adaptive learning rate scheduler (#44271)
@kaixuanliu
fix bug embedding_size mismatch with hidden_size in electra model test (#44657)
Fix bug and add XPU Expectations for qwen2 and jamba tests (#44733)
Add XPU Expectations for vibe voice acoustic tokenizer tests (#44428)
add XPU Expectations for higgs_audio_v2 tests (#44482)
fix model parallelism bug for eurobert model (#44490)
Fix failed unit tests for moonshine_streaming model (#43936)
skip 1 invalid test case for higgs_audio_v2 (#44350)
@juliendenize
Add Mistral 4 (#44760)
[MistralCommonBackend] Upgrade mistral-common to v1.10.0 (#44656)
@molbap
Add model lerobot PI0 to transformers (#44160)
Remove many output_attentions and other traced outputs on 100+ models (#43590)
@JJJYmmm
[Bugfix] fix video inference of qwen3vl and qwen3.5 series (#44474)
@math-hiyoko
Fix: Remove references to text2text-generation, summarization and translation pipeline tasks (#44510)
Fix: Remove references to transformers run command (#44513)
Original source Report a problem - Mar 4, 2026
- Date parsed from source:Mar 4, 2026
- First seen by Releasebot:Mar 20, 2026
v5.3.0: EuroBERT, VibeVoice ASR, TimesFM2.5, PP-DocLayoutV2, OlmoHybrid, ModernVBert, Higgs Audio V2
transformers adds new multilingual, audio, time-series, and document models, including EuroBERT, VibeVoice ASR, TimesFM 2.5, PP-DocLayoutV2, OLMo Hybrid, ModernVBERT, and Higgs Audio V2, alongside breaking changes, quantization updates, and broad fixes.
New Model additions
EuroBERT
EuroBERT is a multilingual encoder model based on a refreshed transformer architecture, akin to Llama but with bidirectional attention. It supports a mixture of European and widely spoken languages, with sequences of up to 8192 tokens.
Links: Documentation | Paper | Blog Post
Add eurobert (#39455) by @ArthurZucker in #39455
VibeVoice ASR
VibeVoice ASR is an automatic speech recognition model from Microsoft that combines acoustic and semantic audio tokenizers with a causal language model for robust speech-to-text transcription. The model uses VibeVoice's acoustic and semantic tokenizers that process audio at 24kHz, paired with a Qwen2-based language decoder for generating transcriptions. It can process up to 60 minutes of continuous audio input, supports customized hotwords, performs joint ASR/diarization/timestamping, and handles over 50 languages with code-switching support.
Links: Documentation | Paper
Add VibeVoice ASR (#43625) by @ebezzam in #43625
TimesFM2.5
TimesFM 2.5 is a pretrained time-series foundation model that uses a decoder-only attention architecture with input patching for forecasting. The model is designed to provide accurate zero-shot forecasts across different domains, forecasting horizons and temporal granularities without requiring dataset-specific training. It builds on the original TimesFM architecture with enhancements including rotary attention, QK normalization, per-dimension attention scaling, and continuous quantile prediction.
Links: Documentation | Paper
Timesfm 2.5 (#41763) by @kashif in #41763
PP-DocLayoutV2
PP-DocLayoutV2 is a dedicated lightweight model for layout analysis, focusing specifically on element detection, classification, and reading order prediction. The model is composed of two sequentially connected networks: an RT-DETR-based detection model that performs layout element detection and classification, followed by a pointer network that orders these layout elements. It is designed to analyze document layouts by identifying and organizing various layout components in their proper reading sequence.
Links: Documentation
[Model] Add PP-DocLayoutV2 Model Support (#43018) by @zhang-prog in #43018
OlmoHybrid
OLMo Hybrid is a hybrid architecture model from Ai2 that combines standard transformer attention layers with linear attention layers using the Gated Deltanet. This hybrid approach aims to improve efficiency while maintaining model quality by interleaving full attention layers with linear attention layers. The model uses a custom cache system that handles both KV cache for attention layers and recurrent state for linear attention layers.
Links: Documentation
Add OLMo Hybrid model (#43358) by @yanhong-lbh in #43358
ModernVBert
ModernVBert is a Vision-Language encoder that combines ModernBert with a SigLIP vision encoder. It is optimized for visual document understanding and retrieval tasks, making it suitable for processing documents that contain both text and visual elements.
Links: Documentation | Paper
Add ModernVBERT models (#42504) by @paultltc in #42504
ColModernVBert
ColModernVBert is a model for efficient visual document retrieval that leverages ModernVBert to construct multi-vector embeddings directly from document images, following the ColPali approach. The model enables retrieval and scoring of visual documents by processing both text queries and document images to generate embeddings that can be compared for relevance scoring.
Links: Documentation | Paper
Add ModernVBERT models (#42504) by @paultltc in #42504
Higgs Audio V2
Higgs Audio V2 is a powerful audio foundation model developed by Boson AI that was pretrained on over 10 million hours of audio data and diverse text data. Despite having no post-training or fine-tuning, the model excels in expressive audio generation thanks to its deep language and acoustic understanding. The model supports various audio generation tasks including single-speaker and multi-speaker smart voice, zero-shot voice cloning, and multi-speaker voice cloning.
Links: Documentation
Add Higgs Audio V2 Model (#40294) by @szhengac in #40294
Higgs Audio V2 Tokenizer
The Higgs Audio V2 Tokenizer is an audio tokenization model that operates at a low frame rate of 25 fps while maintaining high audio quality, effectively halving the frame rate of many baseline models. It uses unified 24 kHz training that mixes speech, music, and sound-event clips in one model to capture both semantic and acoustic details, facilitating the training of audio language models. The model enables fast inference by avoiding diffusion steps, with an encoder/decoder architecture that processes batches quickly for real-time or large-scale tasks.
Links: Documentation
Add Higgs Audio V2 Model (#40294) by @szhengac in #40294
Breaking changes
Tensor parallelism (TP) support for dense and MoE decoder-only models has been fixed and stabilized, requiring users to update their TP configurations and conversion mappings accordingly.
🚨 fix + tests dense & MoE TP all reduce (decoder only) (#43722) by @3outeille
The Ernie4.5 VL MoE model class and configuration names have been renamed to align with vLLM/SGLang conventions, requiring users to update any references to the old model names in their code.
🚨 [Ernie 4.5 VL Moe] Fix up namings to vllm/sglang convention (#44299) by @vasqu
Several pipeline tasks have been removed or updated in the V5 cleanup (including question-answering, visual-question-answering, and image-to-image), requiring users to migrate to the replacement pipelines or updated task names.
🚨 More V5 pipeline cleanup (#43325) by @Rocketknight1
3D position IDs for vision-language models have been unified under a common interface (sourced from qwen2-vl), requiring users of affected VLMs (e.g., Ernie, GLM4V) to update their processors and any code that manually constructs position IDs.
🚨 Unify 3D position ids (#43972) by @zucchini-nlp
🚨 Tokenizer x vLLM fixes 🚨 :
Unigram tokenizers were missing the spm precompiled charsmap support. We ran an overall v4 vs v5 regression test and fixed what we had missed.
This was done in:
[vllm + v5 fix] handle TokenizersBackend fallback properly for v5 (#44255) by @itazap
Generation
Generation input preparation was significantly refactored to stop relying on cache_position and instead pass pre-sliced input_ids/inputs_embeds directly to prepare_inputs_for_generation, simplifying the generation loop and laying groundwork for broader cache_position removal. Several bug fixes were also applied, including correct sampling for HiggsAudioV2, flaky cache-equality test stabilization for Idefics, and restored generation integration tests.
[higgs-audio-v2] fix sampling (#44386) by @eustlb in [#44386]
fix(flaky): idefics generate cache flake (#44180) by @tarekziade in [#44180]
Fix generation integration tests (#44225) by @zucchini-nlp in [#44225]
[generate] Always pass full input_ids in prepare_inputs_for_generation (#44226) by @Cyrilvallez in [#44226]
fix: HiggsAudioV2 cached decode inputs in compiled generation (#44201) by @tarekziade in [#44201]
[generate] Completely stop relying on cache_position to prepare inputs (#44130) by @Cyrilvallez in [#44130]
Simplify input preparation in generate (#44126) by @Cyrilvallez in [#44126]
Tokenization
Several tokenization bugs were fixed in this release, including resolving an AttributeError in MLukeTokenizer caused by the v5 rename of additional_special_tokens, correcting the Fuyu tokenizer class mapping, fixing LayoutXLM tokenization test failures from the slow tokenizer removal refactor, and adding olmo_hybrid to the auto-tokenizer mapping. The tokenizer documentation was also updated to reflect the new unified v5 backend architecture and reorganized for clarity.
[tiny] Add olmo_hybrid to tokenizer auto-mapping (#44416) by @tyler-romero in [#44416]
fix(tokenizer): Fix MLukeTokenizer AttributeError post-v5 refactor (#44362) by @harshaljanjani in [#44362]
update fuyu tokenizer class (#44235) by @itazap in [#44235]
fix(testing): Fix LayoutXLM tokenization test and LightOnOCR SDPA flash test failures on main CI (#43988) by @harshaljanjani in [#43988]
[docs] tokenizer summary (#43965) by @stevhliu in [#43965]
[docs] refactor tokenizer docs (#43900) by @stevhliu in [#43900]
Kernels
Fixed several kernel-related issues including a security vulnerability, corrected Mamba kernel loading to handle incompatible import structures, ensured Liger Kernel is properly enabled during hyperparameter search, and expanded Flash Attention to support multiple compatible implementations.
Fix kernels security issue (#44395) by @Cyrilvallez in [#44395]
Enable Liger Kernel when doing hyperparameter search. (#44329) by @linfeng-du in [#44329]
[Mamba] Fix kernel loading (#44176) by @vasqu in [#44176]
[Flash Attn] Enable compatible implementations (#44177) by @vasqu in [#44177]
Fix percentage formatting in help messages for gradient checkpointing, Liger Kernel, and empty cache steps (#44100) by @qgallouedec in [#44100]
Quantization
This release adds several new quantization backends and fixes, including MLX quantization support for MPS devices, Four Over Six (4/6) NVFP4 quantization integration for NVIDIA Blackwell GPUs, and CPU support for MXFP4 models, alongside a bug fix for MXFP4 model saving using reverse_op.
[Quantization] Fixing mxfp4 saving using reverse_op (#43148) by @MekkCyber in [#43148]
[Quantization] Add metal quantization for MPS devices! (#43934) by @MekkCyber in [#43934]
Enable mxfp4 model on CPU (#43512) by @jiqing-feng in [#43512]
Add Four Over Six quantization integration (#43970) by @jackcook in [#43970]
Vision
Fixed backward compatibility for image processors loaded from older remote code that lack valid_kwargs definitions, and resolved test failures in AMD ROCm CI by adding the missing timm dependency to the Docker image.
[AMD CI] Add missing timm dependency to ROCm Docker image (#44389) by @Abdennacer-Badaoui in [#44389]
update glm image model expected out for tests (#43907) by @kaixuanliu in [#43907]
Fix image processors from_dict backward compatibility with old remote code (#44245) by @yonigozlan in [#44245]
Bugfixes and improvements
Update PR template (#44415) by @SunMarc in [#44415]
Add Qwen3.5 support for sequence classification (#44406) by @medhakimbedhief in [#44406]
update the expected output for qwen2_5_vl w/ pytorch 2.10 XPU (#44426) by @kaixuanliu in [#44426]
add support for nemotron_3 (#44390) by @liding-nv in [#44390]
[ Dynamic weight loader] fix remote code when format matches (#44396) by @ArthurZucker in [#44396]
[timesfm2_5] fix timesfm2.5 loss (#44331) by @kashif in [#44331]
Fix peft conversion mappings (#44413) by @Cyrilvallez in [#44413]
Reduce tqdm verbosity during model loading (#44414) by @Cyrilvallez in [#44414]
docs: Add NeMo Automodel community integration docs (#44304) by @adil-a in [#44304]
[CB] Small fixes (#44227) by @remi-or in [#44227]
Support non-gated experts (#44319) by @IlyasMoutawwakil in [#44319]
[Bugfix] fix qwen3.5 no split module (#44382) by @JJJYmmm in [#44382]
Fix mutable default arguments and resource leaks (#44287) by @jashshah999 in [#44287]
skip 2 invalid test cases for voxtral_realtime model (#44321) by @kaixuanliu in [#44321]
Mamba-1/-2 init weights in mixer class (#43778) by @kevinli573 in [#43778]
add expectations for xpu for olmo_hybrid model (#44353) by @kaixuanliu in [#44353]
[VITS] Add speaking_rate as an optionl forward argument (#43283) by @gau-nernst in [#43283]
Strict export cleanup (#44293) by @IlyasMoutawwakil in [#44293]
[docs] kernelconfig fix (#44337) by @stevhliu in [#44337]
Add ProcessingKwargs ImagesKwargs etc. to docs (#44269) by @yonigozlan in [#44269]
Fix typos in comments and docstrings (#44332) by @tysoncung in [#44332]
Add testing guide for agents for trainer tests (#44328) by @SunMarc in [#44328]
Update common tests Trainer (#44260) by @SunMarc in [#44260]
[timesfm2_5] fix timesfm mlp bias (#44325) by @kashif in [#44325]
fix zero3 init config (#44236) by @SunMarc in [#44236]
Update expected output for Jais2 model tests (#43910) by @kaixuanliu in [#43910]
Improve has_similar_generate_outputs assertions (#44166) by @tarekziade in [#44166]
Fix failed test case for exaone_moe model (#43938) by @kaixuanliu in [#43938]
fix(modeling_attn_mask_utils): remove FutureWarning from logger.warning_once() (#44307) by @imstevenpmwork in [#44307]
Remove remaining vestiges of the TranslationPipeline (#43869) by @Rocketknight1 in [#43869]
XPU now supports backward for the FA2 fixed path (#43905) by @YangKai0616 in [#43905]
Fix: use TokenizersBackend for Olmo3 to preserve custom pre_tokenizer (#44294) by @mario-sanz in [#44294]
Fix special token maps BC (#44281) by @ArthurZucker in [#44281]
[Modular] Fix file type regression (#44283) by @vasqu in [#44283]
[auto_docstring] Improve typing parsing and add tests (#43748) by @yonigozlan in [#43748]
Restore response_schema saving-loading (#44282) by @Rocketknight1 in [#44282]
Use associative scan HOP mamba recurrentgemma (#43737) by @riccardofelluga in [#43737]
chore: fixes in Trainer class docs (compute_loss & hyperparameter_search) (#44268) by @ethanknights in [#44268]
fix(trainer): pass optim_args to SGD, Adagrad, and RMSprop optimizers (#44203) by @nightcityblade in [#44203]
fix(utils): Make torch_compilable_check compatible with torch.export strict mode (#44266) by @harshaljanjani in [#44266]
Fix TypeError in convert_rope_params_to_dict when ignore_keys is a list (#44272) by @hangjun-ezra in [#44272]
[docs] callbacks and collators (#44239) by @stevhliu in [#44239]
[docs] trainer part 1 (#44185) by @stevhliu in [#44185]
Remove refs to grouped_entities (#44182) by @Rocketknight1 in [#44182]
[mimi] nit (#44237) by @eustlb in [#44237]
Fix local dataset loading priority in run_image_classification_no_tra… (#44199) by @gowthamr-tech in [#44199]
chore: added CLAUDE.md alias (#44232) by @tarekziade in [#44232]
fix: add missing return type annotations to type-checking utilities in generic.py (#44241) by @yushiran in [#44241]
Fix return value - fixes #44238 (#44240) by @tarekziade in [#44240]
fix regression report_to "all" (#44250) by @SunMarc in [#44250]
[fix] Set input_modalities on various architectures that aren't just text (#44078) by @tomaarsen in [#44078]
Add processing tests for phi4 multimodal (#44234) by @yonigozlan in [#44234]
fix: VersionComparison.from_string return type mismatch (#43709) by @tarekziade in [#43709]
refactor _inner_training_loop to smaller methods (#44041) by @winglian in [#44041]
[docs] fix broken chat_templating links in tasks docs (#44115) by @Deep-unlearning in [#44115]
Add missing backtick in AnyToAnyPipeline.call docstring (#44229) by @alvarobartt in [#44229]
Docs(it): fix typo in sentencepiece install command (#44218) by @matisgagneux21 in [#44218]
Docs(it): fix typo in docstring wording (#44219) by @matisgagneux21 in [#44219]
fix bug with position_ids on qwen3-vl models, such that position_ids include text position (#44158) by @leopold-tzafon in [#44158]
Update 404ing BillSum dataset URL on Summarization Task guide (#44212) by @alexandercarruthers in [#44212]
fix(models): Fix LayoutLMv2 NER crash and broken batched truncation/padding (#44187) by @harshaljanjani in [#44187]
[CB] [Major] Asynchronous batching (#43960) by @remi-or in [#43960]
Fix LASR feature extractor regression from invalid center argument (#44207) by @ainergiz in [#44207]
Models with incorrect tokenizer_class in tokenization_config.json tha… (#44179) by @itazap in [#44179]
chore(typing): initial ty integration (#44167) by @tarekziade in [#44167]
fix(flaky): test_generate_with_and_without_position_ids in GLM ORC (#44173) by @tarekziade in [#44173]
[docs] Add Chinese translations for common NLP task tutorials (#44144) by @TinderZ in [#44144]
[Mimi] Calibrate to ensure encoder streaming performs correctly (#43971) by @caffeinism in [#43971]
ESM2 attention_mask and token_dropout fix (#44163) by @lhallee in [#44163]
bring back our demons: clean_up_tokenization_spaces (#44035) by @ArthurZucker in [#44035]
Fix Seq2SeqTrainingArguments documentation (#35258) by @qgallouedec in [#35258]
AutoGrad support for grouped_mm fallback (#44152) by @IlyasMoutawwakil in [#44152]
Patch setitem on ModelOutput even if the parameter was previously None (#44080) by @tomaarsen in [#44080]
[simple] Fix up repr whitespace/brackets (#44048) by @tomaarsen in [#44048]
[chore] Fix incorrect forward type hint for Gemma3n (#44051) by @tomaarsen in [#44051]
Raise informative error when loading video processors (#44125) by @zucchini-nlp in [#44125]
fix(flaky): Different approach to make sure loss exists (#43804) by @tarekziade in [#43804]
[voxtral] fix voxtral proc (#44132) by @eustlb in [#44132]
[docs] Fix typos in GenerationConfig docstring (#44143) by @nightcityblade in [#44143]
Fix gemma3n get_audio_features (#44040) by @zucchini-nlp in [#44040]
Fix UMT5EncoderModel embedding weights not being tied after loading (#43880) by @jiqing-feng in [#43880]
fix(testing): Update stale device override test in GraniteSpeech (#44113) by @harshaljanjani in [#44113]
[Misc][vlms] Use text_config when initializing the fine-grained FP8Expert (#44032) by @JJJYmmm in [#44032]
docs: fix typo 'AuoQuant' → 'AutoQuant' and clarify FINEGRAINED_FP8 library column (#44131) by @cluster2600 in [#44131]
Update post proc (#44090) by @itazap in [#44090]
Fix: flaky Kosmos2ModelTest test (#44061) by @tarekziade in [#44061]
AutoTokenizer ignores config when model_type is None (#44127) by @itazap in [#44127]
Migrate GPT2 to standardized output capture decorators (#43983) by @Aki-07 in [#43983]
grouped_mm fallback (#44043) by @IlyasMoutawwakil in [#44043]
Bump dev version (#44099) by @qgallouedec in [#44099]
Fix loading logic issue (#44095) by @Cyrilvallez in [#44095]
[docs] customizing tokenizers (#43929) by @stevhliu in [#43929]
Merge test_keep_in_fp32_modules and test_keep_in_fp32_modules_strict (#44097) by @Rocketknight1 in [#44097]
[voxtral-realtime] update runner expected values (#44096) by @eustlb in [#44096]
Use torch.isfinite (#44069) by @cyyever in [#44069]
add default flash impl (#44081) by @ArthurZucker in [#44081]
Remove unused dependencies (#43904) by @cyyever in [#43904]
Fix patchtsmixer call to post_init (#44082) by @Cyrilvallez in [#44082]
Fix false positive right-padding warning for decoder-only models in pipeline (#44021) by @ in [#44021]
Significant community contributions
The following contributors have made significant changes to the library over the last release:
@ArthurZucker
Add eurobert (#39455)
[ Dynamic weight loader] fix remote code when format matches (#44396)
Fix special token maps BC (#44281)
bring back our demons: clean_up_tokenization_spaces (#44035)
add default flash impl (#44081)
@liding-nv
add support for nemotron_3 (#44390)
@kashif
[timesfm2_5] fix timesfm2.5 loss (#44331)
[timesfm2_5] fix timesfm mlp bias (#44325)
Timesfm 2.5 (#41763)
@remi-or
[CB] Small fixes (#44227)
[CB] [Major] Asynchronous batching (#43960)
@ebezzam
[VibeVoice ASR] Use updated padding cache for ASR model. (#44392)
Add VibeVoice ASR (#43625)
@MekkCyber
[Quantization] Fixing mxfp4 saving using reverse_op (#43148)
[Quantization] Add metal quantization for MPS devices! (#43934)
@tarekziade
perf: Optimize SynthID logits processor batch index construction (#44172)
Improve has_similar_generate_outputs assertions (#44166)
fix(flaky): idefics generate cache flake (#44180)
chore: added CLAUDE.md alias (#44232)
Fix return value - fixes #44238 (#44240)
fix: VersionComparison.from_string return type mismatch (#43709)
fix: HiggsAudioV2 cached decode inputs in compiled generation (#44201)
chore(typing): initial ty integration (#44167)
fix(flaky): test_generate_with_and_without_position_ids in GLM ORC (#44173)
fix(flaky): Different approach to make sure loss exists (#43804)
Fix: flaky Kosmos2ModelTest test (#44061)
@zhang-prog
[Model] Add PP-DocLayoutV2 Model Support (#43018)
@yanhong-lbh
Add OLMo Hybrid model (#43358)
@vasqu
🚨 [Ernie 4.5 VL Moe] Fix up namings to vllm/sglang convention (#44299)
[Modular] Fix file type regression (#44283)
[Mamba] Fix kernel loading (#44176)
[Flash Attn] Enable compatible implementations (#44177)
@jackcook
Add Four Over Six quantization integration (#43970)
@winglian
refactor _inner_training_loop to smaller methods (#44041)
@paultltc
Add ModernVBERT models (#42504)
@TinderZ
[docs] Add Chinese translations for common NLP task tutorials (#44144)
@szhengac
Add Higgs Audio V2 Model (#40294)
Original source Report a problem - Feb 17, 2026
- Date parsed from source:Feb 17, 2026
- First seen by Releasebot:Mar 20, 2026
v5.2.0: GLM-5, Qwen3.5, Voxtral Realtime, VibeVoice Acoustic Tokenizer
transformers releases VoxtralRealtime, GLM-5, Qwen3.5 and VibeVoice support, bringing new streaming speech, multimodal and large-scale model additions plus a breaking new attention mask interface and broad bug fixes.
New Model additions
VoxtralRealtime
VoxtralRealtime is a streaming speech-to-text model from Mistral AI, designed for real-time automatic speech recognition (ASR). Unlike the offline Voxtral model which processes complete audio files, VoxtralRealtime is architected for low-latency, incremental transcription by processing audio in chunks as they arrive.
The model combines an audio encoder with a Mistral-based language model decoder, using time conditioning embeddings and causal convolutions with padding caches to enable efficient streaming inference.
Add Voxtral Realtime (#43769) by @eustlb
GLM-5 - GlmMoeDsa
The zAI team launches GLM-5, and introduces it as such:
GLM-5, targeting complex systems engineering and long-horizon agentic tasks. Scaling is still one of the most important ways to improve the intelligence efficiency of Artificial General Intelligence (AGI). Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens. GLM-5 also integrates DeepSeek Sparse Attention (DSA), largely reducing deployment cost while preserving long-context capacity.
Reinforcement learning aims to bridge the gap between competence and excellence in pre-trained models. However, deploying it at scale for LLMs is a challenge due to the RL training inefficiency. To this end, we developed slime, a novel asynchronous RL infrastructure that substantially improves training throughput and efficiency, enabling more fine-grained post-training iterations. With advances in both pre-training and post-training, GLM-5 delivers significant improvement compared to GLM-4.7 across a wide range of academic benchmarks and achieves best-in-class performance among all open-source models in the world on reasoning, coding, and agentic tasks, closing the gap with frontier models.
Add GlmMoeDsa (#43858) by @Cyrilvallez
Qwen3.5, Qwen3.5 Moe
The Qwen team launches Qwen 3.5, and introduces it as such:
We are delighted to announce the official release of Qwen3.5, introducing the open-weight of the first model in the Qwen3.5 series, namely Qwen3.5-397B-A17B. As a native vision-language model, Qwen3.5-397B-A17B demonstrates outstanding results across a full range of benchmark evaluations, including reasoning, coding, agent capabilities, and multimodal understanding, empowering developers and enterprises to achieve significantly greater productivity. Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability. We have also expanded our language and dialect support from 119 to 201, providing broader accessibility and enhanced support to users around the world.
Adding Support for Qwen3.5 (#43830) by @bozheng-hit
VibeVoice Acoustic Tokenizer
VibeVoice is a novel framework for synthesizing high-fidelity, long-form speech with multiple speakers by employing a next-token diffusion approach within a Large Language Model (LLM) structure. It's designed to capture the authentic conversational "vibe" and is particularly suited for generating audio content like podcasts and multi-participant audiobooks.
One key feature of VibeVoice is the use of two continuous audio tokenizers, one for extracting acoustic features and another for semantic features.
Add VibeVoice Acoustic Tokenizer (#43400) by @ebezzam
Breaking changes
🚨 [Attn] New attn mask interface everywhere (#42848)
🚨 Modify ModernBERT's default attention implementation to stop using FA (#43764)
🚨 This one is quite breaking for super super super old modles: 🚨 🚨
fix: Prevent AutoTokenizer type mismatch from directory name substrin… (#43791)
If the config does not have a model-type field, we no longer check the name of the folder like for https://huggingface.co/prajjwal1/bert-tiny/blob/main/config.json
Bugfixes and improvements
- [docs] deploying (#43241) by @stevhliu
- [Trainer] Move NEFTune impl to standalone functions (#43714) by @SunMarc
- Fix convert_rope_params_to_dict so it uses rope_theta from the config (#43766) by @hmellor
- Bump dev version (#43777) by @qgallouedec
- Improved AGENTS.md (#43763) by @tarekziade
- Fix-release-ubild (#43773) by @ArthurZucker
- unpin torch for CircleCI (#43790) by @ydshieh
- [Modular Dependencies] Fixup qwen rms norms (#43772) by @vasqu
- fix(testing): Fix BLOOM tokenizer, CLAP audio features, and CLVP text tester usage in tests (#43798) by @harshaljanjani
- Remove unconditional train_batch_size assignment (#43770) by @lordaarush
- [Repo Consistency] Fix rms norm (#43803) by @vasqu
- fix: Prevent AutoTokenizer type mismatch from directory name substrin… (#43791) by @tarekziade
- Refactor trainer data_collator and callbacks tests (#43776) by @SunMarc
- [core] Faster and thread-safe check_model_inputs implementation (#43765) by @Cyrilvallez
- [Trainer] use deepspeed SP process group when Accelerate doesn’t build a mesh (#43799) by @kashif
- fix(flaky): enforce manual seed to reduce flakiness (#43794) by @tarekziade
- Add TRL CI bot workflow to trigger tests on PR comments (#43809) by @qgallouedec
- Fix DeepSpeed model preparation logic in Trainer class (#43780) by @qgallouedec
- [docs] reveal more in toctree (#43808) by @stevhliu
- Fix markdown documentation (#43076) by @cyyever
- Fix slack-report workflow file (#43851) by @ydshieh
- add do_sample=False to qwen2_5_vl model tests to stablize the output (#43728) by @kaixuanliu
- Fix incorrect timestamp calculation in Qwen3VL Processor (#43659) by @jonathan-fulton
- Remove GPU tracking from TrackioCallback and remove env var support (#43371) by @qgallouedec
- Add id and resume support to SwanLab integration (#43719) by @i-pj
- fix gptoss crash in tp (#43853) by @sywangyi
- Delete batch_split from EncoderDecoderCache (#43814) by @cyyever
- delete unnecessary code to make moe compatible to full graph compile (#43855) by @kaixuanliu
- Update ModelType for Unigram tokenizer (#43860) by @pavel-esir
- [docs] Remove pipeline() examples from summarization/translation tasks (#43831) by @Mr-Neutr0n
- Fix video interpolation in pe_audio_video (#43811) by @Rocketknight1
- Look for the pad_token_id in the right place for Llama4 (#43539) by @Rocketknight1
- Fix cardinality error for DETR models without explicit background class (#43513) by @heathdutton
- docs: Add Switch Transformers docstring notes and update spectrogram comment (#43336) by @harshaljanjani
- [xLSTM] Fix bugs preventing small model training (#43209) by @Anri-Lombard
- docs: correct typo 'neccessary' to 'necessary' (#43868) by @thecaptain789
- Improve PR comment CI feedback (#43852) by @ydshieh
- Fix init weights in remote code (#43768) by @zucchini-nlp
- Fix GlmMoeDsaConfig default mlp_layer_types in modular conversion (#43876) by @OiPunk
- [MistralCommonBackend] fix loading proc (#43887) by @eustlb
- [Jamba] Fallback to slow path and warn instead of error out (#43889) by @vasqu
- Fix SwanLab callback to forward resume init args (#43848) by @OiPunk
- Fix old tech stack in doc (#43879) by @cyyever
- Update TrainingArguments (#43806) by @SunMarc
- Remove unnecessary code or checks for PT 2.4+ (#43787) by @cyyever
- Make it possible to evaluate when using sequence parallel in HF Trainer (#43517) by @jp1924
- [Trainer] Move optimizer cls init to trainer_optimizer.py (#43738) by @SunMarc
- fix the error of tests/quantization/fbgemm_fp8/test_fbgemm_fp8.py::Fb… (#43547) by @sywangyi
- fix fbgemm fp8 multi-device load failure. (#43581) by @sywangyi
- Refactor trainer init (#43807) by @SunMarc
- [fix] Use last_hidden_state key from get_image_features for llama4 (#43882) by @tomaarsen
- [Docs] Add docs for GLM-OCR and fix EomT-DINOv3 (#43710) by @NielsRogge
- Update hub metadata (#43892) by @zucchini-nlp
- [fix] DAC model: Apply STE in Dac.from_latents to match the forward pass (#43820) by @harshaljanjani
- Separate check_model_inputs into capture_outputs and merge_with_config_defaults + ensure correctness (#43862) by @Cyrilvallez
- Remove mask slicing in all eager attentions (#42186) by @Cyrilvallez
- Fix expected DAC outputs due to (old) change in CI settings. (#43896) by @ebezzam
- Minor changes trainer (#43744) by @SunMarc
- adding BC for custom toks accessing slow tok attrs deprecated in v5 (#43898) by @itazap
- Fix typo in quantization_operations in PEFT integrations (#43821) by @redpanda1995
- Update KERNELS_MIN_VERSION to 0.10.2 to be the same as setup.py (#43753) by @cyyever
- Decorate cache updates with no_grad, just in case (#43897) by @Rocketknight1
- revert place_model_on_device to property (#43895) by @SunMarc
- Train sampler unification (#43138) by @jiosephlee
- fix(moe): Handle dtype mismatch in torch._grouped_mm with autocast (#43839) by @Mr-Neutr0n
- Fix missing fast image patch counter in Glm46V (#43877) by @OiPunk
- Fix old tech stack in doc (#43902) by @cyyever
- Move _keys_to_ignore_on_load_missing for now (#43893) by @ArthurZucker
- Changes to cache_utils should trigger all tests all the time (#43920) by @Cyrilvallez
- Ernie4 5 vl moe (#43755) by @kaixuanliu
- Harmonize input_embeds to inputs_embeds everywhere (#43916) by @Cyrilvallez
- fix: TextClassificationPipeline docs mentioning deprecated return_all_scores (#43903) by @math-hiyoko
- Revert #43897 (#43923) by @Rocketknight1
- Fix AttributeError in OwlViT conversion script for Python 3.10+ (#43922) by @DimiChatzipavlis
- add openAI style image_url content support in apply_chat_template (#43786) by @kaixuanliu
- Prepare and keep track of position ids in generate (#43734) by @zucchini-nlp
- Fix lifted_tensor in Gemma3n export which dynamo can't reason about (#43801) by @robell
- Fix bark test (#43942) by @Cyrilvallez
- Fix docker files (#43946) by @ydshieh
- Fix flaky test for multimodal LLMs (#43944) by @Rocketknight1
- Add explicit utf-8 encoding to CircleCI scripts for Windows compatibility (#43925) by @
- Modernize string formatting (f-strings) in conversion scripts (#43943) by @
- Fix weight decay exclusions in run_*_no‑trainer.py examples (#42769) by @casinca
- fix: Better weight decay exclusion in run_*_no‑trainer.py examples (#43947) by @casinca
- Timm backbone saves and loads out_features (#43886) by @zucchini-nlp
- Fix qwen-vl position ids when generating several times (#43952) by @zucchini-nlp
- Fix get_number_of_image_tokens (#43948) by @zucchini-nlp
- Fix typos in docstrings, comments, and error messages (#43949) by @
- Fix LASR test layerdrop issue (#43954) by @Rocketknight1
- [kernels] fix kernel versions (#43955) by @MekkCyber
- [Doc tests] Fix bug (#43729) by @NielsRogge
- fix(models): Preserve custom token IDs through DiaConfig save and load (#43928) by @harshaljanjani
- update somes audio models (#43865) by @Deep-unlearning
- Improve memory allocator during loading (#43945) by @Cyrilvallez
- Inclusion of process_group in the gather_full_tensor function in tensor_parallel.py (#43932) by @quic-meetkuma
- Fix sync gradient (#43919) by @SunMarc
- Reorder Trainer methods (#43914) by @SunMarc
- Fix TypeError in dot_natural_key when state_dict keys have mixed types at same position (#43966) by @shtse8
- Enhance JSON schema generation to support instance, static, and class methods (#43968) by @qgallouedec
- Remove unused squeeze from VJEPA2 embeddings rotation (#43984) by @materight
- Improve new failing test analysis for PR comment CI (#44033) by @ydshieh
- Remove other_workflow_run_ids for issue_comment in utils/notification_service.py (#44036) by @ydshieh
- stable grouped_mm API (#43977) by @IlyasMoutawwakil
- create .git-blame-ignore-revs file (#43982) by @SunMarc
- docs: fix typos across documentation files (#43993) by @saurav0369
- update python requirement to 3.10+ to match codebase (#44009) by @mariam851
- Improve use of torch.is_autocast_enabled (#43930) by @cyyever
- Use torch.xlogy (#44006) by @cyyever
- [Deespeed] fix WeightConverter.convert() use (#43926) by @kashif
- Reduce reduce CUDA sync (#44005) by @cyyever
- split out accelerator args builder method (#43987) by @winglian
- SINQ quantization strategy integration (adapted for Transformers V5) (#43112) by @ChiaraBoretti
- fix(models): Unpack BitNet packed weights to fix CI failure (#43721) by @harshaljanjani
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @ChiaraBoretti
- SINQ quantization strategy integration (adapted for Transformers V5) (#43112)
- @cyyever
- Reduce reduce CUDA sync (#44005)
- Use torch.xlogy (#44006)
- Improve use of torch.is_autocast_enabled (#43930)
- Fix old tech stack in doc (#43902)
- Update KERNELS_MIN_VERSION to 0.10.2 to be the same as setup.py (#43753)
- Remove unnecessary code or checks for PT 2.4+ (#43787)
- Fix old tech stack in doc (#43879)
- Delete batch_split from EncoderDecoderCache (#43814)
- Fix markdown documentation (#43076)
- @eustlb
- Add Voxtral Realtime (#43769)
- [MistralCommonBackend] fix loading proc (#43887)
- @ebezzam
- Fix expected DAC outputs due to (old) change in CI settings. (#43896)
- Add VibeVoice Acoustic Tokenizer (#43400)
- @vasqu
- [Jamba] Fallback to slow path and warn instead of error out (#43889)
- 🚨 [Attn] New attn mask interface everywhere (#42848)
- [Repo Consistency] Fix rms norm (#43803)
- [Modular Dependencies] Fixup qwen rms norms (#43772)
- @bozheng-hit
- Adding Support for Qwen3.5 (#43830)
- Feb 5, 2026
- Date parsed from source:Feb 5, 2026
- First seen by Releasebot:Mar 20, 2026
v5.1.0: EXAONE-MoE, PP-DocLayoutV3, Youtu-LLM, GLM-OCR
transformers adds new model support for EXAONE-MoE, PP-DocLayoutV3, Youtu-LLM, and GLM-OCR, while also shipping broad bug fixes, generation cache updates, MoE and XPU improvements, and multiple breaking refactors to keep the library evolving.
New Model additions
EXAONE-MoE
K-EXAONE is a large-scale multilingual language model developed by LG AI Research. Built using a Mixture-of-Experts architecture, K-EXAONE features 236 billion total parameters, with 23 billion active during inference. Performance evaluations across various benchmarks demonstrate that K-EXAONE excels in reasoning, agentic capabilities, general knowledge, multilingual understanding, and long-context processing.
Add EXAONE-MoE implementations (#43080) by @nuxlear
PP-DocLayoutV3
PP-DocLayoutV3 is a unified and high-efficiency model designed for comprehensive layout analysis. It addresses the challenges of complex physical distortions—such as skewing, curving, and adverse lighting—by integrating instance segmentation and reading order prediction into a single, end-to-end framework.
[Model] Add PP-DocLayoutV3 Model Support (#43098) by @zhang-prog
Youtu-LLM
Youtu-LLM is a new, small, yet powerful LLM, contains only 1.96B parameters, supports 128k long context, and has native agentic talents. On general evaluations, Youtu-LLM significantly outperforms SOTA LLMs of similar size in terms of Commonsense, STEM, Coding and Long Context capabilities; in agent-related testing, Youtu-LLM surpasses larger-sized leaders and is truly capable of completing multiple end2end agent tasks.
Add Youtu-LLM model (#43166) by @LuJunru
GlmOcr
GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization. The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.
[GLM-OCR] GLM-OCR Support (#43391)by @zRzRzRzRzRzRzR
Breaking changes
🚨 T5Gemma2 model structure (#43633) - Makes sure that the attn implementation is set to all sub-configs. The config.encoder.text_config was not getting its attn set because we aren't passing it to PreTrainedModel.init. We can't change the model structure without breaking so I manually re-added a call to self.adjust_attn_implemetation in modeling code
🚨 Generation cache preparation (#43679) - Refactors cache initialization in generation to ensure sliding window configurations are now properly respected. Previously, some models (like Afmoe) created caches without passing the model config, causing sliding window limits to be ignored. This is breaking because models with sliding window attention will now enforce their window size limits during generation, which may change generation behavior or require adjusting sequence lengths in existing code.
🚨 Delete duplicate code in backbone utils (#43323) - This PR cleans up backbone utilities. Specifically, we have currently 5 different config attr to decide which backbone to load, most of which can be merged into one and seem redundant
After this PR, we'll have only one config.backbone_config as a single source of truth. The models will load the backbone from_config and load pretrained weights only if the checkpoint has any weights saved. The overall idea is same as in other composite models. A few config arguments are removed as a result.
🚨 Refactor DETR to updated standards (#41549) - standardizes the DETR model to be closer to other vision models in the library.
🚨Fix floating-point precision in JanusImageProcessor resize (#43187) - replaces an int() with round(), expect light numerical differences
🚨 Remove deprecated AnnotionFormat (#42983) - removes a missnamed class in favour of AnnotationFormat.
Bugfixes and improvements
fix(models): Migrate legacy segmentation_indices to out_indices in BeitConfig (#43505) by @harshaljanjani
[docs] Update torch version (#42135) by @stevhliu
Remove SDPA workarounds for torch 2.4+ (#43754) by @cyyever
add use_deterministic to guarantee the consistency for youtu-llm model (#43759) by @kaixuanliu
fix: add compatible_model_types to suppress model type mismatch warnings (#43495) by @leoneperdigao
Fix T5 v1.1 detection (#43681) by @githubnemo
Add moonshine streaming (#43702) by @eustlb
Allow bi-directional attention for all models (#43705) by @Cyrilvallez
Docs: fix Training step by removing tokenizer from trainer initialization (#43733) by @nesjett
Fix scheduler initialization order (#43711) by @SunMarc
Fix accelerate integration import (#43732) by @SunMarc
Update torch minimum version to 2.4 (#41307) by @cyyever
Fix dtype in image-text-to-text pipe (#43731) by @zucchini-nlp
Preventing initialization of siglip's lecun_normal_, default_flax_embed_init in ZeRO3 (#43574) by @jp1924
fix: AttributeError for Qwen3_omni_moe (#43593) by @Vallabh-1504
Improve typing/explanations for general model properties (#43712) by @Cyrilvallez
[Kernels] kernel migration updates for activation kernels (#43518) by @ariG23498
[feat] Allow loading T5Gemma2Encoder with AutoModel (#43559) by @tomaarsen
Added S110 - try-except-pass rule (#43687) by @tarekziade
[docs] benchmarks (#43694) by @stevhliu
fix norm_eps dtype (#43669) by @fschlatt
Llava onevision: output align for tests and add image_sizes input param (#43678) by @kaixuanliu
Fix CLIPOutput attentions not being returned (#43657) by @jonathan-fulton
[Attn] Fixup interface usage after refactor (#43706) by @vasqu
Fix model/processor mismatch in SigLIP2 quantization example (#43652) by @jonathan-fulton
Fix crash of custom models in Notebook or Repl (#43690) by @Cyrilvallez
Simplify TrainingArguments docstring (#43568) by @SunMarc
Composite model inherit automatically all important properties from their children (#43691) by @Cyrilvallez
Update configuration_qwen3.py (#43703) by @francesco-bertolotti
fix gptoss tp crash (#43695) by @sywangyi
[CB] Keep order of incoming requests (#43626) by @remi-or
Fix Apertus model loading (NotImplementedError: Cannot copy out of meta tensor; no data!) (#43473) by @xenova
Remove num_frames in ASR pipeline (#43546) by @jiqing-feng
remove ipex and ccl for xpu and cpu (#42852) by @yao-matrix
update guide with new attr name for toks (#43689) by @itazap
Docs: fix typos in Get started (index, quicktour) (#43666) by @CodeByKodi
the cache class is deprecated by @vasqu (direct commit on main)
custom tok init fix (#43591) by @itazap
More export friendly rewrites and skipping the failing ones (#43436) by @IlyasMoutawwakil
Cast byte_count to int in caching_allocator_warmup for MPS compatibility (#43608) by @tobyliu2004
[Docs] Complete missing Llama4 configuration docs (#43460) by @udaymehta
Fix t5 failures (#43374) by @Abdennacer-Badaoui
Add EoMT with DINOv3 backbone (#41212) by @NielsRogge
Update DBRX docs to reference re-uploaded checkpoint (#43196) by @qgallouedec
[loading] Fix forced upcasting to fp32 (#43683) by @Cyrilvallez
Fix FP8Expert for Qwen (#43670) by @yiliu30
Simplify loading structure (#43589) by @Cyrilvallez
[CB] Refactor logic for inputs and outputs outside of the main API (#43569) by @remi-or
Make sure hub errors are surfaced in PreTrainedTokenizerBase (#43675) by @tarekziade
Fix FP8Expert for DeepSeek R1 (#43616) by @yiliu30
Use correct sampling rate in chat template (#43674) by @zucchini-nlp
[HunYuan] Fix RoPE init (#43411) by @vasqu
XPU now supports MoE kernel(MegaBlocks) implementation (#43435) by @YangKai0616
[Sam] Fixup training flags (#43567) by @vasqu
remove torchao.autoquant from transformers (#43561) by @vkuzo
[DeepSpeed] properly handle MoE weight conversion (#43524) by @kashif
Tie zamba weights correctly (#43623) by @zucchini-nlp
[kernels] Centralize kernels tests (#42819) by @MekkCyber
Fix process_bad_commit_report.py: avoid items to appear in null author in the report (#43662) by @ydshieh
Fix KeyError in check_bad_commit.py (#43655) by @ydshieh
[Benchmark] Minor fix for benchmark: kernel is not correctly called (#43428) by @sywangyi
Add explicit commit info to PR comment CI feedback (#43635) by @ydshieh
Better new failures reporting for PR comment CI (#43629) by @ydshieh
[docs] serving (#42853) by @stevhliu
add XPU expected output for MixedInt8GPT2Test (#43615) by @kaixuanliu
Don't modify mappings in tests (#43634) by @Rocketknight1
Allow Attention and Experts to be used as standalone modules (#43622) by @Cyrilvallez
Don't modify tied_weight_keys in-place (#43619) by @zucchini-nlp
[Rope] Revert #43410 and make inheritance implicit again (#43620) by @vasqu
[vllm compat] Separate renaming from conversion ops (#43621) by @Cyrilvallez
refactor + robusts tests for Tensor Parallel (#42809) by @3outeille
add contiguous operation for diffllama model for xpu to enable compile mode. (#43614) by @kaixuanliu
add xpu expectation for lw_detr model (#43339) by @kaixuanliu
minimax_m2: fix failed test case for XPU (#43324) by @kaixuanliu
Improve new failures reporting (#43628) by @ydshieh
Fix extras on all supported Python versions (#43490) by @tarekziade
fix(models): Fix suno/bark-small CPU offload device mismatch causing CI failures (#43607) by @harshaljanjani
[CB] [Serve] Fix broken serve tests (#43594) by @remi-or
Docs: fix typo in weight converter guide (#43610) by @KOKOSde
[MoE] Use int input for histc on CUDA to support deterministic algorithms (#43583) by @YangKai0616
Fixes configuration default values (#43592) by @zucchini-nlp
Fix make_batched_video with 5D arrays (#43486) by @zucchini-nlp
Operation Green CI II (#43537) by @Rocketknight1
enable cpu paged cache (#42869) by @jiqing-feng
Qwen3 omni - fix get video features (#43588) by @zucchini-nlp
[GLM-Image] Add batch > 1 support and fix configuration defaults (#43342) by @JaredforReal
[Model] Refactor modernbert with the attention interface (#43030) by @YangKai0616
Regex post processing in loading (#43585) by @Cyrilvallez
simplify extra tokens logic in base (#43230) by @itazap
Add XPU support to the tests for solar_open (#43579) by @YangKai0616
remove FbgemmFp8LinearTest (#43545) by @sywangyi
Increase default ReadTimeout in tests (#43586) by @Wauplin
Fix mistral checkpoint loading in utils/fetch_hub_objects_for_ci.py: avoid too many requests and/or timeout (#43584) by @ydshieh
[CI][AMD] Fix Pipeline CI (#43178) by @Abdennacer-Badaoui
fix(converter): speed up MistralConverter.extract_vocab_merges_from_model (#43557) by @tarekziade
Improve GPU monitoring: switch to multiprocessing and use amdsmi for AMD GPUs (#43552) by @Abdennacer-Badaoui
Update test of Youtu-LLM to pr-aligned repos (#43578) by @LuJunru
Rework dependencies and extras + Remove outdated templates folder (#43536) by @Cyrilvallez
Fix repo. consistency bot (push permission issue) (#43570) by @ydshieh
Fix Wav2vec and a few others (#43566) by @Cyrilvallez
[Modular] Allow to add new bases that are not present in the inherited class (#43556) by @vasqu
add an option to disable Sam3VideoModel progress bar (#43564) by @ndeybach
check/fix repo. check bot workflow (#43565) by @ydshieh
Increase timeout when preparing CI (#43560) by @Rocketknight1
43054: Add Siglip2Tokenizer to enforce training-time text preprocessing defaults (#43101) by @vaibhav-research
check PR bot permission - part 3 (try content attribute) (#43555) by @ydshieh
check PR bot permission - part 2 (style only) (#43554) by @ydshieh
check PR bot permission - part 1 (#43553) by @ydshieh
Fix failing tests due to no attribute pad_token_id (#43453) by @Sai-Suraj-27
fix: GPT OSS Conversion Script Enhancements (#42901) by @KyleMylonakisProtopia
[Quantization] Fix triton_kernels name after being renamed to gpt-oss-triton-kernels (#43528) by @MekkCyber
[Quantization] Add cutlass kernel for FP8 (#43304) by @MekkCyber
[CB] Minor perf improvements and ty compatibility (#43521) by @remi-or
Fix tiles mixing for batched input, add tie_word_embeddings to LFM2VL config (#43379) by @ankke
fix: return labels instead of label in reduce_label method in BeitImageProcessorFast (#43527) by @sbucaille
[RoPE] Make explicit inheritance (#43410) by @vasqu
Fix for #43530 (#43535) by @Rocketknight1
Operation Green CI (#43530) by @Rocketknight1
Tie the weights even if initializing from a config on meta device (#43523) by @Cyrilvallez
[kernels] Update cv_utils name (#43529) by @MekkCyber
add trackio to training notebooks (#43442) by @merveenoyan
Mark test_prompt_lookup_decoding as flaky (#42184) by @Rocketknight1
Fix some MoE routers (#43445) by @IlyasMoutawwakil
batched_mm is slow on cpu (#43438) by @IlyasMoutawwakil
fix: initialize BatchNorm2d buffers only when needed (#43520) by @tarekziade
Fix loading of Qwen3 FP8 (#43494) by @githubnemo
fix ShieldGemma2IntegrationTest::test_model (#43343) by @sywangyi
Update SamHQModelIntegrationTest::test_inference_mask_generation_batched_points_batched_images for XPU (#43511) by @sywangyi
Revert utils files changes from PR #42845 (#43507) by @ydshieh
Move hardcoded time_step params to config for Bamba, FalconH1, GraniteMoeHybrid (#43461) by @raimbekovm
Prepare inputs for generation is called from super() (#43280) by @zucchini-nlp
Enhance repo. consistency bot (#43503) by @ydshieh
Add pytest-random-order for reproducible test randomization (#43483) by @tarekziade
Add missing GPURawMetrics.from_dict() method in benchmark_v2 (#43499) by @Abdennacer-Badaoui
push dev version 5.0.1.dev0 by @ArthurZucker (direct commit on main)
Fix failing markuplm & perception_lm integration tests (#43464) by @Sai-Suraj-27
fix(Phi4Multimodal): Fix incorrect default vision/audio config initialization in Phi4MultimodalConfig (#43480) by @charlieJ107
handle 1D position_ids for modeling_flash_attention_utils as well (#43403) by @kaixuanliu
Remove stale TODO comments in UDOP tied weights (#43477) by @raimbekovm
Fix Mxfp4 dequantize (#43326) by @Cyrilvallez
Significant community contributions
The following contributors have made significant changes to the library over the last release:
@cyyever
Remove SDPA workarounds for torch 2.4+ (#43754)
Update torch minimum version to 2.4 (#41307)
🚨 Remove deprecated AnnotionFormat (#42983)
@eustlb
Add moonshine streaming (#43702)
@tarekziade
Added S110 - try-except-pass rule (#43687)
Make sure hub errors are surfaced in PreTrainedTokenizerBase (#43675)
Fix extras on all supported Python versions (#43490)
fix(converter): speed up MistralConverter.extract_vocab_merges_from_model (#43557)
fix: initialize BatchNorm2d buffers only when needed (#43520)
Add pytest-random-order for reproducible test randomization (#43483)
@nuxlear
Add EXAONE-MoE implementations (#43080)
@vasqu
[Attn] Fixup interface usage after refactor (#43706)
the cache class is deprecated
[HunYuan] Fix RoPE init (#43411)
[Sam] Fixup training flags (#43567)
[Rope] Revert #43410 and make inheritance implicit again (#43620)
[Modular] Allow to add new bases that are not present in the inherited class (#43556)
[RoPE] Make explicit inheritance (#43410)
@remi-or
[CB] Keep order of incoming requests (#43626)
[CB] Refactor logic for inputs and outputs outside of the main API (#43569)
[CB] [Serve] Fix broken serve tests (#43594)
[CB] Minor perf improvements and ty compatibility (#43521)
@NielsRogge
Add EoMT with DINOv3 backbone (#41212)
@YangKai0616
XPU now supports MoE kernel(MegaBlocks) implementation (#43435)
[MoE] Use int input for histc on CUDA to support deterministic algorithms (#43583)
[Model] Refactor modernbert with the attention interface (#43030)
Add XPU support to the tests for solar_open (#43579)
@ydshieh
Fix process_bad_commit_report.py: avoid items to appear in null author in the report (#43662)
Fix KeyError in check_bad_commit.py (#43655)
Add explicit commit info to PR comment CI feedback (#43635)
Better new failures reporting for PR comment CI (#43629)
Improve new failures reporting (#43628)
Fix mistral checkpoint loading in utils/fetch_hub_objects_for_ci.py: avoid too many requests and/or timeout (#43584)
Fix repo. consistency bot (push permission issue) (#43570)
check/fix repo. check bot workflow (#43565)
check PR bot permission - part 3 (try content attribute) (#43555)
check PR bot permission - part 2 (style only) (#43554)
check PR bot permission - part 1 (#43553)
Revert utils files changes from PR #42845 (#43507)
Enhance repo. consistency bot (#43503)
@JaredforReal
[GLM-Image] Add batch > 1 support and fix configuration defaults (#43342)
@zhang-prog
[Model] Add PP-DocLayoutV3 Model Support (#43098)
@LuJunru
Update test of Youtu-LLM to pr-aligned repos (#43578)
Add Youtu-LLM model (#43166)
@zRzRzRzRzRzRzR
[GLM-OCR] GLM-OCR Support (#43391)
Original source Report a problem - Jan 26, 2026
- Date parsed from source:Jan 26, 2026
- First seen by Releasebot:Mar 20, 2026
Transformers v5
transformers releases its first major v5 update, bringing major API simplification, dynamic weight loading, tokenizer and config refactors, faster model loading, weekly minor releases, and broad bug fixes plus new model support across vision, audio, and language tasks.
Transformers v5 release notes
Highlights
Significant API changes: dynamic weight loading, tokenization
Backwards Incompatible Changes
Bugfixes and improvements
We have a migration guide that will be continuously updated available on the main branch, please check it out in case you're facing issues: migration guide.
Highlights
We are excited to announce the initial release of Transformers v5. This is the first major release in five years, and the release is significant: 1200 commits have been pushed to main since the latest minor release. This release removes a lot of long-due deprecations, introduces several refactors that significantly simplify our APIs and internals, and comes with a large number of bug fixes.
We give an overview of our focus for this release in the following blogpost. In these release notes, we'll focus directly on the refactors and new APIs coming with v5.
This release is the full V5 release. It sets in motion something bigger: going forward, starting with v5, we'll now release minor releases every week, rather than every 5 weeks. Expect v5.1 to follow next week, then v5.2 the week that follows, etc.
We're moving forward with this change to ensure you have access to models as soon as they're supported in the library, rather than a few weeks after.
In order to install this release, please do so with the following:
pip install transformersFor us to deliver the best package possible, it is imperative that we have feedback on how the toolkit is currently working for you. Please try it out, and open an issue in case you're facing something inconsistent/a bug.
Transformers version 5 is a community endeavor, and we couldn't have shipped such a massive release without the help of the entire community.
Significant API changes
Dynamic weight loading
We introduce a new weight loading API in transformers, which significantly improves on the previous API. This
weight loading API is designed to apply operations to the checkpoints loaded by transformers.
Instead of loading the checkpoint exactly as it is serialized within the model, these operations can reshape, merge,
and split the layers according to how they're defined in this new API. These operations are often a necessity when
working with quantization or parallelism algorithms.This new API is centered around the new WeightConverter class:
class WeightConverter(WeightTransform): operations: list[ConversionOps] source_keys: Union[str, list[str]] target_keys: Union[str, list[str]]The weight converter is designed to apply a list of operations on the source keys, resulting in target keys. A common
operation done on the attention layers is to fuse the query, key, values layers. Doing so with this API would amount
to defining the following conversion:conversion = WeightConverter( ["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"], # The input layers "self_attn.qkv_proj", # The single layer as output operations=[Concatenate(dim=0)], )In this situation, we apply the Concatenate operation, which accepts a list of layers as input and returns a single
layer.This allows us to define a mapping from architecture to a list of weight conversions. Applying those weight conversions
can apply arbitrary transformations to the layers themselves. This significantly simplified the from_pretrained method
and helped us remove a lot of technical debt that we accumulated over the past few years.This results in several improvements:
- Much cleaner definition of transformations applied to the checkpoint
- Reversible transformations, so loading and saving a checkpoint should result in the same checkpoint
- Faster model loading thanks to scheduling of tensor materialization
- Enables complex mix of transformations that wouldn't otherwise be possible (such as quantization + MoEs, or TP + MoEs)
Linked PR: #41580
Tokenization
Just as we moved towards a single backend library for model definition, we want our tokenizers, and the Tokenizer object to be a lot more intuitive. With v5, tokenizer definition is much simpler; one can now initialize an empty LlamaTokenizer and train it directly on your corpus.
Defining a new tokenizer object should be as simple as this:
from transformers import TokenizersBackend, generate_merges from tokenizers import pre_tokenizers, Tokenizer from tokenizers.model import BPE class Llama5Tokenizer(TokenizersBackend): def __init__(self, unk_token="<unk>",bos_token="<s>", eos_token="</s>", vocab=None, merges=None ): if vocab is None: self._vocab = { str(unk_token): 0, str(bos_token): 1, str(eos_token): 2, } else: self._vocab = vocab self._merges = merges self._tokenizer = Tokenizer( BPE(vocab=self._vocab, merges=self._merges, fuse_unk=True) ) self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace( replacement="▁", prepend_scheme=_get_prepend_scheme(self.add_prefix_space, self), split=False ) super().__init__( tokenizer_object=self._tokenizer, unk_token=unk_token, bos_token=bos_token, eos_token=eos_token, )Once the tokenizer is defined as above, you can load it with the following: Llama5Tokenizer(). Doing this returns you an empty, trainable tokenizer that follows the definition of the authors of Llama5 (it does not exist yet 😉).
The above is the main motivation towards refactoring tokenization: we want tokenizers to behave similarly to models: trained or empty, and with exactly what is defined in their class definition.
Backend Architecture Changes: moving away from the slow/fast tokenizer separation
Up to now, transformers maintained two parallel implementations for many tokenizers:
- "Slow" tokenizers (tokenization_<model>.py) - Python-based implementations, often using SentencePiece as the backend.
- "Fast" tokenizers (tokenization_<model>_fast.py) - Rust-based implementations using the 🤗 tokenizers library.
In v5, we consolidate to a single tokenizer file per model: tokenization_<model>.py. This file will use the most appropriate backend available:
- TokenizersBackend (preferred): Rust-based tokenizers from the 🤗 tokenizers library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem:
- handling additional tokens
- a full python API for setting and updating
- automatic parallelization,
- automatic offsets
- customization
- training
- SentencePieceBackend: for tokenizers requiring the sentencepiece library. It inherits from PythonBackend.
- PythonBackend: a Python implementations of the features provided by tokenizers. Basically allows adding tokens.
- MistralCommonBackend: relies on MistralCommon's tokenization library. (Previously known as the MistralCommonTokenizer)
The AutoTokenizer automatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to use AutoTokenizer.from_pretrained() as before. This allows transformers to be future-proof and modular to easily support future backends.
Defining a tokenizers outside of the existing backends
We enable users and tokenizer builders to define their own tokenizers from top to bottom. Tokenizers are usually defined using a backend such as tokenizers, sentencepiece or mistral-common, but we offer the possibility to design the tokenizer at a higher-level, without relying on those backends.
To do so, you can import the PythonBackend (which was previously known as PreTrainedTokenizer). This class encapsulates all the logic related to added tokens, encoding, and decoding.
If you want something even higher up the stack, then PreTrainedTokenizerBase is what PythonBackend inherits from. It contains the very basic tokenizer API features:
- encode
- decode
- vocab_size
- get_vocab
- convert_tokens_to_ids
- convert_ids_to_tokens
- from_pretrained
- save_pretrained
among a few others
API Changes
1. Direct tokenizer initialization with vocab and merges
Starting with v5, we now enable initializing blank, untrained tokenizers-backed tokenizers:
from transformers import LlamaTokenizer tokenizer = LlamaTokenizer()This tokenizer will therefore follow the definition of the LlamaTokenizer as defined in its class definition. It can then be trained on a corpus as can be seen in the tokenizers documentation.
These tokenizers can also be initialized from vocab and merges (if necessary), like the previous "slow" tokenizers:
from transformers import LlamaTokenizer vocab = {"<unk>": 0, "<s>": 1, "</s>": 2, "hello": 3, "world": 4} merges = [("h", "e"), ("l", "l"), ("o", " ")] tokenizer = LlamaTokenizer(vocab=vocab, merges=merges)This tokenizer will behave as a Llama-like tokenizer, with an updated vocabulary. This allows comparing different tokenizer classes with the same vocab; therefore enabling the comparison of different pre-tokenizers, normalizers, etc.
⚠️ The vocab_file (as in, a path towards a file containing the vocabulary) cannot be used to initialize the LlamaTokenizer as loading from files is reserved to the from_pretrained method.
2. Simplified decoding API
The batch_decode and decode methods have been unified to reflect behavior of the encode method. Both single and batch decoding now use the same decode method. See an example of the new behavior below:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("t5-small") inputs = ["hey how are you?", "fine"] tokenizer.decode(tokenizer.encode(inputs))Gives:
- 'hey how are you?</s> fine</s>'
- ['hey how are you?</s>', 'fine</s>']
We expect encode and decode to behave, as two sides of the same coin: encode, process, decode, should work.
Note
A common use-case would be: encode, model.generate, decode. However, using generate would return list[list[int]], which would then be incompatible with decode.
3. Unified encoding API
The encode_plus method is deprecated in favor of the single call method.
4. apply_chat_template returns BatchEncoding
Previously, apply_chat_template returned input_ids for backward compatibility. Starting with v5, it now consistently returns a BatchEncoding dict like other tokenizer methods.
# v5 messages = [ {"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi there!"} ] # Now returns BatchEncoding with input_ids, attention_mask, etc. outputs = tokenizer.apply_chat_template(messages, return_tensors="pt") print(outputs.keys()) # dict_keys(['input_ids', 'attention_mask'])5. Removed legacy configuration file saving:
We simplify the serialization of tokenization attributes:
- special_tokens_map.json - special tokens are now stored in tokenizer_config.json.
- added_tokens.json - added tokens are now stored in tokenizer.json.
- added_tokens_decoder is only stored when there is no tokenizer.json.
When loading older tokenizers, these files are still read for backward compatibility, but new saves use the consolidated format. We're gradually moving towards consolidating attributes to fewer files so that other libraries and implementations may depend on them more reliably.
6. Model-Specific Changes
Several models that had identical tokenizers now import from their base implementation:
- LayoutLM → uses BertTokenizer
- LED → uses BartTokenizer
- Longformer → uses RobertaTokenizer
- LXMert → uses BertTokenizer
- MT5 → uses T5Tokenizer
- MVP → uses BartTokenizer
These modules will eventually be removed altogether.
Removed T5-specific workarounds
The internal _eventually_correct_t5_max_length method has been removed. T5 tokenizers now handle max length consistently with other models.
Testing Changes
A few testing changes specific to tokenizers have been applied:
- Model-specific tokenization test files now focus on integration tests.
- Common tokenization API tests (e.g., add_tokens, encode, decode) are now centralized and automatically applied across all tokenizers. This reduces test duplication and ensures consistent behavior
For legacy implementations, the original BERT Python tokenizer code (including WhitespaceTokenizer, BasicTokenizer, etc.) is preserved in bert_legacy.py for reference purposes.
7. Deprecated / Modified Features
Special Tokens Structure:
SpecialTokensMixin: Merged into PreTrainedTokenizerBase to simplify the tokenizer architecture.
special_tokens_map: Now only stores named special token attributes (e.g., bos_token, eos_token). Use extra_special_tokens for additional special tokens (formerly additional_special_tokens). all_special_tokens includes both named and extra tokens.
# v4 tokenizer.special_tokens_map # Included 'additional_special_tokens' # v5 tokenizer.special_tokens_map # Only named tokens tokenizer.extra_special_tokens # Additional tokensspecial_tokens_map_extended and all_special_tokens_extended: Removed. Access AddedToken objects directly from _special_tokens_map or _extra_special_tokens if needed.
additional_special_tokens: Still accepted for backward compatibility but is automatically converted to extra_special_tokens.
Deprecated Methods:
sanitize_special_tokens(): Already deprecated in v4, removed in v5.
prepare_seq2seq_batch(): Deprecated; use call() with text_target parameter instead.
# v4 model_inputs = tokenizer.prepare_seq2seq_batch(src_texts, tgt_texts, max_length=128) # v5 model_inputs = tokenizer(src_texts, text_target=tgt_texts, max_length=128, return_tensors="pt") model_inputs["labels"] = model_inputs.pop("input_ids_target")BatchEncoding.words(): Deprecated; use word_ids() instead.
Removed Methods:
create_token_type_ids_from_sequences(): Removed from base class. Subclasses that need custom token type ID creation should implement this method directly.
prepare_for_model(), build_inputs_with_special_tokens(), truncate_sequences(): Moved from tokenization_utils_base.py to tokenization_python.py for PythonBackend tokenizers. TokenizersBackend provides model-ready input via tokenize() and encode(), so these methods are no longer needed in the base class.
_switch_to_input_mode(), _switch_to_target_mode(), as_target_tokenizer(): Removed from base class. Use call() with text_target parameter instead.
# v4 with tokenizer.as_target_tokenizer(): labels = tokenizer(tgt_texts, ...) # v5 labels = tokenizer(text_target=tgt_texts, ...)parse_response(): Removed from base class.
Performance
MoE Performance
The v5 release significantly improves the performance of the MoE models, as can be seen in the graphs below. We improve and optimize MoE performance through batched and grouped experts implementations, and we optimize them for decoding using batched_mm.
Core performance
We focus on improving the performance of loading weights on device (which gives speedups up to 6x in tensor parallel situations); this is preliminary work that we'll continue to work on in the coming weeks. Some notable improvements:
- [saving] Simplify general logic by @Cyrilvallez in #42766
- Do not rely on config for inferring model dtype by @Cyrilvallez in #42838
- Improve BatchFeature: stack list and lists of torch tensors by @yonigozlan in #42750
- Remove tied weights from internal attribute if they are not tied by @Cyrilvallez in #42871
- Enforce call to post_init and fix all of them by @Cyrilvallez in #42873
- Simplify tie weights logic by @Cyrilvallez in #42895
- Add buffers to _init_weights for ALL models by @Cyrilvallez in #42309
- [loading] Really initialize on meta device for huge perf gains by @Cyrilvallez in #42941
- Do not use accelerate hooks if the device_map has only 1 device by @Cyrilvallez in #43019
- Move missing weights and non-persistent buffers to correct device earlier by @Cyrilvallez in #43021
Library-wide changes with lesser impact
Default dtype update
We have updated the default dtype for all models loaded with from_pretrained to be auto. This will lead to model instantiations respecting the dtype in which the model was saved, rather than forcing it to load in float 32.
You can, of course, still specify the dtype in which you want to load your model by specifying it as an argument to the from_pretrained method.
Shard size
The Hugging Face Hub infrastructure has gradually moved to a XET backend. This will significantly simplify uploads and downloads, with higher download and upload speeds, partial uploads, and, most notably, a higher threshold for accepted file sizes on the Hugging Face Hub.
To reflect this, we're increasing the default shard size of models serialized on the Hub to 50GB (up from 5GB).
use_auth_token
The use_auth_token argument/parameter is deprecated in favor of token everywhere.
You should be able to search and replace use_auth_token with token and get the same logic.
Linked PR: #41666
Attention-related features
We decided to remove some features for the upcoming v5 as they are currently only supported in a few old models and no longer integrated in current model additions. It's recommended to stick to v4.x in case you need them. Following features are affected:
- No more head masking, see #41076. This feature allowed to turn off certain heads during the attention calculation and only worked for eager.
- No more relative positional biases in Bert-like models, see #41170. This feature was introduced to allow relative position scores within attention calculations (similar to T5). However, this feature is barely used in official models and a lot of complexity instead. It also only worked with eager.
- No more head pruning, see #41417 by @gante. As the name suggests, it allowed to prune heads within your attention layers.
Updates to supported torch APIs
We dropped support for two torch APIs:
- torchscript in #41688
- torch.fx in #41683
Those APIs were deprecated by the PyTorch team, and we're instead focusing on the supported APIs dynamo and export.
Quantization changes
We clean up the quantization API in transformers, and significantly refactor the weight loading as highlighted
above.We drop support for two quantization arguments that have been deprecated for some time:
- load_in_4bit
- load_in_8bit
We remove them in favor of the quantization_config argument which is much more complete. As an example, here is how
you would load a 4-bit bitsandbytes model using this argument:from transformers import AutoModelForCausalLM, BitsAndBytesConfig quantization_config = BitsAndBytesConfig(load_in_4bit=True) model_4bit = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B", device_map="auto", quantization_config=quantization_config )Configuration
Methods to init a nested config such as from_xxx_config are deleted. Configs can be init from the init method in the same way. See #41314.
It is no longer possible to load a config class from a URL file. Configs must be loaded from either a local path or a repo on the Hub. See #42383.
All parameters for configuring model's rotary embedding are now stored under mode.rope_parameters, including the rope_theta and rope_type. Model's config.rope_parameters is a simple dictionaty in most cases, and can also be a nested dict in special cases (i.e. Gemma3 and ModernBert) with different rope parameterization for each layer type. Trying to get config.rope_theta will throw an attribute error from now on. See #39847 and #42255
Qwen-VL family configuration is in a nested format and trying to access keys directly will throw an error (e.g. config.vocab_size). Users are expected to access keys from their respective sub-configs (config.text_config.vocab_size).
Configurations of non-generative models (any model that doesn't call model.generate()) will no longer have a generation_config and model.config.generation_config will throw an attribute error.
Processing
Tokenization
Slow tokenizer files (aka: tokenization_<model>.py ) will be removed in favor of using fast tokenizer files tokenization_<model>fast.py --> will be renamed to tokenization<model>.py. As fast tokenizers are 🤗tokenizers - backend, they include a wider range of features that are maintainable and reliable.
Other backends (sentence piece, tokenizers, etc.) will be supported with a light layer if loading a fast tokenizer fails
- Remove legacy files like special_tokens_map.json and added_tokens.json
- Remove _eventually_correct_t5_max_length
- encode_plus --> call
- batch_decode --> decode
- apply_chat_template by default returns naked input_ids rather than a BatchEncoding dict.
- This was inconvenient - it should return a BatchEncoding dict like tokenizer.call(), but we were stuck with
- it for backward compatibility. The method now returns a BatchEncoding.
Linked PRs:
- #40938
- #40936
- #41626
Processing classes
In processing classes each attribute will be serialized under processor_config.json as a nested dict, instead of serializing attributes in their own config files. Loading will be supported for all old format processors (#41474)
XXXFeatureExtractors classes are completely removed in favor of XXXImageProcessor class for all vision models (#41174)
Minor change: XXXFastImageProcessorKwargs is removed in favor of XXXImageProcessorKwargs which will be shared between fast and slow processors (#40931)
Modeling
Some RotaryEmbeddings layers will start returning a dict of tuples, in case the model uses several RoPE configurations (Gemma2, ModernBert). Each value will be a tuple of "cos, sin" per RoPE type.
Config attribute for RotaryEmbeddings layer will be unified and accessed via config.rope_parameters. Config attr for rope_theta might not be accessible anymore for some models, and instead will be in config.rope_parameters['rope_theta']. BC will be supported for a while as much as possible, and in the near future we'll gradually move to the new RoPE format (#39847)
Vision Language models will not have a shortcut access to its language and vision component from the generative model via model.language_model. It is recommended to either access the module with model.model.language_model or model.get_decoder(). See #42156
All models now accept kwargs in their forward methods
Generate
Old, deprecated output type aliases were removed (e.g. GreedySearchEncoderDecoderOutput). We now only have 4 output classes built from the following matrix: decoder-only vs encoder-decoder, uses beams vs doesn't use beams (#40998)
Removed deprecated classes regarding decoding methods that were moved to the Hub due to low usage (constraints and beam scores) (#41223)
If generate doesn't receive any KV Cache argument, the default cache class used is now defined by the model (as opposed to always being DynamicCache) (#41505)
Generation parameters are no longer accessible via model's config. If generation paramaters are serialized in config.json for any old model, it will be loaded back into model's generation config. Users are expected to access or modify generation parameters only with model.generation_config.do_sample = True.
Trainer
New Features
ALST/Ulysses Sequence Parallelism IntegrationAdded sequence parallelism support via HF Accelerate for training with longer sequences. Enables splitting sequences across devices using ALST (All-to-All Long Sequence Training) and Ulysses algorithms with DeepSpeed.
Improved compute_loss_func Handlingcompute_loss_func now always takes priority over the model's built-in loss computation, giving users consistent control over custom loss functions.
num_items_in_batch in Prediction StepThe num_items_in_batch argument is now passed to compute_loss during prediction_step, enabling proper loss scaling during evaluation.
Breaking Changes
report_to now defaults to "none"
Logging integrations are no longer auto-detected by default; users must explicitly specify which reporting backends to use.
Removing arguments without deprecation cycle in TrainingArguments due to low usage
- mp_parameters -> legacy param that was later on added to the Sagemaker trainer
- _n_gpu -> not intended for users to set, we will initialize it correctly instead of putting it in the TrainingArguments
- overwrite_output_dir - > replaced by resume_from_checkpoint, and it was only used in the examples script, no impact on Trainer.
- logging_dir -> only used for tensorboard, set TENSORBOARD_LOGGING_DIR env var instead
- jit_mode_eval -> use use_torch_compile instead, as torchscript is not recommended anymore
- tpu_num_cores-> It is actually better to remove it, as it is not recommended to set the number of cores. By default, all TPU cores are used . Set TPU_NUM_CORES env var instead
- past_index -> it was only used for a very small number of models that have special architecture like transformersxl + it was not documented at all how to train those models
- ray_scope -> only for a minor arg for ray integration. Set RAY_SCOPE var env instead
- warmup_ratio -> use warmup_step instead. We combined both args together by allowing passing float values in warmup_step.
Removing deprecated arguments in TrainingArguments
- fsdp_min_num_params and fsdp_transformer_layer_cls_to_wrap -> use fsdp_config
- tpu_metrics_debug -> debug
- push_to_hub_token -> hub_token
- push_to_hub_model_id and push_to_hub_organization -> hub_model_id
- include_inputs_for_metrics -> include_for_metrics
- per_gpu_train_batch_size -> per_device_train_batch_size
- per_gpu_eval_batch_size -> per_device_eval_batch_size
- use_mps_device -> mps will be used by default if detected
- fp16_backend and half_precision_backend -> we will only rely on torch.amp as everything has been upstreamed to torch
- no_cuda -> use_cpu
- include_tokens_per_second -> include_num_input_tokens_seen
- use_legacy_prediction_loop -> we only use evaluation_loop function from now on
Removing deprecated arguments in Trainer
- tokenizer in initialization -> processing_class
- model_path in train() -> resume_from_checkpoint
Removed features for Trainer
- sigpot integration for hp search was removed as the library was archived + the api stopped working
- drop support for sagemaker API <1.10
- bump accelerate minimum version to 1.1.0
- bump peft minimum version to 0.18.0
- bump bitsandbytes minimum version to 0.46.1
New defaults for Trainer
use_cache in the model config will be set to False. You can still change the cache value through TrainingArguments usel_cache argument if needed.
Pipeline
Image text to text pipelines will no longer accept images as a separate argument along with conversation chats. Image data has to be embedded in the chat's "content" field. See #42359
PushToHubMixin
removed deprecated organization and repo_url from PushToHubMixin. You must pass a repo_id instead.
removed ignore_metadata_errors from PushToMixin. In practice if we ignore errors while loading the model card, we won't be able to push the card back to the Hub so it's better to fail early and not provide the option to fail later.
push_to_hub do not accept **kwargs anymore. All accepted parameters are explicitly documented.
arguments of push_to_hub are now keyword-only to avoid confusion. Only repo_id can be positional since it's the main arg.
removed use_temp_dir argument from push_to_hub. We now use a tmp dir in all cases.
Linked PR: #42391.
CLI
The deprecated transformers-cli ... command was deprecated, transformers ... is now the only CLI entry point.
transformers CLI has been migrated to Typer, making it easier to maintain + adding some nice features out of
the box (improved --help section, autocompletion).Biggest breaking change is in transformers chat. This command starts a terminal UI to interact with a chat model.
It used to also be able to start a Chat Completion server powered by transformers and chat with it. In this revamped
version, this feature has been removed in favor of transformers serve. The goal of splitting transformers chat
and transformers serve is to define clear boundaries between client and server code. It helps with maintenance
but also makes the commands less bloated. The new signature of transformers chat is:Usage: transformers chat [OPTIONS] BASE_URL MODEL_ID [GENERATE_FLAGS]...
Chat with a model from the command line.It works hand in hand with transformers serve, which means that if transformers serve is running on its default endpoint, transformers chat can be launched as follows:
transformers chat HuggingFaceTB/SmolLM3-3B
It can however use any OpenAI API compatible HTTP endpoint:
transformers chat HuggingFaceTB/SmolLM3-3B https://router.huggingface.co/v1
Linked PRs:
- #40997
- #41487
Removal of the run method
The transformers run (previously transformers-cli run) is an artefact of the past, was not documented nor tested,
and isn't part of any public documentation. We're removing it for now and ask you to please let us know in case
this is a method you are using; in which case we should bring it back with better support.Linked PR: #42447
Environment variables
Legacy environment variables like TRANSFORMERS_CACHE, PYTORCH_TRANSFORMERS_CACHE, and PYTORCH_PRETRAINED_BERT_CACHE have been removed. Please use HF_HOME instead.
Constants HUGGINGFACE_CO_EXAMPLES_TELEMETRY, HUGGINGFACE_CO_EXAMPLES_TELEMETRY, HUGGINGFACE_CO_PREFIX, and HUGGINGFACE_CO_RESOLVE_ENDPOINT have been removed. Please use huggingface_hub.constants.ENDPOINT instead.
Linked PR: #42391.
Requirements update
transformers v5 pins the huggingface_hub version to >=1.0.0. See this migration guide to learn more about this major release. Here are to main aspects to know about:
- switched the HTTP backend from requests to httpx. This change was made to improve performance and to support both synchronous and asynchronous requests the same way. If you are currently catching requests.HTTPError errors in your codebase, you'll need to switch to httpx.HTTPError.
- related to 1., it is not possible to set proxies from your script. To handle proxies, you must set the HTTP_PROXY / HTTPS_PROXY environment variables
- hf_transfer and therefore HF_HUB_ENABLE_HF_TRANSFER have been completed dropped in favor of hf_xet. This should be transparent for most users. Please let us know if you notice any downside!
- typer-slim has been added as required dependency, used to implement both hf and transformers CLIs.
New model additions in v5
CWM
The Code World Model (CWM) model was proposed in CWM: An Open-Weights LLM for Research on Code Generation with World Models by Meta FAIR CodeGen Team. CWM is an LLM for code generation and reasoning about code that has, in particular, been trained to better represent and reason about how code and commands affect the state of a program or system. Specifically, we mid-trained CWM on a large number of observation-action trajectories from Python execution traces and agentic interactions in containerized environments. We post-trained with extensive multi-task RL in verifiable coding, math, and multi-turn software engineering environments.
Add Code World Model (CWM) by @jacobkahn in #41199
SAM3
SAM3 (Segment Anything Model 3) was introduced in SAM 3: Segment Anything with Concepts.
The SAM3 addition adds four new architectures:- Sam3
- Sam3Tracker
- Sam3TrackerVideo
- Sam3Video
SAM3 performs Promptable Concept Segmentation (PCS) on images. PCS takes text and/or image exemplars as input (e.g., "yellow school bus"), and predicts instance and semantic masks for every single object matching the concept.
Sam3Tracker and Sam3TrackerVideo perform Promptable Visual Segmentation (PVS) on images. PVS takes interactive visual prompts (points, boxes, masks) or text inputs to segment a specific object instance per prompt. This is the task that SAM 1 and SAM 2 focused on, and SAM 3 improves upon it. Sam3Tracker and Sam3TrackerVideo are updated versions of SAM2 Video that maintain the same API while providing improved performance and capabilities.
SAM3 Video performs Promptable Concept Segmentation (PCS) on videos. PCS takes text as input (e.g., "yellow school bus"), and predicts instance and semantic masks for every single object matching the concept, while preserving object identities across video frames. The model combines a detection module (SAM3) with a tracking module (SAM2-style tracker) to enable robust object tracking across video frames using text prompts.
Add SAM3 to 🤗 Transformers by @yonigozlan in #42285
LFM2 MoE
LFM2-MoE is a Mixture-of-Experts (MoE) variant of LFM2. The LFM2 family is optimized for on-device inference by combining short‑range, input‑aware gated convolutions with grouped‑query attention (GQA) in a layout tuned to maximize quality under strict speed and memory constraints.
LFM2‑MoE keeps this fast backbone and introduces sparse MoE feed‑forward networks to add representational capacity without significantly increasing the active compute path. The first LFM2-MoE release is LFM2-8B-A1B, with 8.3B total parameters and 1.5B active parameters. The model excels in quality (comparable to 3-4B dense models) and speed (faster than other 1.5B class models).
[Model] Lfm2Moe by @paulpak58 in #41401
VideoLlama 3
The VideoLLaMA3 model is a major update to VideoLLaMA2 from Alibaba DAMO Academy.
[model] Add VideoLLaMA3 implementation by @lkhl in #40499
AudioFlamingo 3
Audio Flamingo 3 (AF3) is a fully open large audio–language model designed for robust understanding and reasoning over speech, environmental sounds, and music. AF3 pairs a Whisper-style audio encoder with a causal language model and performs replace-in-place audio–text fusion: the processor aligns post-pool audio frames to a dedicated placeholder token and the model replaces those token slots with projected audio embeddings during the forward pass.
The model checkpoint is available at: nvidia/audio-flamingo-3-hf
Highlights:
- Unified audio encoder across speech, sound, and music.
- Long-audio support via windowing and post-pool alignment (up to 10 minutes maximum). The model processes audio in 30-second windows with a hard limit of 20 windows (10 minutes total). Audio longer than 10 minutes will be truncated.
- Deterministic fusion that preserves sequence length by replacing audio placeholder tokens with audio embeddings.
[models] Add AudioFlamingo3 integration by @lashahub in #40290
Nanochat
NanoChat is a compact decoder-only transformer model designed for educational purposes and efficient training. The model features several fundamental architectural innovations which are common in modern transformer models. Therefore, it is a good model to use as a starting point to understand the principles of modern transformer models. NanoChat is a variant of the Llama architecture, with simplified attention mechanism and normalization layers.
[MODEL] Nanochat implementation by @burtenshaw in #41634
FastVLM
FastVLM is an open-source vision-language model featuring a novel hybrid vision encoder, FastViTHD. Leveraging reparameterizable convolutional layers, scaled input resolution, and a reduced number of visual tokens, FastVLM delivers high accuracy with exceptional efficiency. Its optimized architecture enables deployment even on edge devices, achieving ultra-low TTFT (time to first token) without sacrificing performance.
Add FastVLM by @camilla-deckard in #41112
PaddleOCR-VL
PaddleOCR-VL is a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.
[Model] Add PaddleOCR-VL Model Support by @zhang-prog in #42178
SAM: Perception Encoder Audiovisual
PE Audio (Perception Encoder Audio) is a state-of-the-art multimodal model that embeds audio and text into a shared (joint) embedding space.
The model enables cross-modal retrieval and understanding between audio and text.Text input
Produces a single embedding representing the full text.
Audio input
PeAudioFrameLevelModel
Produces a sequence of embeddings, one every 40 ms of audio.
Suitable for audio event localization and fine-grained temporal analysis.PeAudioModel
Produces a single embedding for the entire audio clip.
Suitable for global audio-text retrieval tasks.The resulting embeddings can be used for:
- Audio event localization
- Cross-modal (audio–text) retrieval and matching
Sam: Perception Encoder Audiovisual by @eustlb in #42905
Jais2
Jais2 a next-generation Arabic open-weight LLM trained on the richest Arabic-first dataset to date. Built from the ground up with 8B and 70B parameters, Jais 2 understands Arabic the way it's truly spoken across dialects, cuulutre, and modern expression. It is developed by MBZUAI, Inception and Cerebras Systems and based on the transformer architecture with modifications including:
- LayerNorm instead of RMSNorm
- ReLU² activation function
- Rotary Position Embeddings (RoPE)
adds jais2 model support by @sarathc-cerebras in #42684
Pixio
Pixio is a vision foundation model that uses ViT as a feature extractor for multiple downstream tasks like depth estimation, semantic segmentation, feed-forward 3D reconstruction, robotics, and image classification. It is built on the Masked Autoencoder (MAE) pre-training framework, with four minimal yet critical updates: 1) deeper decoder, 2) larger masking granularity, 3) more class tokens, and 4) web-scale curated training data.
Add Pixio pre-trained models by @LiheYoung in #42795
Ernie 4.5 VL MoE
The Ernie 4.5 VL MoE model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes. The Vision-Language series in specific is composed of a novel multimodal heterogeneous structure, sharing paremeters across modalities and dedicating parameters to specific modalities. This becomes especially apparent in the Mixture of Expert (MoE) which is composed of
- Dedicated Text Experts
- Dedicated Vision Experts
- Shared Experts
This architecture has the advantage to enhance multimodal understanding without compromising, and even improving, performance on text-related tasks. An more detailed breakdown is given in the Technical Report.
[Ernie 4.5] Ernie VL models by @vasqu in #39585
GLM-ASR
GLM-ASR-Nano-2512 is a robust, open-source speech recognition model with 1.5B parameters. Designed for
real-world complexity, it outperforms OpenAI Whisper V3 on multiple benchmarks while maintaining a compact size.Key capabilities include:
- Exceptional Dialect Support
- Beyond standard Mandarin and English, the model is highly optimized for Cantonese (粤语) and other dialects,
effectively bridging the gap in dialectal speech recognition. - Low-Volume Speech Robustness
- Specifically trained for "Whisper/Quiet Speech" scenarios. It captures and accurately transcribes extremely
low-volume audio that traditional models often miss. - SOTA Performance
- Achieves the lowest average error rate (4.10) among comparable open-source models, showing significant advantages
in Chinese benchmarks (Wenet Meeting, Aishell-1, etc..).
This model was contributed by Eustache Le Bihan and Yuxuan Zhang.
you can check the model card for more details and our
github repo.GLM-ASR Support by @zRzRzRzRzRzRzR in #42875
GLM 4.7 Flash
GLM-4.7-Flash offers a new option for lightweight deployment that balances performance and efficiency.
[GLM-4.7] GLM-Lite Support by @zRzRzRzRzRzRzR in #43031
GLM Image
We present GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. We further introduce the GLM-4.6V series, open-source multimodal models with native tool use and a 128K context window. A brief overview is available at this https URL. Code, models and more information are released at https://github.com/zai-org/GLM-V
[GLM-Image] AR Model Support for GLM-Image by @zRzRzRzRzRzRzR in #43100
LWDetr
LW-DETR proposes a light-weight Detection Transformer (DETR) architecture designed to compete with and surpass the dominant YOLO series for real-time object detection. It achieves a new state-of-the-art balance between speed (latency) and accuracy (mAP) by combining recent transformer advances with efficient design choices.
The LW-DETR architecture is characterized by its simple and efficient structure: a plain ViT Encoder, a Projector, and a shallow DETR Decoder. It enhances the DETR architecture for efficiency and speed using the following core modifications:
- Efficient ViT Encoder: Uses a plain ViT with interleaved window/global attention and a window-major organization to drastically reduce attention complexity and latency.
- Richer Input: Aggregates multi-level features from the encoder and uses a C2f Projector (YOLOv8) to pass two-scale features ( 1 / 8 and 1 / 32 ).
- Faster Decoder: Employs a shallow 3-layer DETR decoder with deformable cross-attention for lower latency and faster convergence.
- Optimized Queries: Uses a mixed-query scheme combining learnable content queries and generated spatial queries.
Add LWDetr model by @sbucaille in #40991
LightOnOCR
LightOnOcr combines a Vision Transformer encoder (Pixtral-based) with a lightweight text decoder (Qwen3-based) distilled from high-quality open VLMs. It is optimized for document parsing tasks, producing accurate, layout-aware text extraction from high-resolution pages.
Add LightOnOCR model implementation by @baptiste-aubertin in #41621
Bugfixes and improvements
- JetMoe Fix jetmoe after #40132 by @ArthurZucker in #41324
- Fixed tiny incorrect import in gemma3 by @Sai-Suraj-27 in #41354
- Rope for Qwen2--5-vl by @zucchini-nlp in #41173
- 🚨 Bump to Python 3.10 and rework how we check 3rd-party libraries existence by @Cyrilvallez in #41268
- Standardize PretrainedConfig to PreTrainedConfig by @Cyrilvallez in #41300
- Fix trainer for py3.9 by @SunMarc in #41359
- Check model inputs - hidden states by @zucchini-nlp in #40994
- [ModularChecker] QOL for the modular checker by @ArthurZucker in #41361
- Fixing a typo for BLT model by @Narsil in #41325
- 🚨 [v5] Remove relative position embeddings (for bert like models) by @vasqu in #41170
- Fix typo in model proposal template by @Ombucha in #41352
- Better typehints for apply_chat_template by @Samoed in #41355
- 🚨 Remove BetterTransformer by @Cyrilvallez in #41367
- [testing] update test_longcat_generation_cpu by @ydshieh in #41368
- Fix flash_attention.py: wrong argument passing for attn_implementation by @TKONIY in #41347
- Use canonical get_size_with_aspect_ratio (with max_size) from transformers.image_transforms to fix #37939 by @sonianuj287 in #41284
- Fixes in check_model_inputs, GPTBigCodeModel and ImageGPTModel by @IlyasMoutawwakil in #40811
- Remove unnecessary list comprehension by @cyyever in #41305
- make some ut cases pass on xpu w/ latest torch by @yao-matrix in #41337
- Remove unused function patameters by @cyyever in #41358
- [CB] Refactors the way we access paged by @ArthurZucker in #41370
- serve: add non-streaming mode to /v1/responses; stream event parity; remove placeholder logprobs by @antznette1 in #41353
- Update from pretrained error when loading by @ArthurZucker in #33380
- [v5] Sync Bert and Bart eager attention by @vasqu in #41248
- fix asr ut failures by @yao-matrix in #41332
- fix resample in asr pipeline by @yhzx233 in #41298
- Correct numerical regression in vision embeddings by @i3hz in #41374
- [kernels] Kernel Config by @MekkCyber in #41232
- [Cache] lfm2 cache: allocate empty kv layers during init by @paulpak58 in #41396
- Fix test for model with dotted name and relative imports by @st81 in #41343
- Prefer raising TypeError exception for invalid type by @Sai-Suraj-27 in #41346
- [v5] Bump accelerate to 1.1.0 by @SunMarc in #41234
- Fix incorrect assignment in update_device_map for GPTQ quantizer by @Sai-Suraj-27 in #41328
- [v5] Delete left traces of feature extractor by @zucchini-nlp in #41321
- Remove deprecation warning by @Cyrilvallez in #41425
- Fix overriding common_kwargs defaults in processor calls by @yonigozlan in #41381
- v5 dev version by @LysandreJik in #41436
- Tiny Cleanup - Removed duplicate class field definition's by @Sai-Suraj-27 in #41293
- 🚨🚨 Remove all traces of legacy cache format by @Cyrilvallez in #41378
- 🚨 [v5] Prune prune_heads by @gante in #41417
- [v5] Bump min version of bitsandbytes to 0.46.1 by @SunMarc in #41283
- Fixing comments in init file by @MekkCyber in #41414
- Use accelerator API to free device memory by @cyyever in #41195
- enable new model uts to xpu and fix some failures on xpu by @yao-matrix in #41386
- [torchao] Add regex support for ModuleFqnToConfig by @jerryzh168 in #41242
- 🤦 CB nit! by @ArthurZucker in #41413
- Remove Python 3.9 classifier by @cyyever in #41410
- [JetMoe] Fix KV head repetition and padding free by @vasqu in #41423
- [testing] Fix JetMoeIntegrationTest by @ydshieh in #41377
- Add Top-H decoding (entropy-bounded truncation) as a LogitsWarper for text generation by @ErfanBaghaei in #40837
- Validate processing kwargs with @strict from huggingface_hub by @zucchini-nlp in #40793
- Update hqq.md by @prathamesh-chavan-22 in #41452
- enable some falcon-mamba uts on xpu by @yao-matrix in #41428
- Fix generate outputs and simplify cache tests by @Cyrilvallez in #41440
- Fix doc by @Cyrilvallez in #41457
- 🚨 [v5] Rename left traces of past_key_value in BERT-like models by @zucchini-nlp in #41448
- Subconfig is a class attribute by @zucchini-nlp in #41308
- [v5] rm utils/tf_ops/ by @gante in #41402
- Update GLM-4.1V MMRope implementation by @zRzRzRzRzRzRzR in #41182
- [kernels] Cleanup deta kernel by @MekkCyber in #41470
- 🚨 [v5] Rendundant code in nested configs by @zucchini-nlp in #41314
- Remove KERAS_NLP_IMPORT_ERROR by @cyyever in #41468
- Fix auto model configuration for encoder of perceptionlm by @fschlatt in #41464
- Fix tests fsdp by @SunMarc in #41422
- Import Callable from collections.abc by @cyyever in #41130
- Pickle - part 2 by @ydshieh in #41476
- Remove infer_device by @cyyever in #41088
- Change RT-Detr docs to reflect fixed 640x640 input size by @konstantinos-p in #41364
- Cleaning hub kernels by @MekkCyber in #41477
- [v5] remove load_in_4bit and load_in_8bit by @SunMarc in #41287
- 🚨 [Attention Masks] Bidirectional masks for encoder and encoder-decoder models by @vasqu in #41265
- [Fix] Fix test file error by @YangKai0616 in #40973
- enhance patched_tearDown to support python 3.11+ by @yao-matrix in #41429
- RT-Detr correct 2d positional embeddings for non-square images by @konstantinos-p in #41380
- Fix bnb fsdp loading for pre-quantized checkpoint by @SunMarc in #41415
- Remove SigOpt by @SunMarc in #41479
- Remove past_index by @SunMarc in #41384
- Remove deprecated args in Trainer for v5 by @SunMarc in #41404
- Update GLM-4.6 doc by @zRzRzRzRzRzRzR in #41471
- report_to default changed to "none" + cleaning deprecated env var by @SunMarc in #41375
- deprecate overwrite_output_dir by @SunMarc in #41323
- [CI] Fix copies on main by @vasqu in #41486
- [Trainer] deprecate ray scope by @SunMarc in #41403
- deprecate jit_mode_eval by @SunMarc in #41376
- Remove local_rank arg from TrainingArguments by @SunMarc in #41382
- Update philosophy by @molbap in #41438
- Remove DISABLE_KERNEL_MAPPING flag by @MekkCyber in #41475
- Streaming should be handled at the request-level rather than at the istance level by @LysandreJik in #41444
- fix bnb model loading by @jiqing-feng in #41499
- [kernels] Remove RWKV kernel finally ! by @MekkCyber in #41493
- [kernels] rm yoso kernel by @MekkCyber in #41495
- Try to remove pickle - BloomTokenizerFast by @ydshieh in #41466
- Fixed tiny incorrect imports in glm4v by @Sai-Suraj-27 in #41483
- [Parakeet] unnecessary warning & auto mapping by @eustlb in #41412
- [causallm tester] automate pipeline mappings + bloom tests by @gante in #41318
- Fix some tests by @Cyrilvallez in #41503
- fix gemma3n case failure by @yao-matrix in #41426
- [voxtral] language detection + skipping lang:xx by @eustlb in #41225
- Set truncation to False in Qwen3Omni to avoid default truncation by @BakerBunker in #41473
- [QoL] modular conversion shows LoC saved by @molbap in #41500
- More trainer cleaning by @SunMarc in #41489
- Bump to hfh 1.0.0.rc5 to fix test by @Wauplin in #41508
- Revert local_rank deletion and some cleaning by @SunMarc in #41504
- Fix detectron2 import by @Cyrilvallez in #41510
- add Trainer import to .md in appropriate cell block for training.ipynb transformers_doc by @benkeene in #41484
- Remove outdated flags by @Cyrilvallez in #41512
- remove tpu_num_cores by @SunMarc in #41383
- Allow optuna's catch kwargs passthrough by @nicha-api in #41496
- Fix Latex typesetting in documentation by @cyyever in #41177
- [testing] reduce runtime of HunYuanMoEV1IntegrationTest:test_model_generation by @ydshieh in #41373
- [Qwen3VL] fix: hidden_states in place modification error by @HollowMan6 in #41535
- Add MLlam
- Jan 26, 2026
- Date parsed from source:Jan 26, 2026
- First seen by Releasebot:Mar 20, 2026
Release candidate v5.0.0rc3
transformers ships release candidate v5.0.0rc3 with new model support for GLM-Lite, GLM-Image, LWDetr, LightOnOCR, and MiniMax-M2, while also tightening deprecations, fixing bugs, improving docs, and strengthening test and CI coverage on the road to the official release.
Release candidate v5.0.0rc3
New models
- [GLM-4.7] GLM-Lite Supoort by @zRzRzRzRzRzRzR in #43031
- [GLM-Image] AR Model Support for GLM-Image by @zRzRzRzRzRzRzR in #43100
- Add LWDetr model by @sbucaille in #40991
- Add LightOnOCR model implementation by @baptiste-aubertin in #41621
What's Changed
We are getting closer and closer to the official release!
This RC is focused on removing more of the deprecated stuff, fixing some minors issues, doc updates.
- Update Japanese README to match English version by @lilin-1 in #43069
- [docs] Deploying by @stevhliu in #42263
- [docs] inference engines by @stevhliu in #42932
- Fix typos: Remove duplicate duplicate words words by @efeecllk in #43040
- [style] Rework ruff rules and update all files by @Cyrilvallez in #43144
- [CB] Minor fix in kwargs by @remi-or in #43147
- [Bug] qwen2_5_omni: cap generation length to be less than the max_position_embedding in DiT by @sniper35 in #43068
- Fix some deprecated practices in torch 2.9 by @Cyrilvallez in #43167
- Fix Fuyu processor width dimension bug in _get_num_multimodal_tokens by @Abhinavexists in #43137
- Inherit from PreTrainedTokenizerBase by @juliendenize in #43143
- Generation config boolean defaults by @zucchini-nlp in #43000
- Fix failing BartModelIntegrationTest by @Sai-Suraj-27 in #43160
- fix failure of llava/pixtral by @sywangyi in #42985
- GemmaTokenizer: remove redundant whitespace pre-tokenizer by @vaibhav-research in #43106
- Support auto_doctring in Processors by @yonigozlan in #42101
- Fix failing BitModelIntegrationTest by @Sai-Suraj-27 in #43164
- [Fp8] Fix experts by @vasqu in #43154
- Docs: improve wording for documentation build instructions by @Sailnagale in #43007
- [makefile] Cleanup and improve the rules by @Cyrilvallez in #43171
- Some new models added stuff that was already removed by @Cyrilvallez in #43179
- Fixes and compilation warning in torchao docs by @merveenoyan in #42909
- [cache] Remove all deprecated classes by @Cyrilvallez in #43168
- Bump huggingface_hub minimal version by @Wauplin in #43188
- Rework check_config_attributes.py by @Cyrilvallez in #43191
- Fix generation config validation by @zucchini-nlp in #43175
- [style] Use 'x | y' syntax for processors as well by @Wauplin in #43189
- Remove deprecated objects by @Cyrilvallez in #43170
- fix chunked prefill implementation issue-43082 by @marcndo in #43132
- Reduce add_dates verbosity by @yonigozlan in #43184
- Add support for MiniMax-M2 by @rogeryoungh in #42028
- Fix failing salesforce-ctrl, xlm & gpt-neo model generation tests by @Sai-Suraj-27 in #43180
- Less verbose library helpers by @Cyrilvallez in #43197
- run all test files on CircleCI by @ydshieh in #43146
- Clamp temperature to >=1.0 for Dia generation by @Haseebasif7 in #43029
- Fix spelling typos in comments and code by @raimbekovm in #43046
- [docs] llama.cpp by @stevhliu in #43185
- [docs] gptq formatting fix by @victorywwong in #43216
- Grouped beam search from config params by @zucchini-nlp in #42472
- [Generate] Allow custom config values in generate config by @vasqu in #43181
- Fix failing Pix2StructIntegrationTest by @Sai-Suraj-27 in #43229
- Fix missing UTF-8 encoding in check_repo.py for Windows compatibility by @aarushisingh04 in #43123
- [Tokenizer] Change default value of return_dict to True in doc string for apply_chat_template by @kashif in #43223
- Fix failing PhiIntegrationTests by @Sai-Suraj-27 in #43214
- Use HF_TOKEN directly and remove require_read_token by @ydshieh in #43233
- Fix failing Owlv2ModelIntegrationTest & OwlViTModelIntegrationTest by @Sai-Suraj-27 in #43182
- Fix flashattn wrt quantized models by @SunMarc in #43145
- Remove unused imports by @cyyever in #43078
- Fix unsafe torch.load() in _load_rng_state allowing arbitrary code execution by @ColeMurray in #43140
- Reapply modular to examples by @Cyrilvallez in #43234
- More robust diff checks in add_dates by @yonigozlan in #43199
- docs: fix grammatical error in README.md by @davidfertube in #43236
- Fix typo: seperately → separately in lw_detr converter by @skyvanguard in #43235
- Qwen-VL video processor accepts min/max pixels by @zucchini-nlp in #43228
- Deprecate dtype per sub config by @zucchini-nlp in #42990
- Remove more deprecated objects/args by @Cyrilvallez in #43195
- [CB] Soft-reset offloading by @remi-or in #43150
- Make benchmark-v2 to be device agnostic, to support more torch built-in devices like xpu by @yao-matrix in #43153
- Fix benchmark script by @Cyrilvallez in #43253
- Adding to run slow by @IlyasMoutawwakil in #43250
- Fix failing Vip-llava model integration test by @Sai-Suraj-27 in #43252
- Remove deprecated and unused position_ids in all apply_rotary_pos_emb by @Cyrilvallez in #43255
- fix _get_test_info in testing_utils.py by @ydshieh in #43259
- Fix failing Hiera, SwiftFormer & LED Model integration tests by @Sai-Suraj-27 in #43225
- [style] Fix init isort and align makefile and CI by @Cyrilvallez in #43260
- [docs] tensorrt-llm by @stevhliu in #43176
- [consistency] Ensure models are added to the _toctree.yml by @Cyrilvallez in #43264
- Fix failing PegasusX, Mvp & LED model integration tests by @Sai-Suraj-27 in #43245
- [CB] Ensure parallel decoding test passes using FA by @remi-or in #43277
- fix crash in when running FSDP2+TP by @sywangyi in #43226
- [ci] Fixing some failing tests for important models by @Abdennacer-Badaoui in #43231
New Contributors
- @efeecllk made their first contribution in #43040
- @sniper35 made their first contribution in #43068
- @Abhinavexists made their first contribution in #43137
- @vaibhav-research made their first contribution in #43106
- @Sailnagale made their first contribution in #43007
- @rogeryoungh made their first contribution in #42028
- @Haseebasif7 made their first contribution in #43029
- @victorywwong made their first contribution in #43216
- @aarushisingh04 made their first contribution in #43123
- @ColeMurray made their first contribution in #43140
- @davidfertube made their first contribution in #43236
- @skyvanguard made their first contribution in #43235
- @baptiste-aubertin made their first contribution in #41621
Full Changelog: v5.0.0rc2...v5.0.0rc3
Original source Report a problem