Meta AI Release Notes

Name: Meta AI
Brand: Meta

Last updated: Apr 23, 2026

Get this feed:

AI Models

April 2026
- No date parsed from source.
- First seen by Releasebot:
  Apr 23, 2026
Meta AI by Meta

Segment Anything 2 Demo

Meta AI launches Segment Anything 2 demo for video cutouts and effects with a few clicks.

Segment Anything 2 Demo

Create video cutouts and effects with a few clicks
Original source
April 2026
- No date parsed from source.
- First seen by Releasebot:
  Apr 23, 2026
Meta AI by Meta

FAIRChem v2

Meta AI reports FAIRChem v2 introduces UMA, a universal machine learning potential with state-of-the-art accuracy.

FAIRChem v2 introduces the UMA model — a universal machine learning potential for atoms. This is a breaking change from v1 and is not compatible with previous pretrained models.

UMA is trained on 500M+ DFT calculations across molecules, materials, and catalysts — achieving state-of-the-art accuracy with energy conservation and fast inference.
Original source
All of your release notes in one feed

Join Releasebot and get updates from Meta and hundreds of other software products.

Create account
Get updates with:
April 2026
- No date parsed from source.
- First seen by Releasebot:
  Apr 23, 2026
Meta AI by Meta

Seamless Communication

Meta AI releases Seamless Communication, a suite of AI translation models that aims to make cross-language speech more natural, expressive and fast. It includes SeamlessExpressive, SeamlessStreaming and SeamlessM4T v2, and is publicly releasing the models, data and tools.

AI research by Meta

Seamless Communication

A significant step towards removing language barriers through expressive, fast and high-quality AI translation

A family of AI research models that enable more natural and authentic communication across languages

The Seamless Communication models

SeamlessExpressive

A model that aims to preserve expression and intricacies of speech across languages.

SeamlessStreaming

A model that can deliver speech and text translations with around two seconds of latency.

SeamlessM4T v2

A foundational multilingual and multitask model that allows people to communicate effortlessly through speech and text.

Seamless

A model that merges capabilities from SeamlessExpressive, SeamlessStreaming and SeamlessM4T v2 into one.

Preserving prosody

SeamlessExpressive

Translations should capture the nuances of human expression. While existing translation tools are skilled at capturing the content within a conversation, they typically rely on monotone, robotic text-to-speech systems for their output. SeamlessExpressive aims to preserve intricacies of speech; such as pauses and speech rate, in addition to vocal style and emotional tone.

Try the SeamlessExpressive demo

English input: whisper

Please keep the volume down. We just put the baby to sleep.

Spanish output: non-expressive

Spanish output: expressive

English input: sad

Please, don't leave. I hate being here alone.

French output: non-expressive

French output: expressive

Near real-time translation

SeamlessStreaming

SeamlessStreaming is the first massively multilingual model that delivers translations with around two-seconds of latency and nearly the same accuracy as an offline model. Built upon SeamlessM4T v2, SeamlessStreaming supports automatic speech recognition and speech-to-text translation for nearly 100 input and output languages, in addition to speech-to-speech translation for nearly 100 input languages and 36 output languages.

Foundational model for universal translation

SeamlessM4T v2

In August 2023, we introduced the first version of SeamlessM4T, a foundational multilingual and multitask model that delivered state-of-the-art results for translation and transcription across speech and text. Built upon this work, our improved model, SeamlessM4T v2, serves as the foundation for our new SeamlessExpressive and SeamlessStreaming models. It features a new architecture with a non-autoregressive text to unit decoder that delivers improved consistency between text and speech output.

More model details

Learn more about the research behind Seamless Communication

Try the SeamlessExpressive demo

Try the SeamlessExpressive demo to hear how you sound in a different language while maintaining elements of your expression and tone.

Our approach to research

Open innovation

We believe in the power of collaboration and open research to break down communication barriers. To enable our fellow researchers to build upon this work, we’re publicly releasing the full suite of Seamless Communication models, along with metadata, data and tools.

Safety and responsibility

We’re dedicated to promoting a safe and responsible AI ecosystem. We have taken a number of steps to improve the safety of our Seamless Communication models; significantly reducing the impacts of hallucinated toxicity in translations, and implementing a custom watermarking approach for audio outputs from our expressive models.

Resources

More on Seamless Communication

Explore additional resources, including the research paper, model details and more.

Technical overview

More details on how we developed the suite of Seamless Communication models.

Seamless research paper

Methodology, benchmarks, research findings and more from the Seamless Communication project.

AI at Meta blog

Read the full post about the journey, research and milestones achieved.

Download the models

Get access to our suite of publicly available models.

SeamlessExpressive Demo

Hear how you sound in a different language while maintaining elements of your expression and tone.
Original source
April 2026
- No date parsed from source.
- First seen by Releasebot:
  Apr 23, 2026
Meta AI by Meta

Meta Video Seal

Meta AI introduces Video Seal, an open-source video watermarking model that embeds durable, invisible watermarks and hidden messages to help verify video origin even after editing.
Introducing Meta Video Seal

A state-of-the-art, open-source model for video watermarking

With AI-generated content on the rise, verifying video origins is crucial. Video Seal is a neural watermarking model that embeds durable, invisible watermarks - even after video editing.

Imperceptible watermarks

Video Seal embeds an invisible watermark into videos, with the option to include a hidden message.

Robust and Resilient

Video Seal's watermarks are resilient, withstanding distortion efforts such as flipping and blurring.

Origin Verification

The watermark and hidden message can be revealed to verify the video's origin.

How the demo works

Choose a video from the library to explore the model, or upload your own to get started.

Embed up to a 6-character hidden message and watermark in your video.

Use the comparison slider to view an enhanced X-ray visualization of the watermark on the video.

Stress test the watermark by distorting the video and verifying if the watermark and hidden message remain detectable.

Original source
April 2026
- No date parsed from source.
- First seen by Releasebot:
  Apr 23, 2026
Meta AI by Meta

Introducing Meta Motivo

Meta AI releases Meta Motivo, a behavioral foundation model for zero-shot control of a virtual physics-based humanoid. It also adds a new humanoid benchmark, training code, and a demo, with strong whole-body task performance across motion tracking, pose reaching, and reward optimization.
A Meta FAIR release

Introducing Meta Motivo

A first-of-its-kind behavioral foundation model to control a virtual physics-based humanoid agent for a wide range of whole-body tasks.

Try the demo

Download the model

Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models

Meta Motivo is a behavioral foundation model pre-trained with a novel unsupervised reinforcement learning algorithm to control the movements of a complex virtual humanoid agent. At test time, our model can be prompted to solve unseen tasks such as motion tracking, pose reaching, and reward optimization without any additional learning or fine-tuning.

Read the research paper

Physics-based environment

The model has learned to control the agent, subject to the physics of its body and environment. Its behaviors are robust to variations and perturbations.

Different prompts for behaviors

The model can be prompted with motions to track, poses to reach, and rewards to optimize.

Zero-shot capability

The model computes the best behavior for each prompt without any additional learning or fine-tuning.

Explore the Research

We are releasing the pre-trained model together with the new humanoid benchmark and the training code. We hope this will encourage the community to further develop research towards building behavioral foundation models that can generalize to more complex tasks, and potentially different types of agents.

Key takeaways

We introduce a new algorithm grounding the forward-backward unsupervised reinforcement learning method with an imitation objective leveraging a dataset of unsupervised trajectories.

With this new approach, we train Meta Motivo, a behavioral foundation model that controls a high-dimensional virtual humanoid agent to solve a wide range of tasks.

We evaluated our model using a new humanoid benchmark across motion tracking, pose reaching, and motion tracking tasks. Meta Motivo achieved competitive performance with task-specific methods, while outperforming state-of-the-art unsupervised RL and model-based baselines.

The Algorithm

Forward-Backward representations with Conditional Policy Regularization (FB-CPR) is a novel algorithm combining unsupervised forward-backward representations [1, 2, 3] with an imitation learning loss regularizing policies to cover states observed in a dataset of unlabeled trajectories. Our algorithm is trained online through direct access to the environment and it crucially learns a representation that aligns the embedding of states, motions, and rewards into the same latent space. As a result, we can train models whose policies are grounded towards useful behaviors, while being capable of zero-shot inference across a wide range of tasks, such as goal-based RL, imitation learning, reward optimization, and tracking.

The final model includes two components: 1) an embedding network that receives as input the state of the agent and it returns its embedding; 2) a policy network parameterized with the same embedding that receives an input the state and returns the action to take.

Inference from various types of prompts

Our algorithm learns a representation that aligns states, rewards, and policies into the same latent space. We can then leverage this representation to perform zero-shot inference for different tasks

Motion tracking

Pose reaching

Reward optimization

Performance improvement during pre-training

Meta Motivo is a behavioral foundation model trained on a SMPL-based humanoid simulated with the Mujoco simulator using a subset of the AMASS motion capture dataset and 30 million online interaction samples.

The videos below illustrate the behaviors corresponding to one motion tracking task (a cartwheel motion), one pose reaching task (an arabesque pose), and one reward optimization task (running) at different stages of the pre-training process. Despite the model not being explicitly trained to optimize any of these tasks, we see the performance improving during training and more human-like behaviors emerge.

Motion tracking

Pose reaching

Reward optimization

Evaluation Results

For evaluation, we have developed a new humanoid benchmark including motions to track, stable poses to reach, and reward functions to optimize. We consider several different baselines including 1) methods that are retrained for each task separately; 2) behavioral foundation models and model-based algorithms. We are releasing the code with the specification files needed to use the simulator and evaluate the model performance on the tasks that are used in the paper.

Quantitative

Our model achieves between 61% to 88% of the performance of top-line methods retrained for each task, while outperforming all other algorithms except for the tracking: in this case it is second best behind Goal-TD3, which cannot be used for reward-based tasks.

Results

Motion tracking

Pose reaching

Reward optimization

Qualitative

To further analyze the performance gap in reward-based and goal-based tasks between Meta Motivo and single-task TD3, we ran a human evaluation with the objective of having a qualitative assessment of the learned behaviors in terms of human-likeness. This evaluation reveals that policies purely optimized for performance (TD3) produce much less natural behaviors than Meta Motivo, which better trades off performance and qualitative behaviors.

Results

Pose reaching

Reward optimization

Understanding the behavioral latent space

One of the crucial aspects of our new algorithm is that it uses the same representation to embed states, rewards, and motions in the same space. We have then investigated the structure of the learned behavioral latent space.

Visualization

Interpolation

In the image above, we visualize the embedding of motions classified by their activity (e.g., jumping, running, crawling) and reward-based tasks. Not only does the representation capture semantically similar motions in similar clusters, but it creates a latent space where rewards and motions are well aligned.

Limitations

Meta Motivo is our first attempt to train behavioral foundation models with zero-shot capabilities across several different prompt types. While the model achieved strong quantitative and qualitative results, it still suffers from several limitations.

Motion tracking

Pose reaching

Reward optimization

Fast movements and motions on the ground are poorly tracked. The model also exhibits unnatural jittering.

Try it yourself

Control the behavior of an embodied virtual agent through various prompts, including creating your own! See how the agent adjusts to changes in physics and environmental conditions, like gravity and wind.

Try the demo

References

Ahmed Touati, Yann Ollivier, Learning One Representation to Optimize All Rewards, NeurIPS 2021

Ahmed Touati, Jérémy Rapin, Yann Ollivier, Does Zero-shot Reinforcement Learning Exist?, ICLR 2023

Matteo Pirotta, Andrea Tirinzoni, Ahmed Touati, Alessandro Lazaric, Yann Ollivier, Fast Imitation via Behavior Foundation Models, ICLR 2024

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black, SMPL: a skinned multi-person linear model, ACM Transactions on Graphics 2015.

MuJoCo - Advanced physics simulation

Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: archive of motion capture as surface shapes, ICCV 2019.

https://github.com/facebookresearch/humenv

Acknowledgements

Research Authors

Andrea Tirinzoni, Ahmed Touati, Jesse Farebrother, Mateusz Guzek, Anssi Kanervisto, Yingchen Xu, Alessandro Lazaric, Matteo Pirotta

Project Contributors (alphabetical)

Claire Roberts, Dominic Burt, Jiemin Zhang, Leonel Sentana, Maria Ruiz, Matt Hanson, Morteza Behrooz, Ryan Winstead, Spaso Ilievski, Vincent Moens, Vlad Bodurov, William Ngan

© 2024 Meta
Original source
April 2026
- No date parsed from source.
- First seen by Releasebot:
  Apr 23, 2026
Meta AI by Meta

DINOv3

Meta AI releases DINOv3, a self-supervised vision foundation model that brings stronger universal backbones, dense image features, and broad performance across detection, segmentation, depth estimation, and tracking. It also expands the model suite with efficient options for diverse deployment needs.

INTRODUCING DINOV3

Self-supervised learning for vision at unprecedented scale

DINOv3 scales self-supervised learning (SSL) for images to produce our strongest universal vision backbones, enabling breakthrough performance across diverse domains.

Download DINOv3

Read the research paper

DINOV3 OVERVIEW

Cutting-edge image representations, trained without human supervision

We scaled unsupervised training to 7B-parameter models and 1.7B image datasets, using a fraction of compute compared to weakly-supervised methods. Despite keeping backbones frozen during evaluation, they achieve absolute state-of-the-art performance across diverse domains.

Read the research paper

Exceptional performance across visual domains

SSL unlocks domains where annotations are scarce or costly. Backbones enable state-of-the-art results for tasks including object detection in web imagery, but also canopy height mapping in satellite and aerial imagery.

Versatile backbone with powerful dense image features

High-resolution dense features from a single DINOv3 backbone enable leading performance across vision tasks, including object detection, depth estimation, and segmentation, without any finetuning.

Efficient model sizes and architectures

We release a comprehensive model suite addressing a wide range of use cases, including broad coverage of ViT sizes and efficient ConvNeXt models for on-device deployment.

PERFORMANCE

Evaluating DINOv3's Performance

DINOv3 sets a new standard in vision foundation models. For the first time, a model trained with SSL outperforms weakly-supervised models on a broad range of probing tasks, from fine-grained image classification, to semantic segmentation, to object tracking in video.

APPLICATIONS

DINO in action

From challenging annotation scenarios to efficiency-critical deployments, see how researchers and developers use DINO to build breakthrough applications.

Download DINOv3

World Resources Institute

WRI measures tree canopy heights with DINO, helping civil society organizations worldwide monitor reforestation.

Learn more

NASA JPL

NASA JPL uses DINO for Mars exploration robots, enabling multiple vision tasks with minimal compute.

Learn more

Orakl Oncology & CentraleSupelec

Orakl Oncology & CentraleSupelec pre-trains DINO on organoid images, producing a backbone to power prediction of patient responses to cancer treatments.

Learn more

APPROACH

Self-supervised pre-training unlocks simple task adaptation

Pre-training data is curated from a large unlabeled dataset. During pre-training, the model learns general-purpose visual representations, matching features between different augmented views of the same image. In post-training, the model is distilled into more efficient models.

A pre-trained DINOv3 model can be easily tailored by training a lightweight adapter on a small amount of annotated data.

DINO Evolution

DINOv3 marks a new milestone in self-supervised training at scale. It builds upon the scaling progress of DINOv2, further increasing the model size x6, and training data x12.

DINO

Initial research proof-of-concept, with 80M-parameter models trained on 1M images.

Read the research paper

Download the model

DINOv2

First successful scaling of a SSL algorithm. 1B-parameter models trained on 142M images.

Read the research paper

Download the model

DINOv3

An order of magnitude larger training compared to v2, with particular focus on dense features.

Read the research paper

Download the model

Explore additional resources

Read the AI at Meta blog

Read the research paper

Download DINOv3

DINOv3 on Hugging Face
Original source
April 2026
- No date parsed from source.
- First seen by Releasebot:
  Apr 23, 2026
Meta AI by Meta

Introducing Meta Segment Anything Model Audio (SAM Audio)

Meta AI launches SAM Audio, a multimodal sound separation model that uses text, visual, and span prompts to isolate target audio from complex mixes. It also adds PE-AV to Perception Encoder and releases a new OSS evaluation set with a judge model.
With SAM Audio, you can use simple text prompts to accurately separate any sound from any audio or audio-visual source.

SAM AUDIO CAPABILITIES

SAM Audio separates target and residual sounds from any audio or audiovisual source—across general sound, music, and speech.

Text prompts

SAM Audio enables you to use text-based prompts to describe the specific target audio they want to separate.

Visual prompts

SAM Audio lets you pick out and separate sounds by clicking on the part of the video where you hear them.

Span prompts

SAM Audio is the first model to introduce span prompting, selecting the desired point in the timespan that contains the target audio.

Multi-modal prompts

SAM Audio provides you flexibility with three unifying prompt modalities (text, visual, timespan).

A NEW WAY TO EXPERIENCE SOUND

State-of-the-art model for all sound

SAM Audio is a state-of-the-art, unified multimodal model that sets a new standard for audio separation, enabling users to isolate general sounds, music, and speech from complex mixtures using intuitive prompts.

PERFORMANCE

State-of-the-art model performance

SAM Audio achieves beyond state-of-the-art performance for all prompting capabilities.

OUR APPROACH

Model architecture

SAM Audio is a generative separation model that extracts both target and residual stems from an audio mixture using text, visual, or temporal prompts. It is powered by a flow-matching Diffusion Transformer and operates in a DAC-VAE latent space, enabling high-quality joint generation of target and residual audio.

OUR APPROACH

Audiovisual Perception Encoder

PE-AV is a new open source model, bringing audio capabilities to Meta's Perception Encoder.

THE SAM AUDIO EVALUATION DATASET

A first-of-its-kind audio separation OSS evaluation set

SAM Audio is releasing a first-of-its-kind OSS evaluation set for prompted audio separation and a judge model highly correlated with human subjective evaluation.

Real world opportunities

"Artificial Intelligence has been a game changer for the disabled community and the use cases for AI-focused start-ups in our ecosystem are vast. By incorporating open source models like SAM Audio into their work, 2GI’s cohort participants can advance their missions while gaining competitive advantage, showcasing that disabled founders are on the cutting edge of technology."

Diego Mariscal, CEO of 2gether-International

2gether-International empowers disabled founders with resources to launch high-impact startups. In partnership with Meta’s AI for Good team, 2GI leverages open AI models like SAM Audio to accelerate innovation for early-stage, founder-led AI companies.

"For years, Starkey has led the industry in applying artificial intelligence to revolutionize hearing technology. Our ground-breaking work continues to elevate what hearing aids can achieve, particularly in challenging listening situations like noisy environments and overlapping speech. With open models like SAM audio, we see tremendous opportunity to build on our innovations and further our mission to help people hear better and live better."

Achin Bhowmik, Chief Technology Officer and Executive Vice President of Engineering at Starkey

Starkey is the global leader in hearing technology and the only global American-owned hearing aid manufacturer. Using AI, Starkey transforms hearing aids into smart health and communication devices—delivering innovative, connected solutions that enhance lives
Original source
April 2026
- No date parsed from source.
- First seen by Releasebot:
  Apr 23, 2026
Meta AI by Meta

Introducing Meta SAM 3D

Meta AI introduces SAM 3D, a new single-image 3D reconstruction system that brings objects and humans to life with accurate shape, pose, geometry, texture, and full scene context. It includes SAM 3D Body and SAM 3D Objects and is aimed at practical 3D applications.
AI RESEARCH FROM META

Introducing

Meta SAM 3D

SAM 3D can bring any 2D image to life, accurately reconstructing objects and humans, including their shape and pose.

SAM 3D CAPABILITIES

Accurately reconstruct objects and bodies

Object reconstruction

SAM 3D enables precise 3D reconstruction of objects from real images, while accurately reconstructing their geometry and texture.

Body pose & shape estimation

SAM 3D allows for accurate 3D reconstruction of human body shape and position from a single image.

Scene reconstruction

SAM 3D works on real images in-the-wild, maintaining strong fidelity and quality.

Real world 3D perception

SAM 3D enables full scene reconstructions, placing objects and humans in a shared context together.

The SAM 3D models

SAM 3D contains two state-of-the-art models that enable 3D reconstruction of objects and humans from a single image.

SAM 3D Objects

Single image input

Detailed 3D reconstruction of any masked objects, including geometry and texture

Independent, posed 3D models, suitable for manipulation & interaction

Reconstructions are robust to occlusion in the input image

Position multiple objects into a scene, jointly with SAM 3D Body reconstructions

SAM 3D Body

Single image input

Reconstructs body shape and pose, including unique positions and partial visibility

Suitable for manipulation and interaction

Promptable with joint reconstructions

Position multiple people into a scene, jointly with SAM 3D Objects reconstructions

Designed for practical 3D applications

Enhancing Facebook Marketplace shopping

Place a 3D AR overlay of home decor, like a lamp or a table, from Marketplace in your room to visualize the style and fit within your space before purchasing.

Experiment with SAM 3D today

OUR APPROACH

Model architecture

SAM 3D is a suite of two models: SAM 3D Body and SAM 3D Objects:

The SAM 3D Body model architecture uses a transformer-based encoder-decoder architecture to predict 3D human pose and mesh parameters directly from images, enabling accurate and interactive pose regression.

The SAM 3D Objects model employs two stages of DiTs—first generating 3D object shape and pose, then refining texture and details—to deliver high-fidelity, realistic 3D reconstructions.

BENCHMARKS

State-of-the-art performance

SAM 3D achieves beyond state-of-the-art performance across a series of benchmarks for both its models.

THE SAM 3D ARTIST OBJECT DATASET

A dataset of diverse and high-quality 3D meshes

A new first-of-its-kind evaluation set for visually grounded 3D reconstruction in real-world images, with diverse images and objects that are significantly more challenging than existing 3D benchmarks. This represents a new way to measure research progress in 3D, and pushes the field away from curated images/synthetic assets and towards real-world perception and common-sense 3D understanding.

More from Segment Anything

SAM 3

With SAM 3, you can use text and visual prompts to precisely detect, segment and track any object in an image or video.
Original source
April 2026
- No date parsed from source.
- First seen by Releasebot:
  Apr 23, 2026
Meta AI by Meta

Introducing Meta Segment Anything Model 3 (SAM 3)

Meta AI adds SAM 3, a promptable segmentation model that uses text, exemplars and visual clicks to identify, segment and track objects in images and videos, with state-of-the-art performance and upcoming support for Instagram Edits and Vibes.

With SAM 3 you can use text and visual prompts to precisely identify, segment, and follow any object in images or videos—coming soon to Instagram Edits and Vibes on the Meta AI app.

SAM 3 CAPABILITIES

Advanced features, simple prompts

Using open vocabulary text or visual prompts, SAM 3 can detect, segment and track all matching objects in images and videos.

Text prompts

You can prompt SAM 3 with words and short phrases, to mask all objects matching the text description.

Exemplar Prompts

With exemplar prompts, you can simply draw a box around an example of the object you want to segment, and SAM 3 will mask all objects matching the outlined example.

Visual prompts

With all the capabilities of SAM 2, SAM 3 allows you to segment objects using positive and negative clicks.

Interactivity

If SAM 3 ever misses an object or makes a mistake, you can easily add follow-up prompts to help further guide the model.

BENCHMARKS

State-of-the-art performance

SAM 3 is state-of-the-art across all text and visual segmentation tasks in both images and videos. The model additionally maintains all the performance and functionality of SAM 2.

Designed for real-world applications

Edits is the new video creation app by Instagram that helps creators make great videos on their phones. Creators will soon be able to use SAM 3 in Edits to quickly apply effects to people or objects in their videos, helping their creations stand out.

ENHANCED CAPABILITIES

Evolution of SAM

The Segment Anything models build on each other, offering increasingly advanced capabilities for developers and researchers to create, experiment and uplevel media workflows.

SAM 3

Detect, segment and track every example of any object category in an image or video, using text or examples

Segment an object from a click

Track segmented objects in videos

Refine prediction with follow up clicks

Detect and segment matching instances from text

Refine detection with visual examples

SAM 2

Segment and track any object in any image or video using click, box or mask prompts

Segment an object from a click

Track segmented objects in videos

Refine prediction with follow up clicks

SAM 1

Segment any object in any image with as little as a single click

Segment an object from a click

Refine prediction with follow up clicks

Try SAM 3 today

Experiment with SAM 3 in the Segment Anything Playground.

OUR APPROACH

New unified architecture

SAM 3 is built as a unified, promptable model that enables segmentation with language, exemplars and visual prompts across images and videos. It leverages a large-scale, diverse training dataset and a powerful perception encoder backbone to achieve state-of-the-art performance in segmentation and tracking using open-vocabulary short text phrases and visual prompts.

More from Segment Anything

SAM 3D enables precise reconstruction and analysis of 3D people and objects, providing new opportunities for spatial understanding and applications.
Original source
Apr 8, 2026
- Date parsed from source:
  Apr 8, 2026
- First seen by Releasebot:
  Apr 23, 2026
Meta AI by Meta

Introducing Muse Spark: Scaling Towards Personal Superintelligence

Meta AI releases Muse Spark, a new natively multimodal reasoning model with tool-use, visual chain of thought, and multi-agent orchestration. It is available in meta.ai and the Meta AI app, with a private API preview and Contemplating mode rolling out for harder reasoning tasks.

Today, we’re excited to introduce Muse Spark, the first in the Muse family of models developed by Meta Superintelligence Labs. Muse Spark is a natively multimodal reasoning model with support for tool-use, visual chain of thought, and multi-agent orchestration.

Muse Spark is the first step on our scaling ladder and the first product of a ground-up overhaul of our AI efforts. To support further scaling, we are making strategic investments across the entire stack — from research and model training to infrastructure, including the Hyperion data center.

In this post, we'll first explore Muse Spark's new capabilities and applications. After these results, we’ll look behind the curtain at the scaling axes driving our progress toward personal superintelligence.

Muse Spark is available today at meta.ai and the Meta AI app. We’re opening a private API preview to select users.

Capabilities for Personal Superintelligence

Muse Spark offers competitive performance in multimodal perception, reasoning, health, and agentic tasks. We continue to invest in areas with current performance gaps, such as long-horizon agentic systems and coding workflows.

With larger models in development, these results demonstrate that our stack is scaling effectively.

We’re also releasing Contemplating mode, which orchestrates multiple agents that reason in parallel. This allows Muse Spark to compete with the extreme reasoning modes of frontier models such as Gemini Deep Think and GPT Pro. Contemplating mode provides significant capability improvements in challenging tasks, achieving 58% in Humanity’s Last Exam and 38% in FrontierScience Research.

Muse Spark is available now, and Contemplating mode will be rolling out gradually in meta.ai.

*For more details about our evaluations, see our methodology document.

Applications

Muse Spark is the first step toward a personal superintelligence that understands your world. From analyzing your immediate environment to supporting your wellness, the advanced reasoning capabilities of Muse Spark enable powerful, highly personal use cases.

Multimodal.

Muse Spark is built from the ground up to integrate visual information across domains and tools. It achieves strong performance on visual STEM questions, entity recognition, and localization. These capabilities come together to enable interactive experiences like creating fun minigames or troubleshooting your home appliances with dynamic annotations.

Health.

One major application of personal superintelligence is to help people learn about and improve their health. To improve Muse Spark's health reasoning capabilities, we collaborated with over 1,000 physicians to curate training data that enables more factual and comprehensive responses. Muse Spark can generate interactive displays that unpack and explain health information such as the nutritional content of various foods or muscles activated during exercise.

Scaling Axes

To build personal superintelligence, our model’s capabilities should scale predictably and efficiently. Below, we share how we study and track Muse Spark's scaling properties along three axes: pretraining, reinforcement learning, and test-time reasoning.

Pretraining.

The pretraining phase is where Muse Spark acquires its core multimodal understanding, reasoning, and coding abilities — the foundation that reinforcement learning and test-time compute build upon.

Over the last nine months, we rebuilt our pretraining stack with improvements to model architecture, optimization, and data curation. Together, these advancements increase the capability we can extract from every unit of compute. To rigorously evaluate our new recipe, we fit a scaling law to a series of small models and compare the training FLOPs required to hit a specific level of performance. The results are clear: we can reach the same capabilities with over an order of magnitude less compute than our previous model, Llama 4 Maverick. This improvement also makes Muse Spark significantly more efficient than the leading base models available for comparison.

Reinforcement Learning.

After pretraining, reinforcement learning (RL) leverages compute to scalably amplify model capabilities. Even though large-scale RL is notoriously prone to instability, our new stack delivers smooth, predictable gains.

The plots below show the benefits of scaling RL compute (measured in steps) for Muse Spark. On the left, we see log-linear growth in pass@1 and pass@16 (at least one success across 16 attempts) on the training data. This indicates that RL is improving model reliability without compromising reasoning diversity. On the right, accuracy growth on a held-out evaluation set establishes that the gains from RL predictably generalize: Muse Spark smoothly improves on tasks that were not seen in training.

Test-Time Reasoning.

RL trains our models to "think" before they answer — a process known as test-time reasoning. Serving this capability to billions of users requires efficient use of reasoning tokens. To achieve this, we rely on two key levers: thinking time penalties to optimize token use, and multi-agent orchestration that boosts performance without slowing down response times.

To deliver the most intelligence per token, our RL training maximizes correctness subject to a penalty on thinking time. On a subset of evaluations such as AIME, this causes a phase transition. After an initial period where the model improves by thinking longer, the length penalty causes thought compression — Muse Spark compresses its reasoning to solve problems using significantly fewer tokens. After compressing, the model again extends its solutions to achieve stronger performance.

To spend more test-time reasoning without drastically increasing latency, we can scale the number of parallel agents that collaborate to solve hard problems. The figure below illustrates the benefits of this approach. While standard test-time scaling has a single agent think for longer, scaling Muse Spark with multi-agent thinking enables superior performance with comparable latency.

Safety

Muse Spark has broad reasoning capabilities across dual-use scientific domains, so we conducted extensive safety evaluations before deployment. Our process follows the updated Advanced AI Scaling Framework, which defines threat models, evaluation protocols, and deployment thresholds for our most advanced models. We evaluated Muse Spark both before and after applying safety mitigations across frontier risk categories, behavioral alignment, and adversarial robustness.

We found that Muse Spark demonstrates strong refusal behavior across high-risk domains such as biological and chemical weapons, enabled by pretraining data filtering, safety-focused post-training, and system-level guardrails. In the Cybersecurity and Loss of Control domains, Muse Spark does not exhibit the autonomous capability or hazardous tendencies needed to realize threat scenarios. Our evaluations show Muse Spark falls within safe margins across all frontier risk categories we measured given its deployment context. Full results are available in our Safety & Preparedness Report.

In third-party evaluations on a near-launch checkpoint, Apollo Research found that Muse Spark demonstrated the highest rate of evaluation awareness of models they have observed. The model frequently identified scenarios as "alignment traps" and reasoned that it should behave honestly because it was being evaluated. This matters because models that recognize evaluation contexts may behave differently during testing than in deployment. However, these results do not confirm that awareness directly alters behavior, and our own follow-up investigation found initial evidence that evaluation awareness may affect model behavior on a small subset of alignment evaluations, all unrelated to hazardous capabilities or propensities affecting model launch decisions. We concluded this was not a blocking concern for release, though it warrants further research. Read more in our Safety & Preparedness Report.

Conclusion

With Muse Spark, we're on a predictable and efficient scaling trajectory. We look forward to sharing increasingly capable models on the path to personal superintelligence soon.
Original source
Mar 27, 2026
- Date parsed from source:
  Mar 27, 2026
- First seen by Releasebot:
  Apr 23, 2026
Meta AI by Meta

SAM 3.1: Faster and More Accessible Real-Time Video Detection and Tracking With Multiplexing and Global Reasoning

Meta AI releases Segment Anything Model 3 and Segment Anything Playground, bringing promptable concept segmentation, object tracking, new model weights and fine-tuning code, plus faster video processing with SAM 3.1’s object multiplexing for real-time performance.
Update March 27, 2026

We’ve seen incredible adoption of SAM 3 over the last few months, and during that time, we’ve been working behind the scenes on updates to improve video processing efficiency. Today, we’re pleased to introduce SAM 3.1.

As a drop-in replacement for SAM 3, our updated model delivers a significant boost in video processing efficiency by introducing object multiplexing, which allows the model to track up to 16 objects in a single forward pass. This innovation doubles the processing speed for videos with a medium number of objects, increasing throughput from 16 to 32 frames per second on a single H100 GPU. As a result, SAM 3.1 enables real-time object tracking in complex videos while reducing overall GPU resource requirements, making high-performance applications feasible on smaller, more accessible hardware.

This improvement comes from a shift in how the model handles multiple objects. Previously, each object required its own dedicated pass, but with multiplexing, SAM 3.1 processes all tracked objects together, eliminating redundant computation and memory bottlenecks. This global reasoning approach streamlines performance and enhances accuracy in crowded scenes.

We encourage the community to download the SAM 3.1 model checkpoint, explore the updates to the SAM 3 codebase and research paper, and test drive the updated model on the Segment Anything Playground.

Introducing Meta Segment Anything Model 3 and Segment Anything Playground

Takeaways:

We’re announcing Meta Segment Anything Model 3 (SAM 3), a unified model for detection, segmentation, and tracking of objects in images and video using text, exemplar, and visual prompts.

As part of this release, we’re sharing SAM 3 model checkpoints, evaluation datasets, and fine-tuning code.

We’re also introducing Segment Anything Playground, a new platform that makes it easy for anyone to understand the capabilities of SAM and experiment with cutting-edge AI models for creative media modification.

In Edits, Instagram’s video creation app, SAM 3 will soon enable new effects that creators can apply to specific people or objects in their videos. New creation experiences enabled by SAM 3 will also be coming to Vibes on the Meta AI app and meta.ai on the web.

Separately, we’re sharing SAM 3D, a suite of open source models, code, and data for 3D objects and human reconstruction from a single image, setting a new standard for grounded 3D reconstruction in physical world scenarios.

SAM 3 and SAM 3D are powering Facebook Marketplace’s new View in Room feature, helping people visualize the style and fit of home decor items, like a lamp or a table, in their spaces before purchasing.

Together with our partners at Conservation X Labs and Osa Conservation, we’re also launching a first-of-its-kind, publicly available video dataset for wildlife monitoring using SAM 3.

We’re unveiling the next generation of the Segment Anything collection of models, advancing image, and video understanding. Segment Anything Model 3 (SAM 3) introduces some of our most highly requested features like text and exemplar prompts — enabling detection, segmentation, and tracking of any visual concept across images and video. We also want to make it easier for more people to use our models. As part of this release, we’re debuting the Segment Anything Playground, the simplest way for anyone to experiment with applying our state-of-the-art models to media modification.

Today, we’re releasing the SAM 3 model weights, a demo on Segment Anything Playground, and a research paper that details how we built SAM 3. Additionally, we’re sharing the Segment Anything with Concepts (SA-Co) evaluation dataset to serve as a new benchmark for the community. Separately, we’re sharing SAM 3D, which includes a model for object and scene reconstruction and another for human pose and shape estimation. More information about this release can be found in our SAM 3D blog post.

At Meta, we’re using these advancements to help build the next generation of creative media tools. SAM 3 and SAM 3D are being used to enable the new View in Room feature on Facebook Marketplace, helping people visualize the style and fit of home decor items, like a lamp or a table, in their spaces before purchasing. New creation experiences enabled by SAM 3 will be coming to Vibes on the Meta AI app and meta.ai on the web, where people can use AI visual creation tools and remix existing AI-generated videos. We’ll also soon be introducing new effects on our Edits app that use SAM 3. Creators can apply dynamic effects to people or objects in their videos — simplifying a complex editing workflow to just one tap.

Introducing Meta Segment Anything Model 3

Linking language to specific visual elements in images or videos is a major challenge in computer vision. Traditional models often focus on object segmentation with a fixed set of text labels, restricting their ability to address the full spectrum of user requests, which frequently involve segmenting concepts not present in predefined lists. This means that existing models can segment frequent concepts like “person,” but struggle with more nuanced concepts like “the striped red umbrella”.

SAM 3 overcomes these limitations by introducing the promptable concept segmentation capability: finding and segmenting all instances of a concept defined by a text or exemplar prompt. SAM 3 accepts text prompts — open-vocabulary short noun phrases — and image exemplar prompts, eliminating the constraints of fixed label sets. To assess large-vocabulary detection and segmentation performance, we created the Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation in images and videos that challenges models to recognize a much larger vocabulary of concepts compared to prior benchmarks. As part of this release, we’re making SA-Co publicly available to support reproducibility and further innovation in open-ended visual segmentation.

SAM 3 supports a variety of prompt modalities, including both concept prompts such as simple noun phrases and image exemplars, as well as visual prompts, such as masks, boxes, and points, which were introduced in SAM 1 and SAM 2. This increases the flexibility and usability of segmentation, particularly for concepts that are rare or hard to describe with text alone.

SAM 3 excels at segmenting objects described by short noun phrases, reflecting common user intent in interactive and natural settings. Our model can also be used as a perception tool for multimodal large language models to segment objects described by more complex prompts, such as: “people sitting down, but not holding a gift box in their hands.”

Overall, SAM 3 delivers a 2x gain over existing systems in both image and video on our promptable concept segmentation benchmark, SA-Co, and improves upon previous SAM capabilities in interactive visual segmentation tasks.

Building a Novel Data Engine Using AI and Human Annotators

Obtaining high-quality annotated images with segmentation masks and text labels across a broad range of categories and visual domains is a significant challenge. This type of data doesn’t exist at scale on the web. Exhaustively masking every occurrence of an object category — particularly in video — is a time-intensive and complex task for human annotators. Additionally, building comprehensive coverage for a large and diverse vocabulary across multiple visual domains requires considerable time and resources. Overall, the process is both time-consuming and expensive.

We address this challenge by creating a scalable data engine that leverages SAM 3, human annotators, and AI models in the loop, which allows dramatic speed-ups in annotation — approximately 5x faster than humans on negative prompts (concepts not present in the image/video) and 36% faster for positive prompts even in challenging fine-grained domains. This hybrid human and AI system enabled us to create a large and diverse training set with over 4 million unique concepts.

A pipeline of AI models, including SAM 3 and systems such as a Llama-based captioner, automatically mine images and videos, generate captions, parse the captions into text labels, and create initial segmentation masks, which are shown as “candidates” in the above figure.

Human and AI annotators then verify and correct these proposals, yielding a feedback loop that rapidly scales dataset coverage while continuously improving data quality. AI annotators are based on Llama 3.2v models that were specifically trained to match or surpass human accuracy on annotation tasks, such as verifying if a mask is high quality, or if all instances of a concept are exhaustively masked in an image.

By delegating some human annotation tasks to AI annotators, we more than double the throughput compared to a human-only annotation pipeline. AI annotators also automatically filter out easy examples, focusing valuable human annotation effort on the most challenging cases where the current version of SAM 3 fails. We also leverage a concept ontology — a dictionary of concepts and their relationships based on Wikipedia — to map text labels into a shared concept space and increase the coverage of less frequent concepts in the data.

We validate this approach through ablation studies, demonstrating that integrating AI- and human-annotated labels results in measurable improvements in model performance. We further validate that an entirely automated data engine can be used to generate data to automatically expand coverage to new visual and text domains.

Model Architecture

Building a model that excels at promptable concept segmentation requires us to maintain strong performance on all tasks compared to individual, task-specific models. This presents significant challenges in model design and in the development of a training recipe, due to potential task conflicts. For example, the task of re-detecting and tracking instances requires visual features that distinguish them from other instances of the same concept. This conflicts with the concept detection task, which requires visual features that are similar for all instances of a concept. Finding the right architecture is an important step in being able to solve all tasks in a unified model. Additionally, designing strong data recipes is essential to prevent issues like catastrophic forgetting as new tasks and data are introduced.

The SAM 3 model architecture also builds on many previous AI advancements from Meta. The text and image encoders in SAM 3 are from the Meta Perception Encoder, an open source model we shared in April that enables the building of more advanced computer vision systems that can assist people in everyday tasks, such as image recognition and object detection. Using the Meta Perception Encoder enabled us to achieve a significant leap in performance compared to previous encoder choices. The detector component is based on the DETR model, which was the first to use transformers for object detection. The memory bank and memory encoder used in SAM 2 is the basis for the Tracker component. We also used several open source components, including datasets, benchmarks, and model improvements, to advance our work.

Results

We achieve a step change in concept segmentation performance in images (measured on SA-Co Gold subset) and videos (on SA-Co Video), with SAM 3 doubling cgF1 scores (a measure of how well the model can recognize and localize concepts) relative to existing models. SAM 3 consistently outperforms both foundational models like Gemini 2.5 Pro and strong specialist baselines such as GLEE, OWLv2, and LLMDet. In studies, users prefer SAM 3 outputs over the strongest baseline, OWLv2, approximately three to one. We also achieve state-of-the-art results on the SAM 2 visual segmentation tasks (mask-to-masklet, point-to-mask), matching or exceeding the state-of-the-art performance of previous models like SAM 2. Furthermore, we see notable gains on challenging benchmarks like zero-shot LVIS (not shown) and object counting (shown on CountBench).

This excellent performance comes with fast inference — SAM 3 runs in 30 milliseconds for a single image with more than 100 detected objects on an H200 GPU. In video, the inference latency scales with the number of objects, sustaining near real-time performance for approximately five concurrent objects.

We also show that a multimodal large language model (MLLM) that uses SAM 3 as a tool, called SAM 3 Agent, can segment more complex text queries such as, “What object in the picture is used for controlling and guiding a horse?” The MLLM proposes noun phrase queries to prompt SAM 3 and analyzes the returned masks, iterating until the masks are satisfactory. Without training on any referring expression segmentation or reasoning segmentation data, SAM 3 Agent surpasses prior work on challenging free-text segmentation benchmarks that require reasoning, such as ReasonSeg (shown above) and OmniLabel.

Applications to Science

SAM 3 is already being applied for use cases in scientific fields. For example, Meta collaborated with Conservation X Labs and Osa Conservation to combine on-the-ground wildlife monitoring with SAM 3 to build an open dataset of research-ready, raw video footage. The publicly available SA-FARI dataset includes over 10,000 camera trap videos of more than 100 species, annotated with bounding boxes and segmentation masks for every animal in each frame.

FathomNet is a unique research collaboration led by MBARI that is working to advance AI tools for ocean exploration. Segmentation masks and a new instance segmentation benchmark tailored for underwater imagery are now available to the marine research community via the FathomNet Database.

SA-FARI and FathomNet can be used by the broader AI community to develop innovative new ways to discover, monitor, and conserve wildlife on land and in the ocean.

Future Areas of Exploration for the Open Source Community

While SAM 3 demonstrates strong performance for segmenting objects in images and short videos with simple text phrases, the model performance can be further improved, especially in challenging scenarios.

SAM 3 struggles to generalize to fine-grained out-of-domain concepts in a zero-shot manner, such as identifying specific terms that require domain knowledge like “platelet,” especially in niche visual domains involving medical or scientific imagery. We experimented with strategies to extend the capability of SAM 3 and found that the model quickly adapts to new concepts and visual domains when fine-tuned on small quantities of annotated data. As part of our code release, we’re sharing fine-tuning approaches that the community can leverage to adapt SAM 3 for their use cases. We’re also partnering with Roboflow to enable people to annotate data, fine-tune, and deploy SAM 3 for their particular needs.

Additionally, while SAM 3 performs well with short open-vocabulary prompts, such as “a hardcover book,” the model doesn’t support longer, complex phrases like, “the second to last book from the right on the top shelf.” However, when paired with multimodal large language models, the model can be trained to support longer, more complex descriptions including cases that require reasoning.

When applied to video, SAM 3 tracks every object with a SAM 2-style masklet, which means the cost of SAM 3 inference scales linearly with the number of objects being tracked. Each object is processed separately, utilizing only shared per-frame embeddings, without inter-object communication. Incorporating shared object-level contextual information could aid in improving efficiency and model performance in complex scenes with many visually similar objects.

There’s plenty more work to be done to propel research in this field even further. We hope the AI community will join us by building with SAM 3, adopting the SA-Co benchmark, and leveraging these new resources to help push these capabilities further. Together, we can accelerate open science to build impactful new experiences and use cases that benefit people and society.

Explore SAM 3 on the Segment Anything Playground

We’re bringing all of this work together in the Segment Anything Playground, our new platform that enables anyone to try our latest models — no technical expertise needed. The start-from-scratch flow enables uploading an image or video, or it’s possible to jump right in using one of the available templates. These include practical options like pixelating faces, license plates, and screens, as well as fun video edits such as adding a spotlight effect, motion trails, or magnifying specific objects. Additionally, the templates assist in annotating visual data and provide a way to stress test SAM 3. We’ve designed SAM Playground to be the simplest way to experiment with our models for media modification, and we can’t wait to see how people use it to enhance their creativity.

SAM 3 also performs well on first-person footage captured by wearable devices like Meta’s Aria Gen 2 research glasses. This enables robust segmentation and tracking of objects from a first-person perspective, handling the dynamic challenges of wearable-captured scenes. Select recordings from the Aria Gen 2 Pilot Dataset are now featured on the Segment Anything Playground. This integration demonstrates SAM 3’s value for research and applications in areas like machine perception, contextual AI, and robotics, where understanding the world from the human perspective is crucial.

Get Started With Segment Anything Model 3

We want to continue empowering creators, developers, and researchers to experiment, build, and push the boundaries of what’s possible with Meta Segment Anything Model 3. Looking ahead, we’re optimistic about the transformative potential of SAM 3 to unlock new use cases and create positive impact across diverse fields. As always, we welcome continued iteration and feedback from the community to help us evolve and advance the field together.
Original source
Mar 26, 2026
- Date parsed from source:
  Mar 26, 2026
- First seen by Releasebot:
  Apr 23, 2026
Meta AI by Meta

Introducing TRIBE v2: A Predictive Foundation Model Trained to Understand How the Human Brain Processes Complex Stimuli

Meta AI releases TRIBE v2, a next-gen brain-response model that predicts high-resolution neural activity from sights, sounds, and language with faster, more accurate, 70x higher-resolution results. It also shares the model, code, paper, and interactive demo for researchers.
Takeaways

We're introducing TRIBE v2, our next-gen model that acts as a digital twin of human neural activity. This offers unprecedented speed, accuracy, and a 70x resolution increase as compared to similar models to predict how the brain responds to almost any sight or sound — enabling neuroscientists and clinical researchers to test theories without requiring human subjects.

We're releasing the model, codebase, paper, and an interactive demo to help researchers push the boundaries of neuroscience, apply brain insights to build better AI systems, and use computational simulation to accelerate breakthroughs in the treatment of neurological disorders.

Understanding how the human brain processes the world around us is one of the greatest open challenges in neuroscience. Breakthroughs here could transform how we understand and treat neurological conditions affecting hundreds of millions of people — and improve AI systems by directly guiding their development from neuroscientific principles.

Today, we're announcing TRIBE v2: our first AI model of human brain responses to sights, sounds, and language. Building on our Algonauts 2025 award-winning model, which was trained on the low-resolution fMRI recordings of four individuals, we leverage a massive dataset of more than 700 healthy volunteers who were presented with a wide variety of media, including images, podcasts, videos, and text. TRIBE v2 reliably predicts high-resolution fMRI brain activity — enabling zero-shot predictions for new subjects, languages, and tasks — and consistently outperforms standard modeling approaches. By creating a digital model of the human brain, researchers can rapidly test hypotheses about its underlying functions without the need for human subjects in every experiment.

To accelerate the pace of neuroscience discovery and open up new avenues for clinical practice, we’re sharing a research paper, along with model weights and code, under a CC BY-NC license. We also invite everyone to explore TRIBE v2 on our demo website. By sharing this work, we hope to help accelerate neuroscience research that will unlock scientific and clinical breakthroughs for the greater good.
Original source
Mar 10, 2026
- Date parsed from source:
  Mar 10, 2026
- First seen by Releasebot:
  Apr 23, 2026
Meta AI by Meta

Mapping the World's Forests with Greater Precision: Introducing Canopy Height Maps v2

Meta AI launches Canopy Height Maps v2, an open source model and world-scale forest maps that bring sharper, more consistent canopy measurements for researchers and governments. Built on DINOv3, it improves accuracy, detail, and bias control for global forest monitoring.

Collaborations with the Public Sector in Europe, the United States, and Beyond

Forests are essential to life on Earth — storing carbon, sheltering wildlife, and shaping our climate. To protect and restore them, we must see them as never before. Today, in partnership with the World Resources Institute, we’re announcing Canopy Height Maps v2 (CHMv2): an open source model and world-scale maps generated with it. Together, they will help researchers and governments measure and understand every tree, gap, and canopy edge — enabling smarter biodiversity support and land-management decisions.

At the heart of CHMv2 is DINOv3, Meta’s self-supervised vision model, which brings unprecedented clarity and detail to forest mapping worldwide. But visibility isn’t enough — having accurate, high-resolution data on forest structure is essential for turning insights into action. Tree canopy height measurements are important for monitoring forest health, tracking restoration efforts, detecting degradation, and estimating carbon storage.

Building on our original high-resolution canopy height maps released in 2024, CHMv2 delivers substantial improvements in accuracy, detail, and global consistency. This comes from replacing the DINOv2 backbone with our more capable DINOv3 backbone, pre-trained on SAT-493M, a large and diverse dataset of satellite imagery.

“DINOv3 strengthens our ability to measure forest structure across diverse landscapes, making high-resolution restoration monitoring more consistent and more scalable,” says John Brandt, Data Science Lead at the World Resources Institute.

DINOv3 learns robust visual features from large amounts of unlabeled imagery. By training on diverse satellite data, DINOv3 captures the subtle visual cues that indicate tree height, such as shadows, textures, and crown shapes — without requiring millions of manually labeled examples. This enables CHMv2 to deliver major gains in accuracy and detail over the previous version.

Additionally, the model's R² — a way of measuring how closely predictions match real-world measurements — has soared from 0.53 to 0.86. The model now delivers sharper canopy maps and minimizes bias for tall trees, making its predictions more trustworthy for scientific and operational use.

The training dataset for CHMv2 was also expanded and improved by adding more geographically diverse, high-quality lidar examples. To better align satellite imagery with real-world lidar measurements, we built automated matching tools and developed a specialized loss function to address the unique challenges of canopy height estimation. Together, these advances enable CHMv2 to set a new bar for global forest mapping.

Our previous AI model and associated maps, CHMv1, are already supporting climate migration, restoration, and biodiversity efforts. In the United Kingdom, Forest Research — the research agency of the Forestry Commission — is using these to transform how they monitor and manage Great Britain’s forests. Their work demonstrates how these tools can support national-scale forest inventory and help track progress toward climate commitments.

Beyond the United Kingdom, Canopy Height Maps are helping national and local governments across Europe advance their environmental goals. The European Commission’s Joint Research Centre used the first version of Canopy Height Maps in its Global Forest Cover map for 2020 research (ESSD paper, EU Forest Observatory) and hopes to use CHMv2 for future map versions and other tree monitoring efforts, including the 3 Billion Tree Initiative — a commitment to plant at least 3 billion biodiverse trees across the European Union by 2030.

In the United States, these maps have also been leveraged in city planning tools being used for the implementation of Cities for Smart Surfaces, an initiative led by the Smart Surfaces Coalition and signed on by the mayors of 10 cities, including Atlanta, Baltimore, Boston, Columbia (South Carolina), Dallas, and New Orleans. Cities for Smart Surfaces is a multiyear project funded by Waverley Street Foundation and the MacArthur Foundation to cool cities and metropolitan areas with reflective (cool) roofs and pavements, green roofs, solar energy, porous pavements, rain gardens, and trees. Additionally, WRI Ross Center for Sustainable Cities is making use of these maps in Cool Cities Lab, a forthcoming globally relevant scenario planning tool — initially available for cities in 11 countries — that helps cities assess the temperature effects of urban cooling interventions.

Looking Ahead

CHMv2 represents a significant step forward, but challenges remain. We’re continuing to improve predictions in regions where data is sparse, address viewing-geometry effects, and extend temporal coverage to better support change detection over time.

By making these advances available to the research community, we hope to accelerate progress in forest monitoring worldwide. Better maps enable better decisions — for conservation, climate action, and the countless communities that depend on healthy forests.
Original source
Mar 10, 2026
- Date parsed from source:
  Mar 10, 2026
- First seen by Releasebot:
  Apr 23, 2026
Meta AI by Meta

DINOv3

Meta AI releases DINOv3 and CHMv2, bringing improved high-resolution canopy height maps, stronger dense vision features, and broad support across PyTorch Hub, Hugging Face Hub, Transformers, and timm, with code, weights, and usage guides now available.
[2026-03-10] 🔥 The Canopy Height Maps v2 (CHMv2) model and inference code are now available

(more details on downloading the model weights and using the code here). The model weights are also available in Hugging Face Hub and supported by the Hugging Face Transformers library. Building on our original high-resolution canopy height maps released in 2024, CHMv2 delivers substantial improvements in accuracy, detail, and global consistency by leveraging DINOv3.

[2025-11-20] Distillation code and configurations for ConvNeXt backbones are now released!

[2025-10-13] Semantic segmentation (ADE20K) and monocular depth estimation (NYUv2-Depth) linear probing code are now released!

[2025-09-17] DINOv3 backbones are now supported by the PyTorch Image Models / timm library starting with version 1.0.20.

[2025-08-29] DINOv3 backbones are supported by released versions of the Hugging Face Transformers library starting with version 4.56.0.

[2025-08-14] DINOv3 backbones are now available in Hugging Face Hub and supported by the development version of the Hugging Face Transformers library.

DINOv3 🦖🦖🦖

Meta AI Research, FAIR

Authors: Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien Mairal, Hervé Jégou, Patrick Labatut, Piotr Bojanowski

Reference PyTorch implementation and models for DINOv3. For details, see the DINOv3 paper.

Overview

High-resolution dense features.

We visualize the cosine similarity maps obtained with DINOv3 output features between the patches marked with a red cross and all other patches.

An extended family of versatile vision foundation models producing high-quality dense features and achieving outstanding performance on various vision tasks including outperforming the specialized state of the art across a broad range of settings, without fine-tuning.

Pretrained models

Please follow the link provided below to get access to all the model weights: once accepted, an e-mail will be sent with the complete list of URLs pointing to all the available model weights (both backbones and adapters). These URLs can then be used to either:

download the model or adapter weights to a local filesystem and point torch.hub.load() to these local weights via the weights or backbone_weights parameters, or

directly invoke torch.hub.load() to download and load a backbone or an adapter from its URL via also the weights or backbone_weights parameters.

See the example code snippets below.

⚠️ Please use wget instead of a web browser to download the weights.

The release includes detailed instructions for installation, getting started with notebooks, data preparation, training setups, evaluation, and usage examples for various pretrained heads (image classification, depther, detector, segmentor, zero-shot tasks).

Canopy Height Maps v2 (CHMv2)

The CHMv2 model loading is available via PyTorch Hub and Hugging Face Transformers. CHMv2 uses the DINOv3 ViT-L/16 satellite as the backbone. The model weights can be requested and downloaded with instructions provided. Example code snippets demonstrate how to load and use the CHMv2 model for canopy height prediction.

License

DINOv3 code and model weights are released under the DINOv3 License.

Contributing

See contributing guidelines and code of conduct.

Citing DINOv3

If you find this repository useful, please consider giving a star ⭐ and citation 🦖:

@misc{simeoni2025dinov3, title={{DINOv3}}, author={Sim{\'e}oni, Oriane and Vo, Huy V. and Seitzer, Maximilian and Baldassarre, Federico and Oquab, Maxime and Jose, Cijo and Khalidov, Vasil and Szafraniec, Marc and Yi, Seungeun and Ramamonjisoa, Micha{\"e}l and Massa, Francisco and Haziza, Daniel and Wehrstedt, Luca and Wang, Jianyuan and Darcet, Timoth{\'e}e and Moutakanni, Th{\'e}o and Sentana, Leonel and Roberts, Claire and Vedaldi, Andrea and Tolan, Jamie and Brandt, John and Couprie, Camille and Mairal, Julien and J{\'e}gou, Herv{\'e} and Labatut, Patrick and Bojanowski, Piotr}, year={2025}, eprint={2508.10104}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2508.10104}, }
Original source

This is the end. You've seen all the release notes in this feed!

Meta AI Release Notes

Segment Anything 2 Demo

Segment Anything 2 Demo

FAIRChem v2

Seamless Communication

AI research by Meta

Seamless Communication

SeamlessExpressive

SeamlessStreaming

SeamlessM4T v2

Seamless

Preserving prosody

SeamlessExpressive

Near real-time translation

SeamlessStreaming

Foundational model for universal translation

SeamlessM4T v2

Our approach to research

Open innovation

Safety and responsibility

Resources

More on Seamless Communication

Technical overview

Seamless research paper

AI at Meta blog

Download the models

SeamlessExpressive Demo

Meta Video Seal

Introducing Meta Video Seal

Imperceptible watermarks

Robust and Resilient

Origin Verification

How the demo works

Introducing Meta Motivo

A Meta FAIR release

Introducing Meta Motivo

Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models

Explore the Research

Key takeaways

The Algorithm

Inference from various types of prompts

Performance improvement during pre-training

Evaluation Results

Quantitative

Qualitative

Understanding the behavioral latent space

Limitations

Try it yourself

References

Acknowledgements

DINOv3

INTRODUCING DINOV3

DINOV3 OVERVIEW

Exceptional performance across visual domains

Versatile backbone with powerful dense image features

Efficient model sizes and architectures

PERFORMANCE

APPLICATIONS

World Resources Institute

NASA JPL

Orakl Oncology & CentraleSupelec

APPROACH

DINO Evolution

DINO

DINOv2

DINOv3

Introducing Meta Segment Anything Model Audio (SAM Audio)

SAM AUDIO CAPABILITIES

Text prompts

Visual prompts

Span prompts

Multi-modal prompts

A NEW WAY TO EXPERIENCE SOUND

PERFORMANCE

OUR APPROACH

Model architecture

OUR APPROACH

Audiovisual Perception Encoder

THE SAM AUDIO EVALUATION DATASET

A first-of-its-kind audio separation OSS evaluation set