Moondream Release Notes

Last updated: Apr 17, 2026

Get this feed:
  • Apr 16, 2026
    • Date parsed from source:
      Apr 16, 2026
    • First seen by Releasebot:
      Apr 17, 2026
    Moondream logo

    Moondream

    Lens: Moondream's Finetune Service

    Moondream launches Lens, a fine-tuning product for production-ready vision AI. It helps users improve model accuracy with simple pay-as-you-go RL and supervised fine-tuning, then run the tuned model in Cloud or locally with Photon.

    Lens in Action: PTZOptics

    VLMs are a big leap for vision AI. They reason about vision at a higher level, and they're much easier to use than the previous generation of models. But they've been hard to put into production for three reasons. First, they're slow, and production systems often need real-time decisions. Second, they struggle to run locally, which production systems often require for security, reliability, or cost. Third, they suffer from a "last-mile" problem: the VLM looks promising in the lab, but in the real world, accuracy falls short.

    Moondream is a different kind of VLM. It's purpose-built for production systems, and unsurprisingly, we've been tackling these three problems. First, we built small models that are lightweight enough to run everywhere. Then we launched Photon, our inference engine that achieves 20ms inference time on an H100. And today, we're happy to announce the launch of Lens, our fine-tuning product that solves the "last-mile" problem.

    We've been working with a partner, PTZOptics, who make network-attached remote controlled cameras. In many cases, customers want the camera to act as if it had a smart camera operator controlling it: following the action of a soccer game, zooming in and out at crucial times in a presentation, or detecting anomalies for security or operational reasons.

    With Moondream, this is now a reality. You can have the camera track complicated things ("the person in the red shirt"), take inventory of what's shown, or get alerted when actions occur ("someone's hurt"). And with Lens, you can teach Moondream new skills, or tune it when the accuracy is lacking.

    Simple API, Pay-as-You-Go

    Lens is a simple API that provides fine-tuning, through both reinforcement learning and supervised fine-tuning. There's no hardware to set up or binaries to worry about. And it's simple pay-as-you-go. We've seen great results with as few as a dozen images.

    As soon as you're done improving the model, you can invoke it immediately through our Cloud, or run it locally with Photon. It's the easiest and fastest way to go from fine-tune to production.

    See the Difference

    Here are a few examples of Lens fine-tunes across very different domains. In each case, the base model struggled, and the fine-tuned model nailed it.

    Broadcast Sports: Detecting the Ball Handler

    We fine-tuned Moondream to detect the player with the ball in NBA broadcast footage. The base model returned dozens of false positives (red boxes). After fine-tuning with RL, it finds just the ball handler (green box). F1 jumped from 28% to 79%, and false positives dropped from 61 to 2.

    Before fine-tuning
    After fine-tuning

    Training took 54 minutes and cost $16.89.
    See the full interactive example →

    Geolocation: Country Identification from Street View

    We trained Moondream to identify countries from street-view imagery. The base model guesses the wrong continent entirely. After fine-tuning with just 25 images per country, it reads road markings, signage, and landscape cues correctly, beating GPT-5.4's 69.8% accuracy with 71.1%.

    Before fine-tuning
    After fine-tuning

    See the full interactive example →

    Medical Imaging: Glaucoma Staging

    We fine-tuned Moondream to classify retinal images by glaucoma severity. The base model defaulted to "early" for nearly every image. The fine-tuned model distinguishes severity correctly, performing 2x better than GPT-5.4.

    Before fine-tuning
    After fine-tuning

    Training took 47 minutes and cost $15.68.
    See the full interactive example →

    You can explore all of our fine-tune examples at moondream.ai/p/lens.

    Need Help Getting Started?

    For customers that are new to vision AI and fine-tuning, we offer help. Our dedicated production team can work with you to deliver a fine-tuned model for your specific use case. Contact [email protected] for more info.

    What We've Been Building Toward

    This is what we've been building toward. VLMs are finally in a form factor that fits the real world.

    Try Lens today, and if you're at NAB next week, come see the PTZOptics demo live. Questions? Drop us a line at [email protected].

    Original source
  • Mar 25, 2026
    • Date parsed from source:
      Mar 25, 2026
    • First seen by Releasebot:
      Mar 26, 2026
    Moondream logo

    Moondream

    Photon: Real-Time Vision AI Is Finally Here

    Moondream introduces Photon, a faster production inference engine for vision AI, delivering 2x speed over similar vLLM setups and over 60 inferences per second on H100s. It is built for real-time image and video analysis across edge to data center.

    The era of production vision AI isn't coming. It's here.

    Vision Language Models (VLMs) changed the game. Instead of building custom CV pipelines for every task, you can now just prompt a model about an image in plain language. That alone made vision AI easier and cheaper to adopt. But VLMs also unlocked something deeper: visual reasoning that simply wasn't possible before. Problems that were out of reach for traditional AI systems are now solvable, and almost anyone can afford to try.

    The result has been an explosion of new vision AI applications. Manufacturing defect detection. Broadcast video analysis. Retail inventory and loss prevention. What used to be research-grade problems are now powering a new wave of startups, and Moondream is at the center of many of them.

    But there's a gap between what VLMs can do and what they can do fast enough to matter.

    Most people's experience with a VLM looks like this: you ask it a question about an image, wait a few seconds (sometimes tens of seconds), and get an answer back. The answers are often impressive. The wait is often a dealbreaker. When you're processing live video, running a manufacturing line, or making real-time decisions, a few seconds of latency kills the use case entirely.

    We heard this over and over from customers. They wanted everything Moondream offers: the accuracy, the grounding, the ease of use. But they needed it faster than any VLM had delivered before.

    Photon is our answer.

    Why We Could Build This

    Photon isn't just fast inference code. The real advantage is that we own the entire stack. We design the model. We design the inference engine. We design the fine-tuning platform and the deployment tools.

    We made architectural decisions at model design time to optimize for the hardware we actually deploy on. We knew which GPU operations would matter on which chips, and shaped the model around that. You can't retrofit those decisions onto an existing model. Compared to similar-sized models on vLLM, Photon is 2x faster. You don't get that from better kernels alone.

    On an H100, Photon delivers over 60 inferences per second. That's frame-by-frame video processing on server-class hardware. On edge devices, including older ones limited by supply chain realities, we still deliver meaningful throughput.

    Here's what matters in practice: production vision AI systems rarely run just one inference per image. You're often analyzing the same frame in multiple ways. Photon gives you the headroom to do that.

    What This Changes

    Live broadcasting with real-time moderation. Manufacturing lines running at full speed with frame-by-frame defect detection. Security systems that keep pace with camera feeds. These were theoretically possible before. Now they're operationally viable.

    Speed also affects cost. When you run inference faster on a GPU, each inference gets cheaper. Photon supports operation batching, which lets you trade slightly higher per-inference latency for much better total throughput. The result is that real-time image and video analysis can fit much tighter budgets than before.

    Moondream Is Now Production-Ready, End to End

    Moondream is becoming a complete stack for production vision AI. The model works well at grounding tasks across industries. Lens, our upcoming fine-tuning platform, makes it easy and cheap to improve accuracy on your specific use case. And now Photon gives you a best-in-class inference engine that runs everywhere, edge to data center, with full support.

    Getting started takes minutes, not days:

    pip install moondream
    
    import moondream as md
    from PIL import Image
    
    model = md.vl(api_key="YOUR_API_KEY", local=True)
    image = Image.open("photo.jpg")
    
    print(model.caption(image))
    # => {"caption": "A golden retriever sitting on a park bench, looking ..."}
    

    Moondream is free to download and run however you want. Photon is for teams that need faster, production-ready performance. See pricing for details and the documentation to get started.

    What's Next

    Lens, our fine-tuning product, is launching soon. More hardware support for Photon is on the way. As both products mature, they'll integrate more tightly so you can fine-tune on your data and deploy through Photon in a single step.

    We're going to stay focused on making Moondream the best production-ready VLM. Faster. Less memory. Lower cost. Running everywhere.

    Original source
  • All of your release notes in one feed

    Join Releasebot and get updates from Moondream and hundreds of other software products.

    Create account
  • Mar 10, 2026
    • Date parsed from source:
      Mar 10, 2026
    • First seen by Releasebot:
      Mar 11, 2026
    Moondream logo

    Moondream

    Moondream Segmenting Update: Better Masks, Better Benchmarks, 40% Faster

    Moondream announces a faster, higher accuracy segmenting upgrade now live in Moondream Cloud. The new model boosts RefCOCO benchmarks, inscribes native SVG masks, and enhances referring expressions with faster latency. Local inference will follow later in the week, making it a clear product release.

    We introduced segmenting as a Moondream skill in September 2025 with Moondream 3 Preview. It launched with state-of-the-art scores on segmenting benchmarks. Despite the launch of several segmenting vision models since then, Moondream remains top dog.

    Today, we're excited to announce that we've raised the bar even further with an improvement now live on Moondream Cloud. This new version produces better segmenting results, achieves better benchmark scores, and does it 40% faster than before.

    Examples

    • Car closest to the top of lombard street
    • castle-like building
    • Headlights
    • man wearing blue shirt and jeans, standing near the left railing of the bridge, looking down
    • Runner in the lead with the longest hair
    • Transamerica Pyramid
    • Waldo wearing the number 25317
    • White 911

    Moondream Segmenting Recap

    Put simply, what makes Moondream segmenting different is that it:

    • produces native SVG masks (vectors, not bitmasks)
    • state of the art on segmentation benchmarks
    • offers quick inference speeds, even if raw speed is not the only thing we optimize for
    • supports deep, native referring capabilities such as "the person touching the door"

    Benchmark Improvements

    This latest segmentation model delivers a significant leap in performance across all major referring expression benchmarks. On RefCOCO+ Val, which tests attribute-based reasoning without positional cues, we achieve 79.1 mIoU, a 4.4-point improvement over the previous state-of-the-art (which was also Moondream!). RefCOCOg Val, which evaluates complex natural language descriptions, sees similar gains at 80.7 mIoU. We also report 88.2 mIoU on RefCOCO-M, our high-fidelity benchmark with pixel-accurate masks, underscoring that these gains translate to real-world precision, not just benchmark optimization.

    | Metric: mIoU | RefCOCO Val | RefCOCO+ Val | RefCOCOg Val | RefCOCO-m |
    | Old | 81.8 | 74.7 | 76.4 | 86.9 |
    | New | 83.2 (+1.4) | 79.1 (+4.4) | 80.7 (+4.3) | 88.2 (+1.3) |

    How We Compare

    Most segmentation-capable VLMs are either accurate or fast, but not both. Large multimodal models with bolted-on segmentation decoders can handle complex queries, but they are slow and expensive to run at scale. Lightweight models are fast, but they choke on anything beyond simple noun phrases. Moondream closes this gap: state-of-the-art accuracy at speeds that make latency-sensitive and high-throughput applications practical.

    Moondream vs. SAM 3

    SAM 3 can segment generic concepts like "car" or "person", but it can't natively resolve referring expressions. For prompts like "the person touching the door" or "laundry on the floor," you need to pair it with a larger reasoning model that adds 10s of seconds of latency and drives up cost. Moondream handles complex prompts natively, returns crisp higher-quality SVG masks, at a 5x lower price point.

    Conclusion

    This update is live now on Moondream Cloud. If you're already using segmentation, you get better quality and lower latency immediately. Later this week, we'll also be releasing the model for local inference, along with a technical whitepaper for those who want to go deeper. Learn more about Moondream's segmentation skill at /skills/segment.

    Original source
  • Dec 19, 2025
    • Date parsed from source:
      Dec 19, 2025
    • First seen by Releasebot:
      Dec 20, 2025
    Moondream logo

    Moondream

    We added Moondream 3 Preview support to Moondream Station

    Moondream Station launches Mac with Moondream 3 Preview, delivering native MLX performance and quantized models for Apple Silicon. The one click installer runs on Mac Windows and Linux and showcases snappy inference, with 35+ tokens per second on an M1 Max.

    Moondream Station

    Moondream Station is our free on-prem client for Moondream. A one-click (or one command) installer makes it a snap to get Moondream running on your Mac, PC, or Linux box instantly. Today we're happy to announce Moondream 3 Preview support on Mac. Try it out for yourself (works on mac, windows, linux):

    pip install moondream-station

    Built for Apple Silicon

    To get the most out of Apple Silicon, we built Mac inference to be fully MLX native and added quantized Moondream 3 support. The result is snappy performance. You'll need a Mac with at least 16GB of memory. On an M1 Max with 64GB, we are seeing over 35 tokens per second. Here's a demo of how it works on that M1 Max:

    Moondream uses dedicated grounding tokens, so any x or y coordinate only requires one token. This means inferences for grounded skills like point or detect feel near instantaneous.

    Using the API

    Once Moondream Station is running, you can connect to it using our Python client:

    # pip install moondream
    import moondream as md
    from PIL import Image
    
    # Connect to Moondream Station
    model = md.vl(endpoint="http://localhost:2020/v1")
    
    # Load an image
    image = Image.open("path/to/image.jpg")
    
    # Ask a question
    answer = model.query(image, "What's in this image?")["answer"]
    print("Answer:", answer)
    

    What's next

    We are planning to make more improvements to Moondream Station over the next few weeks. If you have ideas or requests, reach out on Discord.

    Happy holidays from the Moondream team.

    Original source
  • Oct 17, 2025
    • Date parsed from source:
      Oct 17, 2025
    • First seen by Releasebot:
      Dec 6, 2025
    Moondream logo

    Moondream

    Announcing Moondream Cloud

    Moondream launches Moondream Cloud, a hosted vision AI platform built on Moondream 3 Preview. It emphasizes speed and cost, with pay‑as‑you‑go pricing, $5 free credits, and enterprise options including on‑prem and compliance. A clear release highlighting a new product rollout.

    Fast, cheap, smart. Pick three.

    We're excited to launch Moondream Cloud, a hosted version of Moondream that makes it easy to build cutting-edge vision applications.

    When choosing vision AI tech, three things matter most: intelligence, speed, and cost. With our recent launch of Moondream 3 Preview, our model already delivers top-tier intelligence, reaching SOTA on visual reasoning and grounding tasks, outperforming top frontier models. Our Moondream Cloud release focuses on the other two: speed and cost.

    Pricing

    Moondream Cloud is pay-as-you-go. No subscriptions, no commitments, just load up credits and you're done. To help you start building right away, you get $5 in free monthly credits too (no credit card required!).

    Our pricing is token based: Moondream 3 Preview costs $0.30 per million input tokens, and $2.50 per million output tokens. These token rates are simlar to Gemini 2.5 Flash and GPT-5 Mini. But token pricing doesn't tell the full story. Moondream uses a custom SuperBPE tokenizer that means we generate 21% fewer tokens for the same output text. We have dedicated grounding tokens that represent points with two tokens and object bounding boxes with three tokens, where competing models have to use tens of tokens. And we represent images of all resolutions with 729 tokens, leading to significant savings on prefill.

    We simulated a workload where each of the three examples below are processed once a minute, for 30 days. To do this for all three images would cost:

    Comparisons

    We compared Moondream Cloud with Gemini Flash 2.5 and GPT-5 Mini. Both are vision-capable and similarly priced. (We skipped Claude Haiku 4.5 because its vision capabilities were significantly behind on the tasks we evaluated.)

    Example 1: Pointing

    Average runtime: Moondream 3 (Preview) 1.52 seconds, Gemini 2.5 Flash 3.02 seconds, GPT-5 Mini 27.58 seconds
    Input tokens: Moondream 737, Gemini 1,352, GPT-5 Mini 419
    Output tokens: Moondream 25, Gemini 241, GPT-5 Mini 1,372
    Monthly cost (1 RPM): Moondream $12, Gemini $35, GPT-5 Mini $123

    In this example, Moondream is cheaper because we use both fewer input and fewer output tokens. We require fewer tokens both because we encode the image efficiently (compared to Gemini 2.5 Flash), and because we don't need a complicated text prompt to get the model to output just the list of 2D points. On the outputs, Moondream benefits from having dedicated grounding tokens, requiring only two tokens per point. The result is that Moondream is significantly cheaper to run.

    Example 2: Object detection

    Average runtime: Moondream 4.56 seconds, Gemini 7.69 seconds, GPT-5 Mini 52.88 seconds
    Input tokens: Moondream 737, Gemini 1,839, GPT-5 Mini 1,849
    Output tokens: Moondream 103, Gemini 1,524, GPT-5 Mini 3,271
    Monthly cost (1 RPM): Moondream $21, Gemini $170, GPT-5 Mini $302

    Again, Moondream is more efficient because our grounding tokens mean we only emit three tokens per bounding box -- two tokens encoding the position of the middle of the box, and one token encoding both the height and width. Like before you'll notice we're also significantly more accurate.

    Example 3: OCR

    Average runtime: Moondream 3.92 seconds, Gemini 3.44 seconds, GPT-5 Mini 18.47 seconds
    Input tokens: Moondream 743, Gemini 1,395, GPT-5 Mini 1,812
    Output tokens: Moondream 414, Gemini 533, GPT-5 Mini 528
    Monthly cost (1 RPM): Moondream $54, Gemini $75, GPT-5 Mini $65

    This one is more evenly matched, since we're emitting normal text output. But Moondream still wins on cost because of more efficient image encoding, and more efficient output tokenization (using our custom tokenizer).

    Throughput and Data Privacy

    On the free tier, we allow up to two requests per second. When you hold $10 or more in paid credits, we increase that to 10 requests per second. We never train on your data, and no data is persisted after returning responses.

    We also offer enterprise plans with:

    • On-prem inference (run Moondream in your own infrastructure)
    • Compliance options (e.g. HIPAA)
    • Dedicated consulting and support
    • Volume-based pricing

    Reach out at [email protected] to discuss your needs.

    Conclusion

    Moondream exists for one reason: to power the next wave of vision AI agents. Our new 9B parameter mixture-of-experts Moondream 3 (Preview) model combines the speed of a 2B model with state-of-the-art visual reasoning and grounding, with no compromises. And now, with Moondream Cloud, using it as simple as it gets. Fast, cheap, smart -- pick three.

    Go to the cloud console to grab an API key, then check out our documentation to get started!

    Original source
  • Sep 23, 2025
    • Date parsed from source:
      Sep 23, 2025
    • First seen by Releasebot:
      Dec 6, 2025
    Moondream logo

    Moondream

    Moondream Station 2: Simpler, more features, and Windows!

    Moondream Station 2 is live with a one‑command installer for PC Mac and Linux, faster starts and full Windows support. Run Moondream locally with Moondream 3 Preview, new architecture, PyPI install, and heavy workloads with smart port handling. Mac work in progress.

    Moondream Station 2 is live

    Running Moondream on your own machine just got way smoother. Moondream Station 2 is a one-command installer for PC, Mac, and Linux that’s faster, simpler, and more capable than ever.

    Highlights:

    • Moondream 3 Preview support: run our latest model locally (Mac support coming soon).
    • New architecture: quicker starts, easier to operate.
    • Windows ready: full native support.
    • PyPI package: install with pip install moondream-station.
    • Built for heavy workloads: request queuing, multiple workers for big-GPU setups, and built-in metrics to track usage and history.
    • Smart port handling: defaults to port 2020, finds a free one if needed, or lets you set your own.

    Whether you’re on Windows, Linux, or Mac, Station 2 puts Moondream’s visual reasoning power a single command away. Apologies to our Mac audience, we're working on quantizing the model and addressing a few technical hiccups. We'll release it asap.

    Get started →

    Original source
  • Sep 18, 2025
    • Date parsed from source:
      Sep 18, 2025
    • First seen by Releasebot:
      Dec 6, 2025
    Moondream logo

    Moondream

    Moondream 3 Preview: Frontier-level reasoning at a blazing speed

    Moondream 3 unveils a 9B MoE with 2B active params, delivering frontier visual reasoning at speed and low cost. It boosts context length to 32K and RL‑driven training for real world tasks like object detection and OCR. Available on the Moondream playground and HuggingFace.

    Moondream 3 Preview

    We're excited to announce a preview release of Moondream 3. It's a new architecture of 9B MoE, with 2B active params. Moondream now achieves frontier-level visual reasoning while still retaining blazingly fast and efficient inference.

    Why A New Architecture

    The impact of AI today has largely been relegated to the digital realm. We have agents that can code, produce digital art, and so on - but very few cases of AI operating in our physical world. No robots to clean our houses, or act as receptionists, or inspect buildings, etc… For Moondream 3, we focused on 4 key areas.

    • Visual reasoning: despite our focus on smaller models, we don't want that to come at the cost of capability. We want Moondream to be the most capable VLM at real-world tasks.

    • Trainable: Many vision tasks require specialization. It's not enough for VLMs to be as good as humans. Even humans need training when it comes to complex tasks. Accurately interpreting an X-Ray image, or detecting struggling people in crowds. Moondream must be easily trainable.

    • Fast: Vision AI applications often need near-realtime performance. Sorting produce, or detecting missing herd animals from a drone, or recognizing security incidents - none of these tasks can be built without fast vision inference.

    • Inexpensive: Vision AI apps often deal with huge quantities of images, and cost can often be a blocker to adoption. Moondream must be cheap to run at scale.

    Moondream 3 achieves these goals by adopting a 9B MoE model, yet still with 2B active parameters. This enables it to achieve, and in some cases beat, frontier-level models, yet still only require 2B active parameters (keeping it fast and inexpensive). We also improved its training dynamics, making Moondream 3 more efficient at learning, especially when using Reinforcement Learning (more on that in subsequent announcements). For more details on the architecture, head to the "Tech Notes" below. One final detail however: we grew the context length from 2k to 32k, making Moondream much better at understanding and producing more complex queries and answers.

    Moondream 3 in action

    Here are some examples of Moondream 3.

    Object Detection

    Moondream 3 is astonishingly good at object detection. It goes beyond simple labels (.e.g., "car") and can understand more complex queries. We show results compared to frontier models alongside. These models don't support grounding skills like object detection and pointing natively, so we used a templated query for those (see footer).

    • Example 1
      Prompt: "Runner with purple socks"

    • Example 2
      Prompt: "Quantity input"

    Pointing

    Moondream supports pointing as a native skill.

    • Example 3
      Prompt:
      "Bottle"

    • Example 4
      Prompt: "Best utensil for pasta"

    Structured output

    With a longer context length, Moondream 3 generates intelligent structured outputs with minimal prompting.

    • Example 5: Sled dogs
      Prompt
      "A JSON array with keys: dog_id, fur_color, harness_color."

    • Result
      [
      { "dog_id": 1, "fur_color": "light brown", "harness_color": "red" },
      { "dog_id": 2, "fur_color": "dark brown", "harness_color": "red" },
      { "dog_id": 3, "fur_color": "gray", "harness_color": "red" },
      { "dog_id": 4, "fur_color": "white", "harness_color": "red" },
      { "dog_id": 5, "fur_color": "dark brown", "harness_color": "green" },
      { "dog_id": 6, "fur_color": "light brown", "harness_color": "green" },
      { "dog_id": 7, "fur_color": "dark brown", "harness_color": "black" },
      { "dog_id": 8, "fur_color": "white", "harness_color": "black" }
      ]

    OCR

    Moondream 3 has drastically improved its OCR abilities. Our vision encoder can get tripped up on tiny fonts (working on it), but it's now useful in many real-world cases.

    • Example 6
      Prompt
      "Convert to markdown""

    • Result

      Metal Reaction Electrode Potential (V) Gold Au⁺ + e⁻ = Au +1.692 Silver Ag⁺ + e⁻ = Ag +0.7996 Copper Cu²⁺ + 2e⁻ = Cu +0.342 Iron Fe³⁺ + 3e⁻ = Fe -0.037 Lead Pb²⁺ + 2e⁻ = Pb -0.126 Nickel Ni²⁺ + 2e⁻ = Ni -0.257 Cadmium Cd²⁺ + 2e⁻ = Cd -0.403 Iron Fe²⁺ + 2e⁻ = Fe -0.447 Zinc Zn²⁺ + 2e⁻ = Zn -0.762 Aluminum Al³⁺ + 3e⁻ = Al -1.662

    Benchmarks

    Here are some early benchmark results. We show it alongside some top frontier models for comparison. In practice, however, it's probably not a fair comparison for Moondream since, in practical terms, Moondream produces answers in fraction of the time of these bigger models. We'll publish more complete results later and include inference times to make this clearer.

    MD3 Preview Technical Notes

    Here are some details on our new model architecture. Moondeam 3 is a fine-grained sparse mixture-of-experts model with 64 experts, of which 8 are activated for each token. We initialized it from Moondream 2 (a 2B dense model) using drop upcycling. We also extended the usable context length to 32K tokens, which is critical for few-shot prompting and agentic workflows with tool-use. We don’t fully leverage this longer context in our post-training yet (part of why it's only a preview release). The full 32k context is available for you if you're interested in fine-tuning the model.

    (Figure: Long-context perplexity evaluation on GovReport dataset. Each point shows the average cross-entropy loss (nats per token) for a 128-token sliding window at that position, measured across 100 documents truncated to 32,768 tokens.)

    We do not use a separate context-length extension phase during training, instead opting to interleave long-context samples while pretraining with a default context length of 4096 tokens. Many context length extension methods like YaRN include an attention temperature scaling component. Inspired by this, we adjust the architecture to enable learned temperature scaling as a function of position, and find this helps with long context modeling.

    Like our last 2B release, this is a hybrid reasoning model that supports both reasoning and non-reasoning mode. Unlike other reasoning models, however, Moondream focuses on visual reasoning with grounding. Here’s an example of what that means:

    Each chunk of underlined text in the reasoning is grounded, meaning the model references a particular part of the image. In our playground, you can see what the model is focusing on by hovering over the text.

    The model starts with only a small set of visual-reasoning examples, and gradually learns to rely on them more during our reinforcement learning (RL) post-training phase. RL proved so effective that, as we refined our training approach, post-training ended up using more compute than the initial pre-training itself.

    It was trained with load-balancing and router orthogonality losses to help similar tokens specialize together early on, then had load balancing disabled in post-training to avoid catastrophic forgetting from distribution shift. Finally, attention tweaks like learnable temperature and LSE suppression sharpened focus and cut noise—boosting accuracy and clarity.

    Conclusion

    This preview release comes with some caveats. We haven't optimized the inference code yet, so inferences are much slower than anticipated (we're working on it!). We're also still actively training this model, and we expect the capabilities and benchmarks scores to improve. We also plan to produce variants of this model (e.g., quantized versions and distilled smaller versions).

    The model is now available on the Moondream playground, and you can download it on HuggingFace (Moondream Station will be updated soon). Hit us up on our Discord if you have any questions.

    (1) Frontier models don't support object detection natively, so this prompt was used instead:
    Detect these objects in the image: [comma-separated list].

    Original source
  • Jun 23, 2025
    • Date parsed from source:
      Jun 23, 2025
    • First seen by Releasebot:
      Dec 6, 2025
    Moondream logo

    Moondream

    Moondream Update: Grounded Reasoning, Better Detection, Faster Generation

    Moondream’s latest release adds grounded visual reasoning for sharper interpretation, improved object detection, and 20–40% faster responses. It can think step by step, audit its reasoning, and switch between reasoning and normal modes for speed or accuracy, via playground, API, or HuggingFace.

    Grounded Reasoning

    Simple tasks—like reading a date off a receipt need little thought. Harder ones, like finding the median value on a chart, demand real reasoning about where things are and how they relate.

    Moondream now supports grounded reasoning, allowing the model to spend some time to think about the problem and reason precisely about positions and spatial relationships within images before generating an answer. This unlocks performance gains for tasks that depend on accurate visual interpretation.

    Moondream can now pause, look around the picture, and think step-by-step before answering. We call this grounded reasoning because the model can reason, using both logic and visual facts about the image to produce more accurate answers.

    Take chart understanding, for example. Without reasoning, Moondream does its best by essentially guessing ths answer in one shot. With reasoning on, it breaks the job into three small steps, then nails the answer.

    Moondream's reasoning is specifically designed for accurate visual reasoning. The model can choose the "ground" its reasoning with spatial positions in the image when needed to accurately solve a task. Consider counting objects in images, for example. When there are more than a couple of instances of an object in an image the model chooses to explicitly point at them in the reasoning trace, similar to how a human may tackle the same problem.

    The recent Vision Language Models are Biased paper shows that many VLMs suffer from confirmation bias when counting, returning memorized knowledge instead of actually counting when they see familiar objects. As we deploy VLMs in high-stake applications, it's critical that we are able to ensure and audit that the models are actually reasoning about the image instead of simply performing sophisticated pattern matching. Our approach to visual reasoning not only helps the model reason about images, but also provides a way for users to audit what it's doing and understand failure modes.

    Moondream supports both reasoning and normal queries with the same model, meaning you can trade off accuracy vs speed depending on the complexity of the task you're trying to performance. You can enable reasoning by passing reasoning=True with the query skill. This reasoning mode is powerful but still experimental. For simpler tasks, the original mode may perform better, so we recommend trying both.

    How We Taught It to Think

    We've started using reinforcement learning (RL) to train Moondream. If you're not familiar with RL, here's a short description of how it works. Traditionally, models are trained by asking them questions where the correct answer is known (aka "Ground Truth"). If the model doesn't answer correctly, we apply a corrective change in the model weights to encourage it to answer better next time. This process is called "supervised learning".

    RL works a little differently. We start the same way, with a question where we know the correct answer. With RL, however, we ask Moondream to generate numerous answers using different temperatures, then we grade the answers on good they are. Not only if the answer is correct, but whether it used correct reasoning. This is easier done with tasks that have singular answers (e.g., "What's the sum of the numbers the table?"). For more open-ended answers (e.g., "Caption this image"), we use another Moondream model to judge the answer.

    So far we've used RL to train Moondream on 55 tasks and the results are impressive. We plan to increase this to ~120 before the next update to the model.

    With smaller models such as Moondream, it's common practice to "bootstrap" the model with reasoning traces from a bigger model. We haven't taken this approach for two reasons: first, our context length is currently limited to 2048 tokens, and this will need to be increased before we can train on longer reasoning traces. Secondly, most open reasoning models are focused on mathematical and coding reasoning, and this is not as effective for visual reasoning.

    Sharper Object Detection

    Moondream's Object Detection skill just got a lot better with this release. Previously, Moondream had a tendency to clump together objects that were close to each other, and sometimes struggled with finer-grained object specifications (e.g. "blue bottle" instead of just "bottle"), compared to Moondream's pointing capability.

    This was largely due to the quality of the the datasets we used. Object detection datasets generated by humans tend to be messy and imprecise as drawing highly accurate bounding boxes is tedious. Annotators often take shorcuts, and sometimes draw a single box around multiple instances when they're close to each other in the image.

    We used RL to overcome this, and the results are impressive. We'll be sharing more details about this in a separate blog post, but for now, here's a sample of the results.

    Faster Text Generation

    Moondream now generates answers 20-40% faster than before. This is because we upgraded the model to use a "superword" tokenizer that encodes text more efficiently. This means Moondream needs to emit fewer tokens to generate the same answer, and we achieve this without any drop in accuracy.

    Changing a tokenizer typically involves a costly step to retrain the entire model. We built a lightweight "tokenizer transfer hypernetwork" that enabled us to adapt smoothly to new tokenizers without retraining.

    Lastly, this "tokenizer transfer hypernetwork" also makes it easier to train multilingual variants in the future.

    UI Understanding

    Moondream's performance on ScreenSpot, a benchmark for UI understanding, jumped significantly from 60.3 to 80.4. This makes Moondream a great choice for UI-focused applications that require fast element localization.

    While the model cannot be used as a standalone computer use agent yet, it can work very effectively when treated as a tool to be used by a larger agentic model. This is the setup used by projects like Magnitude, where a bigger LLM writes test cases that leverage Moondream for UI understanding tasks. This separation of planning and execution models allows them to run tests more quickly and reliably than using alternatives like OpenAI or Anthropic's Computer Use APIs.

    Looking Ahead

    Grounded reasoning, smarter object detection, faster tokenizer, and better UI understanding represent a big step forward for Moondream. These fundamental advances also open the door to more improvements on the horizon. We look forward to pushing the model to achieve deeper reasoning capabilities and broader task coverage, and even more speed optimizations.

    There's more to this release than what we've covered here, so check it out yourself. Checkout our free online playground, or our free cloud API, or run it locally using Moondream Station - also free. If you prefer really low-level stuff, you can use it directly using Hugging Face Transformers.

    Happy Moondreamin’!

    Original source
  • May 21, 2025
    • Date parsed from source:
      May 21, 2025
    • First seen by Releasebot:
      Dec 6, 2025
    Moondream logo

    Moondream

    Fewer bits, more dreams

    Moondream unveils 4-bit quantization to slash memory use and speed up inference. The quantized model hits 99.4% of full precision accuracy with a 42% memory drop and 34% faster performance on RTX 3090. Open source on Linux via Moondream Station; Mac coming soon; HuggingFace access.

    Introduction

    When we compare model sizes, we usually quote the number of parameters. But that doesn't tell the whole story. Depending on other factors, a model with fewer parameters may actually use more memory, and run inference slower than a larger model. That's why we prefer to focus on actual memory size and inference speed.

    4-bit quantization

    Today we're excited to announce a new feature that makes Moondream run faster and use less memory: 4-bit quantization. In case you're not familiar, quantization is a technique that reduces the number of bits used to store a model's weights. For example, weights are usually stored as 16-bit float, which take 2 bytes each. A 4-bit weight only takes 0.5 bytes.

    The challenge with quantization is that it can lead to a loss of model accuracy. We've been working on this for a while, and we're excited to share that our 4-bit quantized model reaches 99.4% of the accuracy of the full precision model. In practice you'd probably never notice the difference.

    Performance and Availability

    Meanwhile you probably would notice the speedup and memory improvement. The peak memory usage is reduced by 42% (from 4.2GB to 2.4GB) and, the inference speed is increased by 34% (on an RTX 3090), although the speedup may vary by machine. On the accuracy front, we measure the average score on 8 popular vision benchmarks. The 4-bit quantized model achieved an average score of 74.5 vs 74.9 for the full precision model.

    So let's update our chart from our 2025-04-14 model release.

    Both the full precision model and the 4-bit quantized model are available as open source. The 4-bit model is currently available for Linux users in Moondream Station (our one-click solution), with Mac support coming soon. Advanced users can also access the model directly via Hugging Face Transformers.

    Happy Moondreamin'.

    Original source
  • May 1, 2025
    • Date parsed from source:
      May 1, 2025
    • First seen by Releasebot:
      Dec 6, 2025
    Moondream logo

    Moondream

    Moondream Station Launch

    Moondream Station launches as a one-click way to run Moondream locally, handling download, setup, and updates. It lets you run Moondream from the CLI or via a local port 2020, with Mac live and Windows/Linux coming soon.

    Moondream Station

    tl;dr; Moondream Station is a one-click way to run Moondream locally. It's free, fast, and with no technical headaches. Download here.

    After we launched Moondream last year, we hosted a hackathon in Seattle. It was clear right away that Vision Language Models (VLMs) are changing the game. Just by prompting, people were unlocking new ways to work with images.

    What we learned that day stuck with us: even though Moondream can run on a laptop, developers would trip up on the installation and setup. They wanted something that just works. So we launched a cloud version, free.

    However we also kept seeing people wanting to run it locally, and still stubbing their toes with the download and setup. So we finally rolled up our sleeves and did something about it. Today we're launching Moondream Station – a one-click solution to running Moondream locally.

    Moondream Station manages all of the tedious parts for you. It manages the download, setup, and updates to run Moondream on your desktop. Once it's running, you can invoke Moondream either through the client command line, or write code that invokes it through a local port 2020 (hah!). Here's a video showing it in action:

    It's Mac only for now, but Windows and Linux coming soon. We have a bunch of feature updates planned. Drop by on our Discord if you have questions or any problems.

    Original source
  • Apr 14, 2025
    • Date parsed from source:
      Apr 14, 2025
    • First seen by Releasebot:
      Dec 6, 2025
    Moondream logo

    Moondream

    Moondream 2025-04-14

    Moondream 2025-04-14 introduces a more efficient VLM with improved document and chart understanding, better OCR, UI comprehension, and stronger counting benchmarks, all while reducing compute. The note also outlines training tricks and the roadmap for future upgrades.

    Moondream Release Notes

    Moondream is not just one of the tiniest VLMs. It's the world's most efficient VLM. It produces highly accurate answers with the least computing possible. On a graph with intelligence on the Y axis and resource usage on the X axis, Moondream aims to be top left.
    AI is still early in its development, and just as we've seen mainstream computing evolve from mainframes to desktops, mobile, and even smaller devices, so will AI. This is especially true for Vision AI. Almost every physical device is improved if it can reason about its surroundings. Today this often means streaming video or images back to the cloud, which is slow, costly, and privacy-problematic. But with improving hardware and models, this is about to change, and today is another solid step forward.
    But efficiency isn't just for the edge, it's for the cloud too. Analyzing vision at scale can be costly. Customers turn to Moondream when scale becomes a concern. Analyzing millions of images, or thousands of hours of video with Moondream is more cost effective than with any other VLM.
    That's why we're excited to announce Moondream 2025-04-14. As usual, we've improved on all of the benchmarks we focus on, with some notably large improvements. Let's first look at how it stacks up vs our previous release just a few weeks ago:
    [Comparison chart with previous Moondream version]
    Now let's see where that puts us vs other top open source small VLMs.
    [Comparison chart with other small VLMs]

    Tech Notes

    This release of Moondream was trained on about 450B tokens. For contrast, models like Gemma 3 4B have been trained on 4 trillion tokens, and Qwen 2.5 VL utilized 18 trillion tokens for text modeling, plus an additional 4 trillion tokens for their VLM. Our efficiency in producing a high performance model with a fraction of the training budget is the result of:

    • High-Quality Data: We've observed that small models are especially sensitive to noisy data. We produce training data that contains both rigorously filtered real-world data and carefully crafted synthetic data, designed to minimize domain gaps.
    • Focused Scope: Moondream is specifically designed for developers creating computer vision applications. We prioritize relevant capabilities over broader use-cases like multi-turn conversations or haiku writing.
    • Training Techniques: We've developed a set of training methods that maximize training efficiency. We keep most of them proprietary, but here are two we're disclosing today:
      • We use a custom second-order optimizer, crucial for balancing conflicting gradients, such as object detection versus text generation tasks.
      • We use a self-supervised auxiliary image loss that significantly accelerates model convergence.
        Our focus this release was to improve on our previous one by targeting document understanding, charts, and user interfaces. Moondream has become quite proficient at reading documents. Here are examples of document and layout understanding:
        [Document understanding example]
        [Layout understanding example]
        This improvement in document and text reading has also yielded sizeable bumps in our text-related benchmarks:
    • ChartQA: Improved from 74.8 1 77.5 (82.2 with PoT)
    • DocVQA: Improved from 76.5 1 79.3
    • TextVQA: Improved from 74.6 1 76.3

    Performance

    Counting

    This release of Moondream has seriously counting chops (e.g. "how many birds in this image"). To see how good we got, here's a chart comparing ourselves to all the big names in VLMs.
    [CountBenchQA performance chart]

    Chart Understanding

    Chart understanding has been a key focus for this release. Charts require models to ground text and numbers in a visual layout, then reason over them precisely. On ChartQA, Moondream improves from 74.8 in our last release to 77.5, and 82.2 with Program of Thought (PoT) prompting.
    PoT is a prompting strategy where the model generates and executes code to solve problems step-by-step. This is especially valuable in chart QA, where reasoning failures often stem not from misreading the chart, but from making small but critical logical errors 1 like summing three correct numbers incorrectly. Rather than expecting the model to always reason flawlessly in natural language, we let it write and run code.
    [Chart understanding example]

    Here's a few more notes from this update:

    1. To access the OCR capability for docs and tables, use the prompt "Transcribe the text" or "Transcribe the text in natural reading order".
    2. Object detection supports document layout detection (figure, formula, text, etc).
    3. UI understanding has improved, with ScreenSpot [email protected] up from 53.3 to 60.3.

    Conclusion

    As excited as we are for this launch, we have a lot more coming up for next releases too. Here's a few areas we're focused on:

    1. Repetition Handling: We've seen an increase in repetitions in our inferences, especially when generating long document answers. We've added temperature and nucleus sampling to reduce output repetition, with a repetition penalty setting coming soon. Training adjustments will further mitigate this issue in future releases.
    2. Tokenizer Upgrade: Our current tokenizer, derived from the three-year-old CodeGen model, hinders optimal training and inference performance. We plan to adopt either a traditional BPE tokenizer (ensuring broad ecosystem compatibility) or a BPE variant (optimized for efficiency).
    3. Bounding Box Accuracy: Currently, the model occasionally generates bounding boxes encompassing multiple items. We have identified the root cause and a solution is forthcoming. Meanwhile, prefixing object detection queries with "(coco)" can help mitigate this issue.
    4. Continued Training: As performance continues to steadily improve, we anticipate training for an additional 200 billion tokens before the next release.
      We invite you to go check it out for yourself in our playground, or start coding today.
      You can run it locally using our new Moondream Server (Mac and Linux for now, Windows coming…), or in our cloud (with a generous free tier).
      Happy Moondreamin'.
    Original source
  • Mar 28, 2025
    • Date parsed from source:
      Mar 28, 2025
    • First seen by Releasebot:
      Dec 6, 2025
    Moondream logo

    Moondream

    Moondream 2025-03-27 Release

    Moondream unveils a new release with Long captions, image tagging, and a faster transformer client, plus near-state-of-the-art object detection gains. Built on real user feedback, this update is ready to try today in the playground or via download.

    We're excited to announce a new Moondream release. There's a lot to unpack, but we'll give you the highlights here, and share more over the coming weeks. The improvements in this release were driven from real-world usage and feedback from our community and customers. We want to extend a huge thank you to everyone who contributed to that. Keep it coming, let's gooooo!

    Longer

    The ability to caption images is one of the top Moondream use cases. Super accurate captions from a fast, efficient, and easy to run model seems to be a winning combo! Use cases range from synthetic data generation to real-world understanding and robotics.
    Until now, Moondream offered a choice of "Short" or "Normal" length captions. This model introduces "Long" format. From our testing, this generates roughly 2x longer captions than "Normal".

    Better

    We're still doing exhaustive evals, but so far we've seen major improvements on object detection (COCO mAP), OCR (OCRBench), and counting (CountBenchQA):
    We had some customers request improvements in our object detection capability. We were excited to work on that, and we're especially pleased with the results. This new COCO mAP score now makes Moondream near state of the art on object detection.
    We also had a customer with a specific need: the ability to tag all the things visible in an image. While we don't have a public benchmark available to highlight, our internal benchmark and vibe checks show a huge improvement in this ability.
    We call it "Image Tagging," and you can try it out by using this prompt in your image query:
    "List all visible objects, features, and characteristics of this image. Return the result as a JSON array."
    Here's an example of how it works:

    compile()
    on the model sped up inference from 61.4 tok/s to 123.4 tok/s on an Nvidia 3090.
    

    Faster

    We have a client update planned in the next few weeks that includes mobile, but in the meantime, we snuck in one key improvement to our transformers-based client. It just got a lot faster.
    From our testing, calling

    compile()
    

    on the model sped up inference from 61.4 tok/s to 123.4 tok/s on an Nvidia 3090.
    That not only makes it cheaper to run, it also opens up more possibilities for near-realtime processing, especially for video streaming.

    Stronger

    Moondream's improvements are driven by the feedback and engagement of its growing community and customers.
    This is creating a flywheel effect, where the feedback and requests from the community drive us to make more improvements to the model, and these improvements drive more adoption from an ever-growing community.
    In other words, you're part of the reason Moondream keeps growing. We extend our tip of the hat to you, our Moondreamers.

    Conclusion

    We'll be sharing more details over the next few weeks, but the great news is that you don't have to wait.
    You can download the model now, or go kick its tires in our playground, or even better yet, build something with our free cloud offering.

    Original source
  • Jan 9, 2025
    • Date parsed from source:
      Jan 9, 2025
    • First seen by Releasebot:
      Dec 6, 2025
    Moondream logo

    Moondream

    Moondream 2025-01-09 Release: Structured Text, Enhanced OCR, Gaze Detection

    Moondream 1.9B arrives with a new Gaze Detection capability, stronger OCR, and easier structured output across JSON, XML, Markdown, and CSV. It benchmarks against top small vision models and invites developers to try in the playground or download.

    Today, we’re announcing a new release of Moondream 1.9B. It has improvements across a bunch of areas and includes a new capability, Gaze Detection. This release marks the first time we’ve focused on industry benchmarks, and we’re excited to share some results. Despite these upgrades, the model is still just 1.9B, so it's fast and can run everywhere. Try it out in our playground or download it now.

    1. Structured Output

    Building with Moondream is easier than ever with our new support for structured output formats such as JSON, XML, Markdown, and CSV. Here’s some examples:

    Example 1: JSON structured output

    Example 2: XML structured output

    Example 3: Markdown structured output

    2. New Capability: Gaze Detection

    Traditional Vision AI consists of specialized models built for different tasks like “object detection” (outline a specified object's region in an image) or “captioning” (create a caption for an image). Moondream supports several of these common Vision AI tasks as “capabilities,” all within a single model. Moondream already supports object detection and captioning, as well as “visual querying” (ask any question to a photo) and “pointing” (get the x,y coordinates of elements within a photo).
    Today, we are excited to launch a new capability: Gaze Detection.
    This capability tracks human attention. Note that this capability is experimental. We’re releasing it to get feedback from developers so we can improve it over time.

    Example 1: Driver Gaze Detection

    Example 2: Sport Gaze Detection

    3. Benchmarks

    We’ve always been a bit iffy about benchmarks. Some focus on problems we don’t think are relevant to Moondream (e.g., solving math equations). Others include weird questions and wrong answers (at least to us — see the Weird Benchmarks appendix below). And focusing too much on benchmarks can lead to weird behaviors, with allegations that some models "cheat" by training on the actual benchmarks themselves.
    Despite this, we decided to improve our scores because we don’t want anyone sleeping on Moondream because of low results. We benchmarked ourselves along with the top small vision language models.
    You can find our individual benchmark results below:

    4. Better OCR

    We made changes to Moondream’s vision layer that have helped improve text reading/OCR significantly. We’ve also trained it on a lot more document querying and understanding. Here’s some examples:

    Example 1: OCR Example

    Example 2: Chart OCR Example

    Looking Ahead

    As pumped as we are about this release, the best part, for us, is seeing what you build with it. VLMs are making it faster, cheaper, and easier than ever to build next generation vision-enabled apps. Getting setup takes minutes, or you can try out Moondream in our playground. We offer Cloud inference with a generous free tier, or you can download it and run it yourself. Check out our docs for a getting started guide and lots of sample code.

    Happy Moondreaming!

    Appendix 1: Weird Benchmark Questions

    Here’s a few examples of weird benchmark questions...

    Example 1: Confusing Benchmark Question
    In GQA, the following image has a question that asks “Is the traffic signal on the right side or the left?” If you look closely, you can see there are traffic lights on both sides of the street. However, GQA expects the answer to be “Left.”

    Example 2: Nonsensical Benchmark Question
    In the following image, GQA asks “What animal sits in the bench that is on the right side?" It expects the answer to be “bird” 🤯.

    Original source

This is the end. You've seen all the release notes in this feed!

Related vendors