- Oct 17, 2025
- Parsed from source:Oct 17, 2025
- Detected by Releasebot:Dec 6, 2025
Announcing Moondream Cloud
Moondream launches Moondream Cloud, a hosted vision AI platform built on Moondream 3 Preview. It emphasizes speed and cost, with pay‑as‑you‑go pricing, $5 free credits, and enterprise options including on‑prem and compliance. A clear release highlighting a new product rollout.
Fast, cheap, smart. Pick three.
We're excited to launch Moondream Cloud, a hosted version of Moondream that makes it easy to build cutting-edge vision applications.
When choosing vision AI tech, three things matter most: intelligence, speed, and cost. With our recent launch of Moondream 3 Preview, our model already delivers top-tier intelligence, reaching SOTA on visual reasoning and grounding tasks, outperforming top frontier models. Our Moondream Cloud release focuses on the other two: speed and cost.
Pricing
Moondream Cloud is pay-as-you-go. No subscriptions, no commitments, just load up credits and you're done. To help you start building right away, you get $5 in free monthly credits too (no credit card required!).
Our pricing is token based: Moondream 3 Preview costs $0.30 per million input tokens, and $2.50 per million output tokens. These token rates are simlar to Gemini 2.5 Flash and GPT-5 Mini. But token pricing doesn't tell the full story. Moondream uses a custom SuperBPE tokenizer that means we generate 21% fewer tokens for the same output text. We have dedicated grounding tokens that represent points with two tokens and object bounding boxes with three tokens, where competing models have to use tens of tokens. And we represent images of all resolutions with 729 tokens, leading to significant savings on prefill.
We simulated a workload where each of the three examples below are processed once a minute, for 30 days. To do this for all three images would cost:
Comparisons
We compared Moondream Cloud with Gemini Flash 2.5 and GPT-5 Mini. Both are vision-capable and similarly priced. (We skipped Claude Haiku 4.5 because its vision capabilities were significantly behind on the tasks we evaluated.)
Example 1: Pointing
Average runtime: Moondream 3 (Preview) 1.52 seconds, Gemini 2.5 Flash 3.02 seconds, GPT-5 Mini 27.58 seconds
Input tokens: Moondream 737, Gemini 1,352, GPT-5 Mini 419
Output tokens: Moondream 25, Gemini 241, GPT-5 Mini 1,372
Monthly cost (1 RPM): Moondream $12, Gemini $35, GPT-5 Mini $123In this example, Moondream is cheaper because we use both fewer input and fewer output tokens. We require fewer tokens both because we encode the image efficiently (compared to Gemini 2.5 Flash), and because we don't need a complicated text prompt to get the model to output just the list of 2D points. On the outputs, Moondream benefits from having dedicated grounding tokens, requiring only two tokens per point. The result is that Moondream is significantly cheaper to run.
Example 2: Object detection
Average runtime: Moondream 4.56 seconds, Gemini 7.69 seconds, GPT-5 Mini 52.88 seconds
Input tokens: Moondream 737, Gemini 1,839, GPT-5 Mini 1,849
Output tokens: Moondream 103, Gemini 1,524, GPT-5 Mini 3,271
Monthly cost (1 RPM): Moondream $21, Gemini $170, GPT-5 Mini $302Again, Moondream is more efficient because our grounding tokens mean we only emit three tokens per bounding box -- two tokens encoding the position of the middle of the box, and one token encoding both the height and width. Like before you'll notice we're also significantly more accurate.
Example 3: OCR
Average runtime: Moondream 3.92 seconds, Gemini 3.44 seconds, GPT-5 Mini 18.47 seconds
Input tokens: Moondream 743, Gemini 1,395, GPT-5 Mini 1,812
Output tokens: Moondream 414, Gemini 533, GPT-5 Mini 528
Monthly cost (1 RPM): Moondream $54, Gemini $75, GPT-5 Mini $65This one is more evenly matched, since we're emitting normal text output. But Moondream still wins on cost because of more efficient image encoding, and more efficient output tokenization (using our custom tokenizer).
Throughput and Data Privacy
On the free tier, we allow up to two requests per second. When you hold $10 or more in paid credits, we increase that to 10 requests per second. We never train on your data, and no data is persisted after returning responses.
We also offer enterprise plans with:
- On-prem inference (run Moondream in your own infrastructure)
- Compliance options (e.g. HIPAA)
- Dedicated consulting and support
- Volume-based pricing
Reach out at [email protected] to discuss your needs.
Conclusion
Moondream exists for one reason: to power the next wave of vision AI agents. Our new 9B parameter mixture-of-experts Moondream 3 (Preview) model combines the speed of a 2B model with state-of-the-art visual reasoning and grounding, with no compromises. And now, with Moondream Cloud, using it as simple as it gets. Fast, cheap, smart -- pick three.
Go to the cloud console to grab an API key, then check out our documentation to get started!
Original source Report a problem - Sep 23, 2025
- Parsed from source:Sep 23, 2025
- Detected by Releasebot:Dec 6, 2025
Moondream Station 2: Simpler, more features, and Windows!
Moondream Station 2 is live with a one‑command installer for PC Mac and Linux, faster starts and full Windows support. Run Moondream locally with Moondream 3 Preview, new architecture, PyPI install, and heavy workloads with smart port handling. Mac work in progress.
Moondream Station 2 is live
Running Moondream on your own machine just got way smoother. Moondream Station 2 is a one-command installer for PC, Mac, and Linux that’s faster, simpler, and more capable than ever.
Highlights:
- Moondream 3 Preview support: run our latest model locally (Mac support coming soon).
- New architecture: quicker starts, easier to operate.
- Windows ready: full native support.
- PyPI package: install with pip install moondream-station.
- Built for heavy workloads: request queuing, multiple workers for big-GPU setups, and built-in metrics to track usage and history.
- Smart port handling: defaults to port 2020, finds a free one if needed, or lets you set your own.
Whether you’re on Windows, Linux, or Mac, Station 2 puts Moondream’s visual reasoning power a single command away. Apologies to our Mac audience, we're working on quantizing the model and addressing a few technical hiccups. We'll release it asap.
Get started →
Original source Report a problem - Sep 18, 2025
- Parsed from source:Sep 18, 2025
- Detected by Releasebot:Dec 6, 2025
Moondream 3 Preview: Frontier-level reasoning at a blazing speed
Moondream 3 unveils a 9B MoE with 2B active params, delivering frontier visual reasoning at speed and low cost. It boosts context length to 32K and RL‑driven training for real world tasks like object detection and OCR. Available on the Moondream playground and HuggingFace.
Moondream 3 Preview
We're excited to announce a preview release of Moondream 3. It's a new architecture of 9B MoE, with 2B active params. Moondream now achieves frontier-level visual reasoning while still retaining blazingly fast and efficient inference.
Why A New Architecture
The impact of AI today has largely been relegated to the digital realm. We have agents that can code, produce digital art, and so on - but very few cases of AI operating in our physical world. No robots to clean our houses, or act as receptionists, or inspect buildings, etc… For Moondream 3, we focused on 4 key areas.
Visual reasoning: despite our focus on smaller models, we don't want that to come at the cost of capability. We want Moondream to be the most capable VLM at real-world tasks.
Trainable: Many vision tasks require specialization. It's not enough for VLMs to be as good as humans. Even humans need training when it comes to complex tasks. Accurately interpreting an X-Ray image, or detecting struggling people in crowds. Moondream must be easily trainable.
Fast: Vision AI applications often need near-realtime performance. Sorting produce, or detecting missing herd animals from a drone, or recognizing security incidents - none of these tasks can be built without fast vision inference.
Inexpensive: Vision AI apps often deal with huge quantities of images, and cost can often be a blocker to adoption. Moondream must be cheap to run at scale.
Moondream 3 achieves these goals by adopting a 9B MoE model, yet still with 2B active parameters. This enables it to achieve, and in some cases beat, frontier-level models, yet still only require 2B active parameters (keeping it fast and inexpensive). We also improved its training dynamics, making Moondream 3 more efficient at learning, especially when using Reinforcement Learning (more on that in subsequent announcements). For more details on the architecture, head to the "Tech Notes" below. One final detail however: we grew the context length from 2k to 32k, making Moondream much better at understanding and producing more complex queries and answers.
Moondream 3 in action
Here are some examples of Moondream 3.
Object Detection
Moondream 3 is astonishingly good at object detection. It goes beyond simple labels (.e.g., "car") and can understand more complex queries. We show results compared to frontier models alongside. These models don't support grounding skills like object detection and pointing natively, so we used a templated query for those (see footer).
Example 1
Prompt: "Runner with purple socks"Example 2
Prompt: "Quantity input"
Pointing
Moondream supports pointing as a native skill.
Example 3
Prompt:
"Bottle"Example 4
Prompt: "Best utensil for pasta"
Structured output
With a longer context length, Moondream 3 generates intelligent structured outputs with minimal prompting.
Example 5: Sled dogs
Prompt
"A JSON array with keys: dog_id, fur_color, harness_color."Result
[
{ "dog_id": 1, "fur_color": "light brown", "harness_color": "red" },
{ "dog_id": 2, "fur_color": "dark brown", "harness_color": "red" },
{ "dog_id": 3, "fur_color": "gray", "harness_color": "red" },
{ "dog_id": 4, "fur_color": "white", "harness_color": "red" },
{ "dog_id": 5, "fur_color": "dark brown", "harness_color": "green" },
{ "dog_id": 6, "fur_color": "light brown", "harness_color": "green" },
{ "dog_id": 7, "fur_color": "dark brown", "harness_color": "black" },
{ "dog_id": 8, "fur_color": "white", "harness_color": "black" }
]
OCR
Moondream 3 has drastically improved its OCR abilities. Our vision encoder can get tripped up on tiny fonts (working on it), but it's now useful in many real-world cases.
Example 6
Prompt
"Convert to markdown""Result
Metal Reaction Electrode Potential (V) Gold Au⁺ + e⁻ = Au +1.692 Silver Ag⁺ + e⁻ = Ag +0.7996 Copper Cu²⁺ + 2e⁻ = Cu +0.342 Iron Fe³⁺ + 3e⁻ = Fe -0.037 Lead Pb²⁺ + 2e⁻ = Pb -0.126 Nickel Ni²⁺ + 2e⁻ = Ni -0.257 Cadmium Cd²⁺ + 2e⁻ = Cd -0.403 Iron Fe²⁺ + 2e⁻ = Fe -0.447 Zinc Zn²⁺ + 2e⁻ = Zn -0.762 Aluminum Al³⁺ + 3e⁻ = Al -1.662
Benchmarks
Here are some early benchmark results. We show it alongside some top frontier models for comparison. In practice, however, it's probably not a fair comparison for Moondream since, in practical terms, Moondream produces answers in fraction of the time of these bigger models. We'll publish more complete results later and include inference times to make this clearer.
MD3 Preview Technical Notes
Here are some details on our new model architecture. Moondeam 3 is a fine-grained sparse mixture-of-experts model with 64 experts, of which 8 are activated for each token. We initialized it from Moondream 2 (a 2B dense model) using drop upcycling. We also extended the usable context length to 32K tokens, which is critical for few-shot prompting and agentic workflows with tool-use. We don’t fully leverage this longer context in our post-training yet (part of why it's only a preview release). The full 32k context is available for you if you're interested in fine-tuning the model.
(Figure: Long-context perplexity evaluation on GovReport dataset. Each point shows the average cross-entropy loss (nats per token) for a 128-token sliding window at that position, measured across 100 documents truncated to 32,768 tokens.)
We do not use a separate context-length extension phase during training, instead opting to interleave long-context samples while pretraining with a default context length of 4096 tokens. Many context length extension methods like YaRN include an attention temperature scaling component. Inspired by this, we adjust the architecture to enable learned temperature scaling as a function of position, and find this helps with long context modeling.
Like our last 2B release, this is a hybrid reasoning model that supports both reasoning and non-reasoning mode. Unlike other reasoning models, however, Moondream focuses on visual reasoning with grounding. Here’s an example of what that means:
Each chunk of underlined text in the reasoning is grounded, meaning the model references a particular part of the image. In our playground, you can see what the model is focusing on by hovering over the text.
The model starts with only a small set of visual-reasoning examples, and gradually learns to rely on them more during our reinforcement learning (RL) post-training phase. RL proved so effective that, as we refined our training approach, post-training ended up using more compute than the initial pre-training itself.
It was trained with load-balancing and router orthogonality losses to help similar tokens specialize together early on, then had load balancing disabled in post-training to avoid catastrophic forgetting from distribution shift. Finally, attention tweaks like learnable temperature and LSE suppression sharpened focus and cut noise—boosting accuracy and clarity.
Conclusion
This preview release comes with some caveats. We haven't optimized the inference code yet, so inferences are much slower than anticipated (we're working on it!). We're also still actively training this model, and we expect the capabilities and benchmarks scores to improve. We also plan to produce variants of this model (e.g., quantized versions and distilled smaller versions).
The model is now available on the Moondream playground, and you can download it on HuggingFace (Moondream Station will be updated soon). Hit us up on our Discord if you have any questions.
(1) Frontier models don't support object detection natively, so this prompt was used instead:
Original source Report a problem
Detect these objects in the image: [comma-separated list]. - Jun 23, 2025
- Parsed from source:Jun 23, 2025
- Detected by Releasebot:Dec 6, 2025
Moondream Update: Grounded Reasoning, Better Detection, Faster Generation
Moondream’s latest release adds grounded visual reasoning for sharper interpretation, improved object detection, and 20–40% faster responses. It can think step by step, audit its reasoning, and switch between reasoning and normal modes for speed or accuracy, via playground, API, or HuggingFace.
Grounded Reasoning
Simple tasks—like reading a date off a receipt need little thought. Harder ones, like finding the median value on a chart, demand real reasoning about where things are and how they relate.
Moondream now supports grounded reasoning, allowing the model to spend some time to think about the problem and reason precisely about positions and spatial relationships within images before generating an answer. This unlocks performance gains for tasks that depend on accurate visual interpretation.
Moondream can now pause, look around the picture, and think step-by-step before answering. We call this grounded reasoning because the model can reason, using both logic and visual facts about the image to produce more accurate answers.
Take chart understanding, for example. Without reasoning, Moondream does its best by essentially guessing ths answer in one shot. With reasoning on, it breaks the job into three small steps, then nails the answer.
Moondream's reasoning is specifically designed for accurate visual reasoning. The model can choose the "ground" its reasoning with spatial positions in the image when needed to accurately solve a task. Consider counting objects in images, for example. When there are more than a couple of instances of an object in an image the model chooses to explicitly point at them in the reasoning trace, similar to how a human may tackle the same problem.
The recent Vision Language Models are Biased paper shows that many VLMs suffer from confirmation bias when counting, returning memorized knowledge instead of actually counting when they see familiar objects. As we deploy VLMs in high-stake applications, it's critical that we are able to ensure and audit that the models are actually reasoning about the image instead of simply performing sophisticated pattern matching. Our approach to visual reasoning not only helps the model reason about images, but also provides a way for users to audit what it's doing and understand failure modes.
Moondream supports both reasoning and normal queries with the same model, meaning you can trade off accuracy vs speed depending on the complexity of the task you're trying to performance. You can enable reasoning by passing reasoning=True with the query skill. This reasoning mode is powerful but still experimental. For simpler tasks, the original mode may perform better, so we recommend trying both.
How We Taught It to Think
We've started using reinforcement learning (RL) to train Moondream. If you're not familiar with RL, here's a short description of how it works. Traditionally, models are trained by asking them questions where the correct answer is known (aka "Ground Truth"). If the model doesn't answer correctly, we apply a corrective change in the model weights to encourage it to answer better next time. This process is called "supervised learning".
RL works a little differently. We start the same way, with a question where we know the correct answer. With RL, however, we ask Moondream to generate numerous answers using different temperatures, then we grade the answers on good they are. Not only if the answer is correct, but whether it used correct reasoning. This is easier done with tasks that have singular answers (e.g., "What's the sum of the numbers the table?"). For more open-ended answers (e.g., "Caption this image"), we use another Moondream model to judge the answer.
So far we've used RL to train Moondream on 55 tasks and the results are impressive. We plan to increase this to ~120 before the next update to the model.
With smaller models such as Moondream, it's common practice to "bootstrap" the model with reasoning traces from a bigger model. We haven't taken this approach for two reasons: first, our context length is currently limited to 2048 tokens, and this will need to be increased before we can train on longer reasoning traces. Secondly, most open reasoning models are focused on mathematical and coding reasoning, and this is not as effective for visual reasoning.
Sharper Object Detection
Moondream's Object Detection skill just got a lot better with this release. Previously, Moondream had a tendency to clump together objects that were close to each other, and sometimes struggled with finer-grained object specifications (e.g. "blue bottle" instead of just "bottle"), compared to Moondream's pointing capability.
This was largely due to the quality of the the datasets we used. Object detection datasets generated by humans tend to be messy and imprecise as drawing highly accurate bounding boxes is tedious. Annotators often take shorcuts, and sometimes draw a single box around multiple instances when they're close to each other in the image.
We used RL to overcome this, and the results are impressive. We'll be sharing more details about this in a separate blog post, but for now, here's a sample of the results.
Faster Text Generation
Moondream now generates answers 20-40% faster than before. This is because we upgraded the model to use a "superword" tokenizer that encodes text more efficiently. This means Moondream needs to emit fewer tokens to generate the same answer, and we achieve this without any drop in accuracy.
Changing a tokenizer typically involves a costly step to retrain the entire model. We built a lightweight "tokenizer transfer hypernetwork" that enabled us to adapt smoothly to new tokenizers without retraining.
Lastly, this "tokenizer transfer hypernetwork" also makes it easier to train multilingual variants in the future.
UI Understanding
Moondream's performance on ScreenSpot, a benchmark for UI understanding, jumped significantly from 60.3 to 80.4. This makes Moondream a great choice for UI-focused applications that require fast element localization.
While the model cannot be used as a standalone computer use agent yet, it can work very effectively when treated as a tool to be used by a larger agentic model. This is the setup used by projects like Magnitude, where a bigger LLM writes test cases that leverage Moondream for UI understanding tasks. This separation of planning and execution models allows them to run tests more quickly and reliably than using alternatives like OpenAI or Anthropic's Computer Use APIs.
Looking Ahead
Grounded reasoning, smarter object detection, faster tokenizer, and better UI understanding represent a big step forward for Moondream. These fundamental advances also open the door to more improvements on the horizon. We look forward to pushing the model to achieve deeper reasoning capabilities and broader task coverage, and even more speed optimizations.
There's more to this release than what we've covered here, so check it out yourself. Checkout our free online playground, or our free cloud API, or run it locally using Moondream Station - also free. If you prefer really low-level stuff, you can use it directly using Hugging Face Transformers.
Happy Moondreamin’!
Original source Report a problem - May 21, 2025
- Parsed from source:May 21, 2025
- Detected by Releasebot:Dec 6, 2025
Fewer bits, more dreams
Moondream unveils 4-bit quantization to slash memory use and speed up inference. The quantized model hits 99.4% of full precision accuracy with a 42% memory drop and 34% faster performance on RTX 3090. Open source on Linux via Moondream Station; Mac coming soon; HuggingFace access.
Introduction
When we compare model sizes, we usually quote the number of parameters. But that doesn't tell the whole story. Depending on other factors, a model with fewer parameters may actually use more memory, and run inference slower than a larger model. That's why we prefer to focus on actual memory size and inference speed.
4-bit quantization
Today we're excited to announce a new feature that makes Moondream run faster and use less memory: 4-bit quantization. In case you're not familiar, quantization is a technique that reduces the number of bits used to store a model's weights. For example, weights are usually stored as 16-bit float, which take 2 bytes each. A 4-bit weight only takes 0.5 bytes.
The challenge with quantization is that it can lead to a loss of model accuracy. We've been working on this for a while, and we're excited to share that our 4-bit quantized model reaches 99.4% of the accuracy of the full precision model. In practice you'd probably never notice the difference.
Performance and Availability
Meanwhile you probably would notice the speedup and memory improvement. The peak memory usage is reduced by 42% (from 4.2GB to 2.4GB) and, the inference speed is increased by 34% (on an RTX 3090), although the speedup may vary by machine. On the accuracy front, we measure the average score on 8 popular vision benchmarks. The 4-bit quantized model achieved an average score of 74.5 vs 74.9 for the full precision model.
So let's update our chart from our 2025-04-14 model release.
Both the full precision model and the 4-bit quantized model are available as open source. The 4-bit model is currently available for Linux users in Moondream Station (our one-click solution), with Mac support coming soon. Advanced users can also access the model directly via Hugging Face Transformers.
Happy Moondreamin'.
Original source Report a problem - May 1, 2025
- Parsed from source:May 1, 2025
- Detected by Releasebot:Dec 6, 2025
Moondream Station Launch
Moondream Station launches as a one-click way to run Moondream locally, handling download, setup, and updates. It lets you run Moondream from the CLI or via a local port 2020, with Mac live and Windows/Linux coming soon.
Moondream Station
tl;dr; Moondream Station is a one-click way to run Moondream locally. It's free, fast, and with no technical headaches. Download here.
After we launched Moondream last year, we hosted a hackathon in Seattle. It was clear right away that Vision Language Models (VLMs) are changing the game. Just by prompting, people were unlocking new ways to work with images.
What we learned that day stuck with us: even though Moondream can run on a laptop, developers would trip up on the installation and setup. They wanted something that just works. So we launched a cloud version, free.
However we also kept seeing people wanting to run it locally, and still stubbing their toes with the download and setup. So we finally rolled up our sleeves and did something about it. Today we're launching Moondream Station – a one-click solution to running Moondream locally.
Moondream Station manages all of the tedious parts for you. It manages the download, setup, and updates to run Moondream on your desktop. Once it's running, you can invoke Moondream either through the client command line, or write code that invokes it through a local port 2020 (hah!). Here's a video showing it in action:
It's Mac only for now, but Windows and Linux coming soon. We have a bunch of feature updates planned. Drop by on our Discord if you have questions or any problems.
Original source Report a problem - Apr 14, 2025
- Parsed from source:Apr 14, 2025
- Detected by Releasebot:Dec 6, 2025
Moondream 2025-04-14
Moondream 2025-04-14 introduces a more efficient VLM with improved document and chart understanding, better OCR, UI comprehension, and stronger counting benchmarks, all while reducing compute. The note also outlines training tricks and the roadmap for future upgrades.
Moondream Release Notes
Moondream is not just one of the tiniest VLMs. It's the world's most efficient VLM. It produces highly accurate answers with the least computing possible. On a graph with intelligence on the Y axis and resource usage on the X axis, Moondream aims to be top left.
AI is still early in its development, and just as we've seen mainstream computing evolve from mainframes to desktops, mobile, and even smaller devices, so will AI. This is especially true for Vision AI. Almost every physical device is improved if it can reason about its surroundings. Today this often means streaming video or images back to the cloud, which is slow, costly, and privacy-problematic. But with improving hardware and models, this is about to change, and today is another solid step forward.
But efficiency isn't just for the edge, it's for the cloud too. Analyzing vision at scale can be costly. Customers turn to Moondream when scale becomes a concern. Analyzing millions of images, or thousands of hours of video with Moondream is more cost effective than with any other VLM.
That's why we're excited to announce Moondream 2025-04-14. As usual, we've improved on all of the benchmarks we focus on, with some notably large improvements. Let's first look at how it stacks up vs our previous release just a few weeks ago:
[Comparison chart with previous Moondream version]
Now let's see where that puts us vs other top open source small VLMs.
[Comparison chart with other small VLMs]Tech Notes
This release of Moondream was trained on about 450B tokens. For contrast, models like Gemma 3 4B have been trained on 4 trillion tokens, and Qwen 2.5 VL utilized 18 trillion tokens for text modeling, plus an additional 4 trillion tokens for their VLM. Our efficiency in producing a high performance model with a fraction of the training budget is the result of:
- High-Quality Data: We've observed that small models are especially sensitive to noisy data. We produce training data that contains both rigorously filtered real-world data and carefully crafted synthetic data, designed to minimize domain gaps.
- Focused Scope: Moondream is specifically designed for developers creating computer vision applications. We prioritize relevant capabilities over broader use-cases like multi-turn conversations or haiku writing.
- Training Techniques: We've developed a set of training methods that maximize training efficiency. We keep most of them proprietary, but here are two we're disclosing today:
- We use a custom second-order optimizer, crucial for balancing conflicting gradients, such as object detection versus text generation tasks.
- We use a self-supervised auxiliary image loss that significantly accelerates model convergence.
Our focus this release was to improve on our previous one by targeting document understanding, charts, and user interfaces. Moondream has become quite proficient at reading documents. Here are examples of document and layout understanding:
[Document understanding example]
[Layout understanding example]
This improvement in document and text reading has also yielded sizeable bumps in our text-related benchmarks:
- ChartQA: Improved from 74.8 1 77.5 (82.2 with PoT)
- DocVQA: Improved from 76.5 1 79.3
- TextVQA: Improved from 74.6 1 76.3
Performance
Counting
This release of Moondream has seriously counting chops (e.g. "how many birds in this image"). To see how good we got, here's a chart comparing ourselves to all the big names in VLMs.
[CountBenchQA performance chart]Chart Understanding
Chart understanding has been a key focus for this release. Charts require models to ground text and numbers in a visual layout, then reason over them precisely. On ChartQA, Moondream improves from 74.8 in our last release to 77.5, and 82.2 with Program of Thought (PoT) prompting.
PoT is a prompting strategy where the model generates and executes code to solve problems step-by-step. This is especially valuable in chart QA, where reasoning failures often stem not from misreading the chart, but from making small but critical logical errors 1 like summing three correct numbers incorrectly. Rather than expecting the model to always reason flawlessly in natural language, we let it write and run code.
[Chart understanding example]Here's a few more notes from this update:
- To access the OCR capability for docs and tables, use the prompt "Transcribe the text" or "Transcribe the text in natural reading order".
- Object detection supports document layout detection (figure, formula, text, etc).
- UI understanding has improved, with ScreenSpot [email protected] up from 53.3 to 60.3.
Conclusion
As excited as we are for this launch, we have a lot more coming up for next releases too. Here's a few areas we're focused on:
- Repetition Handling: We've seen an increase in repetitions in our inferences, especially when generating long document answers. We've added temperature and nucleus sampling to reduce output repetition, with a repetition penalty setting coming soon. Training adjustments will further mitigate this issue in future releases.
- Tokenizer Upgrade: Our current tokenizer, derived from the three-year-old CodeGen model, hinders optimal training and inference performance. We plan to adopt either a traditional BPE tokenizer (ensuring broad ecosystem compatibility) or a BPE variant (optimized for efficiency).
- Bounding Box Accuracy: Currently, the model occasionally generates bounding boxes encompassing multiple items. We have identified the root cause and a solution is forthcoming. Meanwhile, prefixing object detection queries with "(coco)" can help mitigate this issue.
- Continued Training: As performance continues to steadily improve, we anticipate training for an additional 200 billion tokens before the next release.
We invite you to go check it out for yourself in our playground, or start coding today.
You can run it locally using our new Moondream Server (Mac and Linux for now, Windows coming…), or in our cloud (with a generous free tier).
Happy Moondreamin'.
- Mar 28, 2025
- Parsed from source:Mar 28, 2025
- Detected by Releasebot:Dec 6, 2025
Moondream 2025-03-27 Release
Moondream unveils a new release with Long captions, image tagging, and a faster transformer client, plus near-state-of-the-art object detection gains. Built on real user feedback, this update is ready to try today in the playground or via download.
We're excited to announce a new Moondream release. There's a lot to unpack, but we'll give you the highlights here, and share more over the coming weeks. The improvements in this release were driven from real-world usage and feedback from our community and customers. We want to extend a huge thank you to everyone who contributed to that. Keep it coming, let's gooooo!
Longer
The ability to caption images is one of the top Moondream use cases. Super accurate captions from a fast, efficient, and easy to run model seems to be a winning combo! Use cases range from synthetic data generation to real-world understanding and robotics.
Until now, Moondream offered a choice of "Short" or "Normal" length captions. This model introduces "Long" format. From our testing, this generates roughly 2x longer captions than "Normal".Better
We're still doing exhaustive evals, but so far we've seen major improvements on object detection (COCO mAP), OCR (OCRBench), and counting (CountBenchQA):
We had some customers request improvements in our object detection capability. We were excited to work on that, and we're especially pleased with the results. This new COCO mAP score now makes Moondream near state of the art on object detection.
We also had a customer with a specific need: the ability to tag all the things visible in an image. While we don't have a public benchmark available to highlight, our internal benchmark and vibe checks show a huge improvement in this ability.
We call it "Image Tagging," and you can try it out by using this prompt in your image query:
"List all visible objects, features, and characteristics of this image. Return the result as a JSON array."
Here's an example of how it works:compile() on the model sped up inference from 61.4 tok/s to 123.4 tok/s on an Nvidia 3090.Faster
We have a client update planned in the next few weeks that includes mobile, but in the meantime, we snuck in one key improvement to our transformers-based client. It just got a lot faster.
From our testing, callingcompile()on the model sped up inference from 61.4 tok/s to 123.4 tok/s on an Nvidia 3090.
That not only makes it cheaper to run, it also opens up more possibilities for near-realtime processing, especially for video streaming.Stronger
Moondream's improvements are driven by the feedback and engagement of its growing community and customers.
This is creating a flywheel effect, where the feedback and requests from the community drive us to make more improvements to the model, and these improvements drive more adoption from an ever-growing community.
In other words, you're part of the reason Moondream keeps growing. We extend our tip of the hat to you, our Moondreamers.Conclusion
We'll be sharing more details over the next few weeks, but the great news is that you don't have to wait.
Original source Report a problem
You can download the model now, or go kick its tires in our playground, or even better yet, build something with our free cloud offering. - Jan 9, 2025
- Parsed from source:Jan 9, 2025
- Detected by Releasebot:Dec 6, 2025
Moondream 2025-01-09 Release: Structured Text, Enhanced OCR, Gaze Detection
Moondream 1.9B arrives with a new Gaze Detection capability, stronger OCR, and easier structured output across JSON, XML, Markdown, and CSV. It benchmarks against top small vision models and invites developers to try in the playground or download.
Today, we’re announcing a new release of Moondream 1.9B. It has improvements across a bunch of areas and includes a new capability, Gaze Detection. This release marks the first time we’ve focused on industry benchmarks, and we’re excited to share some results. Despite these upgrades, the model is still just 1.9B, so it's fast and can run everywhere. Try it out in our playground or download it now.
1. Structured Output
Building with Moondream is easier than ever with our new support for structured output formats such as JSON, XML, Markdown, and CSV. Here’s some examples:
Example 1: JSON structured output
Example 2: XML structured output
Example 3: Markdown structured output
2. New Capability: Gaze Detection
Traditional Vision AI consists of specialized models built for different tasks like “object detection” (outline a specified object's region in an image) or “captioning” (create a caption for an image). Moondream supports several of these common Vision AI tasks as “capabilities,” all within a single model. Moondream already supports object detection and captioning, as well as “visual querying” (ask any question to a photo) and “pointing” (get the x,y coordinates of elements within a photo).
Today, we are excited to launch a new capability: Gaze Detection.
This capability tracks human attention. Note that this capability is experimental. We’re releasing it to get feedback from developers so we can improve it over time.Example 1: Driver Gaze Detection
Example 2: Sport Gaze Detection
3. Benchmarks
We’ve always been a bit iffy about benchmarks. Some focus on problems we don’t think are relevant to Moondream (e.g., solving math equations). Others include weird questions and wrong answers (at least to us — see the Weird Benchmarks appendix below). And focusing too much on benchmarks can lead to weird behaviors, with allegations that some models "cheat" by training on the actual benchmarks themselves.
Despite this, we decided to improve our scores because we don’t want anyone sleeping on Moondream because of low results. We benchmarked ourselves along with the top small vision language models.
You can find our individual benchmark results below:4. Better OCR
We made changes to Moondream’s vision layer that have helped improve text reading/OCR significantly. We’ve also trained it on a lot more document querying and understanding. Here’s some examples:
Example 1: OCR Example
Example 2: Chart OCR Example
Looking Ahead
As pumped as we are about this release, the best part, for us, is seeing what you build with it. VLMs are making it faster, cheaper, and easier than ever to build next generation vision-enabled apps. Getting setup takes minutes, or you can try out Moondream in our playground. We offer Cloud inference with a generous free tier, or you can download it and run it yourself. Check out our docs for a getting started guide and lots of sample code.
Happy Moondreaming!
Appendix 1: Weird Benchmark Questions
Here’s a few examples of weird benchmark questions...
Example 1: Confusing Benchmark Question
In GQA, the following image has a question that asks “Is the traffic signal on the right side or the left?” If you look closely, you can see there are traffic lights on both sides of the street. However, GQA expects the answer to be “Left.”Example 2: Nonsensical Benchmark Question
Original source Report a problem
In the following image, GQA asks “What animal sits in the bench that is on the right side?" It expects the answer to be “bird” 🤯.
This is the end. You've seen all the release notes in this feed!