Moondream Release Notes

Follow

15 release notes curated from 16 sources by the Releasebot Team. Last updated: Jun 5, 2026

Get this feed:
  • Jun 4, 2026
    • Date parsed from source:
      Jun 4, 2026
    • First seen by Releasebot:
      Jun 5, 2026
    Moondream logo

    Moondream

    Popping the GPU Bubble

    Moondream's Photon inference engine now achieves near-realtime VLM inference, with up to 35% higher decode throughput by hiding GPU bubbles through pipelined decoding, ping-pong slots, safer constrained decoding, and cleaner request teardown.

    Photon, Moondream's inference engine, achieves near-realtime VLM inference (~33ms on NVIDIA B200). This is a peek into how it delivers up to 35% higher decode throughput by optimizing how the GPU works.

    The bubble

    How do you make an AI model run as fast as possible? This is a question we obsess over at Moondream HQ. The GPU handles all the math involved in model inference, so at first glance it doesn't seem like there's much to it: just tell it what to do and wait for the answer. But if you start looking at how it actually works under the hood, you find that the GPU often sits idle, not for lack of work, but because the CPU hasn't told it what to do next yet. This phenomenon is called a GPU bubble.

    When a typical AI model generates text, it produces one token at a time (a token is a chunk of text, roughly a few characters). Each token depends on the tokens before it, a property called autoregressive, so generation is sequential. You can't compute the third token before you have the second. This decode loop involves a round trip between the CPU and GPU. The GPU does most of the heavy lifting to run the actual model, performing billions of arithmetic operations to produce the next token. But there's also a surprising amount of work done by the CPU. It selects which requests to run next, sets up the metadata the GPU needs for them, picks the actual token out of the model's output and records it, and more.

    The challenge is that one token's worth of GPU work is small, while the CPU housekeeping is a fixed cost paid on every trip. If the GPU has to wait for that housekeeping before it can start the next token, it sits idle for part of every loop. This is why we get GPU bubbles.

    In this post we're going to dive into how Photon hides these bubbles using a technique called pipelined decoding. The idea is to overlap the two kinds of work: we start GPU work on the next token while the CPU is still finishing the last one.

    Here's the shape of the problem.

    In the blocking version (top), every step is a baton pass. The CPU plans and launches a forward, the GPU runs it, then the CPU synchronizes, waits for the results to land, commits them, and only then starts planning the next step. This is because the plan depends on the token we select. For example, if the model indicates it has finished answering, then we need to schedule a new pending request from our queue. The GPU sits idle waiting for the CPU to finish its commit-plan-launch work.

    The fix is to pipeline the loop. Launch the next forward while the current step's token is still coming back and being committed. That's the pipelined version (bottom): the forwards run back-to-back, and the CPU work is overlapped underneath them.

    The reason we can is that the token we just sampled doesn't have to leave the GPU. The next forward reads it straight from GPU memory as its input. We still want a copy on the CPU eventually, to detokenize it, stream it, and decide whether the request is done, but that is bookkeeping we can do a moment later, in the background, while the next forward already runs. Not waiting on that copy is the move that removes the bubble.

    Making it safe requires three things, that we cover in the rest of this post: keeping step buffers from colliding (ping-pong slots), getting the sampling order right for constrained decoding (forward now, sample later), and cleaning up after a request finishes (zombies).

    Mechanism 1: ping-pong slots

    To run a decode step, the GPU needs a working set of buffers: a place to stage the input (the last generated token and its position in the sequence), a place for the model to write its output (the logits, one score per word in the vocabulary), a place to land the sampled token, and some bookkeeping the attention kernel needs to find each sequence's cached keys and values (its KV cache). We keep pinned (page-locked) host buffers on both ends, so the copies on and off the GPU run as background DMA (direct memory access) transfers instead of blocking the CPU.

    These buffers are allocated once and reused on every step. We work hard to avoid performing GPU memory allocations at runtime, because they can cause device synchronization and introduce bubbles. Fixed buffer addresses are also needed for capturing the decode step once as a CUDA graph and replaying it, reducing kernel launch overhead. We call this bundle a DecodeSlot.

    This works, but introduces a blocker for pipelining. The buffers stay in use until the step is done, so we cannot start the next step until the current one finishes. To overlap two steps, the second step needs its own working set, otherwise it can overwrite the results of the first step before the CPU has read them. So we keep two slots and alternate between them, ping-pong style.

    One thing to note about launch: we don't execute kernels the instant we issue a launch from CPU. Instead, we enqueue them onto a stream -- an ordered queue that the GPU drains in order. Work on the same stream runs sequentially, while work on separate streams can overlap. Both slots put their forwards onto the same compute stream. The slots are not for GPU parallelism. They only exist so the CPU can process one slot's results while the GPU runs the other slot's forward.

    The forwards all share that one compute stream, but the copies do not. Each step's device-to-host copy, the one that brings the sampled token back for bookkeeping, goes on a separate copy stream, so it can run while the GPU is busy with the next forward. That is what lets us not wait for it. We anchor the copy to an event recorded the instant the step's outputs are written, so it waits on exactly that step's work and nothing queued behind it.

    A slot only becomes free once its results have been read, not just once the GPU is done with it. Its pinned host buffer is the landing site for a copy that may still be in flight, so handing the slot to a new step too early would overwrite a copy mid-transfer, creating a hard-to-debug corruption bug. So the slot stays reserved through the commit that reads it, and is released only once that commit has finished.

    Mechanism 2: forward now, sample later

    The next forward can run ahead because it doesn't depend on anything the CPU does with the last token. But two things about the next step do depend on the last step's committed result. One is which sequences are still in the batch: if a request just finished, it shouldn't be in the next forward. That is the next section (zombies). The other is what tokens the next step is even allowed to sample, and that one is this section.

    It comes from constrained decoding. Moondream's spatial skills return structured output instead of free text: point returns a coordinate, detect returns boxes, segment returns an outline. We get those from the same decode loop by restricting which tokens the model may produce at each step: we force the scores (the logits) of the disallowed ones to negative infinity before we sample. A point step has to emit a coordinate, a detect request walks an x, y, size cycle, and so on. Which tokens are allowed, the mask, depends on what has been produced so far, so the mask for step t+1 depends on the token we sampled at t.

    The dependency is in sampling, not in the forward.

    Each scheduler tick goes through three phases: launch, commit, and finalize:

    1. Launch the forward for t+1. It doesn't depend on the mask, so it goes immediately.
    2. Commit step t: wait on the in-flight copy and advance the request's decode state. That is needed to decide the mask for t+1.
    3. Finalize sampling for t+1: with the state current, build the mask and sample.

    Sampling t+1 lands after committing t because the commit is what makes t+1's mask correct. We call this "commit-before-finalize" ordering. The GPU runs the t+1 forward through steps 2 and 3, so the commit disappears from the critical path.

    For plain text there is no mask, so forward and sampling can both run a step ahead. For constrained sequences the forward still runs ahead, but sampling waits on the previous commit, which caps how far ahead we get with no special-casing. One loop handles both.

    Mechanism 3: zombies: finalize early, release late

    Back in forward now, sample later we flagged two ways the next step depends on the last step's committed result. The sampling mask was one. Batch membership is the other, and it takes a bit of care to handle right.

    To launch step t+1 we first decide its batch, which sequences are in it, and we do that before committing step t. So what happens when a sequence hits its stop token at t, but is already baked into t+1's forward? You can't un-launch GPU work. The sequence is finished, yet still physically present in a batch that's executing.

    Photon calls these zombies, and instead of bolting on cancellation logic, it lets the behavior emerge from two per-sequence fields:

    • finalized: True after the sequence has hit EOS or its length cap.
    • inflight_refs: the number of in-flight steps that still reference this sequence (0, 1, or 2).

    When step t commits and detects EOS, the sequence is marked finalized and its result is emitted — but it isn't torn down, because inflight_refs is still nonzero (step t+1 references it). At step t+1's commit, the sequence is already finalized, so the commit is skipped: no token is appended, no state mutates. The zombie was harmlessly along for the ride — it occupied its slot and wrote some KV that nobody will read. Only when inflight_refs finally hits 0 are its KV pages and LoRA slot released.

    This finalize-early, release-late dance is a small amount of refcounting that replaces what would otherwise be a thicket of "cancel this row mid-flight" special cases.

    Prefill rides the same pipeline

    So far this has all been about decode steps, but a real serving loop is constantly doing two different kinds of work: prefill (processing a new request's prompt + image, the expensive one-shot forward over many tokens) and decode (one token at a time for everyone already running).

    Photon doesn't separate them. A prefill is just another kind="prefill" launch in the same two-slot pipeline. Because the pipeline only cares that a slot is free, not what kind of work last used it, a prefill forward can be launched into one slot while a decode step from the other slot is still being committed, and vice versa. The expensive prefill forward runs on the GPU while the CPU commits decode results; the next decode forward runs while the CPU finishes admitting the just-prefilled request. The same commit ordering (and the same inflight_refs bookkeeping) keeps everything correct across the two kinds, so none of the zombie or constrained-decode logic needs a special case for "what if a prefill is in flight."

    This matters most when outputs are short. A request that emits three tokens spends almost all of its life in prefill and admission, not decode, so a workload of many short requests is really a stream of prefills with a little decode sprinkled in. Sharing one pipeline is what lets that stream overlap its own CPU bookkeeping instead of serializing prefill behind decode and back again.

    A cost model for the bubble

    How much should pipelining actually buy you? You can predict it from the parts of a decode step, and then check the prediction against measurement.

    A decode step is three pieces of work:

    • forward: the heavy GPU matmuls. At decode this is memory-bandwidth bound: every token streams the whole weight set through the cores, so it has a floor near weight_bytes / memory_bandwidth. It shrinks as memory gets faster or as the model gets smaller.
    • sampling: turning the scores into a committed token: the constrained-decode mask, the argmax/sample, the spatial (grounding) decode, and the device→host copy of the result. All GPU work.
    • bookkeeping: the CPU around it. Choose the next batch (plan), launch the graph (launch), commit the previous step (commit).

    A blocking loop runs the three in series, so the GPU sits idle through the bookkeeping — that idle is the bubble. Pipelining slides the bookkeeping of one step underneath the forward + sampling of the next, so the period collapses toward forward + sampling and the bubble disappears. Measured per step, pipelined, that's exactly what we see — the GPU is busy for essentially the whole period (steady-state medians, moondream2, ms):

    forward + sampling ≈ period; the leftover GPU idle is under 0.05 ms. So what was hiding it worth? It comes down to a tug-of-war between two things — how much of a step you manage to tuck away, against a small penalty for running ahead:

    speedup = T_block / T_pipe × (1 − z)
    bubble hidden zombie tax

    Two symbols, two ideas. The first term is the win, and it's the whole GPU-speed story: how long a step takes blocking (T_block) over how long it takes pipelined (T_pipe) — i.e. how much faster the step runs once the bookkeeping is tucked underneath it.

    The second, z, is the price of running ahead — the zombie tax from Mechanism 3. Launch step t+1 before committing t, and a sequence that just finished still has a forward in flight: a wasted step. On a single stream that's one wasted forward for every L tokens the request generated, so about 1% at L ≈ 110. Pack a batch, though, and it nearly vanishes — the zombie is just one more row in a step that's already paying full price to stream the weights, so it rides along almost free. The tax bites hardest at one stream and fades exactly where throughput lives, which is why predicting it needs both L and the batch size.

    Here's that step, measured both ways — blocking idles each step while the CPU commits the last token and re-launches; pipelining runs that work (and the async mask upload) underneath the forward, so the forwards never stop:

    Now put real numbers in it. Measure each piece on its own — the two step times and L — and the model's prediction should land on what the benchmark actually delivers (depth-1 blocking vs depth-2 pipelined, nothing else changed):

    Three things to read out of it:

    1. The win grows with GPU speed. Same workload, +12% on a 3090 but +35% on a B200 at 32 streams. The bookkeeping is GPU-speed-independent, so as the forward shrinks — faster memory, or a smaller model — the bubble is a bigger share of the step. Pipelining is insurance against the GPU getting faster, which for us is the same thing as the model getting smaller.
    2. The zombie tax is real but small, and it amortizes. At one stream the zombie is a whole wasted forward — about 1% at L≈110. At batch it's one extra row in a step that's memory-bound on the weights, not the row count, so it costs almost nothing: at 32 streams the 3090's observed +11.6% lands right on the no-zombie per-step ratio. The tax bites at a single stream and fades exactly where throughput lives. (The B200's 32-stream row sits a few points under prediction for a duller reason — at ~4 ms/step the whole run is under half a second, so prefill and the end-of-run batch ramp-down are a visible slice of the wall.)
    3. It only pays once the bubble is actually hideable. (This is how we caught a bug, in fact: the pipelined numbers came out at blocking speed, traced to an accidental synchronous copy while building the constrained-decode mask. Moving it to the copy stream was worth +11% on the 3090 and +34% on the B200.)

    It's never just one thing

    That's the whole technique: ping-pong slots so two steps don't collide, a forward/sampling split so even constrained decoding can run ahead, and a little zombie refcounting so finished requests tear down cleanly. The GPU stops waiting on the CPU, and you get back anywhere from a few percent to a third; more the faster your accelerator/model is.

    But Photon isn't fast because of this one technique, or any single technique. It's fast because dozens of these details compound across the serving stack: how we resize and tile images on the way in, the kernels that run the model, the scheduler ordering here, and the synchronization points we remove from the hot path. No one piece is the whole story; the stack gets fast when enough of them line up.

    We'll keep writing these up, one corner of the stack at a time. Follow us on Twitter so you don't miss the next one. And keep an eye out for Photon 2.0, coming soon: we can't share details yet, but it's a big one.

    Original source
  • May 1, 2026
    • Date parsed from source:
      May 1, 2026
    • First seen by Releasebot:
      May 2, 2026
    Moondream logo

    Moondream

    Photon 1.2.0: Faster Inference, Now on Mac, Windows, Blackwell, and Jetson Thor

    Moondream ships Photon 1.2.0 with faster local vision AI and broader native hardware support across Apple Silicon, Windows x86_64, NVIDIA Blackwell, Jetson Thor, and existing GPUs. It lowers latency, boosts throughput, and makes on-device deployment easier.

    Production vision AI that runs everywhere — now faster, on more hardware.

    Moondream's mission is simple: production vision AI that runs everywhere. Most cloud VLMs take seconds to respond, which doesn't work for systems that need to be fast and often run on-device or at the edge. So we built the full stack ourselves: our own models, a fine-tune service (Lens), and an inference engine (Photon). Today's Photon update in the Moondream 1.2.0 release makes it faster still.

    Getting Started

    To install:

    pip install moondream
    

    Then run locally by setting

    local=True
    

    :

    import moondream as md
    from PIL import Image
    
    model = md.vl(api_key="YOUR_API_KEY", local=True)
    image = Image.open("photo.jpg")
    
    print(model.caption(image)["caption"])
    

    That local=True flag is the important part. It tells Moondream to run inference on your machine using Photon instead of sending the request to the hosted API. With Photon 1.2.0, local Moondream inference now supports:

    Platform | What's new
    Apple Silicon | Native inference on M-series Macs
    Windows x86_64 | Native CUDA inference (no WSL required) or Linux containers
    NVIDIA Blackwell | Support for B200 and RTX PRO 6000
    NVIDIA Jetson Thor | Edge inference on JetPack 7 / CUDA 13
    Existing NVIDIA GPUs | Faster prefill, MoE, dispatch, and tail latency

    The result: Moondream is now easier to deploy across laptops, workstations, edge devices, and production GPU servers.

    Why This Release Matters

    Production vision AI depends on more than model quality. It needs to:
    • be strong fast enough for real applications.
    • run on the hardware teams already use.
    • work outside a single cloud or GPU environment.
    • be simple enough to install and ship.

    Photon 1.2.0 improves Moondream across all of those dimensions. It expands native hardware support, reduces setup complexity, improves single-request latency, and increases throughput on both new and existing GPUs.

    That matters for applications like:

    Use case | What Photon improves
    Interactive image apps | Faster answers from a single request
    Production APIs | Higher request throughput
    Robotics and inspection | Local inference without cloud round trips
    Desktop tools | Native Mac and Windows support
    Edge devices | Vision AI where network latency or privacy matters
    Private workflows | Images can stay on-device

    Native Moondream Inference on Apple Silicon

    Photon now runs on Apple M-series Macs starting with macOS 13 Ventura and Python 3.12. Photon uses native Metal kernels across the decode path, including paged attention, rotary embeddings, KV cache management, MoE routing, sampling, and layer norm. KV cache sizing is automatically tuned to the Mac's unified memory.

    Reference performance on ChartQA, batch size 4, direct mode:

    Hardware | Moondream 2 | Moondream 3
    MacBook Pro, M5 Max, 48 GB | 7.26 requests/sec | 4.58 requests/sec
    Mac mini, M2, 24 GB | 0.79 requests/sec | 0.55 requests/sec
    Mac mini, M4 base, 16 GB | 0.84 requests/sec | —

    Apple Silicon support makes local Moondream development much more practical: demos, prototypes, desktop apps, and privacy-sensitive workflows can run directly on a Mac.

    Native Windows Support

    Photon now supports native Windows x86_64 inference. This is not a Linux wrapper. Photon's kernel-loading runtime has been rebuilt to support Windows directly, including MSVC compatibility, Windows DLL loading semantics, and cross-platform library naming across the kernel stack. Windows systems now run the same CUDA kernels as Linux x86_64.

    Low-Latency and High-Throughput Inference on Blackwell

    Photon 1.2.0 adds support for NVIDIA Blackwell, including B200 data-center GPUs and RTX PRO 6000 workstation GPUs.

    B200 is now the fastest hardware Photon supports:

    Hardware | Model | Single-request latency | Batch 64 throughput
    NVIDIA B200 | Moondream 2 | ~23 ms | 93.61 requests/sec
    NVIDIA B200 | Moondream 3 | ~30 ms | 71.27 requests/sec

    The single-request latency is derived from batch size 1 performance. This is the number that matters when an application needs an answer immediately. The batch 64 number shows high-volume throughput. This is the number that matters when a system is serving many requests at once. At batch size 64, B200 is:

    Model | Speedup vs. H100
    Moondream 2 | 1.49× faster
    Moondream 3 | 1.23× faster

    Photon also supports RTX PRO 6000, which reaches 39.3 requests/sec on Moondream 2 and 39.7 requests/sec on Moondream 3 at batch size 64. Under the hood, this release includes Blackwell-specific MoE kernels and dedicated Blackwell flash-attention kernels for both decode and prefill. The practical result is lower latency for interactive workloads and higher throughput for production serving.

    Edge Inference on Jetson Thor

    Photon now also supports NVIDIA Jetson AGX Thor 64 GB on JetPack 7.

    This brings Moondream to a new class of edge deployments: robotics, inspection systems, kiosks, vehicles, cameras, and embedded vision products where cloud inference may add latency, cost, or privacy concerns.

    Reference performance:

    Hardware | Model | Single-request latency | Batch 64 throughput
    Jetson AGX Thor | Moondream 2 | ~152 ms | 14.53 requests/sec
    Jetson AGX Thor | Moondream 3 | ~147 ms | 12.05 requests/sec

    That means Moondream can run locally on Jetson Thor and return vision-language answers in well under a second.

    Photon also now ships a multi-CUDA Linux aarch64 wheel. The same install works across Jetson Thor, Jetson Orin, and GH200 systems. Photon selects the correct CUDA build automatically: CUDA 13 for Thor on JetPack 7, CUDA 12 for Jetson Orin and GH200 systems on JetPack 6.

    Faster on Existing NVIDIA GPUs

    Photon 1.2.0 also improves performance on existing NVIDIA hardware, including L40S, RTX 4090, Jetson Orin, A100, A10/A10G, L4, and RTX 6000.

    The main improvements are:

    Improvement | Impact
    Faster FP8 prefill on Ada and Jetson Orin | Better performance for FP8 KV cache deployments
    New native paged flash-attention kernels | Faster prefill and decode paths
    Faster MoE inference | Better Moondream 3 performance across GPUs
    Lower per-call dispatch overhead | Faster batch 1 and small-batch inference
    More consistent tail latency | More predictable application performance

    Small-batch performance is especially important in real applications. When a user asks a question about an image, batch size 1 latency determines how fast the answer comes back. Photon 1.2.0 reduces overhead in that path while also improving throughput for larger production batches.

    Conclusion

    Photon 1.2.0 expands where Moondream can be deployed and improves how fast it responds. Full benchmark details, including additional batch sizes and chain-of-thought mode results, are available in PERFORMANCE.md.

    With Moondream, you don't have to compromise. You can get sophisticated visual reasoning at near-realtime speeds, and it runs everywhere. Got a production-level vision challenge? Contact us, we'd love to talk.

    Original source
  • All of your release notes in one feed

    Join Releasebot and get updates from Moondream and hundreds of other software products.

    Create account
  • Apr 16, 2026
    • Date parsed from source:
      Apr 16, 2026
    • First seen by Releasebot:
      Apr 17, 2026
    Moondream logo

    Moondream

    Lens: Moondream's Finetune Service

    Moondream launches Lens, a fine-tuning product for production-ready vision AI. It helps users improve model accuracy with simple pay-as-you-go RL and supervised fine-tuning, then run the tuned model in Cloud or locally with Photon.

    Lens in Action: PTZOptics

    VLMs are a big leap for vision AI. They reason about vision at a higher level, and they're much easier to use than the previous generation of models. But they've been hard to put into production for three reasons. First, they're slow, and production systems often need real-time decisions. Second, they struggle to run locally, which production systems often require for security, reliability, or cost. Third, they suffer from a "last-mile" problem: the VLM looks promising in the lab, but in the real world, accuracy falls short.

    Moondream is a different kind of VLM. It's purpose-built for production systems, and unsurprisingly, we've been tackling these three problems. First, we built small models that are lightweight enough to run everywhere. Then we launched Photon, our inference engine that achieves 20ms inference time on an H100. And today, we're happy to announce the launch of Lens, our fine-tuning product that solves the "last-mile" problem.

    We've been working with a partner, PTZOptics, who make network-attached remote controlled cameras. In many cases, customers want the camera to act as if it had a smart camera operator controlling it: following the action of a soccer game, zooming in and out at crucial times in a presentation, or detecting anomalies for security or operational reasons.

    With Moondream, this is now a reality. You can have the camera track complicated things ("the person in the red shirt"), take inventory of what's shown, or get alerted when actions occur ("someone's hurt"). And with Lens, you can teach Moondream new skills, or tune it when the accuracy is lacking.

    Simple API, Pay-as-You-Go

    Lens is a simple API that provides fine-tuning, through both reinforcement learning and supervised fine-tuning. There's no hardware to set up or binaries to worry about. And it's simple pay-as-you-go. We've seen great results with as few as a dozen images.

    As soon as you're done improving the model, you can invoke it immediately through our Cloud, or run it locally with Photon. It's the easiest and fastest way to go from fine-tune to production.

    See the Difference

    Here are a few examples of Lens fine-tunes across very different domains. In each case, the base model struggled, and the fine-tuned model nailed it.

    Broadcast Sports: Detecting the Ball Handler

    We fine-tuned Moondream to detect the player with the ball in NBA broadcast footage. The base model returned dozens of false positives (red boxes). After fine-tuning with RL, it finds just the ball handler (green box). F1 jumped from 28% to 79%, and false positives dropped from 61 to 2.

    Before fine-tuning
    After fine-tuning

    Training took 54 minutes and cost $16.89.
    See the full interactive example →

    Geolocation: Country Identification from Street View

    We trained Moondream to identify countries from street-view imagery. The base model guesses the wrong continent entirely. After fine-tuning with just 25 images per country, it reads road markings, signage, and landscape cues correctly, beating GPT-5.4's 69.8% accuracy with 71.1%.

    Before fine-tuning
    After fine-tuning

    See the full interactive example →

    Medical Imaging: Glaucoma Staging

    We fine-tuned Moondream to classify retinal images by glaucoma severity. The base model defaulted to "early" for nearly every image. The fine-tuned model distinguishes severity correctly, performing 2x better than GPT-5.4.

    Before fine-tuning
    After fine-tuning

    Training took 47 minutes and cost $15.68.
    See the full interactive example →

    You can explore all of our fine-tune examples at moondream.ai/p/lens.

    Need Help Getting Started?

    For customers that are new to vision AI and fine-tuning, we offer help. Our dedicated production team can work with you to deliver a fine-tuned model for your specific use case. Contact [email protected] for more info.

    What We've Been Building Toward

    This is what we've been building toward. VLMs are finally in a form factor that fits the real world.

    Try Lens today, and if you're at NAB next week, come see the PTZOptics demo live. Questions? Drop us a line at [email protected].

    Original source
  • Mar 25, 2026
    • Date parsed from source:
      Mar 25, 2026
    • First seen by Releasebot:
      Mar 26, 2026
    Moondream logo

    Moondream

    Photon: Real-Time Vision AI Is Finally Here

    Moondream introduces Photon, a faster production inference engine for vision AI, delivering 2x speed over similar vLLM setups and over 60 inferences per second on H100s. It is built for real-time image and video analysis across edge to data center.

    The era of production vision AI isn't coming. It's here.

    Vision Language Models (VLMs) changed the game. Instead of building custom CV pipelines for every task, you can now just prompt a model about an image in plain language. That alone made vision AI easier and cheaper to adopt. But VLMs also unlocked something deeper: visual reasoning that simply wasn't possible before. Problems that were out of reach for traditional AI systems are now solvable, and almost anyone can afford to try.

    The result has been an explosion of new vision AI applications. Manufacturing defect detection. Broadcast video analysis. Retail inventory and loss prevention. What used to be research-grade problems are now powering a new wave of startups, and Moondream is at the center of many of them.

    But there's a gap between what VLMs can do and what they can do fast enough to matter.

    Most people's experience with a VLM looks like this: you ask it a question about an image, wait a few seconds (sometimes tens of seconds), and get an answer back. The answers are often impressive. The wait is often a dealbreaker. When you're processing live video, running a manufacturing line, or making real-time decisions, a few seconds of latency kills the use case entirely.

    We heard this over and over from customers. They wanted everything Moondream offers: the accuracy, the grounding, the ease of use. But they needed it faster than any VLM had delivered before.

    Photon is our answer.

    Why We Could Build This

    Photon isn't just fast inference code. The real advantage is that we own the entire stack. We design the model. We design the inference engine. We design the fine-tuning platform and the deployment tools.

    We made architectural decisions at model design time to optimize for the hardware we actually deploy on. We knew which GPU operations would matter on which chips, and shaped the model around that. You can't retrofit those decisions onto an existing model. Compared to similar-sized models on vLLM, Photon is 2x faster. You don't get that from better kernels alone.

    On an H100, Photon delivers over 60 inferences per second. That's frame-by-frame video processing on server-class hardware. On edge devices, including older ones limited by supply chain realities, we still deliver meaningful throughput.

    Here's what matters in practice: production vision AI systems rarely run just one inference per image. You're often analyzing the same frame in multiple ways. Photon gives you the headroom to do that.

    What This Changes

    Live broadcasting with real-time moderation. Manufacturing lines running at full speed with frame-by-frame defect detection. Security systems that keep pace with camera feeds. These were theoretically possible before. Now they're operationally viable.

    Speed also affects cost. When you run inference faster on a GPU, each inference gets cheaper. Photon supports operation batching, which lets you trade slightly higher per-inference latency for much better total throughput. The result is that real-time image and video analysis can fit much tighter budgets than before.

    Moondream Is Now Production-Ready, End to End

    Moondream is becoming a complete stack for production vision AI. The model works well at grounding tasks across industries. Lens, our upcoming fine-tuning platform, makes it easy and cheap to improve accuracy on your specific use case. And now Photon gives you a best-in-class inference engine that runs everywhere, edge to data center, with full support.

    Getting started takes minutes, not days:

    pip install moondream
    
    import moondream as md
    from PIL import Image
    
    model = md.vl(api_key="YOUR_API_KEY", local=True)
    image = Image.open("photo.jpg")
    
    print(model.caption(image))
    # => {"caption": "A golden retriever sitting on a park bench, looking ..."}
    

    Moondream is free to download and run however you want. Photon is for teams that need faster, production-ready performance. See pricing for details and the documentation to get started.

    What's Next

    Lens, our fine-tuning product, is launching soon. More hardware support for Photon is on the way. As both products mature, they'll integrate more tightly so you can fine-tune on your data and deploy through Photon in a single step.

    We're going to stay focused on making Moondream the best production-ready VLM. Faster. Less memory. Lower cost. Running everywhere.

    Original source
  • Mar 10, 2026
    • Date parsed from source:
      Mar 10, 2026
    • First seen by Releasebot:
      Mar 11, 2026
    Moondream logo

    Moondream

    Moondream Segmenting Update: Better Masks, Better Benchmarks, 40% Faster

    Moondream announces a faster, higher accuracy segmenting upgrade now live in Moondream Cloud. The new model boosts RefCOCO benchmarks, inscribes native SVG masks, and enhances referring expressions with faster latency. Local inference will follow later in the week, making it a clear product release.

    We introduced segmenting as a Moondream skill in September 2025 with Moondream 3 Preview. It launched with state-of-the-art scores on segmenting benchmarks. Despite the launch of several segmenting vision models since then, Moondream remains top dog.

    Today, we're excited to announce that we've raised the bar even further with an improvement now live on Moondream Cloud. This new version produces better segmenting results, achieves better benchmark scores, and does it 40% faster than before.

    Examples

    • Car closest to the top of lombard street
    • castle-like building
    • Headlights
    • man wearing blue shirt and jeans, standing near the left railing of the bridge, looking down
    • Runner in the lead with the longest hair
    • Transamerica Pyramid
    • Waldo wearing the number 25317
    • White 911

    Moondream Segmenting Recap

    Put simply, what makes Moondream segmenting different is that it:

    • produces native SVG masks (vectors, not bitmasks)
    • state of the art on segmentation benchmarks
    • offers quick inference speeds, even if raw speed is not the only thing we optimize for
    • supports deep, native referring capabilities such as "the person touching the door"

    Benchmark Improvements

    This latest segmentation model delivers a significant leap in performance across all major referring expression benchmarks. On RefCOCO+ Val, which tests attribute-based reasoning without positional cues, we achieve 79.1 mIoU, a 4.4-point improvement over the previous state-of-the-art (which was also Moondream!). RefCOCOg Val, which evaluates complex natural language descriptions, sees similar gains at 80.7 mIoU. We also report 88.2 mIoU on RefCOCO-M, our high-fidelity benchmark with pixel-accurate masks, underscoring that these gains translate to real-world precision, not just benchmark optimization.

    | Metric: mIoU | RefCOCO Val | RefCOCO+ Val | RefCOCOg Val | RefCOCO-m |
    | Old | 81.8 | 74.7 | 76.4 | 86.9 |
    | New | 83.2 (+1.4) | 79.1 (+4.4) | 80.7 (+4.3) | 88.2 (+1.3) |

    How We Compare

    Most segmentation-capable VLMs are either accurate or fast, but not both. Large multimodal models with bolted-on segmentation decoders can handle complex queries, but they are slow and expensive to run at scale. Lightweight models are fast, but they choke on anything beyond simple noun phrases. Moondream closes this gap: state-of-the-art accuracy at speeds that make latency-sensitive and high-throughput applications practical.

    Moondream vs. SAM 3

    SAM 3 can segment generic concepts like "car" or "person", but it can't natively resolve referring expressions. For prompts like "the person touching the door" or "laundry on the floor," you need to pair it with a larger reasoning model that adds 10s of seconds of latency and drives up cost. Moondream handles complex prompts natively, returns crisp higher-quality SVG masks, at a 5x lower price point.

    Conclusion

    This update is live now on Moondream Cloud. If you're already using segmentation, you get better quality and lower latency immediately. Later this week, we'll also be releasing the model for local inference, along with a technical whitepaper for those who want to go deeper. Learn more about Moondream's segmentation skill at /skills/segment.

    Original source
  • Dec 19, 2025
    • Date parsed from source:
      Dec 19, 2025
    • First seen by Releasebot:
      Dec 20, 2025
    Moondream logo

    Moondream

    We added Moondream 3 Preview support to Moondream Station

    Moondream Station launches Mac with Moondream 3 Preview, delivering native MLX performance and quantized models for Apple Silicon. The one click installer runs on Mac Windows and Linux and showcases snappy inference, with 35+ tokens per second on an M1 Max.

    Moondream Station

    Moondream Station is our free on-prem client for Moondream. A one-click (or one command) installer makes it a snap to get Moondream running on your Mac, PC, or Linux box instantly. Today we're happy to announce Moondream 3 Preview support on Mac. Try it out for yourself (works on mac, windows, linux):

    pip install moondream-station

    Built for Apple Silicon

    To get the most out of Apple Silicon, we built Mac inference to be fully MLX native and added quantized Moondream 3 support. The result is snappy performance. You'll need a Mac with at least 16GB of memory. On an M1 Max with 64GB, we are seeing over 35 tokens per second. Here's a demo of how it works on that M1 Max:

    Moondream uses dedicated grounding tokens, so any x or y coordinate only requires one token. This means inferences for grounded skills like point or detect feel near instantaneous.

    Using the API

    Once Moondream Station is running, you can connect to it using our Python client:

    # pip install moondream
    import moondream as md
    from PIL import Image
    
    # Connect to Moondream Station
    model = md.vl(endpoint="http://localhost:2020/v1")
    
    # Load an image
    image = Image.open("path/to/image.jpg")
    
    # Ask a question
    answer = model.query(image, "What's in this image?")["answer"]
    print("Answer:", answer)
    

    What's next

    We are planning to make more improvements to Moondream Station over the next few weeks. If you have ideas or requests, reach out on Discord.

    Happy holidays from the Moondream team.

    Original source
  • Jan 9, 2025
    • Date parsed from source:
      Jan 9, 2025
    • First seen by Releasebot:
      Dec 6, 2025
    Moondream logo

    Moondream

    Moondream 2025-01-09 Release: Structured Text, Enhanced OCR, Gaze Detection

    Moondream 1.9B arrives with a new Gaze Detection capability, stronger OCR, and easier structured output across JSON, XML, Markdown, and CSV. It benchmarks against top small vision models and invites developers to try in the playground or download.

    Today, we’re announcing a new release of Moondream 1.9B. It has improvements across a bunch of areas and includes a new capability, Gaze Detection. This release marks the first time we’ve focused on industry benchmarks, and we’re excited to share some results. Despite these upgrades, the model is still just 1.9B, so it's fast and can run everywhere. Try it out in our playground or download it now.

    1. Structured Output

    Building with Moondream is easier than ever with our new support for structured output formats such as JSON, XML, Markdown, and CSV. Here’s some examples:

    Example 1: JSON structured output

    Example 2: XML structured output

    Example 3: Markdown structured output

    2. New Capability: Gaze Detection

    Traditional Vision AI consists of specialized models built for different tasks like “object detection” (outline a specified object's region in an image) or “captioning” (create a caption for an image). Moondream supports several of these common Vision AI tasks as “capabilities,” all within a single model. Moondream already supports object detection and captioning, as well as “visual querying” (ask any question to a photo) and “pointing” (get the x,y coordinates of elements within a photo).
    Today, we are excited to launch a new capability: Gaze Detection.
    This capability tracks human attention. Note that this capability is experimental. We’re releasing it to get feedback from developers so we can improve it over time.

    Example 1: Driver Gaze Detection

    Example 2: Sport Gaze Detection

    3. Benchmarks

    We’ve always been a bit iffy about benchmarks. Some focus on problems we don’t think are relevant to Moondream (e.g., solving math equations). Others include weird questions and wrong answers (at least to us — see the Weird Benchmarks appendix below). And focusing too much on benchmarks can lead to weird behaviors, with allegations that some models "cheat" by training on the actual benchmarks themselves.
    Despite this, we decided to improve our scores because we don’t want anyone sleeping on Moondream because of low results. We benchmarked ourselves along with the top small vision language models.
    You can find our individual benchmark results below:

    4. Better OCR

    We made changes to Moondream’s vision layer that have helped improve text reading/OCR significantly. We’ve also trained it on a lot more document querying and understanding. Here’s some examples:

    Example 1: OCR Example

    Example 2: Chart OCR Example

    Looking Ahead

    As pumped as we are about this release, the best part, for us, is seeing what you build with it. VLMs are making it faster, cheaper, and easier than ever to build next generation vision-enabled apps. Getting setup takes minutes, or you can try out Moondream in our playground. We offer Cloud inference with a generous free tier, or you can download it and run it yourself. Check out our docs for a getting started guide and lots of sample code.

    Happy Moondreaming!

    Appendix 1: Weird Benchmark Questions

    Here’s a few examples of weird benchmark questions...

    Example 1: Confusing Benchmark Question
    In GQA, the following image has a question that asks “Is the traffic signal on the right side or the left?” If you look closely, you can see there are traffic lights on both sides of the street. However, GQA expects the answer to be “Left.”

    Example 2: Nonsensical Benchmark Question
    In the following image, GQA asks “What animal sits in the bench that is on the right side?" It expects the answer to be “bird” 🤯.

    Original source
  • Mar 28, 2025
    • Date parsed from source:
      Mar 28, 2025
    • First seen by Releasebot:
      Dec 6, 2025
    Moondream logo

    Moondream

    Moondream 2025-03-27 Release

    Moondream unveils a new release with Long captions, image tagging, and a faster transformer client, plus near-state-of-the-art object detection gains. Built on real user feedback, this update is ready to try today in the playground or via download.

    We're excited to announce a new Moondream release. There's a lot to unpack, but we'll give you the highlights here, and share more over the coming weeks. The improvements in this release were driven from real-world usage and feedback from our community and customers. We want to extend a huge thank you to everyone who contributed to that. Keep it coming, let's gooooo!

    Longer

    The ability to caption images is one of the top Moondream use cases. Super accurate captions from a fast, efficient, and easy to run model seems to be a winning combo! Use cases range from synthetic data generation to real-world understanding and robotics.
    Until now, Moondream offered a choice of "Short" or "Normal" length captions. This model introduces "Long" format. From our testing, this generates roughly 2x longer captions than "Normal".

    Better

    We're still doing exhaustive evals, but so far we've seen major improvements on object detection (COCO mAP), OCR (OCRBench), and counting (CountBenchQA):
    We had some customers request improvements in our object detection capability. We were excited to work on that, and we're especially pleased with the results. This new COCO mAP score now makes Moondream near state of the art on object detection.
    We also had a customer with a specific need: the ability to tag all the things visible in an image. While we don't have a public benchmark available to highlight, our internal benchmark and vibe checks show a huge improvement in this ability.
    We call it "Image Tagging," and you can try it out by using this prompt in your image query:
    "List all visible objects, features, and characteristics of this image. Return the result as a JSON array."
    Here's an example of how it works:

    compile()
    on the model sped up inference from 61.4 tok/s to 123.4 tok/s on an Nvidia 3090.
    

    Faster

    We have a client update planned in the next few weeks that includes mobile, but in the meantime, we snuck in one key improvement to our transformers-based client. It just got a lot faster.
    From our testing, calling

    compile()
    

    on the model sped up inference from 61.4 tok/s to 123.4 tok/s on an Nvidia 3090.
    That not only makes it cheaper to run, it also opens up more possibilities for near-realtime processing, especially for video streaming.

    Stronger

    Moondream's improvements are driven by the feedback and engagement of its growing community and customers.
    This is creating a flywheel effect, where the feedback and requests from the community drive us to make more improvements to the model, and these improvements drive more adoption from an ever-growing community.
    In other words, you're part of the reason Moondream keeps growing. We extend our tip of the hat to you, our Moondreamers.

    Conclusion

    We'll be sharing more details over the next few weeks, but the great news is that you don't have to wait.
    You can download the model now, or go kick its tires in our playground, or even better yet, build something with our free cloud offering.

    Original source
  • Apr 14, 2025
    • Date parsed from source:
      Apr 14, 2025
    • First seen by Releasebot:
      Dec 6, 2025
    Moondream logo

    Moondream

    Moondream 2025-04-14

    Moondream 2025-04-14 introduces a more efficient VLM with improved document and chart understanding, better OCR, UI comprehension, and stronger counting benchmarks, all while reducing compute. The note also outlines training tricks and the roadmap for future upgrades.

    Moondream Release Notes

    Moondream is not just one of the tiniest VLMs. It's the world's most efficient VLM. It produces highly accurate answers with the least computing possible. On a graph with intelligence on the Y axis and resource usage on the X axis, Moondream aims to be top left.
    AI is still early in its development, and just as we've seen mainstream computing evolve from mainframes to desktops, mobile, and even smaller devices, so will AI. This is especially true for Vision AI. Almost every physical device is improved if it can reason about its surroundings. Today this often means streaming video or images back to the cloud, which is slow, costly, and privacy-problematic. But with improving hardware and models, this is about to change, and today is another solid step forward.
    But efficiency isn't just for the edge, it's for the cloud too. Analyzing vision at scale can be costly. Customers turn to Moondream when scale becomes a concern. Analyzing millions of images, or thousands of hours of video with Moondream is more cost effective than with any other VLM.
    That's why we're excited to announce Moondream 2025-04-14. As usual, we've improved on all of the benchmarks we focus on, with some notably large improvements. Let's first look at how it stacks up vs our previous release just a few weeks ago:
    [Comparison chart with previous Moondream version]
    Now let's see where that puts us vs other top open source small VLMs.
    [Comparison chart with other small VLMs]

    Tech Notes

    This release of Moondream was trained on about 450B tokens. For contrast, models like Gemma 3 4B have been trained on 4 trillion tokens, and Qwen 2.5 VL utilized 18 trillion tokens for text modeling, plus an additional 4 trillion tokens for their VLM. Our efficiency in producing a high performance model with a fraction of the training budget is the result of:

    • High-Quality Data: We've observed that small models are especially sensitive to noisy data. We produce training data that contains both rigorously filtered real-world data and carefully crafted synthetic data, designed to minimize domain gaps.
    • Focused Scope: Moondream is specifically designed for developers creating computer vision applications. We prioritize relevant capabilities over broader use-cases like multi-turn conversations or haiku writing.
    • Training Techniques: We've developed a set of training methods that maximize training efficiency. We keep most of them proprietary, but here are two we're disclosing today:
      • We use a custom second-order optimizer, crucial for balancing conflicting gradients, such as object detection versus text generation tasks.
      • We use a self-supervised auxiliary image loss that significantly accelerates model convergence.
        Our focus this release was to improve on our previous one by targeting document understanding, charts, and user interfaces. Moondream has become quite proficient at reading documents. Here are examples of document and layout understanding:
        [Document understanding example]
        [Layout understanding example]
        This improvement in document and text reading has also yielded sizeable bumps in our text-related benchmarks:
    • ChartQA: Improved from 74.8 1 77.5 (82.2 with PoT)
    • DocVQA: Improved from 76.5 1 79.3
    • TextVQA: Improved from 74.6 1 76.3

    Performance

    Counting

    This release of Moondream has seriously counting chops (e.g. "how many birds in this image"). To see how good we got, here's a chart comparing ourselves to all the big names in VLMs.
    [CountBenchQA performance chart]

    Chart Understanding

    Chart understanding has been a key focus for this release. Charts require models to ground text and numbers in a visual layout, then reason over them precisely. On ChartQA, Moondream improves from 74.8 in our last release to 77.5, and 82.2 with Program of Thought (PoT) prompting.
    PoT is a prompting strategy where the model generates and executes code to solve problems step-by-step. This is especially valuable in chart QA, where reasoning failures often stem not from misreading the chart, but from making small but critical logical errors 1 like summing three correct numbers incorrectly. Rather than expecting the model to always reason flawlessly in natural language, we let it write and run code.
    [Chart understanding example]

    Here's a few more notes from this update:

    1. To access the OCR capability for docs and tables, use the prompt "Transcribe the text" or "Transcribe the text in natural reading order".
    2. Object detection supports document layout detection (figure, formula, text, etc).
    3. UI understanding has improved, with ScreenSpot [email protected] up from 53.3 to 60.3.

    Conclusion

    As excited as we are for this launch, we have a lot more coming up for next releases too. Here's a few areas we're focused on:

    1. Repetition Handling: We've seen an increase in repetitions in our inferences, especially when generating long document answers. We've added temperature and nucleus sampling to reduce output repetition, with a repetition penalty setting coming soon. Training adjustments will further mitigate this issue in future releases.
    2. Tokenizer Upgrade: Our current tokenizer, derived from the three-year-old CodeGen model, hinders optimal training and inference performance. We plan to adopt either a traditional BPE tokenizer (ensuring broad ecosystem compatibility) or a BPE variant (optimized for efficiency).
    3. Bounding Box Accuracy: Currently, the model occasionally generates bounding boxes encompassing multiple items. We have identified the root cause and a solution is forthcoming. Meanwhile, prefixing object detection queries with "(coco)" can help mitigate this issue.
    4. Continued Training: As performance continues to steadily improve, we anticipate training for an additional 200 billion tokens before the next release.
      We invite you to go check it out for yourself in our playground, or start coding today.
      You can run it locally using our new Moondream Server (Mac and Linux for now, Windows coming…), or in our cloud (with a generous free tier).
      Happy Moondreamin'.
    Original source
  • May 1, 2025
    • Date parsed from source:
      May 1, 2025
    • First seen by Releasebot:
      Dec 6, 2025
    Moondream logo

    Moondream

    Moondream Station Launch

    Moondream Station launches as a one-click way to run Moondream locally, handling download, setup, and updates. It lets you run Moondream from the CLI or via a local port 2020, with Mac live and Windows/Linux coming soon.

    Moondream Station

    tl;dr; Moondream Station is a one-click way to run Moondream locally. It's free, fast, and with no technical headaches. Download here.

    After we launched Moondream last year, we hosted a hackathon in Seattle. It was clear right away that Vision Language Models (VLMs) are changing the game. Just by prompting, people were unlocking new ways to work with images.

    What we learned that day stuck with us: even though Moondream can run on a laptop, developers would trip up on the installation and setup. They wanted something that just works. So we launched a cloud version, free.

    However we also kept seeing people wanting to run it locally, and still stubbing their toes with the download and setup. So we finally rolled up our sleeves and did something about it. Today we're launching Moondream Station – a one-click solution to running Moondream locally.

    Moondream Station manages all of the tedious parts for you. It manages the download, setup, and updates to run Moondream on your desktop. Once it's running, you can invoke Moondream either through the client command line, or write code that invokes it through a local port 2020 (hah!). Here's a video showing it in action:

    It's Mac only for now, but Windows and Linux coming soon. We have a bunch of feature updates planned. Drop by on our Discord if you have questions or any problems.

    Original source
  • May 21, 2025
    • Date parsed from source:
      May 21, 2025
    • First seen by Releasebot:
      Dec 6, 2025
    Moondream logo

    Moondream

    Fewer bits, more dreams

    Moondream unveils 4-bit quantization to slash memory use and speed up inference. The quantized model hits 99.4% of full precision accuracy with a 42% memory drop and 34% faster performance on RTX 3090. Open source on Linux via Moondream Station; Mac coming soon; HuggingFace access.

    Introduction

    When we compare model sizes, we usually quote the number of parameters. But that doesn't tell the whole story. Depending on other factors, a model with fewer parameters may actually use more memory, and run inference slower than a larger model. That's why we prefer to focus on actual memory size and inference speed.

    4-bit quantization

    Today we're excited to announce a new feature that makes Moondream run faster and use less memory: 4-bit quantization. In case you're not familiar, quantization is a technique that reduces the number of bits used to store a model's weights. For example, weights are usually stored as 16-bit float, which take 2 bytes each. A 4-bit weight only takes 0.5 bytes.

    The challenge with quantization is that it can lead to a loss of model accuracy. We've been working on this for a while, and we're excited to share that our 4-bit quantized model reaches 99.4% of the accuracy of the full precision model. In practice you'd probably never notice the difference.

    Performance and Availability

    Meanwhile you probably would notice the speedup and memory improvement. The peak memory usage is reduced by 42% (from 4.2GB to 2.4GB) and, the inference speed is increased by 34% (on an RTX 3090), although the speedup may vary by machine. On the accuracy front, we measure the average score on 8 popular vision benchmarks. The 4-bit quantized model achieved an average score of 74.5 vs 74.9 for the full precision model.

    So let's update our chart from our 2025-04-14 model release.

    Both the full precision model and the 4-bit quantized model are available as open source. The 4-bit model is currently available for Linux users in Moondream Station (our one-click solution), with Mac support coming soon. Advanced users can also access the model directly via Hugging Face Transformers.

    Happy Moondreamin'.

    Original source
  • Jun 23, 2025
    • Date parsed from source:
      Jun 23, 2025
    • First seen by Releasebot:
      Dec 6, 2025
    Moondream logo

    Moondream

    Moondream Update: Grounded Reasoning, Better Detection, Faster Generation

    Moondream’s latest release adds grounded visual reasoning for sharper interpretation, improved object detection, and 20–40% faster responses. It can think step by step, audit its reasoning, and switch between reasoning and normal modes for speed or accuracy, via playground, API, or HuggingFace.

    Grounded Reasoning

    Simple tasks—like reading a date off a receipt need little thought. Harder ones, like finding the median value on a chart, demand real reasoning about where things are and how they relate.

    Moondream now supports grounded reasoning, allowing the model to spend some time to think about the problem and reason precisely about positions and spatial relationships within images before generating an answer. This unlocks performance gains for tasks that depend on accurate visual interpretation.

    Moondream can now pause, look around the picture, and think step-by-step before answering. We call this grounded reasoning because the model can reason, using both logic and visual facts about the image to produce more accurate answers.

    Take chart understanding, for example. Without reasoning, Moondream does its best by essentially guessing ths answer in one shot. With reasoning on, it breaks the job into three small steps, then nails the answer.

    Moondream's reasoning is specifically designed for accurate visual reasoning. The model can choose the "ground" its reasoning with spatial positions in the image when needed to accurately solve a task. Consider counting objects in images, for example. When there are more than a couple of instances of an object in an image the model chooses to explicitly point at them in the reasoning trace, similar to how a human may tackle the same problem.

    The recent Vision Language Models are Biased paper shows that many VLMs suffer from confirmation bias when counting, returning memorized knowledge instead of actually counting when they see familiar objects. As we deploy VLMs in high-stake applications, it's critical that we are able to ensure and audit that the models are actually reasoning about the image instead of simply performing sophisticated pattern matching. Our approach to visual reasoning not only helps the model reason about images, but also provides a way for users to audit what it's doing and understand failure modes.

    Moondream supports both reasoning and normal queries with the same model, meaning you can trade off accuracy vs speed depending on the complexity of the task you're trying to performance. You can enable reasoning by passing reasoning=True with the query skill. This reasoning mode is powerful but still experimental. For simpler tasks, the original mode may perform better, so we recommend trying both.

    How We Taught It to Think

    We've started using reinforcement learning (RL) to train Moondream. If you're not familiar with RL, here's a short description of how it works. Traditionally, models are trained by asking them questions where the correct answer is known (aka "Ground Truth"). If the model doesn't answer correctly, we apply a corrective change in the model weights to encourage it to answer better next time. This process is called "supervised learning".

    RL works a little differently. We start the same way, with a question where we know the correct answer. With RL, however, we ask Moondream to generate numerous answers using different temperatures, then we grade the answers on good they are. Not only if the answer is correct, but whether it used correct reasoning. This is easier done with tasks that have singular answers (e.g., "What's the sum of the numbers the table?"). For more open-ended answers (e.g., "Caption this image"), we use another Moondream model to judge the answer.

    So far we've used RL to train Moondream on 55 tasks and the results are impressive. We plan to increase this to ~120 before the next update to the model.

    With smaller models such as Moondream, it's common practice to "bootstrap" the model with reasoning traces from a bigger model. We haven't taken this approach for two reasons: first, our context length is currently limited to 2048 tokens, and this will need to be increased before we can train on longer reasoning traces. Secondly, most open reasoning models are focused on mathematical and coding reasoning, and this is not as effective for visual reasoning.

    Sharper Object Detection

    Moondream's Object Detection skill just got a lot better with this release. Previously, Moondream had a tendency to clump together objects that were close to each other, and sometimes struggled with finer-grained object specifications (e.g. "blue bottle" instead of just "bottle"), compared to Moondream's pointing capability.

    This was largely due to the quality of the the datasets we used. Object detection datasets generated by humans tend to be messy and imprecise as drawing highly accurate bounding boxes is tedious. Annotators often take shorcuts, and sometimes draw a single box around multiple instances when they're close to each other in the image.

    We used RL to overcome this, and the results are impressive. We'll be sharing more details about this in a separate blog post, but for now, here's a sample of the results.

    Faster Text Generation

    Moondream now generates answers 20-40% faster than before. This is because we upgraded the model to use a "superword" tokenizer that encodes text more efficiently. This means Moondream needs to emit fewer tokens to generate the same answer, and we achieve this without any drop in accuracy.

    Changing a tokenizer typically involves a costly step to retrain the entire model. We built a lightweight "tokenizer transfer hypernetwork" that enabled us to adapt smoothly to new tokenizers without retraining.

    Lastly, this "tokenizer transfer hypernetwork" also makes it easier to train multilingual variants in the future.

    UI Understanding

    Moondream's performance on ScreenSpot, a benchmark for UI understanding, jumped significantly from 60.3 to 80.4. This makes Moondream a great choice for UI-focused applications that require fast element localization.

    While the model cannot be used as a standalone computer use agent yet, it can work very effectively when treated as a tool to be used by a larger agentic model. This is the setup used by projects like Magnitude, where a bigger LLM writes test cases that leverage Moondream for UI understanding tasks. This separation of planning and execution models allows them to run tests more quickly and reliably than using alternatives like OpenAI or Anthropic's Computer Use APIs.

    Looking Ahead

    Grounded reasoning, smarter object detection, faster tokenizer, and better UI understanding represent a big step forward for Moondream. These fundamental advances also open the door to more improvements on the horizon. We look forward to pushing the model to achieve deeper reasoning capabilities and broader task coverage, and even more speed optimizations.

    There's more to this release than what we've covered here, so check it out yourself. Checkout our free online playground, or our free cloud API, or run it locally using Moondream Station - also free. If you prefer really low-level stuff, you can use it directly using Hugging Face Transformers.

    Happy Moondreamin’!

    Original source
  • Sep 18, 2025
    • Date parsed from source:
      Sep 18, 2025
    • First seen by Releasebot:
      Dec 6, 2025
    Moondream logo

    Moondream

    Moondream 3 Preview: Frontier-level reasoning at a blazing speed

    Moondream 3 unveils a 9B MoE with 2B active params, delivering frontier visual reasoning at speed and low cost. It boosts context length to 32K and RL‑driven training for real world tasks like object detection and OCR. Available on the Moondream playground and HuggingFace.

    Moondream 3 Preview

    We're excited to announce a preview release of Moondream 3. It's a new architecture of 9B MoE, with 2B active params. Moondream now achieves frontier-level visual reasoning while still retaining blazingly fast and efficient inference.

    Why A New Architecture

    The impact of AI today has largely been relegated to the digital realm. We have agents that can code, produce digital art, and so on - but very few cases of AI operating in our physical world. No robots to clean our houses, or act as receptionists, or inspect buildings, etc… For Moondream 3, we focused on 4 key areas.

    • Visual reasoning: despite our focus on smaller models, we don't want that to come at the cost of capability. We want Moondream to be the most capable VLM at real-world tasks.

    • Trainable: Many vision tasks require specialization. It's not enough for VLMs to be as good as humans. Even humans need training when it comes to complex tasks. Accurately interpreting an X-Ray image, or detecting struggling people in crowds. Moondream must be easily trainable.

    • Fast: Vision AI applications often need near-realtime performance. Sorting produce, or detecting missing herd animals from a drone, or recognizing security incidents - none of these tasks can be built without fast vision inference.

    • Inexpensive: Vision AI apps often deal with huge quantities of images, and cost can often be a blocker to adoption. Moondream must be cheap to run at scale.

    Moondream 3 achieves these goals by adopting a 9B MoE model, yet still with 2B active parameters. This enables it to achieve, and in some cases beat, frontier-level models, yet still only require 2B active parameters (keeping it fast and inexpensive). We also improved its training dynamics, making Moondream 3 more efficient at learning, especially when using Reinforcement Learning (more on that in subsequent announcements). For more details on the architecture, head to the "Tech Notes" below. One final detail however: we grew the context length from 2k to 32k, making Moondream much better at understanding and producing more complex queries and answers.

    Moondream 3 in action

    Here are some examples of Moondream 3.

    Object Detection

    Moondream 3 is astonishingly good at object detection. It goes beyond simple labels (.e.g., "car") and can understand more complex queries. We show results compared to frontier models alongside. These models don't support grounding skills like object detection and pointing natively, so we used a templated query for those (see footer).

    • Example 1
      Prompt: "Runner with purple socks"

    • Example 2
      Prompt: "Quantity input"

    Pointing

    Moondream supports pointing as a native skill.

    • Example 3
      Prompt:
      "Bottle"

    • Example 4
      Prompt: "Best utensil for pasta"

    Structured output

    With a longer context length, Moondream 3 generates intelligent structured outputs with minimal prompting.

    • Example 5: Sled dogs
      Prompt
      "A JSON array with keys: dog_id, fur_color, harness_color."

    • Result
      [
      { "dog_id": 1, "fur_color": "light brown", "harness_color": "red" },
      { "dog_id": 2, "fur_color": "dark brown", "harness_color": "red" },
      { "dog_id": 3, "fur_color": "gray", "harness_color": "red" },
      { "dog_id": 4, "fur_color": "white", "harness_color": "red" },
      { "dog_id": 5, "fur_color": "dark brown", "harness_color": "green" },
      { "dog_id": 6, "fur_color": "light brown", "harness_color": "green" },
      { "dog_id": 7, "fur_color": "dark brown", "harness_color": "black" },
      { "dog_id": 8, "fur_color": "white", "harness_color": "black" }
      ]

    OCR

    Moondream 3 has drastically improved its OCR abilities. Our vision encoder can get tripped up on tiny fonts (working on it), but it's now useful in many real-world cases.

    • Example 6
      Prompt
      "Convert to markdown""

    • Result

      Metal Reaction Electrode Potential (V) Gold Au⁺ + e⁻ = Au +1.692 Silver Ag⁺ + e⁻ = Ag +0.7996 Copper Cu²⁺ + 2e⁻ = Cu +0.342 Iron Fe³⁺ + 3e⁻ = Fe -0.037 Lead Pb²⁺ + 2e⁻ = Pb -0.126 Nickel Ni²⁺ + 2e⁻ = Ni -0.257 Cadmium Cd²⁺ + 2e⁻ = Cd -0.403 Iron Fe²⁺ + 2e⁻ = Fe -0.447 Zinc Zn²⁺ + 2e⁻ = Zn -0.762 Aluminum Al³⁺ + 3e⁻ = Al -1.662

    Benchmarks

    Here are some early benchmark results. We show it alongside some top frontier models for comparison. In practice, however, it's probably not a fair comparison for Moondream since, in practical terms, Moondream produces answers in fraction of the time of these bigger models. We'll publish more complete results later and include inference times to make this clearer.

    MD3 Preview Technical Notes

    Here are some details on our new model architecture. Moondeam 3 is a fine-grained sparse mixture-of-experts model with 64 experts, of which 8 are activated for each token. We initialized it from Moondream 2 (a 2B dense model) using drop upcycling. We also extended the usable context length to 32K tokens, which is critical for few-shot prompting and agentic workflows with tool-use. We don’t fully leverage this longer context in our post-training yet (part of why it's only a preview release). The full 32k context is available for you if you're interested in fine-tuning the model.

    (Figure: Long-context perplexity evaluation on GovReport dataset. Each point shows the average cross-entropy loss (nats per token) for a 128-token sliding window at that position, measured across 100 documents truncated to 32,768 tokens.)

    We do not use a separate context-length extension phase during training, instead opting to interleave long-context samples while pretraining with a default context length of 4096 tokens. Many context length extension methods like YaRN include an attention temperature scaling component. Inspired by this, we adjust the architecture to enable learned temperature scaling as a function of position, and find this helps with long context modeling.

    Like our last 2B release, this is a hybrid reasoning model that supports both reasoning and non-reasoning mode. Unlike other reasoning models, however, Moondream focuses on visual reasoning with grounding. Here’s an example of what that means:

    Each chunk of underlined text in the reasoning is grounded, meaning the model references a particular part of the image. In our playground, you can see what the model is focusing on by hovering over the text.

    The model starts with only a small set of visual-reasoning examples, and gradually learns to rely on them more during our reinforcement learning (RL) post-training phase. RL proved so effective that, as we refined our training approach, post-training ended up using more compute than the initial pre-training itself.

    It was trained with load-balancing and router orthogonality losses to help similar tokens specialize together early on, then had load balancing disabled in post-training to avoid catastrophic forgetting from distribution shift. Finally, attention tweaks like learnable temperature and LSE suppression sharpened focus and cut noise—boosting accuracy and clarity.

    Conclusion

    This preview release comes with some caveats. We haven't optimized the inference code yet, so inferences are much slower than anticipated (we're working on it!). We're also still actively training this model, and we expect the capabilities and benchmarks scores to improve. We also plan to produce variants of this model (e.g., quantized versions and distilled smaller versions).

    The model is now available on the Moondream playground, and you can download it on HuggingFace (Moondream Station will be updated soon). Hit us up on our Discord if you have any questions.

    (1) Frontier models don't support object detection natively, so this prompt was used instead:
    Detect these objects in the image: [comma-separated list].

    Original source
  • Sep 23, 2025
    • Date parsed from source:
      Sep 23, 2025
    • First seen by Releasebot:
      Dec 6, 2025
    Moondream logo

    Moondream

    Moondream Station 2: Simpler, more features, and Windows!

    Moondream Station 2 is live with a one‑command installer for PC Mac and Linux, faster starts and full Windows support. Run Moondream locally with Moondream 3 Preview, new architecture, PyPI install, and heavy workloads with smart port handling. Mac work in progress.

    Moondream Station 2 is live

    Running Moondream on your own machine just got way smoother. Moondream Station 2 is a one-command installer for PC, Mac, and Linux that’s faster, simpler, and more capable than ever.

    Highlights:

    • Moondream 3 Preview support: run our latest model locally (Mac support coming soon).
    • New architecture: quicker starts, easier to operate.
    • Windows ready: full native support.
    • PyPI package: install with pip install moondream-station.
    • Built for heavy workloads: request queuing, multiple workers for big-GPU setups, and built-in metrics to track usage and history.
    • Smart port handling: defaults to port 2020, finds a free one if needed, or lets you set your own.

    Whether you’re on Windows, Linux, or Mac, Station 2 puts Moondream’s visual reasoning power a single command away. Apologies to our Mac audience, we're working on quantizing the model and addressing a few technical hiccups. We'll release it asap.

    Get started →

    Original source
  • Oct 17, 2025
    • Date parsed from source:
      Oct 17, 2025
    • First seen by Releasebot:
      Dec 6, 2025
    Moondream logo

    Moondream

    Announcing Moondream Cloud

    Moondream launches Moondream Cloud, a hosted vision AI platform built on Moondream 3 Preview. It emphasizes speed and cost, with pay‑as‑you‑go pricing, $5 free credits, and enterprise options including on‑prem and compliance. A clear release highlighting a new product rollout.

    Fast, cheap, smart. Pick three.

    We're excited to launch Moondream Cloud, a hosted version of Moondream that makes it easy to build cutting-edge vision applications.

    When choosing vision AI tech, three things matter most: intelligence, speed, and cost. With our recent launch of Moondream 3 Preview, our model already delivers top-tier intelligence, reaching SOTA on visual reasoning and grounding tasks, outperforming top frontier models. Our Moondream Cloud release focuses on the other two: speed and cost.

    Pricing

    Moondream Cloud is pay-as-you-go. No subscriptions, no commitments, just load up credits and you're done. To help you start building right away, you get $5 in free monthly credits too (no credit card required!).

    Our pricing is token based: Moondream 3 Preview costs $0.30 per million input tokens, and $2.50 per million output tokens. These token rates are simlar to Gemini 2.5 Flash and GPT-5 Mini. But token pricing doesn't tell the full story. Moondream uses a custom SuperBPE tokenizer that means we generate 21% fewer tokens for the same output text. We have dedicated grounding tokens that represent points with two tokens and object bounding boxes with three tokens, where competing models have to use tens of tokens. And we represent images of all resolutions with 729 tokens, leading to significant savings on prefill.

    We simulated a workload where each of the three examples below are processed once a minute, for 30 days. To do this for all three images would cost:

    Comparisons

    We compared Moondream Cloud with Gemini Flash 2.5 and GPT-5 Mini. Both are vision-capable and similarly priced. (We skipped Claude Haiku 4.5 because its vision capabilities were significantly behind on the tasks we evaluated.)

    Example 1: Pointing

    Average runtime: Moondream 3 (Preview) 1.52 seconds, Gemini 2.5 Flash 3.02 seconds, GPT-5 Mini 27.58 seconds
    Input tokens: Moondream 737, Gemini 1,352, GPT-5 Mini 419
    Output tokens: Moondream 25, Gemini 241, GPT-5 Mini 1,372
    Monthly cost (1 RPM): Moondream $12, Gemini $35, GPT-5 Mini $123

    In this example, Moondream is cheaper because we use both fewer input and fewer output tokens. We require fewer tokens both because we encode the image efficiently (compared to Gemini 2.5 Flash), and because we don't need a complicated text prompt to get the model to output just the list of 2D points. On the outputs, Moondream benefits from having dedicated grounding tokens, requiring only two tokens per point. The result is that Moondream is significantly cheaper to run.

    Example 2: Object detection

    Average runtime: Moondream 4.56 seconds, Gemini 7.69 seconds, GPT-5 Mini 52.88 seconds
    Input tokens: Moondream 737, Gemini 1,839, GPT-5 Mini 1,849
    Output tokens: Moondream 103, Gemini 1,524, GPT-5 Mini 3,271
    Monthly cost (1 RPM): Moondream $21, Gemini $170, GPT-5 Mini $302

    Again, Moondream is more efficient because our grounding tokens mean we only emit three tokens per bounding box -- two tokens encoding the position of the middle of the box, and one token encoding both the height and width. Like before you'll notice we're also significantly more accurate.

    Example 3: OCR

    Average runtime: Moondream 3.92 seconds, Gemini 3.44 seconds, GPT-5 Mini 18.47 seconds
    Input tokens: Moondream 743, Gemini 1,395, GPT-5 Mini 1,812
    Output tokens: Moondream 414, Gemini 533, GPT-5 Mini 528
    Monthly cost (1 RPM): Moondream $54, Gemini $75, GPT-5 Mini $65

    This one is more evenly matched, since we're emitting normal text output. But Moondream still wins on cost because of more efficient image encoding, and more efficient output tokenization (using our custom tokenizer).

    Throughput and Data Privacy

    On the free tier, we allow up to two requests per second. When you hold $10 or more in paid credits, we increase that to 10 requests per second. We never train on your data, and no data is persisted after returning responses.

    We also offer enterprise plans with:

    • On-prem inference (run Moondream in your own infrastructure)
    • Compliance options (e.g. HIPAA)
    • Dedicated consulting and support
    • Volume-based pricing

    Reach out at [email protected] to discuss your needs.

    Conclusion

    Moondream exists for one reason: to power the next wave of vision AI agents. Our new 9B parameter mixture-of-experts Moondream 3 (Preview) model combines the speed of a 2B model with state-of-the-art visual reasoning and grounding, with no compromises. And now, with Moondream Cloud, using it as simple as it gets. Fast, cheap, smart -- pick three.

    Go to the cloud console to grab an API key, then check out our documentation to get started!

    Original source
Releasebot

Curated by the Releasebot team

Releasebot is an aggregator of official release notes from hundreds of software vendors and thousands of sources.

Our editorial process involves the manual review and audit of release notes procured with the help of automated systems.

Similar to Moondream with recent updates: