
When Liquid AI, a startup founded by MIT computer scientists in 2023, announced Liquid Foundation Models Series 2 (LFM2) in July 2025, the pitch was simple: The idea was to use the new model to provide the fastest on-device foundation model on the market. "liquid" It is an architecture with training and inference efficiency that has made small models a serious alternative to cloud-only large language models (LLMs) such as OpenAI’s GPT series and Google’s Gemini.
The first release offered dense checkpoints of 350M, 700M, and 1.2B parameters, a hybrid architecture focused on gated short convolution, and benchmark numbers that outperformed LFM2 over similarly sized competitors such as Qwen3, Llama 3.2, and Gemma 3 in both quality and CPU throughput. The message to businesses was clear. With real-time, privacy-protecting AI on your phone, laptop, or vehicle, you no longer have to sacrifice functionality for latency.
In the months since its launch, Liquid has expanded LFM2 to a broader product line, adding task- and domain-specific variants, smaller video ingestion and analytics models, and an edge-focused deployment stack called LEAP, positioning these models as a control layer for on-device and on-premises agent systems.
Now, by publishing a detailed 51-page LFM2 technical report on arXiv, the company is going a step further, exposing the architecture search process, training data mix, distillation objectives, curriculum strategy, and post-training pipeline behind their models.
Also, unlike previous open models, LFM2 is built around repeatable recipes. That means a hardware-in-the-loop search process, a training curriculum that compensates for a smaller parameter budget, and a post-training pipeline that follows instructions and is tailored to your tool usage.
In addition to providing weights and APIs, Liquid effectively exposes detailed blueprints that other organizations can use as a reference to train their own small, efficient models from scratch, tailored to their own hardware and deployment constraints.
Model family designed based on real-world constraints, not GPU labs
This technical report starts with an assumption that companies are familiar with. This means that real AI systems reach their limits long before the benchmarks reach them. Latency budgets, peak memory limits, and thermal throttling define what can actually run in production environments, especially on laptops, tablets, general-purpose servers, and mobile devices.
To address this, Liquid AI performed architecture searches directly on target hardware, such as Snapdragon mobile SoCs and Ryzen laptop CPUs. The result is consistent results at scale. In other words, Gated short convolutional block and a few Grouped query attention (GQA) layer. This design was repeatedly chosen over the more exotic Linear Attention and SSM hybrids because it provides better quality, latency, and memory Pareto profiles under real-world device conditions.
This is important for enterprise teams in three ways:
-
Predictability. This architecture is simple, parameter efficient, and stable over model sizes from 350M to 2.6B.
-
Operational portability. Dense and MoE variants share the same structural backbone, simplifying deployment across mixed hardware fleets.
-
Feasibility on the device. Prefill and decode throughput on the CPU is often approximately double that of comparable open models, reducing the need to offload routine tasks to cloud inference endpoints.
This report can be read as a systematic attempt to design a model that companies can implement, rather than optimizing for academic novelty. will actually be shipped.
This is notable and more practical for companies in fields where many open models implicitly assume access to multi-H100 clusters during inference.
Training pipeline tailored to company-related actions
LFM2 uses a training approach that compensates for model smallness through structure rather than brute force. The main elements are:
-
Pre-training for 10-12T tokens and additional Mid-training phase for 32K contextsThis extends the model’s useful context window without exploding computational costs.
-
a Separated Top-K knowledge distillation goals This avoids the instability of standard KL distillation when the teacher provides only a partial logit.
-
a 3-step post-training sequence—SFT, length-normalized preference adjustment, and model merging—are designed to produce more reliable instruction following and tool usage behavior.
It’s important for enterprise AI developers that LFM2 models behave not like “little LLMs” but like pragmatic agents that follow structured formats, adhere to JSON Schema, and can manage multi-turn chat flows. Many open models of similar size fail not because of a lack of reasoning ability, but because of weak adherence to instructional templates. LFM2 post-training recipes directly target these rough edges.
In other words, Liquid AI has optimized the small model by Reliability of operationit’s not just about the scoreboard.
Multimodality designed for device constraints rather than lab demos
The LFM2-VL and LFM2-Audio variants reflect another change. In other words, it’s built around multimodality. Token efficiency.
Rather than directly embedding a large-scale vision transformer into the LLM, LFM2-VL connects a SigLIP2 encoder via a connector and actively reduces the number of visual tokens via PixelUnshuffle. High-resolution input automatically triggers dynamic tiling, allowing you to control token budgets even on mobile hardware. LFM2-Audio uses bifurcated audio paths (one for embedding and one for generation) to support real-time transcription and speech synthesis on moderate CPU.
For enterprise platform architects, this design represents a practical future:
-
Document understanding occurs directly at the endpoint, such as a field device.
-
Audio transcription and voice agents run locally for privacy compliance.
-
Multimodal agents operate within a fixed delay envelope without streaming data outside the device.
The throughline is the same, multimodal functionality that doesn’t require a GPU farm.
Search model built for agent systems rather than traditional search
LFM2-ColBERT extends deferred interaction retrieval into a footprint small enough for enterprise deployments requiring multilingual RAGs without the overhead of specialized vector DB accelerators.
This is especially meaningful as organizations begin to coordinate their fleets of agents. Fast local searches that run on the same hardware as your inference models reduce latency and improve governance. Documents never leave the boundaries of your device.
Taken together, the VL, Audio, and ColBERT variants demonstrate LFM2 as a modular system rather than a single model drop.
A new blueprint for hybrid enterprise AI architecture
Across all variants, the LFM2 report implicitly shows what tomorrow’s enterprise AI stack will look like. Hybrid local cloud orchestrationsmall, fast models running on-device handle time-critical recognition, formatting, tool invocation, and decision-making tasks, while larger models in the cloud provide powerful inference when needed.
Several trends come together here.
-
cost management. Avoid unpredictable cloud bills by running routine inference locally.
-
Latency determinism. TTFT and decoding stability are important in agent workflows. Eliminate network jitter on your device.
-
Governance and Compliance. Local execution simplifies PII processing, data storage, and auditability.
-
Resilience. If the cloud path becomes unavailable, the agent system will gracefully degrade.
Enterprises adopting these architectures may treat smaller on-device models as the “control plane” for agent workflows, with larger cloud models acting as on-demand accelerators.
LFM2 is one of the clearest open source foundations for a control layer to date.
Strategic takeaway: On-device AI is now a design choice, not a compromise
Organizations building AI capabilities have long accepted that “real AI” requires cloud inference. LFM2 challenges that assumption. These models perform competitively across inference, instruction following, multilingual tasks, and RAGs, while achieving significant latency improvements over other open small model families.
For CIOs and CTOs finalizing their 2026 roadmaps, the implications are direct. The small, open, on-device model is now powerful enough to run meaningful slices of production workloads.
LFM2 is not intended to replace the Frontier cloud model for frontier-scale inference. But it provides what companies probably need more: a reproducible, open, and operational foundation. Agent systems that need to run everywherefrom phones to industrial endpoints to air-gapped secure facilities.
LFM2 is less a research milestone and more a sign of architectural convergence as enterprise AI expands. The future is neither cloud nor edge. Both work together. And releases like LFM2 provide the building blocks for organizations ready to build a hybrid future by design, not by accident.
