Together AI’s ATLAS Adaptive Speculator speeds up inference by 400% by learning from workloads in real time -

Companies expanding their AI deployments are hitting an invisible performance wall. Who is the culprit? Static speculators that cannot keep up with changing workloads.

Speculators are small AI models that operate in parallel with larger language models during inference. We draft multiple tokens in advance and the main model validates them in parallel. This technique, called speculative decoding, has become essential for companies looking to reduce inference costs and latency. Instead of generating tokens one at a time, the system can accept multiple tokens at once, significantly increasing throughput.

AI together today announced research and a new system called ATLAS (AdapTive-LeArning Speculator System) aimed at helping companies overcome the challenges of static speculators. This technology provides self-learning inference optimization capabilities that can deliver inference performance up to 400% faster than the baseline level of performance available with existing inference technologies such as vLLM. This system addresses the critical issue of slowing down inference as AI workloads evolve, even when professional speculators are in place.

The company is It has started The focus for 2023 is Inference optimization On an enterprise AI platform. Earlier this year, the company Raised $305 million As customer adoption and demand grows.

"The companies we work with typically see workload changes as they scale, but then they don’t see as much speedup from speculative execution as they used to." Tri Dao, Chief Scientist at Together AI, told VentureBeat in an exclusive interview. "These speculators typically do not perform well when the workload domain begins to change."

The workload drift problem that no one talks about

Most speculators currently in production are "static" model. They are trained once on a fixed dataset representing the expected workload and then deployed without the ability to adapt. Companies like Meta and Mistral ship pre-trained speculators along with their primary models. Inference platforms like vLLM use these static speculators to increase throughput without changing output quality.

But there’s a catch. As companies’ use of AI evolves, the accuracy of static speculators will rapidly decline.

"If you’re a company that produces coding agencies and most of your developers are writing in Python, and suddenly some of them are writing in Rust or C, you’ll see things start to slow down." Mr. Dao explained. "Speculators have a mismatch between what they were trained on and the actual workload."

This workload drift represents a hidden burden on AI scaling. Companies either accept the drop in performance or invest in retraining custom speculators. This process only captures snapshots in time, which can quickly become outdated.

How adaptive projection devices work: A dual model approach

ATLAS uses a dual speculator architecture that combines stability and adaptability.

static speculator – Heavyweight models trained on extensive data provide consistent baseline performance. it works as "speed floor."

adaptive speculator – Lightweight model continuously learns from live traffic. We specialize in responding instantly to new domains and usage patterns.

A controller that recognizes reliability – The orchestration layer dynamically chooses which speculator to use. adjust your guess "Look ahead" Based on trust score.

"Before we learn what adaptive throwers are, there are still static throwers that offer speed improvements first." Ben Athiwaratkun, staff AI scientist at Together AI, explained to VentureBeat. "As adaptive speculators gain confidence, their speed increases over time."

The innovation lies in balancing acceptance rate (how often the target model matches the drafted token) and draft latency. As the adaptive model learns from traffic patterns, the controller relies more on lightweight speculators to extend lookahead. This further improves performance.

Users do not need to adjust any parameters. "No knobs need to be turned on the user’s side." Dao said. "On our end, we turned these knobs so the user could adjust the configuration to get the desired speedup."

Performance comparable to custom silicon

Together AI tests show that ATLAS reaches 500 tokens per second on DeepSeek-V3.1 when fully adapted. What’s even more impressive is that the Nvidia B200 GPU’s numbers match or exceed specialized inference chips such as: grox Custom hardware.

"Software and algorithm improvements can close the gap with truly specialized hardware." Dao said. "These huge models generate 500 tokens per second, which is even faster than some customized chips."

The 400% inference speedup the company claims represents the cumulative effect of Together’s Turbo optimization suite. FP4 quantization provides an 80% speedup compared to the FP8 baseline. A static Turbo Speculator adds another 80-100% gain. An adaptive system is layered on top of that. Each optimization compounds the benefits of the other optimizations.

Compared to standard inference engines such as vLLM Using Nvidia’s TensorRT-LLM provides a significant improvement. The AI benchmarks each workload together against a stronger baseline between the two before applying speculative optimizations.

Explaining memory and compute trade-offs

Performance gains are achieved by exploiting a fundamental inefficiency in modern inference: wasted computing power.

Dao explained that much of the computational power is typically underutilized during inference.

"During inference, which is our main workload today, we primarily use the memory subsystem." he said.

Speculative decoding reduces memory accesses at the expense of idle computing. If the model generates one token at a time, it is memory limited. The GPU remains idle while waiting for memory. However, when a speculator proposes five tokens and the target model validates them simultaneously, the compute usage spikes, even though memory accesses remain roughly constant.

"The total amount of computation to generate five tokens is the same, but memory needs to be accessed only once instead of five times." Dao said.

Think of it as an intelligent cache for AI.

For infrastructure teams accustomed to traditional database optimization, the adaptive speculator acts like an intelligent caching layer, but with key differences.

Traditional caching systems like Redis and memcached require an exact match. Save the exact same query result and retrieve it when you run that particular query again. Adaptive throwing devices work differently.

"This can be seen as an intelligent way of figuring out and caching some patterns that appear, rather than storing them exactly." Mr. Dao explained. "In general, we observe that you all use similar code or control your computing in similar ways. That way you can predict what the big model is going to say. We’re getting better and better at predicting it."

Rather than storing exact responses, the system learns patterns of how the model generates tokens. Recognize that when you are editing Python files with a particular codebase, certain token sequences are more likely to occur. Speculators adapt to these patterns and improve their predictions over time without requiring identical inputs.

Use case: RL training and evolving workloads

The following two enterprise scenarios particularly benefit from adaptive projection equipment.

reinforcement learning training: Static speculators quickly go out of alignment as the policy evolves during training. ATLAS continuously adapts to changing policy distributions.

Evolving workloads: Workload configurations will change as enterprises discover new AI use cases. "Maybe they started using AI for chatbots, but then they realized that AI can write code and started moving towards code." Dao said. "Or they realize that these AIs can actually summon tools, control computers, do accounting, etc."

In a vibecoding session, the adaptive system can specialize in the specific codebase being edited. These are files that are not visible during training. This further improves acceptance rate and decoding speed.

What it means for enterprises and the inference ecosystem

ATLAS is currently available on Togetter AI’s dedicated endpoint as part of the platform at no additional cost. This optimization is available to the company’s more than 800,000 developers (up from 450,000 in February).

But the broader impact goes beyond one vendor’s products. The transition from static to adaptive optimization represents a fundamental rethinking of how inference platforms should work. As companies deploy AI across multiple domains, the industry must move beyond one-time trained models to systems that continuously learn and improve.

Together AI has previously released some of its research methods as open source and collaborated with projects such as vLLM. Although a fully integrated ATLAS system is unique, some of the underlying technologies may ultimately impact the broader inference ecosystem.

For companies looking to lead with AI, the message is clear. Adaptive algorithms on commodity hardware can compete with custom silicon at a fraction of the cost. As this approach matures across the industry, software optimization is increasingly prioritized over specialized hardware.

Source link

Categories