
Nous Research, an open source artificial intelligence startup backed by cryptocurrency venture firm Paradigm, released a new competitive programming model on Monday. The model rivals or exceeds some large-scale proprietary systems and was trained in just four days using 48 of Nvidia’s latest B200 graphics processors.
The model, called NousCoder-14B, is another entry into the crowded field of AI coding assistants, but it arrives at a particularly tense moment. Claude Code, rival Anthropic’s agent programming tool, has been dominating social media discussions since New Year’s Day, with developers posting breathtaking experiences with its features. The concurrent developments highlight how rapidly AI-assisted software development is evolving, and how fiercely companies large and small are competing to acquire what many believe will become the foundational technology for how software is created.
type: embedded entry inline ID: 74cSyrq6OUrp9SEQ5zOUSl
NousCoder-14B achieved 67.87 percent accuracy on LiveCodeBench v6, a standardized assessment that tests models based on competitive programming problems published from August 2024 to May 2025. According to Nous Research’s technical report published alongside the release, this number represents an improvement of 7.08 percentage points compared to the base model it was trained on, Alibaba’s Qwen3-14B.
"I have given Claude Code a description of the problem. I generated something I built last year in an hour." Jaana Dogan, Google’s lead engineer for the Gemini API, captured the general mood around AI coding tools in a viral post about X last week. Dougan was describing a distributed agent orchestration system that her team spent a year developing. This is the system that Claude Code approximated from a three-paragraph prompt.
This juxtaposition is instructive. While Anthropic’s Claude Code captures imaginations with its demonstration of end-to-end software development, Nous Research is betting that open source alternatives trained on verifiable problems can fill the gap. We also believe that transparency in how these models are built is just as important as the raw functionality.
How Nous Research built an AI coding model that anyone can replicate
What sets the NousCoder-14B release apart from many competitor announcements is its radical openness. In addition to model weights, Nous Research has published a complete reinforcement learning environment, benchmark suite, and training harness built on the company’s Atropos framework, allowing any researcher with sufficient computing power to reproduce or extend the work.
"Open sourcing the Atropos stack provides the infrastructure needed for reproducible Olympic-level inference research." One observer of X noted that it sums up its importance to the academic and open source communities.
The model was trained by Joe Li, a resident researcher at Nous Research and a former competitive programmer himself. Lee’s technical report reveals an unexpected personal side. He compared the trajectory of the model’s improvement to his own journey at Codeforces, a competitive programming platform where participants earn ratings based on their performance in contests.
Based on a rough estimate of mapping LiveCodeBench scores to Codeforces ratings, Lee calculated that NousCoder-14B’s improvement (from an approximate 1600-1750 rating range to 2100-2200) reflected a jump that took nearly two years of consistent practice from ages 14 to 16. This model achieved comparable results in 4 days.
"It was quite a surreal experience watching the final training run unfold." Lee wrote in the technical report:
But Li quickly noticed an important caveat that touches on broader questions about AI’s efficiency. The model required 24,000 problems, and Lee solved about 1,000 over those two years. Humans, at least for now, remain vastly sample-efficient learners.
Inside a reinforcement learning system that trains 24,000 competitive programming problems
NousCoder-14B’s training process provides a window into increasingly sophisticated techniques used by researchers to improve AI inference capabilities through reinforcement learning.
This approach relies on what researchers call "verifiable rewards" — A system where a model generates code solutions, those solutions are executed against test cases, and the model receives a simple binary signal (correct or incorrect). Although this feedback loop is conceptually simple, it requires a large infrastructure to run at scale.
Nous Research used Modal, a cloud computing platform, to run the sandboxed code in parallel. Each of the 24,000 training problems contains, on average, several hundred test cases, and the system must verify that the generated code produces the correct output within time and memory constraints (15 seconds and 4 GB, respectively).
The training employed a technique called DAPO (Dynamic Sampling Policy Optimization), which the researchers found performed slightly better than other techniques in their experiments. Key innovations include: "dynamic sampling" — Discard training examples where the model solves all trials or fails all trials, as they do not provide useful gradient signals for learning.
Researchers also adopted "Iterative context expansion," We start by training the model with a context window of 32,000 tokens and then scale it to 40,000 tokens. During the evaluation, further expanding the context to approximately 80,000 tokens yielded the best results, with an accuracy of 67.87 percent.
Perhaps most importantly, the training pipeline overlaps inference and validation. As soon as the model generates a solution, it starts working on the next problem while the previous solution is checked. This pipelining, combined with asynchronous training where multiple model instances work in parallel, maximizes hardware utilization on expensive GPU clusters.
Imminent data shortage that could slow progress in AI coding models
Li’s technical report contains discoveries that will have profound implications for the future of AI development. The training dataset for NousCoder-14B includes: "The majority of readily available and verifiable competitive programming problems are provided in the form of standardized datasets."
In other words, researchers are approaching the limits of high-quality training data for this particular area.
"The total number of competitive programming problems on the Internet is about the same order of magnitude." Lee wrote, referring to the 24,000 problems used in the training. "This suggests that we are nearing the limits of high-quality data in the realm of competitive programming."
This observation reflects growing concerns across the AI industry about data constraints. While computing continues to scale according to well-understood economic and engineering principles, training data "increasingly limited," As Lee said.
"Some of the most important research that needs to be done in the future appears to be in the areas of synthetic data generation and data-efficient algorithms and architectures." he concluded.
This challenge is particularly acute in competitive programming, which requires problems with known correct solutions that can be automatically verified. Unlike natural language tasks where human evaluation or surrogate metrics are sufficient, code either works or it doesn’t, which makes generating synthetic data much more difficult.
Li identified one potential avenue. It’s about training models to not only solve problems, but to generate solvable problems, allowing for forms of self-play similar to techniques that have proven successful in gameplay AI systems. "Once the overall problem generation is resolved, self-play takes a very interesting direction." he wrote.
A $65 million bet on whether open source AI can compete with Big Tech
Nous Research has a unique position in the field of AI. Companies that work on open source releases that compete with, and in some cases surpass, proprietary alternatives.
In April 2025, the company raised $50 million in a round led by Paradigm, a cryptocurrency-focused venture firm founded by Coinbase co-founder Fred Asum. According to some reports, total funding has reached $65 million. The investment reflects growing interest in decentralized approaches to AI training, an area where Nous Research developed its Psyche platform.
Previous releases include Hermes 4, a family of models that we reported on "Outperforms ChatGPT without content restrictions." DeepHermes-3 is what the company describes as the first "Toggle-on inference model" — Allow users to activate extended thinking capabilities as needed.
The company has cultivated a unique aesthetic and community, leading to some skepticism that the style masks the essence. "Oh, I’m going to believe in the anime PFP company. Stop FFS benchmark max," One reviewer wrote about X, citing Nous Research’s anime-style branding and industry practices for optimizing benchmark performance.
Some people asked technical questions. "Nemotron is better based on benchmarks, but" One commenter pointed out that he was referring to Nvidia’s family of language models. Another asked what about NousCoder-14B. "Agent-focused or just “one-shot” coding" — This is an important distinction for real-world software development, where repeated feedback usually yields better results than a single attempt.
Researchers say what needs to happen next for AI coding tools to keep improving
This release includes several directions for future research that suggest where AI coding research is headed.
Multiturn reinforcement learning is at the top of the list. Currently, the model only receives a final binary reward (pass or fail) after generating a solution. However, competitive programming problems typically involve public test cases that provide intermediate feedback such as compilation errors, incorrect output, and time limit violations. Training a model to incorporate this feedback across multiple trials can significantly improve performance.
Controlling response length also remains a challenge. The researchers found that incorrect solutions tended to be longer than correct ones, and during training, response lengths quickly saturated the available context window. This pattern could not be resolved with various algorithmic modifications.
Perhaps the most ambitious of all, Lee proposed: "Problem generation and self-play" — Training models for solving and creating programming problems. This allows the model to generate its own training curriculum, directly addressing the data scarcity problem.
"Although humans are good at generating problems that are interesting and useful to other competitive programmers, there still appears to be a large gap in LLM’s ability to generate creative problems." Lee wrote.
This model is currently available at Hugging Face under the Apache 2.0 license. For researchers and developers who want to build on this work, Nous Research has simultaneously published the complete Atropos training stack.
In 96 hours, the replicated AI accomplished what Lee spent two years of youthful dedication to climb from a 1600-point level novice to a 2100-point rated competitor on Codeforces. He needed 1,000 problems. This model required 24,000. But soon, these systems may learn to write their own problems, learn on their own, and leave human benchmarks behind altogether.
The question is no longer whether machines can learn to code. It’s whether they can quickly become better teachers than us.
