Andrej Karpathy’s weekend “vibe code” hack quietly sketches the missing layer of enterprise AI orchestration.



This weekend, Andrei Karpathy, former AI director at Tesla and founding member of OpenAI, decided he wanted to read a book. But he didn’t want to read it alone. He wanted to read this book with his artificial intelligence committee. Each committee offers its own perspective, critiques the others, and ultimately synthesizes a final answer under the guidance of experts. "chairman."

To make this happen, Karpathy wrote what he calls “The World’s Greatest.” "vibe code project" — Software primarily created quickly by AI assistants and intended more for fun than functionality. He posted his results to a repository called . "LLM Council," Post on GitHub with a clear disclaimer. "I have no intention of supporting it in any way… The code is now ephemeral and the library is over."

But for technical decision makers across the enterprise, getting past the casual disclaimer reveals something far more important than a weekend toy. In a few hundred lines of Python and JavaScript, Karpathy sketched a reference architecture for the most important and undefined layer of modern software stacks: orchestration middleware that sits between enterprise applications and the volatile AI model market.

As companies finalize platform investments for 2026, the LLM Council "Build or buy?" The reality of AI infrastructure. This shows that while the routing and aggregation logic of an AI model is surprisingly simple, there is real complexity in the operational wrapper required to make it enterprise-ready.

How LLM Council works: Four AI models synthesize discussions, critiques, and answers.

To the average person, the LLM Council web application looks almost identical to ChatGPT. User enters a query in the chat box. But behind the scenes, the application triggers a sophisticated three-step workflow that mirrors how human decision-making bodies work.

First, the system dispatches the user’s query to the frontier model’s panel. Karpathy’s default configuration includes OpenAI’s GPT-5.1, Google’s Gemini 3.0 Pro, Anthropic’s Claude Sonnet 4.5, and xAI’s Grok 4. These models generate initial responses in parallel.

In the second stage, the software performs peer review. Each model is fed the anonymized responses of its corresponding model and asked to rate them based on accuracy and insight. This step transforms the AI ​​from a generator to a critic and enforces a layer of quality control that is rare in standard chatbot interactions.

Finally, the specified "LLM Chairman" — currently configured as Google’s Gemini 3 — receives the original query, individual responses, and peer rankings. Synthesize this large amount of context into a single, authoritative answer for your users.

Karpathy said the results were often surprising. "Very often, models are surprisingly willing to choose another LLM’s response as better than their own." He wrote on X (formerly Twitter): He described using the tool to read book chapters and observed that the model consistently praised GPT-5.1 as the most insightful and rated Claude the least. But Karpathy’s own qualitative assessment differed from his digital council. He discovered GPT-5.1 "Too verbose" and, "Condensed and processed" Gemini output.

A case for treating FastAPI, OpenRouter, and frontier models as interchangeable components

For CTOs and platform architects, the value of the LLM Council lies in its construction, not its literary criticism. This repository will serve as the primary document for exactly what a modern minimal AI stack will look like in late 2025.

The application is "thin" architecture. The backend uses FastAPI, a modern Python framework, and the frontend is a standard React application built with Vite. Data storage is handled by simple JSON files written to local disk rather than complex databases.

At the heart of the entire operation is OpenRouter, an API aggregator that normalizes the differences between different model providers. By routing requests through this single broker, Karpathy avoided writing separate integration code for OpenAI, Google, and Anthropic. The application does not know or care which company is providing the intelligence. Just send a prompt and wait for a response.

This design choice highlights a growing trend in enterprise architecture: commoditization of the model layer. By treating frontier models as interchangeable components that can be replaced by editing a single line in the configuration file (specifically the COUNCIL_MODELS list in the backend code), this architecture protects applications from vendor lock-in. If a new model of Meta or Mistral tops the leaderboard next week, it can be added to the council in seconds.

What’s missing from prototype to production: authentication, PII editing, compliance

The core logic of the LLM Council is sophisticated, but at the same time; "weekend hack" And the production system. For the enterprise platform team, cloning Karpathy’s repository is just the first step in a marathon.

A technical audit of your code will reveal what’s missing. "boring" Infrastructure sold by commercial vendors at a premium price. The system has no authentication. Anyone with access to the web interface can query the model. There is no concept of user roles. This means junior developers have the same access as CIOs.

Furthermore, there is no governance layer. In a corporate environment, sending data to four different external AI providers at the same time immediately raises compliance concerns. There’s no mechanism here to redact personally identifiable information (PII) before it leaves your local network, and there’s no audit log to track who asked what.

Reliability is also an open question. The system assumes that the OpenRouter API is always up and that the model responds in a timely manner. They lack circuit breakers, fallback strategies, and retry logic to keep business-critical applications running even in the event of a provider outage.

These absences are not flaws in Karpathy’s code. He made it clear that he has no intention of supporting or improving the project. However, they define the value proposition of the commercial AI infrastructure market.

Companies like LangChain, AWS Bedrock, and various AI gateway startups are essentially "hardening" Focusing on the core logic demonstrated by Kalpathy. They provide security, observability, and compliance wrappers that turn raw orchestration scripts into executable enterprise platforms.

Why Karpathy thinks code is now "temporary" Traditional software libraries are outdated

Perhaps the most provocative aspect of this project is the philosophy on which it is built. Mr. Karpathy explained the development process: "99% are vibe coded," This suggests that he relies heavily on an AI assistant to generate the code, rather than writing it line by line himself.

"The code is now temporary and the library is finished. Ask your LLM to change it to your liking." He writes in the repository documentation:

This statement signals a fundamental shift in software engineering capabilities. Traditionally, companies have built internal libraries and abstractions to manage complexity and maintained them for years. Karpathy hints at a future where code is treated as: "quick scaffolding" — It’s disposable, easily rewritten by AI, and doesn’t last long.

This poses a difficult strategic question for corporate decision makers. If internal tools allow "coded vibe" Does it make sense to spend a weekend buying an expensive, rigid software suite for your internal workflows, or should your platform team enable engineers to generate single-use custom tools that meet your exact needs at a fraction of the cost?

When AI models judge AI: The dangerous gap between machine preferences and human needs

Beyond architecture, the LLM Council project inadvertently sheds light on a particular risk in automated AI deployments: differences in human and machine decisions.

Karpathy observed that his model preferred GPT-5.1 and he preferred Gemini, suggesting that AI models may have a common bias. They may prefer redundancy, specific formatting, or rhetorical confidence that doesn’t necessarily align with human business needs for brevity and precision.

As companies become more dependent on "LLM as a judge" This discrepancy is important when the system evaluates the quality of customer-facing bots. When automatic raters consistently reward "redundant and large" While human customers want simple solutions and metrics are showing success, customer satisfaction is plummeting. Karpathy’s experiment suggests that relying solely on AI to grade AI is a strategy fraught with hidden coordination problems.

What enterprise platform teams can learn from weekend hacks before building their 2026 stacks

Ultimately, the LLM Council serves as a Rorschach test for the AI ​​industry. For hobbyists, reading books is a fun way to do it. For vendors, this was a threat, proving that the core functionality of their products could be duplicated in a few hundred lines of code.

But for enterprise technology leaders, it’s a reference architecture. This demystifies the orchestration layer and shows that the technical challenge lies in managing the data, not in routing prompts.

As platform teams head into 2026, many will be looking at Karpathy’s code not to deploy it, but to understand it. This proves that a multi-model strategy is not technically impossible. The question remains whether companies will build the governance layer themselves or pay someone else to wrap it for them. "vibe code" He wears enterprise-grade armor.



Source link