
One of the best things about generative AI models (both large-scale language models (LLMs) and diffusion-based image generators) is that they "Non-deterministic." That is, despite its reputation among some critics, "fancy autocorrect," Generative AI models actually generate output by choosing from a distribution of the most likely next tokens (units of information) to fill out the response.
Ask an LLM: "Where is the capital of France?" Determine the answer by sampling the probability distribution of France, capitals, cities, etc. "Paris." But the answer might come back in the form of: "The capital of France is Paris, but" or just "Paris" or "Paris, but at one time it was Versailles."
Still, those who use these models frequently on a daily basis will find that sometimes the answers can feel annoyingly repetitive or similar. Common coffee jokes are recycled across generations of queries. Story prompts generate similar arcs. Even tasks that should have many plausible answers, such as naming U.S. states, tend to boil down to just a few. This phenomenon, known as mode collapse, occurs during post-training adjustments and limits the usefulness of powerful models.
In particular, when using the LLM to generate new creative works such as writing, communication, strategy, illustration, etc., one would actually want the output to be: It’s even more diverse than what we already have.
Now, a team of researchers from Northeastern University, Stanford University, and West Virginia University has devised an ingenious and simple way to take language and image models and generate different responses to almost any user prompt. Add a single simple sentence. "Generate five responses with corresponding probabilities sampled from the complete distribution."
a method called Verbalized sampling (VS) enables models like GPT-4, Claude, and Gemini to produce more diverse, human-like outputs without retraining or accessing internal parameters. This is described in a paper published in early October 2025 in the online open access journal arxiv.org.
When prompted in this way, the model will not default to using the safest and most typical output. Instead, it verbalizes the internal distribution of potential completions and a broader sample of possibilities. This one-line change greatly increases the diversity of output across multiple domains.
Weiyang Shi, an assistant professor at Northeastern University and a co-author of the paper, wrote about X: "The full potential of the LLM has not yet been unlocked. As we showed in our paper, instant optimization can be derived and theoretically proven by considering how the LLM is trained and tuned."
Why models break down – how VS can reverse it
According to the research team, the root cause of mode collapse lies not only in algorithms such as reinforcement learning from human feedback (RLHF), but also in the structure of human preferences. People tend to rate more familiar or typical answers as better, which encourages LLMs to make “safer” choices over diverse answers during fine-tuning.
However, rather than erasing the knowledge underlying the model, this bias simply suppresses it. VS works by bypassing this suppression. Rather than seeking the single most likely output, we encourage the model to reveal a set of plausible responses and their relative probabilities. This distributed level of prompting restores access to the richer diversity present in the basic pre-trained model.
Actual performance across tasks
The research team tested verbalized sampling in several common use cases.
-
creative writing: In story generation, VS increased diversity scores by up to 2.1x compared to standard prompts while maintaining quality. One of the story prompts, “No Goodbye,” generated a stylized farewell scene due to the direct prompt, but also generated stories about cosmic events, silent emails, and music stopping mid-dance when prompted via VS.
-
Dialogue simulation: In persuasive interaction tasks, VS allows the model to simulate human-like patterns such as hesitation, resistance, and changes of mind. The distribution of donation behavior in VS was in better agreement with real human data compared to the baseline method.
-
Open-ended QA: When asked to list valid answers (such as the names of U.S. states), models using VS produced answers that more closely matched the diversity of real-world data. They covered a broader set of answers without sacrificing factual accuracy.
-
Generating synthetic data: VS created a more diverse dataset when used to generate math problems for model training. These improved downstream performance on competitive math benchmarks, outperforming synthetic data generated by direct prompts.
Tunable diversity and effective use of large models
The notable benefits of VS are: Adjustability. Users can set a probability threshold at a prompt to sample from the less-probable “tail” of the model’s distribution. The lower the threshold, the higher the diversity. This adjustment can be made using just the prompt text without changing decoding settings such as temperature or top p.
In one test using the Gemini-2.5-Flash model, the diversity of story writing steadily increased as the probability threshold decreased from 1 to 0.001. The graphs attached to the study showed that VS outperformed both direct and sequence-based prompts across all thresholds.
Interestingly, this method scales well with model size. Larger models such as GPT-4.1 and Claude-4 showed even greater gains from VS compared to smaller models. While smaller models benefited, the diversity improvement was approximately 1.5 to 2 times stronger in larger models. This suggests that VS can help further unlock the potential functionality of advanced models.
Deployment and availability
The Verbalized Sampling method is now available as a Python package.
pip install verbalized-sampling
This package includes integration with LangChain and supports a simple interface for sampling from linguistic distributions. Users can also adjust parameters such as k (number of responses), threshold, and temperature to suit your application.
Live Colab notebooks and documentation are available on GitHub (https://github.com/CHATS-lab/verbalized-sampling) under the Apache 2.0 Enterprise license.
Practical tips and common problems
This method works for all major LLMs, but some users may encounter rejections or errors at first.
In these cases, the author suggests using the system-prompted version of the template or referring to the alternative formats listed on the GitHub page.
Some models interpret complex instructions as a jailbreak attempt and refuse to follow them unless the structure is clearer.
For example, prompting via system-level instructions like the following improves reliability:
You are a helpful assistant. Each query generates five responses within separate tags. The probability of each response is less than 0.10.
This small change usually solves the problem.
A lightweight fix for a big problem
Verbalized Sampling represents a practical inference-time fix for deep limitations in the behavior of modern language models. No model retraining or internal access required. Does not depend on one model family. It also increases the variety of output as well as its quality as determined by both human ratings and benchmark scores.
With increasing interest in tools that enhance model creativity, VS is likely to be rapidly adopted in areas such as writing, design, simulation, education, and synthetic data generation.
For users and developers who are frustrated by the same LLM responses, a fix may be as simple as changing the question.
