As we mature from childhood, our vocabulary and the way we use it grows, our experiences become richer, and we can think, reason and interact with others with idiosyncraticity and intention. Therefore, our choice of words evolves to align with our personal values, ethics, cultural norms and views. Over time, most of us develop internal “guides” that allow us to learn the context behind a conversation. They also often distract us from sharing harmful or inappropriate information and feelings. After all, large-scale language models (LLMS), which are often burned with bias and toxic languages, can acquire similar capabilities to alleviate their own languages, as they are trained on large public datasets.
New methods of MIT, MIT-IBM Watson AI Lab, and IBM Research are called self-discipline autoregressive sampling (SASA), allowing LLM to detoxify its own output without sacrificing flow ency.
Unlike other detoxification methods, this decoding algorithm learns the boundaries of toxic/nontoxic subspaces within the internal representation of LLMs themselves without changing the parameters of the model, the need for retraining, or external reward models. Next, during inference, the algorithm evaluates the toxicity values of partially generated phrases: already generated and accepted tokens (words) and each potential new token that can be reasonably selected to be close to the classifier boundary. Next, we choose a word option to place the phrases in a non-toxic space, providing a fast and efficient way to ultimately generate a less toxic language.
“I wanted to know how to do it with an existing language model (it) but during the generation process, decoding can be affected by human values. The example we are using here is toxicity,” says study author Ching-Yun “Irene” Ko Phd ’24.
Co-authors of KO include Luca Daniel, professor at MIT Bureau of Electrical Engineering and Computer Science (EECS), members of the MIT-IBM Watson AI Lab, and graduate advisors for KO. Several members of MIT-IBM Watson AI Lab and/or IBM Research – Pin-Yu Chen, Payel Das, Youssef Mrouueh, Soham Dan, Georgios Kollias, Subhajit Chaudhury, and Tejaswini Pedapati. The work will be presented at the International Conference on Learning Expression.
Find the “guard rail”
The training resources behind LLMS most often contain content collected from public spaces, such as the Internet and other readily available datasets. So, cursed words and bullying/unpleasant language are components, but some are in the context of literary works. LLM can then essentially generate, or generate, generate, generate, and generate, dangerous and/or biased content that often contains offensive or disgusting words, even from harmless prompts. Furthermore, it has been found that it can learn and amplify languages that are favorable or harmful for many applications and downstream tasks, leading to the need for mitigation or revision strategies.
There are many ways to achieve fair, valuable and robust language generation. Some methods use LLM retraining using sanitized datasets. This can be expensive, time consuming and can change the performance of your LLM. Others decoding external reward models such as sampling and beam searching takes longer to run and require more memory. For the research teams at SASA, KO, Daniel, and IBM, they develop methods to exploit the autoregressive nature of LLMS to use decode-based strategies during LLM inference, and gradually manipulate generation.
The research group achieved this by constructing a linear classifier that operates in subspaces learned from embeddings in LLMs. When LLM is trained, words with similar meanings are placed in close proximity in vector space, further apart from different words. Therefore, the researchers hypothesized that embedding of LLMs could also capture contextual information and be used for detoxification. Researchers used a dataset containing a set of prompts (first half of a sentence or thought), a response (complete that sentence), and annotations that contributed to humans, such as toxic or non-toxic. We then applied a Bayesian optimal classifier to learn and draw lines figically between binary subspaces in sentences expressed as positive values (non-toxic spaces) and negative numbers (toxic spaces).
The SASA system works by reemphasizing the sampling probability of the most recent potential token based on the IT values and the distance between the generated phrases, with the aim of remaining close to the original sampling distribution.
To illustrate, if the user is generating a potential token #12 in a sentence, LLM speaks perfectly reasonable words based on the 11 words that came before, and uses Top-K, Top-P to filter and generate around 10 tokens to choose from. SASA then evaluates each of these tokens in a partially completed statement to be close to the classifier (i.e., the value of tokens 1-11 and each potential token 12). Tokens that generate sentences in positive spaces are encouraged, and tokens in negative spaces are punished. Furthermore, the farther away from the classifier, the stronger the impact.
“The goal is to change the autoregressive sampling process by reemphasizing the probability of a good token. If the next token is likely to be toxic with context in mind, it reduces the probability of sampling that tends to be a toxic token,” says Ko. The researchers chose to do this.
Reduces toxicity for value matching
The researchers evaluated the method for several baseline interventions that increased the size of three LLMS by three. All were transformers, with autonetically-based GPT2-Large, Llama2-7B, and Llama 3.1-8B-Instruct, with 762 million, 7 billion, and 8 billion parameters, respectively. For each prompt, LLM was tasked with completing the sentence/phrase 25 times, with Perspectiveapi scored from 0 to 1, with those above 0.5 being toxic. The team considered two metrics. It was possible to generate an average maximum toxicity score across 25 generations of all prompts and at least one toxic phrase in the 25 generations. Decreased flow ency (and therefore increased confusion) was also analyzed. SASA was tested to complete Realtoxicity Prompts (RPT), Bold, and Attaq datasets, including prompts for naturally occurring English sentences.
The researchers strengthened the complexity of the SASA detoxification trial, starting with a nontoxic prompt from the RPT data set and looking for completion of harmful sentences. They then escalated it from the RPT to a more challenging prompt, which was more likely to produce with regard to the results, and similarly applied SASA to the instruction tuning model to assess whether the technique could further reduce unnecessary outs. We also used bold benchmarks and ATTAQ benchmarks to examine the general applicability of SASA in detoxification. Using a bold dataset, researchers further sought to find gender bias in language generations and achieve a balanced rate of toxicity between genders. Finally, the team explored how runtime, memory usage, and SASA can be combined with word filtering to achieve healthy and/or useful language generation.
“If we think about how humans think and respond in the world, we see the bad things, so it’s not about allowing language models to just see the good things. It’s about understanding both the good and the bad,” says Ko.
Overall, SASA achieved significant reductions in toxic language generation with performance comparable to RAD, a cutting-edge external reward model technology. However, it has been universally observed that more potent detoxification involves a reduced flow. Before the intervention, LLMS produced more toxic responses to female labelled prompts than males. However, SASA was also able to significantly reduce adverse reactions, making them even more even. Similarly, word filtering above SASA significantly reduced toxicity levels, but also prevented LLM’s ability to respond to cohelles.
A great aspect of this work is that it is an optimization problem with well-defined and constrained, KO says. This means that you can achieve and adjust the need to naturally hear open language generation and reduce unnecessary languages.
Furthermore, according to KO, SASA could work well on multiple attributes in the future. “Humans have multiple human values. I don’t want to say anything toxic, but I also want to be true, useful and faithful. Because of SASA’s lightweight method, it can be easily applied in these situations. “If you want to work with multiple values, simply check the position of the generation in multiple subspaces. Add only a small overhead in terms of calculations and parameters,” says KO.
This work was supported in part by the MIT-IBM Watson AI Lab and the National Science Foundation.