
Have you rave reviews of this movie or is it a hot pot? Is this news article business or technology? Is this online chatbot conversation heading towards giving financial advice? Does this online medical information site provide incorrect information?
These types of automated conversations are becoming more and more common, whether they seek reviews of movies or restaurants or get information about their bank accounts or health records. This kind of evaluation is now done more than ever by a very sophisticated algorithm known as text classifiers rather than humans. But how can you tell us how accurate these categories are in fact?
Now, the Laboratory for Institutes at MIT (LIDS) has come up with innovative approaches to show how these classifiers are not only doing their job, but also going a step further and showing how they can be made more accurate.
The new evaluation and repair software was developed by leading research scientists at Lids, Kalyan Veeramachaneni, and his students, Lei Xu and Sarah Alnegheimish. The software package is freely available for download by anyone who wants to use it.
A standard way to test these classification systems is to create what is known as synthetic examples. This is a sentence that is very similar to what has already been categorized. For example, researchers may retrieve sentences already tagged as rave reviews by classifier programs and change the word or some words while retaining the same meaning to see if the classifier can be shown to the pan. Alternatively, a statement deemed to be incorrect may be misunderstood as accurate. This ability to trick classifiers creates these hostile examples.
According to Veeramachaneni, people have tried different methods to find vulnerabilities in these classifiers. However, existing ways to find these vulnerabilities have struggled with this task, he says, and misses many examples they should catch.
More and more companies are trying to use such assessment tools in real time, monitoring the output of chatbots used for a variety of purposes to prevent inappropriate responses. For example, banks may use chatbots to respond to routine customer queries such as checking account balances and credit card applications, but they can ensure that the answers are not interpreted as financial advice and put the company liability. “Before viewing chatbot responses to end users, they want to use a text classifier to detect whether they are providing financial advice,” says Veeramachaneni. However, it is important to test that classifier to see how reliable the assessment is.
“These chatbots, or summarizing engines, etc. are all set up,” he says. For example, to address it within external customers and organizations, such as providing information about HR issues. It’s important to put these text classifiers in a loop to detect what they’re supposed to say and filter them before the output is sent to the user.
It comes with the use of hostile examples. A statement that is already categorized but produces a different response when slightly modified while retaining the same meaning. How can people confirm that the meaning is the same? By using another large language model (LLM) that interprets and compares meanings. So, LLM says that two statements mean the same thing, but the classifier labels them in a different way: “It’s an adversarial statement – it can deceive the classifier.” And when researchers looked into these hostile sentences, “We found that in most cases this was just a change in words.”
Further research using LLMS to analyze thousands of cases showed that, as certain specific words have a significant impact on classification changes, testing of classifier accuracy can focus on this small subset that appears to make the most difference. They found that 1% of all 1% of all 30,000 words in the system’s vocabulary can account for almost half of the inversion of all these classifications in some specific applications.
Lei Xu Phd ’23 is a recent Lids alumni who performed most of the analysis as part of his paper’s work. The goal is to allow for much narrower and more targeted searches, rather than examining all possible word alternatives, and thus to make the computational task of generating adversarial examples more manageable. “He uses a large-scale language model, interestingly, as a way to understand the power of a word.”
We then use LLMS to search for other words closely related to these powerful words, allowing for an overall ranking of words depending on their impact on the results. When these adversarial statements are discovered, they are used to retrain classifiers to take them into consideration, increasing the robustness of the classifiers against those mistakes.
Making your classifier more accurate may not sound like a big deal if it’s just a matter of categorizing news articles into categories, or just determining whether reviews from movies to restaurants are positive or negative. However, whether it helps to prevent the careless release of sensitive medical, financial, or security information, guide important studies such as compound properties and protein folding for biomedical applications, or identify and block known speech or misinformation, classifiers are used in settings where results are in fact important.
As a result of this study, the team introduced new metrics. This is called P. This provides a measure of how robust a particular classifier is to single word attacks. And because of the importance of such misclassification, the research team made the product available as open access for anyone to use. The package consists of two components. It aims to improve classifier robustness by generating adversarial statements to test classifiers in a particular application, and by generating and using adversarial statements to retrain the model.
In some tests, a conflicting method of testing the output of the classifier resulted in a success rate of 66% due to adversarial attacks. The team’s system reduced its attack success rate by almost half to 33.7%. In other applications, the improvements were little 2% different, but they can still be very important, says Veeramachaneni. Because these systems are used for billions of interactions, and even a small percentage can affect millions of transactions.
The team’s results were published in the journal on July 7th Expert System papers from Xu, Veeramachaneni, and Alnegheimish of Lids, as well as Laure Berti-Equille, IRD of Marseille, France, and Alfredo Cuesta-Infante, University of Rey Juan Carlos, Spain.
