Study: Platforms that rank the latest LLMs may be unreliable | Massachusetts Institute of Technology News



Companies that want to use large-scale language models (LLMs) to summarize sales reports or prioritize customer inquiries can choose from hundreds of unique LLMs with dozens of model variations, each with slightly different performance.

To narrow down their choices, companies often rely on LLM ranking platforms. The LLM ranking platform collects user feedback on model interactions and ranks the latest LLMs based on their performance on specific tasks.

However, the MIT researchers found that a small number of user interactions can skew the results, leading people to mistakenly believe that one LLM is the ideal choice for a given use case. Their research revealed that removing a small portion of crowdsourced data can change which models rank higher.

They have developed a fast method to test ranking platforms and determine whether they are susceptible to this issue. This evaluation method identifies the individual votes that are most responsible for skewing the results, allowing users to examine these influential votes.

The researchers say this study highlights the need for more rigorous strategies for evaluating model rankings. While this study did not focus on mitigation, it offers suggestions that could improve the robustness of the platform, such as collecting more detailed feedback to create rankings.

The study also warns those who may rely on rankings when making decisions about LLMs, which can have far-reaching and costly implications for businesses and organizations.

“We were surprised that these ranking platforms were so sensitive to this issue. If we find that the top-ranked LLM relies on only two or three out of tens of thousands of user feedbacks, we can’t assume that the top-ranked LLM will consistently outperform all other LLMs upon deployment,” said Tamara Broderick, an associate professor in MIT’s Department of Electrical Engineering and Computer Science (EECS). Member of the Institute for Information and Decision Systems (LIDS) and the Institute for Data, Systems and Society. An affiliate of the Computer Science and Artificial Intelligence Institute (CSAIL). He is also the study’s senior author.

The paper also includes lead authors Jenny Huang and Yunyi Shen, EECS graduate students, and Dennis Wei, senior research scientist at IBM Research. This research will be presented at the International Conference on Learning Representations.

Delete data

There are many types of LLM ranking platforms, but the most common variation asks users to submit a query to two models and choose which LLM provides a better response.

The platform aggregates the results of these matches to create a ranking that shows which LLMs performed best on specific tasks such as coding or visual comprehension.

By selecting the best performing LLM, users may expect that the model’s highest ranking will generalize, that is, it will perform better than other models in similar but not identical applications with a set of new data.

MIT researchers have previously studied generalization in fields such as statistics and economics. This study reveals specific cases where removing a small number of data can change model results, indicating that the conclusions of these studies may not hold beyond narrow settings.

The researchers wanted to see if the same analysis could be applied to LLM ranking platforms.

“At the end of the day, users want to know if they are choosing the best LLM, and if only a small number of prompts are driving this ranking, it suggests that the ranking may not be final,” Broderick says.

However, it is impossible to manually test the data drop phenomenon. For example, one ranking they evaluated had over 57,000 votes. Testing a 0.1% data drop means removing each subset of 57 votes out of 57,000 (there are more than 10 votes).194 subset), recalculate the ranking.

Instead, the researchers developed an efficient approximation method based on previous work and adapted it to fit the LLM ranking system.

“There is theory that proves that the approximation works under certain assumptions, but users don’t have to trust it. With our method, users are notified of problematic data points at the end, so all they have to do is remove those data points, rerun the analysis, and see if the ranking changes,” she says.

amazingly sensitive

When the researchers applied their technique to a popular ranking platform, they were surprised at how few data points needed to be removed to cause significant changes in the top LLMs. In one example, removing just 2 votes (0.0035%) from over 57,000 votes changed which model ranked at the top.

Another ranking platform that used expert annotators and high-quality prompts was more robust. Here, removing 83 of the 2,575 ratings (about 3%) reversed the top model.

Their research found that many influential votes could be the result of user error. In some cases, there seemed to be a clear answer as to which LLM performed better, but users chose other models instead, Broderick says.

“You can’t know what was going through the user’s head at the time, but maybe they made a wrong click, weren’t paying attention, or honestly didn’t know which one was better. The takeaway here is that you don’t want noise, user error, or outliers to determine which is the top-ranked LLM,” she added.

The researchers suggest that collecting additional feedback from users, such as the confidence level of each vote, could provide richer information to help alleviate this problem. Ranking platforms can also use human intermediaries to evaluate crowdsourced responses.

Researchers would like to continue exploring generalization in other contexts while developing better approximation techniques that can capture more examples of non-robustness.

“The work of Broderick and her students shows how valid estimates of the impact of specific data on downstream processes can be obtained, even though thorough calculations are difficult given the size of modern machine learning models and datasets,” said Jessica Hulman, the Ginny Rometty Professor of Computer Science at Northwestern University, who was not involved in the study. “Recent research provides a glimpse into powerful data dependencies in the routinely applied, yet highly fragile, methods of aggregating human preferences and using them to update models. Seeing how few preferences can actually change the behavior of fine-tuned models could inspire more thoughtful ways to collect these data.”

This research was funded in part by the Office of Naval Research, MIT-IBM Watson AI Lab, National Science Foundation, Amazon, and a CSAIL Seed Award.



Source link