New method could improve efficiency of LLM training | Massachusetts Institute of Technology News



Reasoning Large-Scale Language Models (LLMs) are designed to solve complex problems by breaking them down into a series of smaller steps. These powerful models are especially good at difficult tasks such as advanced programming and multi-step planning.

However, developing an inference model requires a huge amount of computation and energy due to the inefficiency of the training process. Some high-power processors continue to process complex queries while other processors in the group sit idle.

Researchers at MIT and elsewhere have discovered ways to use this computational downtime to efficiently accelerate the training of inference models.

Their new method automatically trains a smaller, faster model to predict the output of a larger inference LLM, which is then verified by the larger model. This reduces the amount of work that the inference model has to perform and speeds up the training process.

The key to this system is that it allows small models to be trained and deployed adaptively, so that functionality starts only when some processors are idle. Accelerate training without incurring additional overhead by leveraging otherwise wasted computational resources.

When tested with multiple inference LLMs, this method doubled the training speed while maintaining accuracy. This reduces the cost of developing advanced LLMs for applications such as financial trend forecasting and power grid risk detection, and has the potential to improve energy efficiency.

“People want models that can handle more complex tasks,” said Qinghao Hu, an MIT postdoctoral fellow and co-lead author of a paper on the technique. “But if that is the goal of model development, efficiency must be a priority. We found a lossless solution to this problem and developed a full-stack system that can actually achieve significant speedups.”

The paper also includes co-lead author Shang Yang, a graduate student in electrical engineering and computer science (EECS). Junxian Guo, EECS graduate student. Lead author Song Han is an associate professor at EECS, a member of the Electronics Research Institute, and a distinguished scientist at NVIDIA. So do other researchers at NVIDIA, ETH Zurich, MIT-IBM Watson AI Lab, and the University of Massachusetts Amherst. This research will be presented at the ACM International Conference on Architectural Support for Programming Languages ​​and Operating Systems.

training bottleneck

Developers want the Reasoning LLM to identify and correct mistakes in their critical thinking process. This feature allows you to handle complex queries that would otherwise stumble on standard LLM.

To teach this skill, developers train an inference LLM using a technique called reinforcement learning (RL). The model generates multiple potential answers to a query, receives a reward for the best candidate, and is updated based on the top answer. These steps are repeated thousands of times as the model learns.

However, researchers found that the process of generating multiple answers, called rollout, can consume as much as 85% of the execution time required for RL training.

“The actual ‘training’ part, updating the model, takes very little time by comparison,” Hu says.

This bottleneck occurs in standard RL algorithms because all processors in a training group must complete their responses before proceeding to the next step. Some processors may be processing very long responses, so other processors that generated shorter responses will wait until they are completed.

“Our goal was to turn this idle time into speedup without incurring any unnecessary costs,” Hu adds.

They tried to use an existing technique called speculative decoding to speed up processing. Speculative decoding involves training a small model called a drafter to quickly infer the future output of a larger model.

The larger model validates the drafter’s guesses and the accepted responses are used for training.

Larger models speed up the process by allowing all of the drafter’s guesses to be verified at once, rather than generating each output in turn.

adaptive solution

However, with speculative decoding, the drafter model is typically trained only once and remains static. This makes this technique infeasible with reinforcement learning because the inference model is updated thousands of times during training.

Static drafters quickly become obsolete after a few steps.

To overcome this problem, researchers created a flexible system known as “Taming the Long Tail” (TLT).

The first part of TLT is the Adaptive Drafter Trainer, which takes advantage of idle processor free time to train a drafter model on the fly, maintaining consistency with the target model without using extra computational resources.

The second component, the adaptive rollout engine, manages speculative decoding and automatically selects the best strategy for each new input batch. This mechanism changes the speculative decoding configuration based on characteristics of the training workload, such as the number of inputs processed by the draft model and the number of inputs accepted by the target model during validation.

Additionally, the researchers designed the draft model to be lightweight and trainable quickly. TLT reuses some components of the inference model training process to train drafters, leading to further improvements in speed.

“As soon as some processor finishes a short query and becomes idle, we switch it to train the draft model using the same data we are using for the rollout process. The key mechanism is adaptive speculative decoding, without which these benefits would not be possible,” Hu says.

They tested TLT across multiple inference LLMs trained using real-world datasets. The system sped up training by 70-210% while maintaining the accuracy of each model.

As an added bonus, the small drafter model is readily available for efficient deployment as a free by-product.

In the future, researchers hope to integrate TLT into more types of training and inference frameworks and find new reinforcement learning applications that can be accelerated using this approach.

“As inference continues to become the primary workload driving the demand for inference, Qinghao’s TLT is a great feature to address the computational bottlenecks in training these inference models. We believe this method will be very useful in the context of efficient AI computing,” said Han.

This research was funded by the MIT-IBM Watson AI Lab, the MIT AI Hardware Program, the MIT Amazon Science Hub, Hyundai Motor Company, and the National Science Foundation.



Source link