
When researchers are building large-scale language models (LLM), they aim to maximize performance under specific calculations and fiscal budgets. Training a model can be millions of dollars, so developers should be careful about cost-related decisions regarding model architecture, optimizers and training datasets before committing to a model. To predict the quality and accuracy of predictions in large models, practitioners often rely on law scaling. We try to approximate the performance of a much larger target model using smaller, cheaper models. But the challenge is that there are thousands of ways to create scaling methods.
New works by researchers at MIT and MIT-IBM Watson AI Labs address this by accumulating and releasing a collection of hundreds of models and metrics on training and performance, and roughly estimating over 1000 scaling methods. This led the team to develop a meta-analysis, select small models, and create a guide on how to estimate scaling laws for different LLM model families. This allows the budget to be optimally applied to generating reliable performance forecasts.
“The idea that we might try to build a mathematical model of the training process is a few years ago, but what’s new here is that most of the work people have done before can say something after the fact that we’ve trained all these models. MIT-IBM WATSON AI LAB and Chief Researcher.
The study was recently presented at Andreas’ International Conference on Machine Learning, along with MIT-IBM Watson AI Lab Research’s Leshem Choshen and Yang Zhang of IBM Research.
Performance extrapolation
No matter how you slice it, the development of LLMS is an expensive effort. It involves determining the accuracy and tuning of the output to the target application and task from decisions regarding the number of parameters and tokens, data selection and size, and training techniques. Scaling methods provide a way to predict model behavior by avoiding the need to fully train all candidates and by linking large model losses to small, small model performance from the same family. The main differences between small models are the number of parameters and token training size. According to Choshen, unraveling scaling methods not only allows better pre-training decisions, but also allows researchers without vast resources to democratize the field by allowing effective scaling methods to be understood and constructed.
The functional form of scaling laws is relatively simple, incorporating small model components that capture the number of parameters and the number of scaling effects, the number of training tokens and scaling effects, and the baseline performance of the model family of interest. Together, they help researchers estimate performance losses in targeted large-scale models. The lower the loss, the more likely it is to have a better output for the target model.
These laws allow researchers to efficiently weigh trade-offs and test the best ways to allocate limited resources. These are particularly useful for evaluating the scaling of specific variables, such as the number of tokens, and for evaluating A/B tests for various pre-training setups.
Generally, scaling methods are not new. However, in the AI field they emerged as models grew and costs spiked. “It’s like at some point in the field, legal scaling has emerged,” Choshen said. “They started to get attention, but no one really tested how good they were and what they needed to do to make a good scaling method.” Furthermore, in a sense, the law itself was also a black box. “Whenever people created a scaling method in the past, it was always one model, or one model family, one dataset, one developer,” Andreas says. “There weren’t many systematic meta-analyses in practice, since everyone trains their own scaling methods individually. So (we wanted to know) are there any high-level trends you’ll see in those things?”
Better buildings
To investigate this, Choshen, Andreas, and Zhang created a large dataset. They collected LLMs from 40 model families, including mixtures of Pythia, Opt, Olmo, Llama, Bloom, T5-Pile, ModuleFormer mixtures, GPT, and other families. These include 485 unique pre-training models for training checkpoints, computational costs (FLOPS), training epochs, and seeds, and 1.9 million performance metrics for loss and downstream tasks, where available. The models differed in architecture, weight, etc. Using these models, researchers conformed to more than 1,000 scaling laws, compared accuracy across architecture, model size, and training regimes, and the number of models, the inclusion of intermediate training checkpoints, and partial training influenced the predictive power of scaling methods to target models. They used measurements of absolute relative error (are). This is the difference between predicting scaling methods and observed losses in large-scale trained models. This allowed the team to compare scaling laws and after analysis they distill practical recommendations for AI practitioners about what makes effective scaling methods.
Their shared guidelines will allow developers to take a walk through steps, options and expectations to consider. First, it is important to determine the computational budget and the accuracy of the target model. The team found that 4% was the highest achievable accuracy expected due to random seed noise, but up to 20% would help with decision making. The researchers identified several factors that improve prediction, including including interim training checkpoints rather than relying solely on final losses. This has resulted in increased reliability due to the scaling law. However, very early training data before 10 billion tokens is noisy, less accurate and needs to be discarded. They recommend that more models be prioritized over more models across size expansions to improve the robustness of predictions in not only larger models but scaling methods. Selecting five models provides a solid starting point.
In general, including larger models improves prediction, but saving costs can be saved by partially training the target model to about 30% of the dataset and using it for extrapolation. If budgets are significantly constrained, developers should consider training one small model within the target model family and borrow scaling method parameters from model families with similar architectures. However, this may not work with the encoder decoder model. Finally, the MIT-IBM research group found that when scaling laws are compared between model families, there is a strong correlation between the two hyperparameters. This means that three of the five hyperparameters can explain almost all variations and capture the behavior of the model. Together, these guidelines provide a systematic approach to making AI researchers more efficient, reliable and accessible to them, working under various budget constraints.
There were some surprises during this work. Small, partially trained models are still very predictive, and in addition, intermediate training stages from fully trained models can be used to predict another target model (as if they were individual models). “Essentially, you’re already training a full model, so you don’t pay anything on training, so a semi-trained model, for example, is just a by-product of what you did,” says Choshen. Another feature that Andreas pointed out was that when it was aggregated, variations across the model family and various experiments popped out, making it more loud than expected. Unexpectedly, the researchers found that scaling laws of large-scale models can be used to predict performance to a smaller model. Other research in this field assumes that small models are “different beasts” compared to larger models. But Choshen disagrees. “If they’re completely different, they should behave completely differently, and not.”
Although this work focused on model training time, researchers plan to extend the analysis to model inference. Andreas said, “As you add training data and more parameters, it gets better, and instead draws more samples, so you can draw more samples. I think there are lessons to learn here about how to build a predictive model with the mindset you need at runtime.” He said, “I’m not going to train one model before doing it, so the theory of inference time scaling might become even more important.
This study was supported in part by the MIT-IBM Watson AI Lab and the Sloan Research Fellowship.
