The Google team has discovered the new Scaling Law! The new method, DiLoCo, has proven to be better, faster, and more powerful to train larger and larger LLMs across multiple data centers.
After the calculations during the test, Google’s three teams pooled their efforts to discover the new Scaling Law!
Just now, Google researcher Zachary Charles announced that “a major breakthrough in distributed training has been made on larger and larger models”.
This core algorithm is DiLoCo’s Scaling Law.
The new training method is not afraid of model size, and in the future, the scale of training large models in “multiple data centers” will no longer be a problem.
The paper makes four major findings: the Scaling law of the DiLoCo training method is far more effective than “data parallelism”:
- More robust (Harder): The hyperparameters of DiLoCo remain stable and predictable at different model sizes.
- Better: As the model size expands, the advantages of DiLoCo over parallel data training are further improved.
- Faster:D iLoCo requires orders of magnitude less bandwidth than data-parallel training.
- Stronger:DiLoCo’s ability to adapt to a number of different combinations of rules is a great deal of criticism.
It is worth mentioning that this masterpiece brings together three major teams of Google: Google Research, Google Search, and Google DeepMind.
Paper link: https://arxiv.org/pdf/2503.09799
Under a fixed computing budget, the researchers explored the scaling law of DiLoCo when training large models.
In this paper, we focus on how algorithm factors (such as the number of model copies, hyperparameter settings, and token budgets) affect the training process, and prove that these effects can be accurately predicted by scaling law.
The results show that DiLoCo exhibits stable and predictable scalability when the model size increases. Arthur Douillard, co-author of the paper, once again emphasized: DiLoCo is in effect!
The future of intelligence will be distributed, and DiLoCo may be that key ingredient
Under reasonable tuning, DiLoCo has more scaling advantages than data parallel training, and may be better than data parallel training even on small-scale models.
These findings reveal the powerful advantages of DiLoCo: it not only solves the communication bottleneck, but also opens up new possibilities for large-scale model training.
Some netizens exclaimed, “DiLoCo may redefine the way LLM Scaling! Less bandwidth requirements, more efficiency.”
01
The end of “data parallelism” training?
Data parallel training excels on large models, but only when computing resources are centralized and dispersed.
If the computation is widely distributed, communication can be a huge bottleneck, especially when the model size grows!
The solution for machine learning, such as in federated learning and data center training, is to train multiple independent models and synchronize them periodically.
As machine learning models scale, the need for frequent synchronization inherent in data parallelism approaches can lead to significant performance degradation, which poses a key challenge for scaling models further.
So, how to break through this bottleneck by reducing the need for synchronization while maintaining the quality of the model?
The answer is currently available, and DiLoCo (Distributed Low-Communication) is currently being developed.
Paper link: https://arxiv.org/abs/2311.08105
Each DiLoCo model is a copy of the urban independent training program for inner optimization.
These models are synchronized through an outer optimization step, often introducing a momentum mechanism between the external optimization steps.
In the figure below, there are a total of M=4 copies of the model in the example.
The success of DiLoCo has been proven time and again. It works similarly to FedOpt’s approach to federated learning.
In addition, researchers have repeatedly demonstrated the excellent performance of DiLoCo in large model (LLM) training.
So what’s wrong with DiLoCo? In a nutshell – scale.
Unlike data-parallel training, DiLoCo introduces additional “external” hyperparameters and behaves significantly differently in practice than in theory.
That’s exactly what studying scaling laws is for!
In this study, we built a scaling law of DiLoCo and data parallel training from scratch to predict their performance comparison on large-scale models.
In data-parallel training, each training step is processed with a batch of data of size B.
For the purposes of this study, batch size refers to the number of tokens in a batch (not the number of sequences).
Batch gradients are calculated and optimized using the learning rate γ.
During DiLoCo training, one global batch size of data is processed at each time step and evenly distributed across M DiLoCo replicas at the sequence level.
Therefore, the global batch size is still B, while the local batch size for each DiLoCo replica is B/M. Similar to data-parallel training, each replica computes a batch gradient and uses the learning rate γ perform an inner optimization.
However, unlike data parallelism, DiLoCo performs “outer optimization” every H step, based on outer-gradients calculated in the parameter space, and updated with a learning rate η.
An important contrast is data parallelism vs. data parallelism. DiLoCo(M=1)。
While they are similar, they are not identical.
DiLoCo still contains an external optimizer (OuterOpt) step in the case of M=1, so it can be considered as a variant of the Lookahead optimizer.
Whereas in DiLoCo, OuterOpt typically uses GD with Nesterov momentum, which means that DiLoCo (M=1) is actually a variant of data parallel training, but the momentum operation is performed only once every H step.
A large number of experiments were also carried out, covering all aspects of the training process, and their extended behavior was comprehensively analyzed.
02
Experimental Methods
In most of the experiments, the research team used the training set of the C4 dataset to train the model, and the evaluation indicators used the validation set of C4.
In addition, zero-shot evaluation metrics were calculated on three downstream tasks: HellaSwag, Piqa, and Arc-Easy.
03
Model structure: Chinchilla varian
The research team used a pure decoder Transformer architecture similar to “Chinchilla”, added QK-LayerNorm, and used z-loss regularization to make the training more stable.
They packed multiple sequences into each batch, with a maximum sequence length of 2,048 throughout.
All models were trained from scratch, because this time I wanted to study the scale of the pre-training phase.
The research team trained a bunch of models, adjusting the number of Transformer layers, the number of attention heads, the QKV dimension, and the hidden dimension of the feedforward layer.
Unless otherwise specified, they all budget with Chinchilla’s token, and have made a lot of hyperparameter adjustments to all but two of the largest models (4B and 10B parameters).
04
Algorithms and optimizers
The research team used AdamW as the Data-Parallel optimizer, which is also the inner optimizer of DiLoCo. The β1 of the two algorithms is set to 0.9 and β2 is set to 0.99.
The training starts with a 1000-step warm-up and then decays with a cosine learning rate. The weight decay parameter λ is set to T⁻¹, where T is the total number of training steps (depending on batch size and token budget). By the end of the training, the learning rate decayed to 5% of the peak.
To train stability, they clipped the global L2 norm of the (inner) gradient to 1 and the outer gradient unclipped to the outer.
For DiLoCo, they used SGD with Nesterov momentum as the outer optimizer. The momentum is set to 0.9 and the outer learning rate remains the same.
05
Built from scratch, the new Scaling Law is here
Finding 1: Scale
The evaluation loss of DiLoCo is improved relative to Data-Parallel with the increase of N.
Scaling law predicts that when M=2, DiLoCo will have a lower loss than data parallelism when the parameters reach more than billions. This phenomenon has been verified in the training of the largest model tuned by the study tuning, as well as in the training of the 4B and 10B models.
Figure 2 below shows the performance of DiLoCo and Data-Parallel algorithms at different model scales (N).
Figure (a) shows that as the model size increases from 2^25 to 2^31, the EvalLoss of DiLoCo (at M=1, 2, 4, 8, respectively) and Data-Parallel decreases, but the loss of DiLoCo decreases more significantly, especially when the M value is larger.
Figure (b) further shows the percentage difference in the evaluation loss of DiLoCo relative to Data-Parallel, and it can be seen that the loss of DiLoCo is more and more lower than that of Data-Parallel as the model size increases, indicating that DiLoCo performs better when the model size increases.
There are two separate but related parts to this finding:
DiLoCo (M=1) performs better: As mentioned above, DiLoCo has a lower evaluation loss than Data-Parallel at M=1 at all model sizes. Moreover, as the model parameter size N increases, the gap between Data-Parallel and DiLoCo (M=1) is getting bigger and bigger.
Performance of DiLoCo (M≥2): DiLoCo assesses a higher loss at M≥2 at most model scales. However, if we look at the percentage difference (with plus or minus signs) between DiLoCo and Data-Parallel, we will find that with the increase of N, DiLoCo performs better and better relative to Data-Parallel, and even surpasses Data-Parallel when M=2 and N=240 million parameters.
For example, the research team lists the assessed losses of Data-Parallel and DiLoCo at different model sizes N in Table 4 below.
It can be seen that regardless of M, the percentage difference decreases strictly as N increases.
This trend is also illustrated in Figure 2: as N increases, the relative evaluation loss of DiLoCo gradually decreases.
The research team also tested this by training models with 4 billion and 10 billion parameters using hyperparameters tuned by scaling.
While Figure 2 shows results for the “interpolation” range (based on a large number of experimental scans), these findings can also be generalized to extrapolation, allowing DiLoCo to be used to train 4 billion and 10 billion parameter models with lower evaluation losses at M=1 or 2.
Table 5 below shows the results of training with extrapolated hyperparameters, showing the comparison of the evaluation losses of the DiLoCo and Data-Parallel algorithms on the larger scale of 4B and 10B models, showing that DiLoCo performs well overall at larger scales.
Discovery 2: Single-copy DiLoCo
When the number of replicas M=1, the evaluation loss obtained by DiLoCo is lower than that of Data-Parallel at different model scales.
Figure 3 below shows the comparison of the evaluation loss and zero-shot accuracy of HellaSwag between DiLoCo and Data-Parallel at different model sizes (35M, 550M, 1.3B, 2.4B) and global batch sizes (from 2^16 to 2^20 in terms of tokens) when the number of replicas M=1.
Figure (a) shows that the evaluation loss of DiLoCo is consistently lower than that of Data-Parallel, and the gap widens with the increase of batch size. Figure (b) shows that DiLoCo is also better than Data-Parallel in HellaSwag zero-shot accuracy, and the trend is similar.
In almost all cases, at M=1, DiLoCo not only has lower evaluation losses, but also has a higher zero-shot accuracy for downstream tasks than Data-Parallel.
Moreover, the performance of DiLoCo (M=1) has a stronger effect on batch size: doubling or quadrupling the batch size has a great impact on the performance of Data-Parallel, but has almost no effect on DiLoCo (M=1), as is clearly illustrated in Figure 3.
Finding 3: The impact of batch size on performance
DiLoCo increases the optimal batch size, and the optimal global batch size increases as the number of replicas M increases. This means that DiLoCo has improved its scale-out capability compared to Data-Parallel.
Although DiLoCo tends to be slightly inferior in assessing the loss by selecting the best experimental results among all hyperparameters at batch size M>1, it performs significantly better in terms of batch size.
Both Data-Parallel and DiLoCo (M=1) performed well in small batches, but as the batch size increased, Data-Parallel’s performance degraded rapidly.
In contrast, the performance of DiLoCo is much more stable for batch size, regardless of the batch size M.
Figure 4 below shows an example of evaluating the loss, and the results show that the optimal batch size for DiLoCo is larger than Data-Parallel for all M values, and that the optimal batch size for DiLoCo increases further as M increases.
For example, in the 550M model, Data-Parallel’s evaluation loss was lowest at smaller batch sizes, while DiLoCo performed better at larger batch sizes, and a similar trend held true in the 1.3B and 2.4B models.
Figure 5 below shows the zero-shot accuracy on the HellaSwag dataset. The results show that even at smaller model sizes, DiLoCo achieves higher accuracy at larger global batch sizes at M=2.
For example, in the 550M model, the accuracy curve of DiLoCo is better than that of Data-Parallel when the batch size increases; The 1.3B and 2.4B models showed a similar trend.
Finding 4: External learning rate
The optimal external learning rate is basically independent of the model size N, but it varies with the number of replicas M.
An important result is that DiLoCo scales more naturally in terms of horizontal scaling. In all cases, the token budget D is only related to the model size N. This means that if you use 4 times the batch size, the number of training steps will be reduced to 1/4.
For DiLoCo, this still maintains good performance, and can also use more resources at once, reducing the total training time. Data-Parallel, on the other hand, seems to rely more on serial training. This reduction in training time is also doubly pronounced by the reduced amount of traffic.
Figure 6 below illustrates the ideal wall-clock time, simulating different network bandwidths.
As you can see, DiLoCo’s tolerance for larger batch sizes allows it to achieve a performance penalty comparable to Data-Parallel much faster, and this effect is even more pronounced in low-bandwidth settings.
Finding 5: External learning rate
As shown in Figure 7 below, for a sufficiently large model (N≥335 million parameters), the optimal η per M is fixed. The larger the M, the larger the η seems to be. This is consistent with previous research on federated learning: the outer learning rate should increase as the number of clients increases.
In fact, the external learning rate depends only on the number of DiLoCo models and the frequency of synchronization.
In other words, although the optimal inner learning rate will vary with the model size N, the optimal outer learning rate of DiLoCo η not dependent on N, but only related to M.
DiLoCo also helps to solve the problem of overtraining!
Overtraining can be quite expensive, but increasing the batch size and reducing the amount of traffic means that it is often possible to overtrain (OT) 4x with DiLoCo in the same amount of time, while parallel training with data can only do 1x overtraining.
There’s more to it in the paper. These include the scaling law itself, as well as even providing methods for predicting optimal hyperparameters.
Scaling law shows that for models with more than 2 billion parameters, DiLoCo using 2 models is better than the data parallel method
For more details and content of the experiment, please refer to the original article.
06
Chinchilla is dying? AI’s $3 trillion fork in the road
DiLoCo makes it easier to tune hyperparameters and train models. But the problem is that the AI model itself is “a change of soup, not a change of medicine” – still Chinchilla’s.
After all, the pre-trained ScalingLaw of the past is coming to an end, and the new AI Scaling Law has nothing to do with training.
Now, with the rise of new “inference models,” a question has emerged: What will happen to the future of AI if Chinchilla dies?
About 5 years ago, OpenAI researchers found that investing more computing power and data in large-scale training can significantly improve the performance of AI models.
A few years later, Google researchers went one step further and proved that increasing the amount of data would lead to better results by building a model called “Chinchilla”.
This combination of “computing + data” has given rise to today’s giant models, such as GPT-4.
Paper Link: https://arxiv.org/pdf/2203.15556
However, the success of this strategy relies on a huge upfront investment.
Massive amounts of data are crammed into complex and energy-intensive pre-training processes, and tech giants are frantically building data centers crammed with NVIDIA GPUs.
But the question is: how far can this model of throwing money and data go?
Ross Sandler, a top analyst at Barclays Capital, points out that there are two very different scenarios ahead of us:
First, “Chinchilla” continues to dominate, and the huge computing power and data investment continue to rise;
The second is the “stagnant” alternative, where new technologies and models achieve higher performance with fewer resources.
The gap between the two paths in capital expenditures is more than $3 trillion, enough to affect the direction of the entire industry.
The rise of “inference models”.
Driving this potential change is the rise of “inference models”.
OpenAI’s new models such as o1, o3, DeepSeek R1, Google Gemini 2.0 Flash Thinking, etc., use a technology called “test-time compute”.
This approach breaks down complex queries into small tasks and processes them one by one, rather than relying on long pre-training.
Inference models may be slightly slower to respond than traditional models, but they have more accurate output and are less expensive to run.
What’s more, they get rid of the reliance on large-scale pre-training.
DeepSeek R1 even shows the possibility that open-source inference models can achieve a leap in performance in a short period of time.
This means that AI companies may no longer need to spend 18-24 months and huge sums of money to build the next “Big Mac” model.
In addition, hybrid expert models (MoE) have become a widely adopted technology, by training multiple small “expert” models to work in tandem with larger models, using only a portion of the computing power when needed.
In this way, the infrastructure requirements are reduced in one step.
Where is Chinchilla headed?
Over the past five years, the Chinchilla strategy has fueled a boom in the AI supply chain, and many companies’ stock prices have soared as a result.
But today, its sustainability is being questioned.
Barclays analysts noted that “the price-performance model is declining as input costs skyrocket, such as a $10 billion pre-training session, and performance gains may be getting smaller and smaller.”
To make matters worse, training data may be drying up.
The supply of high-quality data is limited, and AI’s appetite for data is growing. How long can Chinchilla live without enough “food”?
Even, some bigwigs in the industry predict that companies like OpenAI may stop endless scaling after GPT-5.
In the face of data depletion, the AI industry is pinning its hopes on “synthetic data”. According to the researchers, this “self-sufficient” feedback loop allows the model to evolve itself and push the technology to new heights.
Chinchillas can essentially survive by “self-feeding”.
“If the AI industry makes a breakthrough in synthetic data and recursive self-improvement, then we will be back on the Chinchilla scaling path, and the demand for computation will continue to rise rapidly.”
Is Chinchilla dead? The AI market will give the final answer to this question.
If inference models and MoE technologies mature, AI may move towards a lightweight, efficient future, and trillions of dollars in infrastructure investment may no longer be necessary.
However, if “synthetic data” brings Chinchilla back to life, the computing power race will make a comeback.
Resources:
- https://arxiv.org/pdf/2503.09799
- https://x.com/MatharyCharles/status/1900593694216253827
- https://www.businessinsider.com/ai-chinchilla-openai-google-anthropic-compute-demand-capex-scaling-laws-2025-3
Author:新智元
Source:谷歌重磅推出全新Scaling Law,抢救Transformer!3万亿美元AI面临岔路
The copyright belongs to the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.