Resource Info Paper http://arxiv.org/abs/2509.07558 Code & Data https://github.com/zerolllin/Delta-L-Normalization Public arXiv Date 2025.09.22
Although previous methods such as GRPO, DAPO, and Dr. GRPO introduce different loss normalization terms to address this issue, they either produce biased estimates or still suffer from high gradient variance. By analyzing the effect of varying lengths on policy loss both theoretically and empirically, we reformulate the problem as finding a minimum-variance unbiased estimator. Our proposed not only provides an unbiased estimate of the true policy loss but also minimizes gradient variance in theory.
GRPO applies a sample-level normalization, dividing each sample-level loss by its response length; while DAPO uses a batch-level approach, normalizing the total loss by the sum of response lengths in the batch. In addition, because such length-dependent factors deviate from standard reinforcement learning theory. Dr. GRPO, in contrast, avoids any length-dependent factor and normalizes the gradient with a fixed constant.
There is a lack of analysis on how they influence the statistical properties of gradient estimation in RLVR, with gradient variance being particularly important because high variance leads to inefficient training and even model collapse.
(1) The aggregation startegies in GRPO and DAPO introduce length-dependent bias in estimating the true policy gradient. As response lengths increase, their parameter updates shrink in gradient norm, slowing convergence. (2) The aggregation strategies in DAPO and Dr. GRPO lead means greater relative noise for the same gradient norm, resulting in less stable training.
the variance of the unnormalized gradient grows approximately proportionally to the response length.
Consider . We approximate this as , by ignoring covariance between individual token-level gradients. Assuming each token-level term contributes approximately a constant variance , we thus have . This approximation indicates that samples with longer responses inherently exhibit higher gradient variance due to increased randomness, and the gradient variance grows proportionally with response length.
GRPO and DAPO introduce length-dependent coefficients determined by the response lengths , which results in a issue.
Rethink loss normalization in RLVR
Existing aggregation methods introduce either bias or excessive variance. We therefore ask: The answer is yes. We observe that this problem can be naturally reformulated within the framework of minimum variance unbiased estimation in statistics. Specifically, we treat the gradients obtained from responses of different lengths as independent observations of the same underlying variable (the ground-truth policy gradient), each with its own variance. Our objective is then to construct a new unbiased estimator by optimally combining these observations so that the resulting variance is minimized. Formally, we define the problem as follows.
Problem Definition. Given a set of independent sample-level gradient estimators satisfying and , where denotes the length associated with sample and is a constant scalar, the objective is to determine coefficients in the linear combination , such that for a given , while minimizing the variance .
Noting that the unbiasedness constraint is equivalent to and, by independence, the variance satisfies , the problem reduces to a convex quadratic program with a single linear constraint. Solving with the Lagrange multiplier method yields the unique minimizer: .
In practice, we find it is beneficial to introduce a hyperparameter to the normalization factor, which gives the following normalization weights:
The parameter provides a tradeoff between variance reduction and utilization of long responses. While longer responses tend to introduce higher variance, sometimes they also carry richer learning signals. Choosing allows these signals to contribute more effectively, at the cost of increased gradient variance. We name this method , as it is specially designed to match the dynamic length nature in RLVR. It has four key properites:
Thus, guarantees lower CV than DAPO and Dr.GRPO, while matching GRPO at . When , variance increases slightly, but long responses contribute more effectively.
These properties make highly valuable for RLVR training. The unbiasedness property ensures consistency with standard reinforcement learning theory, preventing unexpected slowdowns caused by biased gradient estimates. Variance reduction further stabilizes training and accelerates convergence. In practice, we find that, setting , which minimizes the variance, is a universal good choice. further increase the performance in Math, which might be due to the fact that the long response in Math task should be better utilized.
论文的创新之处与独特性:
论文中存在的问题及改进建议:
基于论文的内容和研究结果,提出的创新点或研究路径:
为新的研究路径制定的研究方案:
本文作者:Geaming
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!