<p>View a PDF of the paper titled <strong>GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation</strong>, by Sijia Li and nine other authors</p>
View PDF
<blockquote class="abstract mathjax">
<span class="descriptor">Abstract:</span> Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token-level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment's advantage. Experiments across eight mathematical reasoning and agentic tool-use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self-distillation-only baselines, and token- or turn-level credit-assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings.
</blockquote>
<div>
<h2>Submission History</h2>
From: Sijia Li [view email] <br/>
<strong>[v1]</strong> Tue, 12 May 2026 09:38:38 UTC (1,713 KB)<br/>
<strong>[v2]</strong> Thu, 14 May 2026 10:19:32 UTC (1,713 KB)<br/>
</div>
Understanding GEAR: Granularity-Adaptive Advantage Reweighting
In the ever-evolving landscape of machine learning, particularly in the realm of large language models (LLMs), researchers are continuously exploring methods that ensure more effective learning experiences. The paper titled “GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation” introduces an innovative framework that promises to enhance the performance of LLM agents using adaptive credit assignment techniques.
The Challenge of Reinforcement Learning in LLMs
Reinforcement learning (RL) has garnered significant attention as a post-training method for LLMs. Traditional RL approaches typically depend on outcome-level rewards, which offer broad, outcome-focused supervision. However, this technique often overlooks the finer details of the agent’s performance during long-horizon trajectories. Specifically, assigning credit to specific actions or decisions made by an agent has proven challenging. This is particularly evident when outcomes are influenced by a sequence of actions over time, making it difficult to pinpoint which decisions led to successes or failures.
Introducing GEAR: A Novel Approach
The GEAR framework aims to fill this gap by reshaping trajectory-level Generalized Policy Gradient (GRPO) advantages using signals from self-distillation. At its core, GEAR employs a method where an on-policy student model leverages a ground-truth teacher model’s output to assess divergences between their actions. These divergences serve as signals to identify where the student model may deviate from ideal behavior, allowing for more refined adjustments.
Adaptive Granularity in Credit Assignment
One of the standout features of GEAR is its adaptive granularity in credit assignment. Instead of treating the entire trajectory as a single entity, GEAR segments the trajectory based on moments of divergence. When a spike in divergence occurs—indicating a significant shift in the semantic understanding between the student and the teacher—GEAR utilizes these points as anchors. This approach preserves the resolution at the token level when the student remains aligned with the teacher, while also grouping tokens into adaptive segments when a divergence occurs. This dual-level approach ensures that credit is allocated more accurately, leading to improved learning outcomes.
Experimental Validation and Results
The efficacy of the GEAR framework has been validated through experiments on various benchmarks, including mathematical reasoning and agentic tool-use tasks, using models such as Qwen3 (4B and 8B parameters). The results have been promising, with GEAR consistently outperforming standard GRPO methods, as well as self-distillation-only and turn-level credit assignment strategies. Notably, GEAR demonstrated particularly strong performance improvements—up to 20% in accuracy—on more demanding benchmarks, underscoring its effectiveness in tackling complex long-horizon tasks.
Implications for Future Research
As LLMs become increasingly integral in various applications, GEAR’s contribution to adaptive credit assignment offers a pathway for future research to explore even more refined learning techniques. By moving beyond coarse supervision and leveraging token- and segment-level insights, researchers can devise strategies that further enhance the capabilities and efficiency of LLM agents.
Inspired by: Source

