Understanding Gradient Accumulation Issues in Transformers

In the field of machine learning, particularly when working with deep learning models like Transformers, gradient accumulation serves as a pivotal technique. Our friends at Unsloth recently highlighted a significant issue regarding gradient accumulation that has been affecting the Transformers Trainer. The initial report, courtesy of @bnjmn_marie, reveals discrepancies in loss values when toggling gradient accumulation on and off—a situation that deviates from the expected mathematical equivalence to full batch training.

What Is Gradient Accumulation?

Gradient accumulation is a strategy used to effectively increase the batch size without requiring additional memory. By accumulating gradients over several mini-batches before updating the model weights, users can simulate the effects of a larger batch size, which often leads to more stable training. This technique is especially useful in scenarios where hardware limitations restrict the use of large batches.

Where Does the Issue Stem From?

At the heart of this issue lies the default loss function utilized by each model in the transformers library. This function is tailored to the specific tasks the model is designed for—be it question answering, token classification, causal language modeling (LM), or masked LM. While the default loss function simplifies the training process for users, it is inherently limited and not intended for customization.

The default loss function is computed only when both labels and input_ids are provided as inputs to the model. This design allows users to avoid manually calculating the loss, but it can lead to complications when different training scenarios arise. Consequently, while the simplicity of the Transformers Trainer is appealing, it can sometimes lead to unexpected behaviors, particularly during gradient accumulation.

The Technical Breakdown

For tasks involving token-level outputs, such as causal LM training, the correct loss should be calculated based on the total loss over all batches in a gradient accumulation step. This total loss needs to be divided by the number of non-padding tokens present in those batches, rather than merely averaging the per-batch loss values. The current implementation fails to adhere to this principle, causing discrepancies in loss calculations.

def ForCausalLMLoss(logits, labels, vocab_size, **kwargs):
    # Upcast to float if we need to compute the loss to avoid potential precision issues
    logits = logits.float()
    # Shift so that tokens < n predict n
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()

    # Flatten the tokens
    shift_logits = shift_logits.view(-1, vocab_size)
    shift_labels = shift_labels.view(-1)
    # Enable model parallelism
    shift_labels = shift_labels.to(shift_logits.device)

    num_items = kwargs.pop("num_items", None)
    +        loss = nn.functional.cross_entropy(shift_logits, shift_labels, ignore_index=-100, reduction="sum")
    +        loss = loss / num_items
    -        loss = nn.functional.cross_entropy(shift_logits, shift_labels, ignore_index=-100)
    return loss

How We’re Fixing It

To address the issues surrounding gradient accumulation, we are introducing two key changes to our models and training processes:

For users relying on the default loss functions, we will now automatically adjust the calculations to ensure accurate loss reporting during gradient accumulation. This change aims to resolve the core issue identified.
To empower users in the meantime, we will be introducing an API that allows them to input their own loss functions directly into the Trainer. This flexibility ensures that they can implement their fixes until we finalize our internal adjustments and release an updated version of the Transformers library.

Custom Loss Functions

Models that inherit from the PreTrainedModel class will now feature a loss_function property, which can be defined based on:

The config.loss_type: This approach enables users to specify custom loss functions easily. Modifying the LOSS_MAPPING will allow for this customization.

def my_super_loss(logits, labels):
        return loss = nn.functional.cross_entropy(logits, labels, ignore_index=-100)

LOSS_MAPPING["my_loss_type"] = my_super_loss

Next Steps

We are diligently working on implementing these changes. The first adjustment is set to be deployed for the most widely used models, as noted in our pull request here. Following this, we will issue a call for contributions to ensure that a wider array of models is supported in the next release.

Additionally, our second change, which allows users to apply their custom loss functions and accurately track samples per-batch, is being developed in this pull request: here.

By tomorrow, users can expect the Trainer to function correctly with gradient accumulation. To access the fix, make sure to install from the main branch:

pip install git+https://github.com/huggingface/transformers

We pride ourselves on being responsive to bug reports submitted through our issue tracker: here. Although this issue has persisted within the Transformers framework for some time, we recognize the importance of keeping our defaults intuitive and up-to-date. Our commitment to rapid fixes, like the one implemented here within 24 hours, is a testament to our dedication to enhancing user experience. Your feedback is invaluable, so please don’t hesitate to reach out with any further issues to help us refine Transformers to better suit your needs.

The Transformers team 🤗

Inspired by: Source

Contents

What Is Gradient Accumulation?
Where Does the Issue Stem From?
The Technical Breakdown
How We’re Fixing It
Custom Loss Functions
Next Steps

Effective Solutions for Fixing Gradient Accumulation in Machine Learning

Understanding Gradient Accumulation Issues in Transformers

What Is Gradient Accumulation?

Where Does the Issue Stem From?

The Technical Breakdown

How We’re Fixing It

Custom Loss Functions

Next Steps

Stay Connected

Explore Top AI Tools Instantly

Latest News

Meta Experiences a Decline of 20 Million Users in Last Quarter: What It Means for the Future

Enhancing Long-Horizon Dialogue Agents with Adaptive User-Centric Memory Solutions

Creating an Effective Plan for Managing Nuclear Waste: Why It’s Time to Act

QCon AI Boston 2026: Key Topics on Agents in Production, Inference Costs, and AI Integration in the Software Development Lifecycle

Leading global tech insights for 20M+ innovators

Quick Link

Support

Sign Up for Our Newsletter

Understanding Gradient Accumulation Issues in Transformers

What Is Gradient Accumulation?

Where Does the Issue Stem From?

The Technical Breakdown

How We’re Fixing It

Custom Loss Functions

Next Steps

Sign Up For Daily Newsletter

Get AI news first! Join our newsletter for fresh updates on open-source models.

Stay Connected

Explore Top AI Tools Instantly

Latest News

Meta Experiences a Decline of 20 Million Users in Last Quarter: What It Means for the Future

Enhancing Long-Horizon Dialogue Agents with Adaptive User-Centric Memory Solutions

Creating an Effective Plan for Managing Nuclear Waste: Why It’s Time to Act

QCon AI Boston 2026: Key Topics on Agents in Production, Inference Costs, and AI Integration in the Software Development Lifecycle