Understanding Gradient Accumulation Issues in Transformers
In the field of machine learning, particularly when working with deep learning models like Transformers, gradient accumulation serves as a pivotal technique. Our friends at Unsloth recently highlighted a significant issue regarding gradient accumulation that has been affecting the Transformers Trainer. The initial report, courtesy of @bnjmn_marie, reveals discrepancies in loss values when toggling gradient accumulation on and off—a situation that deviates from the expected mathematical equivalence to full batch training.
What Is Gradient Accumulation?
Gradient accumulation is a strategy used to effectively increase the batch size without requiring additional memory. By accumulating gradients over several mini-batches before updating the model weights, users can simulate the effects of a larger batch size, which often leads to more stable training. This technique is especially useful in scenarios where hardware limitations restrict the use of large batches.
Where Does the Issue Stem From?
At the heart of this issue lies the default loss function utilized by each model in the transformers library. This function is tailored to the specific tasks the model is designed for—be it question answering, token classification, causal language modeling (LM), or masked LM. While the default loss function simplifies the training process for users, it is inherently limited and not intended for customization.
The default loss function is computed only when both labels and input_ids are provided as inputs to the model. This design allows users to avoid manually calculating the loss, but it can lead to complications when different training scenarios arise. Consequently, while the simplicity of the Transformers Trainer is appealing, it can sometimes lead to unexpected behaviors, particularly during gradient accumulation.
The Technical Breakdown
For tasks involving token-level outputs, such as causal LM training, the correct loss should be calculated based on the total loss over all batches in a gradient accumulation step. This total loss needs to be divided by the number of non-padding tokens present in those batches, rather than merely averaging the per-batch loss values. The current implementation fails to adhere to this principle, causing discrepancies in loss calculations.
def ForCausalLMLoss(logits, labels, vocab_size, **kwargs):
# Upcast to float if we need to compute the loss to avoid potential precision issues
logits = logits.float()
# Shift so that tokens < n predict n
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
shift_logits = shift_logits.view(-1, vocab_size)
shift_labels = shift_labels.view(-1)
# Enable model parallelism
shift_labels = shift_labels.to(shift_logits.device)
num_items = kwargs.pop("num_items", None)
+ loss = nn.functional.cross_entropy(shift_logits, shift_labels, ignore_index=-100, reduction="sum")
+ loss = loss / num_items
- loss = nn.functional.cross_entropy(shift_logits, shift_labels, ignore_index=-100)
return loss
How We’re Fixing It
To address the issues surrounding gradient accumulation, we are introducing two key changes to our models and training processes:
- For users relying on the default loss functions, we will now automatically adjust the calculations to ensure accurate loss reporting during gradient accumulation. This change aims to resolve the core issue identified.
- To empower users in the meantime, we will be introducing an API that allows them to input their own loss functions directly into the
Trainer. This flexibility ensures that they can implement their fixes until we finalize our internal adjustments and release an updated version of the Transformers library.
Custom Loss Functions
Models that inherit from the PreTrainedModel class will now feature a loss_function property, which can be defined based on:
- The
config.loss_type: This approach enables users to specify custom loss functions easily. Modifying theLOSS_MAPPINGwill allow for this customization.
def my_super_loss(logits, labels):
return loss = nn.functional.cross_entropy(logits, labels, ignore_index=-100)
LOSS_MAPPING["my_loss_type"] = my_super_loss
Next Steps
We are diligently working on implementing these changes. The first adjustment is set to be deployed for the most widely used models, as noted in our pull request here. Following this, we will issue a call for contributions to ensure that a wider array of models is supported in the next release.
Additionally, our second change, which allows users to apply their custom loss functions and accurately track samples per-batch, is being developed in this pull request: here.
By tomorrow, users can expect the Trainer to function correctly with gradient accumulation. To access the fix, make sure to install from the main branch:
pip install git+https://github.com/huggingface/transformers
We pride ourselves on being responsive to bug reports submitted through our issue tracker: here. Although this issue has persisted within the Transformers framework for some time, we recognize the importance of keeping our defaults intuitive and up-to-date. Our commitment to rapid fixes, like the one implemented here within 24 hours, is a testament to our dedication to enhancing user experience. Your feedback is invaluable, so please don’t hesitate to reach out with any further issues to help us refine Transformers to better suit your needs.
The Transformers team 🤗
Inspired by: Source

