When training Transformer models on a single GPU, it’s important to optimize for both speed and memory efficiency to make the most of limited resources. Here are some key parameters and techniques to consider:
Mixed Precision Training
- Use fp16 or bf16 mixed precision to reduce memory usage while maintaining most of the fp32 precision. Set
fp16=True
orbf16=True
inTrainingArguments
. torch.bfloat16
ortorch.float16
can reduce memory usage by 2-4x with minimal impact on model quality, allowing training with larger batch sizes.- bfloat16 is more numerically stable than float16 and recommended if supported by your GPU (e.g., Nvidia A100).
- Mixed precision speeds up math-heavy operations like linear layers and convolutions.
Optimizer Choice
- AdamW is the most common optimizer but has high memory usage.
- Alternatives like adamw_hf, adamw_torch, adamw_apex_fused, adamw_anyprecision, or adafactor can be more memory efficient.
- Fused optimizers like adamw_apex_fused from Apex or adamw_anyprecision fuse multiple operations for faster speed.
- Fused Adam combines the Adam update’s elementwise operations into a single kernel, reducing memory access.
Gradient Accumulation
- Accumulate gradients over multiple smaller batches before doing an optimizer step to emulate training with larger batch sizes. Set
gradient_accumulation_steps
inTrainingArguments
. - Allows training with large effective batch sizes that don’t fit in GPU memory.
- Increases training speed by reducing communication overhead but can slow down convergence.
Gradient Checkpointing
- Trade off compute for memory by recomputing activations in the backward pass instead of storing them.
- Speeds up training by allowing larger batch sizes but slows down each iteration.
- Enable with
gradient_checkpointing=True
inTrainingArguments
.
Efficient Data Loading
- Use
DataLoader(pin_memory=True)
to speed up data transfer from CPU to GPU memory. - Set
DataLoader(num_workers=N)
to preload data with multiple worker processes.
Offload to CPU
- Offload the optimizer state and model parameters to CPU when not in use to free up GPU memory. Set
offload_optimizer=True
andoffload_param=True
. - Speeds up training by allowing larger models and batch sizes.
TF32 on Ampere GPUs
- Enable TF32 mode on Ampere or newer GPUs to automatically use the TF32 data type, which has the range of fp32 but precision of fp16. Set
tf32=True
inTrainingArguments
. - Speeds up linear layers, convolutions, and matmuls with minimal accuracy impact.
- Set
torch.backends.cuda.matmul.allow_tf32 = True
.
Flash Attention 2
- Use Flash Attention 2 kernels integrated in the Transformers library to speed up attention computation and reduce memory usage.
The Trainer API in 🤗 Transformers supports most of these techniques via the TrainingArguments
class. Experiment with combinations of these approaches to find the optimal tradeoff between speed, memory efficiency, and model quality for your specific model and hardware setup.
Sources https://huggingface.co/docs/transformers/en/perf_train_gpu_one