Efficient Training Techniques for Transformers on a Single GPU

When training Transformer models on a single GPU, it’s important to optimize for both speed and memory efficiency to make the most of limited resources. Here are some key parameters and techniques to consider:

Mixed Precision Training

  • Use fp16 or bf16 mixed precision to reduce memory usage while maintaining most of the fp32 precision. Set fp16=True or bf16=True in TrainingArguments.
  • torch.bfloat16 or torch.float16 can reduce memory usage by 2-4x with minimal impact on model quality, allowing training with larger batch sizes.
  • bfloat16 is more numerically stable than float16 and recommended if supported by your GPU (e.g., Nvidia A100).
  • Mixed precision speeds up math-heavy operations like linear layers and convolutions.

Optimizer Choice

  • AdamW is the most common optimizer but has high memory usage.
  • Alternatives like adamw_hf, adamw_torch, adamw_apex_fused, adamw_anyprecision, or adafactor can be more memory efficient.
  • Fused optimizers like adamw_apex_fused from Apex or adamw_anyprecision fuse multiple operations for faster speed.
  • Fused Adam combines the Adam update’s elementwise operations into a single kernel, reducing memory access.

Gradient Accumulation

  • Accumulate gradients over multiple smaller batches before doing an optimizer step to emulate training with larger batch sizes. Set gradient_accumulation_steps in TrainingArguments.
  • Allows training with large effective batch sizes that don’t fit in GPU memory.
  • Increases training speed by reducing communication overhead but can slow down convergence.

Gradient Checkpointing

  • Trade off compute for memory by recomputing activations in the backward pass instead of storing them.
  • Speeds up training by allowing larger batch sizes but slows down each iteration.
  • Enable with gradient_checkpointing=True in TrainingArguments.

Efficient Data Loading

  • Use DataLoader(pin_memory=True) to speed up data transfer from CPU to GPU memory.
  • Set DataLoader(num_workers=N) to preload data with multiple worker processes.

Offload to CPU

  • Offload the optimizer state and model parameters to CPU when not in use to free up GPU memory. Set offload_optimizer=True and offload_param=True.
  • Speeds up training by allowing larger models and batch sizes.

TF32 on Ampere GPUs

  • Enable TF32 mode on Ampere or newer GPUs to automatically use the TF32 data type, which has the range of fp32 but precision of fp16. Set tf32=True in TrainingArguments.
  • Speeds up linear layers, convolutions, and matmuls with minimal accuracy impact.
  • Set torch.backends.cuda.matmul.allow_tf32 = True.

Flash Attention 2

  • Use Flash Attention 2 kernels integrated in the Transformers library to speed up attention computation and reduce memory usage.

The Trainer API in 🤗 Transformers supports most of these techniques via the TrainingArguments class. Experiment with combinations of these approaches to find the optimal tradeoff between speed, memory efficiency, and model quality for your specific model and hardware setup.

Sources https://huggingface.co/docs/transformers/en/perf_train_gpu_one