Deeplearning.fr

You have to learn the rules of the game. And then you have to play better than anyone else

Optimize open LLMs using GPTQ and Hugging Face Optimum

Publié le 31 août 2023 par loic

https://www.philschmid.de/gptq-llama

Hugging Face Optimum team collaborated with AutoGPTQ library for a simple API to apply GPTQ quantization on language models.
GPTQ quantization allows open LLMs to 8, 4, 3, or 2 bits, enabling them to run on smaller hardware with minimal performance loss.
The blog covers:

Setting up the development environment.
Preparing the quantization dataset.
Loading and quantizing the model.
Testing performance and inference speed.
Bonus: Running inference with text generation.

GPTQ’s purpose is explained before diving into the tutorial.