- LLMLingua-2 is a novel compression technology developed by Microsoft Research, achieving state-of-the-art results with 8 times less GPU memory on tasks typically handled by models like GPT-4.
- It introduces innovative approaches such as « Data Distillation, » « Bidirectional Token Classification, » and optimized compression objectives to efficiently compress prompts without losing key information.
- The technology has shown superior performance across various language tasks and demonstrated remarkable generalization across different LLMs and languages, from GPT-3.5 to Mistral-7B and from English to Chinese.
- Compared to existing prompt compression methods, LLMLingua-2 is 3 to 6 times faster, accelerates end-to-end inference by 1.6 to 2.9 times, and significantly reduces GPU memory usage by a factor of 8.
- This advancement represents a significant step forward in making language AI more practical and scalable for real-world applications, demonstrating Microsoft Research’s leadership in the field.
https://arxiv.org/pdf/2403.12968.pdf
sample
https://huggingface.co/microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank