The History of Open-Source LLMs: Early Days (Part One)

https://cameronrwolfe.substack.com/p/the-history-of-open-source-llms-early

  • Language modeling research traces back to models like GPT, GPT-2, and pre-transformer methods such as ULMFit.
  • GPT-3’s proposal marked the initial rise in popularity by showcasing impressive few-shot learning through self-supervised pre-training and in-context learning.
  • The recognition of GPT-3 led to the creation of various large language models (LLMs), including InstructGPT and ChatGPT, sparking widespread interest in generative AI.
  • Early LLMs often remained closed source, limiting researchers’ understanding and improvement of their workings.
  • Open-source variants of popular language models began to emerge gradually, although they initially lagged behind proprietary models in performance.
  • These early open-source models laid the groundwork for increased transparency in LLM research and inspired the development of more potent subsequent models like Falcon and LLaMA-21.
  • The overview is part of a three-part series that delves into the history of open-source language models, exploring their beginnings, recent developments, and the application of imitation and alignment techniques to enhance their performance.

Large Transformer Model Inference Optimization

https://lilianweng.github.io/posts/2023-01-10-inference-optimization/#quantization

Large transformer models are mainstream nowadays, creating SoTA results for a variety of tasks. They are powerful but very expensive to train and use. The extremely high inference cost, in both time and memory, is a big bottleneck for adopting a powerful transformer for solving real-world tasks at scale.

Why is it hard to run inference for large transformer models? Besides the increasing size of SoTA models, there are two main factors contributing to the inference challenge (Pope et al. 2022):

  1. Large memory footprint. Both model parameters and intermediate states are needed in memory at inference time. For example,
    • The KV cache should be stored in memory during decoding time; E.g. For a batch size of 512 and context length of 2048, the KV cache totals 3TB, that is 3x the model size (!).
    • Inference cost from the attention mechanism scales quadratically with input sequence length.
  2. Low parallelizability. Inference generation is executed in an autoregressive fashion, making the decoding process hard to parallel.

In this post, we will look into several approaches for making transformer inference more efficient. Some are general network compression methods, while others are specific to transformer architecture.

Publié dans LLM

Universal and Transferable Adversarial Attacks on Aligned Language Models

https://llm-attacks.org/

This research examines the safety of large language models (LLMs) such as ChatGPT, Bard, and Claude. It demonstrates the potential for automated creation of adversarial attacks, using character sequences added to user queries that manipulate the LLM into following harmful commands. Unlike traditional « jailbreaks, » these attacks are automated and can affect both open-source and closed-source chatbots. The study raises concerns about the effectiveness of mitigation measures and suggests that the challenges posed by adversarial behavior might persist due to the nature of deep learning models. The findings highlight the need for careful consideration of the safety implications as LLMs become more integrated into various applications.

Publié dans LLM