The History of Open-Source LLMs: Early Days (Part One)

Publié le 2 août 2023 par loic

https://cameronrwolfe.substack.com/p/the-history-of-open-source-llms-early

Language modeling research traces back to models like GPT, GPT-2, and pre-transformer methods such as ULMFit.
GPT-3’s proposal marked the initial rise in popularity by showcasing impressive few-shot learning through self-supervised pre-training and in-context learning.
The recognition of GPT-3 led to the creation of various large language models (LLMs), including InstructGPT and ChatGPT, sparking widespread interest in generative AI.
Early LLMs often remained closed source, limiting researchers’ understanding and improvement of their workings.
Open-source variants of popular language models began to emerge gradually, although they initially lagged behind proprietary models in performance.
These early open-source models laid the groundwork for increased transparency in LLM research and inspired the development of more potent subsequent models like Falcon and LLaMA-21.
The overview is part of a three-part series that delves into the history of open-source language models, exploring their beginnings, recent developments, and the application of imitation and alignment techniques to enhance their performance.

Large Transformer Model Inference Optimization

Publié le 2 août 2023 par loic

https://lilianweng.github.io/posts/2023-01-10-inference-optimization/#quantization

Large transformer models are mainstream nowadays, creating SoTA results for a variety of tasks. They are powerful but very expensive to train and use. The extremely high inference cost, in both time and memory, is a big bottleneck for adopting a powerful transformer for solving real-world tasks at scale.

Why is it hard to run inference for large transformer models? Besides the increasing size of SoTA models, there are two main factors contributing to the inference challenge (Pope et al. 2022):

Large memory footprint. Both model parameters and intermediate states are needed in memory at inference time. For example,
- The KV cache should be stored in memory during decoding time; E.g. For a batch size of 512 and context length of 2048, the KV cache totals 3TB, that is 3x the model size (!).
- Inference cost from the attention mechanism scales quadratically with input sequence length.
Low parallelizability. Inference generation is executed in an autoregressive fashion, making the decoding process hard to parallel.

In this post, we will look into several approaches for making transformer inference more efficient. Some are general network compression methods, while others are specific to transformer architecture.

Universal and Transferable Adversarial Attacks on Aligned Language Models

Publié le 2 août 2023 par loic

https://llm-attacks.org/

This research examines the safety of large language models (LLMs) such as ChatGPT, Bard, and Claude. It demonstrates the potential for automated creation of adversarial attacks, using character sequences added to user queries that manipulate the LLM into following harmful commands. Unlike traditional « jailbreaks, » these attacks are automated and can affect both open-source and closed-source chatbots. The study raises concerns about the effectiveness of mitigation measures and suggests that the challenges posed by adversarial behavior might persist due to the nature of deep learning models. The findings highlight the need for careful consideration of the safety implications as LLMs become more integrated into various applications.

Time Series Made Easy in Python: DARTS

Publié le 12 mars 2023 par loic

Darts is a Python library for user-friendly forecasting and anomaly detection on time series. It contains a variety of models, from classics such as ARIMA to deep neural networks.

Some of the key features of Darts include:

A simple and intuitive interface for defining and fitting models
Support for different types of time series data, including univariate, multivariate, and panel data
A wide range of built-in models, including ARIMA, Exponential Smoothing, Prophet, LSTM, and TCN
Tools for hyperparameter tuning and model selection, such as cross-validation and grid search
Visualization tools for exploring and analyzing time series data and model outputs

Library

Model	Univariate	Multivariate	Probabilistic	Multiple series (global)	Past-observed covariates	Future-known covariates	Static covariates	Reference
`ARIMA`	✅		✅			✅
`VARIMA`	✅	✅				✅
`AutoARIMA`	✅					✅
`StatsForecastAutoARIMA` (faster AutoARIMA)	✅		✅			✅		Nixtla’s statsforecast
`ExponentialSmoothing`	✅		✅
`StatsForecastETS`	✅					✅		Nixtla’s statsforecast
`BATS` and `TBATS`	✅		✅					TBATS paper
`Theta` and `FourTheta`	✅							Theta & 4 Theta
`Prophet` (see install notes)	✅		✅			✅		Prophet repo
`FFT` (Fast Fourier Transform)	✅
`KalmanForecaster` using the Kalman filter and N4SID for system identification	✅	✅	✅			✅		N4SID paper
`Croston` method	✅
`RegressionModel`; generic wrapper around any sklearn regression model	✅	✅		✅	✅	✅	✅
`RandomForest`	✅	✅		✅	✅	✅	✅
`LinearRegressionModel`	✅	✅	✅	✅	✅	✅	✅
`LightGBMModel`	✅	✅	✅	✅	✅	✅	✅
`CatBoostModel`	✅	✅	✅	✅	✅	✅	✅
`XGBModel`	✅	✅	✅	✅	✅	✅	✅
`RNNModel` (incl. LSTM and GRU); equivalent to DeepAR in its probabilistic version	✅	✅	✅	✅		✅		DeepAR paper
`BlockRNNModel` (incl. LSTM and GRU)	✅	✅	✅	✅	✅
`NBEATSModel`	✅	✅	✅	✅	✅			N-BEATS paper
`NHiTSModel`	✅	✅	✅	✅	✅			N-HiTS paper
`TCNModel`	✅	✅	✅	✅	✅			TCN paper, DeepTCN paper, blog post
`TransformerModel`	✅	✅	✅	✅	✅
`TFTModel` (Temporal Fusion Transformer)	✅	✅	✅	✅	✅	✅	✅	TFT paper, PyTorch Forecasting
`DLinearModel`	✅	✅	✅	✅	✅	✅	✅	DLinear paper
`NLinearModel`	✅	✅	✅	✅	✅	✅	✅	NLinear paper
Naive Baselines	✅	✅

Category Encoders

Publié le 11 mars 2023 par loic

A set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques.

Category Encoders is a Python library for encoding categorical variables for machine learning tasks. It is available on contrib.scikit-learn.org and extends the capabilities of scikit-learn’s preprocessing module.

The library provides several powerful encoding techniques for dealing with categorical data, including:

Ordinal encoding: maps categorical variables to integer values based on their order of appearance
One-hot encoding: creates a binary feature for each category in a variable
Binary encoding: maps each category to a binary code
Target encoding: encodes each category with the mean target value for that category
Hashing encoding: maps each category to a random index in a hash table

Category Encoders also supports a range of advanced features, such as handling missing values, combining multiple encoders, and applying encoders to specific subsets of features.

Overall, Category Encoders is a useful tool for preprocessing categorical data and improving the accuracy and performance of machine learning models.

Cleaning labels: Cleanlab

Publié le 11 mars 2023 par loic

cleanlab automatically detects problems in a ML dataset. This data-centric AI package facilitates machine learning with messy, real-world data by providing clean labels for robust training and flagging errors in your data

Paper: https://arxiv.org/pdf/1911.00068.pdf

Code : Code

Yellowbrick: Machine Learning Visualization

Publié le 11 mars 2023 par loic

Feature Visualization

Rank Features: pairwise ranking of features to detect relationships
Parallel Coordinates: horizontal visualization of instances
Radial Visualization: separation of instances around a circular plot
PCA Projection: projection of instances based on principal components
Manifold Visualization: high dimensional visualization with manifold learning
Joint Plots: direct data visualization with feature selection

Classification Visualization

Class Prediction Error: shows error and support in classification
Classification Report: visual representation of precision, recall, and F1
ROC/AUC Curves: receiver operator characteristics and area under the curve
Precision-Recall Curves: precision vs recall for different probability thresholds
Confusion Matrices: visual description of class decision making
Discrimination Threshold: find a threshold that best separates binary classes

Regression Visualization

Prediction Error Plot: find model breakdowns along the domain of the target
Residuals Plot: show the difference in residuals of training and test data
Alpha Selection: show how the choice of alpha influences regularization
Cook’s Distance: show the influence of instances on linear regression

Clustering Visualization

K-Elbow Plot: select k using the elbow method and various metrics
Silhouette Plot: select k by visualizing silhouette coefficient values
Intercluster Distance Maps: show relative distance and size/importance of clusters

Model Selection Visualization

Validation Curve: tune a model with respect to a single hyperparameter
Learning Curve: show if a model might benefit from more data or less complexity
Feature Importances: rank features by importance or linear coefficients for a specific model
Recursive Feature Elimination: find the best subset of features based on importance

Target Visualization

Balanced Binning Reference: generate a histogram with vertical lines showing the recommended value point to bin the data into evenly distributed bins
Class Balance: see how the distribution of classes affects the model
Feature Correlation: display the correlation between features and dependent variables

Text Visualization

Term Frequency: visualize the frequency distribution of terms in the corpus
t-SNE Corpus Visualization: use stochastic neighbor embedding to project documents
Dispersion Plot: visualize how key terms are dispersed throughout a corpus
UMAP Corpus Visualization: plot similar documents closer together to discover clusters
PosTag Visualization: plot the counts of different parts-of-speech throughout a tagged corpus