Introducing Skrub: A Powerful Data Cleaning and Preprocessing Library

Publié le 26 janvier 2025 par loic

Data scientists and analysts often spend significant time cleaning and preparing data before analysis. The Skrub library emerges as a powerful solution for streamlining this process, offering efficient tools for data wrangling and preprocessing.

Key Features

Data Type Handling
The library excels at managing various data types, from categorical variables to numerical data, with built-in support for handling null values and unique value identification[1].

Automated Processing
Skrub’s standout feature is its ability to process complex datasets with minimal manual intervention. The library can handle diverse data structures, including employee records, departmental information, and temporal data[1].

Statistical Analysis
The library provides comprehensive statistical analysis capabilities, offering:

Mean and standard deviation calculations
Median and IQR measurements
Range identification (minimum to maximum values)[1]

Real-World Application

To demonstrate Skrub’s capabilities, consider its handling of employee data:

Processes multiple data types simultaneously
Manages categorical data like department names and position titles
Handles temporal data such as hire dates
Provides detailed statistical summaries of numerical fields[1][2]

Performance Metrics

The library shows impressive efficiency in handling large datasets:

Processes thousands of unique entries
Maintains data integrity with zero null values in critical fields
Handles datasets with hundreds of unique categories[1]

Integration and Usage

Skrub seamlessly integrates with existing data science workflows, focusing on reducing preprocessing time and enhancing machine learning pipeline efficiency. Its intuitive interface makes it accessible for both beginners and experienced data scientists[2].

This powerful library represents a significant step forward in data preprocessing, living up to its motto: « Less wrangling, more machine learning »[2].

Sources
[1] https://skrub-data.org/stable
[2] https://skrub-data.org/stable/auto_examples/00_getting_started.html

Chronos: Learning the Language of Time Series

Publié le 21 mars 2024 par loic

Chronos is a framework designed for pretrained probabilistic time series models.
It utilizes scaling and quantization to tokenize time series values into a fixed vocabulary.
Chronos trains transformer-based language model architectures (specifically, models from the T5 family with parameters ranging from 20M to 710M) using cross-entropy loss.
The models are pretrained on a mix of publicly available datasets and a synthetic dataset generated via Gaussian processes, enhancing generalization.
In a comprehensive benchmark involving 42 datasets, including both classical local models and deep learning approaches, Chronos models:
(a) significantly outperform other methods on datasets included in the training corpus;
(b) show comparable or occasionally superior zero-shot performance on new datasets compared to methods trained specifically on those datasets.
These results demonstrate the potential of pretrained models to leverage time series data across various domains for improving zero-shot accuracy on unseen forecasting tasks, suggesting a simplified approach to forecasting pipelines.

https://arxiv.org/pdf/2403.07815.pdf

https://github.com/amazon-science/chronos-forecasting/

Category Encoders

Publié le 11 mars 2023 par loic

A set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques.

Category Encoders is a Python library for encoding categorical variables for machine learning tasks. It is available on contrib.scikit-learn.org and extends the capabilities of scikit-learn’s preprocessing module.

The library provides several powerful encoding techniques for dealing with categorical data, including:

Ordinal encoding: maps categorical variables to integer values based on their order of appearance
One-hot encoding: creates a binary feature for each category in a variable
Binary encoding: maps each category to a binary code
Target encoding: encodes each category with the mean target value for that category
Hashing encoding: maps each category to a random index in a hash table

Category Encoders also supports a range of advanced features, such as handling missing values, combining multiple encoders, and applying encoders to specific subsets of features.

Overall, Category Encoders is a useful tool for preprocessing categorical data and improving the accuracy and performance of machine learning models.

Denoising Autoencoders for Tabular Data

Publié le 26 février 2023 par loic

Financial Explaining Anomalies

Initial paper :https://arxiv.org/pdf/2209.10658.pdf
Code: https://github.com/topics/denoising-autoencoders
Kaggle example : kaggle Notebook
Bundesbank (2023) use case: Bundesbank (2023) paper

Revisiting Deep Learning Models for Tabular Data

Publié le 26 février 2023 par loic

Paper: https://arxiv.org/pdf/2106.11959v2.pdf
Code Pytorch: https://github.com/lucidrains/tab-transformer-pytorch
Library bis: Implementation of TabTransformer in TensorFlow and Keras
Kaggle example: kaggle tabtransformer
Notebook: Notebook in keras
Keras implementation code :Keras Implementation
Keras code: keras-team code

TabTransformer: Tabular Data Modeling Using Contextual Embeddings

Publié le 26 février 2023 par loic

The main idea in the paper is that the performance of regular Multi-layer Perceptron (MLP) can be significantly improved if we use Transformers to transforms regular categorical embeddings into contextual ones.

The TabTransformer is built upon self-attention based Transformers. The Transformer layers transform the embed- dings of categorical features into robust contextual embed- dings to achieve higher prediction accuracy.

Paper: https://arxiv.org/pdf/2012.06678.pdf
Code: https://github.com/aruberts/TabTransformerTF
Article: Transformers for tabular data: Tabtransformer deep dive

Missing data Imputation

Publié le 26 février 2023 par loic

Are deep learning models superior ?

Paper: https://arxiv.org/pdf/2103.09316.pdf

Deeplearning.fr

You have to learn the rules of the game. And then you have to play better than anyone else

Archives de catégorie : Tabular data

Introducing Skrub: A Powerful Data Cleaning and Preprocessing Library

Key Features

Real-World Application

Performance Metrics

Integration and Usage

Chronos: Learning the Language of Time Series

Category Encoders

Denoising Autoencoders for Tabular Data

Revisiting Deep Learning Models for Tabular Data

TabTransformer: Tabular Data Modeling Using Contextual Embeddings

Missing data Imputation