Building Your Private AI Stack: A 2024 Guide to Self-Hosted Solutions

Publié le 6 décembre 2024 par loic

Are you concerned about data privacy and AI costs? The latest self-hosted AI tools offer powerful alternatives to cloud services. Let’s explore how to build a complete private AI stack using open-source solutions.

Why Self-Host Your AI Stack?

Private AI deployment brings multiple benefits:

Complete data privacy and control
No per-token or API costs
Customizable to your needs
Independence from cloud providers

The Essential Components

Let’s break down the key players in a self-hosted AI stack and how they work together.

LlamaIndex: Your Data Foundation

Think of LlamaIndex as your data’s brain. It processes and organizes your information, making it readily available for AI applications. With support for over 160 data sources and lightning-fast response times, it’s the perfect foundation for private AI deployments.

Flowise: Your Visual AI Builder

Flowise transforms complex AI workflows into visual puzzles. Drag, drop, and connect components to create sophisticated AI applications without diving deep into code. It’s particularly powerful for:

Building RAG pipelines
Creating custom chatbots
Designing knowledge bases
Developing AI agents

Ollama: Your Model Runner

Running AI models locally has never been easier. Ollama manages your models like a skilled librarian, supporting popular options like:

Mistral
Llama 2
CodeLlama
And many others

OpenWebUI: Your Interface Layer

Think of OpenWebUI as your AI’s front desk. It provides:

Clean chat interfaces
Multi-user support
Custom pipeline configurations
Local data storage

n8n: Your Automation Hub

n8n connects everything together, automating workflows and integrating with your existing tools. With over 350 pre-built integrations, it’s the glue that holds your AI stack together.

Real-World Applications

Document Processing System

Imagine a system where documents flow seamlessly from upload to intelligent responses:

Documents enter through OpenWebUI
LlamaIndex processes and indexes them
Flowise manages the RAG pipeline
Ollama provides local inference
n8n automates the entire workflow

Knowledge Management Solution

Create a private alternative to ChatGPT trained on your data:

LlamaIndex manages your knowledge base
Flowise designs the interaction flows
OpenWebUI provides the interface
Ollama serves responses locally
n8n handles integrations

Making It Work Together

The magic happens when these tools collaborate:

LlamaIndex + Flowise:

Seamless data processing
Visual RAG pipeline creation
Efficient knowledge retrieval

Flowise + OpenWebUI:

User-friendly interfaces
Custom interaction flows
Real-time responses

n8n + Everything:

Automated workflows
System integrations
Process orchestration

Looking Ahead

The self-hosted AI landscape continues to evolve. These tools receive regular updates, adding features and improving performance. By building your stack now, you’re investing in a future of AI independence.

Final Thoughts

Building a private AI stack isn’t just about privacy or cost savings—it’s about taking control of your AI future. With these tools, you can create sophisticated AI solutions while keeping your data secure and your costs predictable.

Ready to start building your private AI stack? Begin with one component and gradually expand. The journey to AI independence starts with a single step.

Building a Complete Self-hosted AI Development Environment

Publié le 1 décembre 2024 par loic

Introduction

In today’s AI landscape, having a secure, efficient, and self-contained development environment is crucial. This guide presents a comprehensive solution that combines best-in-class open-source tools for AI development, all running locally on your infrastructure.

Key Components

Ollama: Run state-of-the-art language models locally
n8n: Create automated AI workflows
Qdrant: Vector database for semantic search
Unstructured: Advanced document processing
Argilla: Data labeling and validation
Opik: Model evaluation and monitoring
JupyterLab: Interactive development environment

Benefits

Complete data privacy and control
No cloud dependencies
Cost-effective solution
Customizable infrastructure
Seamless tool integration

Prerequisites

Hardware Requirements

CPU: 4+ cores recommended
RAM: 16GB minimum, 32GB recommended
Storage: 50GB+ free space
GPU: NVIDIA GPU with 8GB+ VRAM (optional)

Software Requirements

# Check Docker version
docker --version
# Should be 20.10.0 or higher

# Check Docker Compose version
docker compose version
# Should be 2.0.0 or higher

# Check Git version
git --version
# Should be 2.0.0 or higher

System Preparation

# Create project directory
mkdir -p ai-development-environment
cd ai-development-environment

# Create required subdirectories
mkdir -p notebooks
mkdir -p shared
mkdir -p n8n/backup
mkdir -p data/documents
mkdir -p data/processed
mkdir -p data/vectors

Directory Structure

ai-development-environment/
├── docker-compose.yml
├── .env
├── notebooks/
│   ├── examples/
│   └── templates/
├── shared/
│   ├── documents/
│   └── processed/
├── n8n/
│   └── backup/
└── data/
    ├── documents/
    ├── processed/
    └── vectors/

Configuration Files

Environment Variables (.env)

# Database Configuration
POSTGRES_USER=n8n
POSTGRES_PASSWORD=n8n
POSTGRES_DB=n8n

# n8n Security
N8N_ENCRYPTION_KEY=1234567890
N8N_USER_MANAGEMENT_JWT_SECRET=1234567890

# Service Configuration
JUPYTER_TOKEN=masterclass
ARGILLA_PASSWORD=masterclass

# Resource Limits
POSTGRES_MAX_CONNECTIONS=100
ELASTICSEARCH_HEAP_SIZE=1g

Docker Compose Configuration

Create docker-compose.yml:

version: '3.8'

volumes:
  n8n_storage:
    driver: local
  postgres_storage:
    driver: local
  ollama_storage:
    driver: local
  qdrant_storage:
    driver: local
  open-webui:
    driver: local
  jupyter_data:
    driver: local
  opik_data:
    driver: local
  elasticsearch_data:
    driver: local

networks:
  demo:
    driver: bridge
    ipam:
      config:
        - subnet: 172.28.0.0/16

services:
  jupyter:
    image: jupyter/datascience-notebook:lab-4.0.6
    networks: ['demo']
    ports:
      - "8888:8888"
    volumes:
      - jupyter_data:/home/jovyan
      - ./notebooks:/home/jovyan/work
      - ./shared:/home/jovyan/shared
    environment:
      - JUPYTER_ENABLE_LAB=yes
      - JUPYTER_TOKEN=${JUPYTER_TOKEN}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    command: start-notebook.py --NotebookApp.token='${JUPYTER_TOKEN}'
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8888/api"]
      interval: 30s
      timeout: 10s
      retries: 3

  unstructured:
    image: quay.io/unstructured-io/unstructured-api:latest
    networks: ['demo']
    ports:
      - "8000:8000"
    volumes:
      - ./shared:/home/unstructured/shared
    command: --port 8000 --host 0.0.0.0
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  opik:
    image: comet/opik:latest
    networks: ['demo']
    ports:
      - "5173:5173"
    volumes:
      - opik_data:/root/opik
      - ./shared:/root/shared
    environment:
      - OPIK_BASE_URL=http://localhost:5173/api
    restart: unless-stopped

  argilla:
    image: argilla/argilla-server:latest
    networks: ['demo']
    ports:
      - "6900:6900"
    environment:
      - ARGILLA_ELASTICSEARCH=http://elasticsearch:9200
      - DEFAULT_USER_PASSWORD=${ARGILLA_PASSWORD}
    depends_on:
      elasticsearch:
        condition: service_healthy
    restart: unless-stopped

  elasticsearch:
    image: elasticsearch:8.11.0
    networks: ['demo']
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - ES_JAVA_OPTS=-Xms512m -Xmx${ELASTICSEARCH_HEAP_SIZE}
    volumes:
      - elasticsearch_data:/usr/share/elasticsearch/data
    healthcheck:
      test: ["CMD-SHELL", "curl -s http://localhost:9200/_cluster/health | grep -vq '\"status\":\"red\"'"]
      interval: 20s
      timeout: 10s
      retries: 5

  # Workflow Automation
  n8n:
    <<: *service-n8n
    container_name: n8n
    restart: unless-stopped
    ports:
      - 5678:5678
    volumes:
      - n8n_storage:/home/node/.n8n
      - ./n8n/backup:/backup
      - ./shared:/data/shared
    depends_on:
      postgres:
        condition: service_healthy
      n8n-import:
        condition: service_completed_successfully

  n8n-import:
    <<: *service-n8n
    container_name: n8n-import
    entrypoint: /bin/sh
    command:
      - "-c"
      - "n8n import:credentials --separate --input=/backup/credentials && n8n import:workflow --separate --input=/backup/workflows"
    volumes:
      - ./n8n/backup:/backup
    depends_on:
      postgres:
        condition: service_healthy

  # Chat Interface
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    networks: ['demo']
    restart: unless-stopped
    container_name: open-webui
    ports:
      - "3000:8080"
    extra_hosts:
      - "host.docker.internal:host-gateway"
    volumes:
      - open-webui:/app/backend/data

  # Language Models
  ollama-cpu:
    profiles: ["cpu"]
    <<: *service-ollama

  ollama-gpu:
    profiles: ["gpu-nvidia"]
    <<: *service-ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  ollama-pull-llama-cpu:
    profiles: ["cpu"]
    <<: *init-ollama
    depends_on:
      - ollama-cpu

  ollama-pull-llama-gpu:
    profiles: ["gpu-nvidia"]
    <<: *init-ollama
    depends_on:
      - ollama-gpu

This completes the docker-compose.yml configuration, combining the original starter kit services with our additional AI development tools. The setup provides a complete environment for AI development, document processing, and workflow automation.

Service Integration Examples

Python Code Examples

Create a new notebook in JupyterLab with these integration examples:

# Document Processing Pipeline
import requests
from pathlib import Path

# Unstructured API Integration
def process_document(file_path):
    with open(file_path, 'rb') as f:
        response = requests.post(
            'http://unstructured:8000/general/v0/general',
            files={'files': f}
        )
    return response.json()

# Ollama Integration
def query_llm(prompt):
    response = requests.post(
        'http://ollama:11434/api/generate',
        json={'model': 'llama3.1', 'prompt': prompt}
    )
    return response.json()

# Qdrant Integration
from qdrant_client import QdrantClient

def store_embeddings(vectors, metadata):
    client = QdrantClient(host='qdrant', port=6333)
    client.upsert(
        collection_name="documents",
        points=vectors,
        payload=metadata
    )

AI Templates and Workflows

Document Processing Workflow

Upload documents to shared directory
Process with Unstructured API
Generate embeddings with Ollama
Store in Qdrant
Query through n8n workflows

Docker Compose Profiles

The project uses different Docker Compose profiles to accommodate various hardware configurations:

For NVIDIA GPU Users

docker compose --profile gpu-nvidia pull
docker compose create && docker compose --profile gpu-nvidia up

This profile enables GPU acceleration for Ollama, providing faster inference times for language models[1].

For Apple Silicon (M1/M2)

docker compose pull
docker compose create && docker compose up

Since GPU access isn’t available in Docker on Apple Silicon, this profile runs without GPU specifications[1].

For CPU-only Systems

docker compose --profile cpu pull
docker compose create && docker compose --profile cpu up

This profile configures services to run on CPU only, suitable for systems without dedicated GPUs[1].

Service Configurations

Core Services

n8n: Workflow automation platform with AI capabilities
Ollama: Local LLM service with configurable GPU/CPU profiles
Qdrant: Vector database for embeddings
PostgreSQL: Database backend for n8n
Open WebUI: Chat interface for model interaction

Additional Services

Unstructured: Document processing service
Argilla: Data labeling platform
Opik: Model evaluation tools
JupyterLab: Development environment

Volume Management

Each service has dedicated persistent storage:

n8n_storage
postgres_storage
ollama_storage
qdrant_storage
elasticsearch_data
jupyter_data

Networking

All services communicate through a shared ‘demo’ network, allowing internal service discovery and communication

Browser Use Agent

Publié le 20 novembre 2024 par loic

Make websites accessible for AI agents 🤖.

Browser use is the easiest way to connect your AI agents with the browser. And it’s Free…

https://github.com/gregpr07/browser-use

from langchain_openai import ChatOpenAI
from browser_use import Agent

agent = Agent(
    task="Find a one-way flight from Bali to Oman on 12 January 2025 on Google Flights. Return me the cheapest option.",
    llm=ChatOpenAI(model="gpt-4o"),
)

# ... inside an async function
await agent.run()

PROMPT++ Automatic prompt engineering

Publié le 29 octobre 2024 par loic

Your Ultimate AI Prompt Rewriting Assistant! 🤖✍️

Are you struggling to get the right responses from AI?
Say hello to Prompt++, the game-changing tool that’s revolutionizing how we interact with AI!

🔑 Key Features:
• FREE Intelligent Prompt Rewriting
• Real-Time Optimization
• Detailed Explanation of Improvements
• User-Friendly Interface

💡 How it works:
1. Input your original prompt
2. Watch it transform instantly
3. Understand the improvements
4. Learn and enhance your skills

🏆 Benefits:
• Better AI outputs
• Time-saving
• Educational
• Increased productivity

Whether you’re a seasoned AI user or just getting started, Prompt++ is your personal prompt engineering expert. It rewrites and optimizes your prompts, ensuring every AI interaction is as effective as possible.

https://baconnier-prompt-plus-plus.hf.space

LLM explained in 2 videos

Publié le 14 octobre 2024 par loic

If you want to learn about LLMs in just 3 hours, two lectures given by Yann Dubois are just what you need:

Overview of LLM training and post-training:

https://youtu.be/9vM4p9NN0Ts

Scalable LLM evaluation:

Singular Value-Rotation Adaptation with Full Rank (SVRA-FR)

Publié le 12 octobre 2024 par loic

A Novel Approach for Efficient Fine-Tuning of Large Language Models

Abstract

We present Singular Value-Rotation Adaptation with Full Rank (SVRA-FR), a novel method for efficient fine-tuning of large language models. SVRA-FR leverages the full singular value decomposition (SVD) of weight matrices, allowing for comprehensive adjustments through singular value modification and singular vector rotation. This approach offers a parameter-efficient, interpretable, and potentially more effective alternative to existing fine-tuning methods, particularly Low-Rank Adaptation (LoRA).

1. Introduction

Large language models have demonstrated remarkable performance across various natural language processing tasks. However, fine-tuning these models for specific tasks remains computationally expensive and often requires significant amounts of data. Recent work on parameter-efficient fine-tuning methods, such as LoRA, has shown promise in reducing these costs. Our work builds upon these approaches by introducing a method that directly manipulates the full SVD components of weight matrices.

2. Method

SVRA-FR consists of the following key components:

2.1 Singular Value Decomposition

We begin by performing SVD on the original weight matrix W:

W = UΣV^T

where U and V are orthogonal matrices containing left and right singular vectors, respectively, and Σ is a diagonal matrix of singular values. This decomposition allows us to represent the weight matrix in terms of its principal components, with singular values indicating the importance of each component.

2.2 Trainable Parameters

SVRA-FR introduces three sets of trainable parameters:

a) Δσ: A vector for adjusting all singular values
b) θ_U: A vector for rotating all left singular vectors
c) θ_V: A vector for rotating all right singular vectors

These parameters allow for fine-grained control over the matrix’s structure and information content.

2.3 Singular Value Adjustment

We modify all singular values:

σ’_i = σ_i + Δσ_i

This adjustment allows us to amplify or attenuate the importance of different components in the weight matrix. By modifying singular values, we can control the « strength » of different features or directions in the weight space.

2.4 Singular Vector Rotation

We apply rotation to all left and right singular vectors:

u’_i = R(θ_U_i)u_i
v’_i = R(θ_V_i)v_i

where R(θ) is a 2D rotation matrix:

R(θ) = [cos(θ) -sin(θ); sin(θ) cos(θ)]

Rotation of singular vectors allows us to adjust the directions of the principal components in the weight space. This can be particularly useful for aligning the model’s features with task-specific requirements without drastically changing the overall structure of the weight matrix.

2.5 Matrix Reconstruction

We reconstruct the adaptation matrix:

W_adapt = U’Σ’V’^T

where U’ and V’ contain the rotated singular vectors and Σ’ is the diagonal matrix of adjusted singular values. This reconstruction combines the effects of singular value adjustments and vector rotations into a single adaptation matrix.

2.6 Weight Update

The final weight update is applied additively:

W_new = W + αW_adapt

where α is a scaling factor. This additive update allows us to preserve the original pre-trained weights while incorporating task-specific adaptations.

3. Comparison with LoRA

SVRA-FR differs from LoRA in several key aspects:

3.1 Parameter Efficiency

For a weight matrix of size m x n, SVRA-FR introduces min(m, n) + m + n trainable parameters, compared to LoRA’s 2r(m+n), where r is the LoRA rank. For large matrices and typical LoRA ranks, SVRA-FR is often more parameter-efficient. This efficiency stems from directly modifying the SVD components rather than introducing separate low-rank matrices.

3.2 Full Rank Adaptation

Unlike LoRA, which uses low-rank matrices, SVRA-FR works with the full SVD, potentially allowing for more comprehensive adaptations. This full-rank approach enables adjustments across the entire weight space, which may be beneficial for tasks requiring fine-grained modifications.

3.3 Direct Manipulation of Matrix Structure

SVRA-FR directly modifies the singular values and vectors of the original matrix, potentially preserving more of the pre-trained structure. This direct manipulation allows for more interpretable changes and may lead to better preservation of the model’s original capabilities.

4. Advantages

Parameter Efficiency: SVRA-FR introduces a small number of trainable parameters relative to the original matrix size, enabling efficient fine-tuning even for very large models.
Comprehensive Adaptation: By working with the full SVD, SVRA-FR allows for adjustments across the entire weight space, potentially capturing complex task-specific requirements.
Interpretability: Changes to singular values and singular vector rotations have clear mathematical interpretations, providing insights into how the model adapts to new tasks.
Preservation of Pre-trained Knowledge: By manipulating the existing SVD structure, SVRA-FR potentially preserves more of the pre-trained model’s knowledge while allowing for task-specific adaptations.
Flexibility: The method allows for both global (singular value adjustments) and targeted (rotations) modifications to the weight matrices, providing a versatile approach to fine-tuning.

5. Potential Challenges

Computational Cost: Computing the full SVD for large matrices can be computationally expensive during initialization. This could be mitigated by using approximate or iterative SVD algorithms.
Optimization Complexity: Training rotations might require careful optimization strategies, as the parameter space for rotations can be more complex than standard linear transformations.
Overfitting Risk: The flexibility of full-rank adaptation might lead to overfitting on smaller datasets. Regularization techniques specific to SVD components might need to be developed.

6. Discussion

SVRA-FR offers a novel approach to fine-tuning large language models by directly manipulating their SVD structure. This method combines the efficiency of parameter-efficient fine-tuning techniques with the comprehensiveness of full-rank adaptations. By allowing for targeted adjustments to singular values and rotations of singular vectors, SVRA-FR provides a flexible framework for adapting pre-trained models to specific tasks.

The full-rank nature of SVRA-FR is a key differentiator from methods like LoRA. While this could potentially lead to more comprehensive adaptations, it also raises questions about the trade-off between flexibility and the risk of overfitting. Empirical studies will be crucial to understand these trade-offs across various tasks and model sizes.

7. Future Work

Future research directions include:

Empirical evaluation of SVRA-FR across various NLP tasks and model sizes
Comparison with other parameter-efficient fine-tuning methods, including LoRA and adapter-based approaches
Investigation of fast SVD techniques to reduce initialization time
Exploration of regularization techniques specific to SVD components to mitigate potential overfitting
Analysis of the interplay between singular value adjustments and singular vector rotations
Development of visualization tools to interpret the changes made by SVRA-FR during fine-tuning

8. Conclusion

SVRA-FR represents a promising new direction in efficient fine-tuning of large language models. By leveraging the full SVD structure of weight matrices, it offers a parameter-efficient, interpretable, and flexible approach to model adaptation. While further empirical validation is needed, SVRA-FR has the potential to significantly improve the efficiency and effectiveness of fine-tuning large language models for specific tasks, particularly in scenarios where comprehensive adaptations are beneficial. The method’s ability to directly manipulate the core structure of weight matrices opens up new possibilities for understanding and controlling the adaptation process in deep learning models.

Sources: Loic Baconnier

Scalable. Interactive. Interpretable Data Science

Publié le 20 septembre 2024 par loic

Safe, interpretable, trustworthy AI, through interactive intelligent visualization, with applications in adversarial machine learning (protecting AI from harm and doing harm), scalable discoveries of deep learning models, and inclusive AI for everyone.

A Must.

https://poloclub.github.io/

Transformer Explainer

Learn How Transformer Models Work with Interactive Visualization

https://poloclub.github.io/transformer-explainer/

Diffusion Explainer

Learn how Stable Diffusion transforms your text prompt into image.

https://poloclub.github.io/diffusion-explainer/

Open Source LLM Tools

Publié le 12 septembre 2024 par loic

Open Source LLM Tools

If you are looking for useful open-source LLM tools, this is a really useful resource.

It includes different categories like tutorials, AI engineering, and applications, among others. You can also see the # of GitHub stars.

https://huyenchip.com/llama-police

PDF-Extract-Kit

Publié le 25 août 2024 par loic

PDF-Extract-Kit, a comprehensive toolkit for high-quality PDF content extraction, including layout detection, formula detection, formula recognition, and OCR.

PDF documents contain a wealth of knowledge, yet extracting high-quality content from PDFs is not an easy task. To address this, we have broken down the task of PDF content extraction into several components:

Layout Detection: Using the LayoutLMv3model for region detection, such as images, tables, titles, text, etc.;
Formula Detection: Using YOLOv8 for detecting formulas, including inline formulas and isolated formulas;
Formula Recognition: Using UniMERNet for formula recognition;
Table Recognition: Using StructEqTable for table recognition;
Optical Character Recognition: Using PaddleOCR for text recognition;

https://github.com/opendatalab/PDF-Extract-Kit

https://www.perplexity.ai/search/look-at-this-github-https-gith-8ZVtYO.2SA6_q5Vg.VXy.g

BERTopic

Publié le 24 août 2024 par loic

BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure.

https://ritvik19.medium.com/papers-explained-193-bertopic-f9aec10cd5a6