Text splitting | Deeplearning.fr

Large language models (LLMs) can be used for many tasks, but often have a limited context size that can be smaller than documents you might want to use. To use documents of larger length, you often have to split your text into chunks to fit within this context size.

This crate provides methods for splitting longer pieces of text into smaller chunks, aiming to maximize a desired chunk size, but still splitting at semantically sensible boundaries whenever possible.

Levels Of Text Splitting

Level 1: Character Splitting – Simple static character chunks of data
Level 2: Recursive Character Text Splitting – Recursive chunking based on a list of separators
Level 3: Document Specific Splitting – Various chunking methods for different document types (PDF, Python, Markdown)
Level 4: Semantic Splitting – Embedding walk based chunking
Level 5: Agentic Splitting – Experimental method of splitting text with an agent-like system. Good for if you believe that token cost will trend to $0.00
*Bonus Level:* Alternative Representation Chunking + Indexing – Derivative representations of your raw text that will aid in retrieval and indexing

Semantic text splitting library

https://github.com/benbrandt/text-splitter

Chunks Vizualizer

https://chunkviz.up.railway.app/