Text Extract API: A Powerful Tool for Document Conversion and OCR

Converting documents to structured formats like Markdown or JSON can be challenging, especially when dealing with PDFs, images, or Office files. The Text Extract API offers a robust solution to this common problem, providing high-accuracy conversion with advanced features.

Key Features

Document Processing
The API excels at converting various document types to Markdown or JSON, handling complex elements like tables, numbers, and mathematical formulas with remarkable accuracy. It utilizes a combination of PyTorch-based OCR (EasyOCR) and Ollama for processing.

Privacy-First Architecture
All processing occurs locally within your environment, with no external cloud dependencies. The system ships with Docker Compose configurations, ensuring your sensitive data never leaves your control.

Advanced Processing Capabilities

OCR enhancement through LLM technology
PII (Personally Identifiable Information) removal
Distributed queue processing with Celery
Redis-based caching for OCR results
Flexible storage options including local filesystem, Google Drive, and AWS S3

Technical Implementation

Core Components
The system is built using FastAPI for the API layer and Celery for handling asynchronous tasks. This architecture ensures efficient processing of multiple documents simultaneously while maintaining responsiveness.

Storage Options
The API supports multiple storage strategies:

Local filesystem with customizable paths
Google Drive integration
Amazon S3 compatibility

Getting Started

Prerequisites

Docker and Docker Compose for containerized deployment
Ollama for LLM processing
Python environment for local development

Installationgit clone text-extract-api cd text-extract-api make install

Use Cases

Document Processing
Perfect for organizations needing to:

Convert legacy documents to modern formats
Extract structured data from PDFs
Process large volumes of documents efficiently
Remove sensitive information from documents

Integration Options

The API offers multiple integration methods:

RESTful API endpoints
Command-line interface
TypeScript client library
Custom storage profile configurations

Conclusion

Text Extract API represents a significant advancement in document processing technology, offering a self-hosted solution that combines accuracy with privacy. Whether you’re dealing with document conversion, data extraction, or PII removal, this tool provides the necessary capabilities while keeping your data secure and under your control.

Sources :

https://github.com/CatchTheTornado/text-extract-api

Deeplearning.fr

You have to learn the rules of the game. And then you have to play better than anyone else