Text Extract API: A Powerful Tool for Document Conversion and OCR

Converting documents to structured formats like Markdown or JSON can be challenging, especially when dealing with PDFs, images, or Office files. The Text Extract API offers a robust solution to this common problem, providing high-accuracy conversion with advanced features.

Key Features

Document Processing
The API excels at converting various document types to Markdown or JSON, handling complex elements like tables, numbers, and mathematical formulas with remarkable accuracy. It utilizes a combination of PyTorch-based OCR (EasyOCR) and Ollama for processing.

Privacy-First Architecture
All processing occurs locally within your environment, with no external cloud dependencies. The system ships with Docker Compose configurations, ensuring your sensitive data never leaves your control.

Advanced Processing Capabilities

  • OCR enhancement through LLM technology
  • PII (Personally Identifiable Information) removal
  • Distributed queue processing with Celery
  • Redis-based caching for OCR results
  • Flexible storage options including local filesystem, Google Drive, and AWS S3

Technical Implementation

Core Components
The system is built using FastAPI for the API layer and Celery for handling asynchronous tasks. This architecture ensures efficient processing of multiple documents simultaneously while maintaining responsiveness.

Storage Options
The API supports multiple storage strategies:

  • Local filesystem with customizable paths
  • Google Drive integration
  • Amazon S3 compatibility

Getting Started

Prerequisites

  • Docker and Docker Compose for containerized deployment
  • Ollama for LLM processing
  • Python environment for local development

Installationgit clone text-extract-api cd text-extract-api make install

Use Cases

Document Processing
Perfect for organizations needing to:

  • Convert legacy documents to modern formats
  • Extract structured data from PDFs
  • Process large volumes of documents efficiently
  • Remove sensitive information from documents

Integration Options

The API offers multiple integration methods:

  • RESTful API endpoints
  • Command-line interface
  • TypeScript client library
  • Custom storage profile configurations

Conclusion

Text Extract API represents a significant advancement in document processing technology, offering a self-hosted solution that combines accuracy with privacy. Whether you’re dealing with document conversion, data extraction, or PII removal, this tool provides the necessary capabilities while keeping your data secure and under your control.

Sources :

https://github.com/CatchTheTornado/text-extract-api