PDF-Extract-Kit

PDF-Extract-Kit, a comprehensive toolkit for high-quality PDF content extraction, including layout detectionformula detectionformula recognition, and OCR.

PDF documents contain a wealth of knowledge, yet extracting high-quality content from PDFs is not an easy task. To address this, we have broken down the task of PDF content extraction into several components:

  • Layout Detection: Using the LayoutLMv3model for region detection, such as imagestablestitlestext, etc.;
  • Formula Detection: Using YOLOv8 for detecting formulas, including inline formulas and isolated formulas;
  • Formula Recognition: Using UniMERNet for formula recognition;
  • Table Recognition: Using StructEqTable for table recognition;
  • Optical Character Recognition: Using PaddleOCR for text recognition;

https://github.com/opendatalab/PDF-Extract-Kit

https://www.perplexity.ai/search/look-at-this-github-https-gith-8ZVtYO.2SA6_q5Vg.VXy.g