PDF-Extract-Kit, a comprehensive toolkit for high-quality PDF content extraction, including layout detection, formula detection, formula recognition, and OCR.
PDF documents contain a wealth of knowledge, yet extracting high-quality content from PDFs is not an easy task. To address this, we have broken down the task of PDF content extraction into several components:
- Layout Detection: Using the LayoutLMv3model for region detection, such as
images,tables,titles,text, etc.; - Formula Detection: Using YOLOv8 for detecting formulas, including
inline formulasandisolated formulas; - Formula Recognition: Using UniMERNet for formula recognition;
- Table Recognition: Using StructEqTable for table recognition;
- Optical Character Recognition: Using PaddleOCR for text recognition;
https://github.com/opendatalab/PDF-Extract-Kit
https://www.perplexity.ai/search/look-at-this-github-https-gith-8ZVtYO.2SA6_q5Vg.VXy.g