PDF-Extract-Kit
, a comprehensive toolkit for high-quality PDF content extraction, including layout detection
, formula detection
, formula recognition
, and OCR
.
PDF documents contain a wealth of knowledge, yet extracting high-quality content from PDFs is not an easy task. To address this, we have broken down the task of PDF content extraction into several components:
- Layout Detection: Using the LayoutLMv3model for region detection, such as
images
,tables
,titles
,text
, etc.; - Formula Detection: Using YOLOv8 for detecting formulas, including
inline formulas
andisolated formulas
; - Formula Recognition: Using UniMERNet for formula recognition;
- Table Recognition: Using StructEqTable for table recognition;
- Optical Character Recognition: Using PaddleOCR for text recognition;
https://github.com/opendatalab/PDF-Extract-Kit
https://www.perplexity.ai/search/look-at-this-github-https-gith-8ZVtYO.2SA6_q5Vg.VXy.g