π Extensive List of Document Parsers
- π§ THIS IS A WORK IN PROGRESS! More will be added soon!
- Feel free to contribute by submitting a pull request π
- Cells marked with β or β have been independently tested. Blank cells indicate that the feature has not yet been independently tested.
- See the
resultsfolder to see the outputs from models.
PDF-to-Text Converters
Usually outputs as raw text or markdown

Machine-generated Documents only
| Models | Source | Output | Needs prompt? | Table | Equation | Figure | Handwriting | Two columns | Multiple columns |
|---|---|---|---|---|---|---|---|---|---|
| PyMuPDF | Raw text | N | β | β | β | β | β | β | |
| PDFPlumber | Raw text | N | β (separate from text) | β | β | β | β | β |
Machine-generated and Scanned Documents
| Models | Source | Output | Needs prompt? | Table | Equation | Handwriting | Two columns | Multiple columns |
|---|---|---|---|---|---|---|---|---|
| Marker | Markdown | N | β (markdown) | β | β | β | β | |
| MonkeyOCR | Markdown | Y | β (html) | β | β | β | β | |
| Nougat | Markdown | N | β | β | β | β | β | |
| MinerU | Markdown | N | β (html) | β | β | β | β | |
| Llamaparse (balanced mode) | - | Markdown | Y | β (markdown) | β | β | β | β |
| Llamaparse (premium mode) | - | Markdown | Y | β (markdown) | β | β | β | β |
| Docling | Markdown | N | β (markdown) | β | β | β | β | |
| RolmOCR | Markdown | Y | β (markdown) | β | β | β | β | |
| olmOCR | Markdown | Y | β (markdown) | β | β | β | β | |
| Unstructured | Raw text | N | β | β | β | β | β | |
| Pytesseract | Raw text | N | β | β | β | β | β | |
| MarkItDown | Markdown | N | β | β | β | β | β | |
| Amazon textract | - | |||||||
| Azure AI Document Intelligence | - | |||||||
| Google Cloud OCR | - | |||||||
| Mathpix | - | |||||||
| MistralOCR | - | |||||||
| Upstage | - | |||||||
| OmniAI | - | |||||||
| ChatDoc PDF parser | - | |||||||
| Reducto | - | |||||||
| OCRFlux | ||||||||
| Nanonets | ||||||||
| PaddleOCR | ||||||||
| ClovaOCR | - | |||||||
| ParseExtract | - | |||||||
| Tensorlake | - | |||||||
| Vectorize | - | |||||||
| MassivePix | - | |||||||
| Dolphin | ||||||||
| GOT | ||||||||
| Manga OCR | ||||||||
| EasyOCR | ||||||||
| PDFeditify | - |
β Process took too long
Layout Parsers
Usually outputs as JSON containing bounding box coordinates, content (as raw text or markdown), and sometimes type (header, figure, paragraph, etc.)

π§ WORK IN PROGRESS
| Models | Source | Output | Table | Equation | Handwriting | Two columns | Multiple columns |
|---|---|---|---|---|---|---|---|
| Chunkr | |||||||
| GroundX | - | ||||||
| ChatDOC | - | ||||||
| Unstract |
Contributing
If you would like to contribute in any way, please read CONTRIBUTING.md and then make a contribution. Thank you!