pdfdeal
pdfdeal is a Python wrapper for the Doc2X API that simplifies PDF handling and enhances document conversion for RAG applications. Doc2X is a universal OCR tool that converts images and PDFs into Markdown or LaTeX text while preserving formulas and formatting, outperforming similar tools in most scenarios. pdfdeal provides abstracted classes to streamline Doc2X API requests and includes native text processing features designed to improve PDF recall rates in retrieval-augmented generation workflows. It uses OCR and PDF recognition tools to identify images and embed them in the original text, with configurable output formats that retain original page numbering. The library integrates with knowledge base platforms such as GraphRAG, Dify, and FastGPT, converting PDFs into compatible text formats for better recognition. Additional Markdown processing tools include converting HTML tables to Markdown, uploading local and online images to remote storage services for persistence, downloading online images to local file