epstein-ocr
epstein-ocr is a convolutional neural network (CNN) based OCR tool designed to extract text from redacted PDF documents containing black-box redactions over text. Developed to recover readable content from the Epstein court archive, it specifically targets cases where a black rectangle hides text that is still recoverable because the underlying character data was not actually destroyed. The system processes images extracted from PDFs using pdfimages at 2x resolution, learning character positions from a known anchor page (such as page-001) and reusing those grid coordinates across all subsequent pages to handle vertical drift and pixel shifts consistently. Training involves annotating the top and bottom of one page with hyphenated word continuations to teach the model line breaks and character placement. The workflow consists of running train.sh to build the model from annotated samples and run.sh to perform inference across all pages, ultimately producing a searchable recovered.pdf. The tool includes debug mo