Home
Softono
epstein-ocr

epstein-ocr

Open source Python
132
Stars
14
Forks
0
Issues
5
Watchers
2 months
Last Commit

About epstein-ocr

epstein-ocr is a convolutional neural network (CNN) based OCR tool designed to extract text from redacted PDF documents containing black-box redactions over text. Developed to recover readable content from the Epstein court archive, it specifically targets cases where a black rectangle hides text that is still recoverable because the underlying character data was not actually destroyed. The system processes images extracted from PDFs using pdfimages at 2x resolution, learning character positions from a known anchor page (such as page-001) and reusing those grid coordinates across all subsequent pages to handle vertical drift and pixel shifts consistently. Training involves annotating the top and bottom of one page with hyphenated word continuations to teach the model line breaks and character placement. The workflow consists of running train.sh to build the model from annotated samples and run.sh to perform inference across all pages, ultimately producing a searchable recovered.pdf. The tool includes debug mo

Platforms

Web Self-hosted

Languages

Python

Project Summary

If you're new here, this project was started in response to an effort to extract some unredacted content in the Epstein archives. The CNN in this repo was used to successfully exfiltrate DBC12.pdf from EFTA00400459.

You can read about this code/approach here: EFTA00400459 has been cracked, DBC12.pdf liberated.

The code in this project runs against the images extracted from the PDF with pdfimages, you can download an archive containing them here.

Basic Usage Info

Expects ../EFTA00400459-{000..=075}_2x.png to exist

  • Run ./train.sh to generate training from train_top.txt and train_bot.txt corresponding to page-001_2x.png
  • Run ./run.sh to OCR all pages and generated recovered.pdf

Trains from top of page-001 and bottom of page-001 non-contiguously to capture vertical drift. Memorizes grid location and reuses for subsequent pages (non-training runs) to prevent pixel shifts.

In training runs with -d/--debug, generates a debug view that lets you see if you mis-typed anything by showing greatest outliers compared to the rest of the members assigned to the bucket:

Typo sanity checking when training

In inference runs, generates a debug view (when -d is in use with no -q/--quiet) that shows the max outliers compared to the rest of the characters in the image. When -o/--output is specified, the debug view is saved to <basename>-proof.png so you can inspect it later.

Post-inference analysis