Home
Softono
ocr-table

ocr-table

Open source MIT Python
277
Stars
65
Forks
3
Issues
14
Watchers
6 years
Last Commit

About ocr-table

Extract tables from scanned image PDFs using Optical Character Recognition.

Platforms

Web Self-hosted

Languages

Python

Links

ocr-table

This project aims to extract tables from scanned image PDFs using Optical Character Recognition.

Install Requirements

  1. Tesseract OCR

     sudo apt-get install tesseract-ocr
  2. Imagemagick

     sudo apt-get install imagemagick
  3. PDF Utilities

     sudo apt-get install poppler-utils
  4. Python packages

     sudo pip install -r requirements.txt

Usage

  1. Clear the pdf/ folder and copy all your pdf files to be scanned in it.

  2. Run the OCR:

     python3 shellocr.py
  3. The scanned text files shall be available in the txt/ folder once the process completes.

Alternate

  1. If the above doesn't work for you, try the alternate method.

  2. Save your file as input.pdf in the root directory.

  3. Run

     python3 pdf_miner.py