About pdfhelper

Some useful functions to process pdf file

y

Published by

yuchen-lea

Visit View Profile

README.md

View on GitHub

+TITLE: Readme

[[file:README_CN.org][中文文档]]

[[file:CHANGELOG.org][Changelog]]

About

This is a command-line tool designed to handle PDF files, with the goal of seamlessly integrating PDFs into various note-taking workflows. Currently, the following capabilities are offered:

Managing the TOC: Export the table of contents from a PDF to a user-friendly, plain text list and import it back into the PDF after any modifications. This TOC file is easy to read and convenient for users to modify. The format can be viewed [[*TOC format][here]].
Annotations Management:
- Export formatted text annotations: Extract annotations like highlights, text, squares, and other types of annots from a PDF, capture relevant document images, and support OCR extraction from these images. Using [[https://pypi.org/project/Mako/][Mako]] templates, you can import the formatted PDF annotations into your preferred note-taking system.
- Manage XFDF Annotations: Export XFDF annotations from the PDF and import them back into the PDF. XFDF files can be imported by PDF readers like XChange. Since some OCR software may flatten annotations during the OCR process, you can export to XFDF before OCR and then import the XFDF after OCR to retain the full annotation functionality.
- Delete annotations from the PDF: Easily share original PDFs with others.
Page Label and Number Conversion: Convert page labels to page numbers and vice versa. Sometimes, while the data is stored as a page number, readers might require navigation based on page labels. This feature addresses that discrepancy.

Start
+begin_src bash

git clone --depth 1 https://github.com/yuchen-lea/pdfhelper.git cd pdfhelper make install pip install -r requirements.txt

+end_src

Then you can use the =pdfhelper= command-line tool!

For subsequent updates, simply pull the latest code from git and run =make update=.

Usage

+begin_src bash :results raw

pdfhelper -h

+end_src

+RESULTS:

usage: pdfhelper [-h] [--version] {export-toc,import-toc,delete-annot,export-xfdf-annot,import-xfdf-annot,export-annot,export-info,import-info,page-label-to-number,page-number-to-label} ... INFILE

Some useful functions to process a PDF file.

positional arguments: {export-toc,import-toc,delete-annot,export-xfdf-annot,import-xfdf-annot,export-annot,export-info,import-info,page-label-to-number,page-number-to-label} export-toc Export the TOC of the PDF. import-toc Import TOC from a file into the PDF. delete-annot Delete annotations from the PDF. export-xfdf-annot Export XFDF annotations of the PDF. import-xfdf-annot Import XFDF annotations of the PDF. export-annot Export formatted text annotations of the PDF. export-info Export information of the PDF. import-info Import information of the PDF. page-label-to-number Convert page label to page number. page-number-to-label Convert page number to page label. INFILE PDF file to process

options: -h, --help show this help message and exit --version, -v show program's version number and exit

** TOC format Sample toc file:

+begin_example

@label 1=A @label 8=[p-]II @label 16=1

Cover#1
The Ten Commandments#2
The Five Rules#3
Contents#8
Foreword#10
Preface#12
1 = 16
1. Toys#2
1. Do It, Do It Again, and Again, and Again ...#14
1. Cons the Magnificent#32
  +2
1. Numbers Games#58
  +end_example

Here, you see three ways of customization:

Defining a Page Number: To bookmark page 3 with the title "The Five Rules":
+begin_example

The Five Rules#3
+end_example
- 🙋‍ List indentation is the same as toc indentation.

Setting the First Page: To bookmark page 17 with the title "1. Toys" (considering the first page is numbered 16):
+begin_example

1 = 16

1. Toys#2
  +end_example
- 🙋‍Note: Using the pattern "# number1=number2" will treat the physical page number2 of the PDF as number1. Any subsequent page numbers set as 'x' will actually point to the physical page calculated as =x+number2-number1=. This is suitable for setting the first page number, for example, "# 1=19", as well as setting the starting page number for the second volume, like "# 250=5".

Accounting for Page Gaps: To bookmark the title "4. Numbers Games" on page 75 (calculated as 58 + (16-1) + 2):
+begin_example

+2

1. Numbers Games#58
  +end_example
- useful when there are missing or extra pages. At the location of missing pages (for instance, where blank pages counted in the pagination have been removed), set "# -[number of missing pages]". At the location where pages are added (like illustration pages not counted in the pagination), set "# +[number of added pages]".

Set the labels:

Starting from page 1, the page numbering style will be uppercase letters. Page 1 will display as "A", page 2 as "B", and so on, until page 7.
Starting from page 8, the page numbering style will switch to uppercase Roman numerals with the prefix "p-". Page 8 will display as "p-II", page 9 as "p-III", and so on, until page 15.
Starting from page 16, the page numbering will be in Arabic numerals. Page 16 will display as "1", page 17 as "2", and so on, until the end of the document.

** Export Annotations

Currently, the following annotation types are supported:

| Type | Result | |-----------+----------------------------------------------------------------------------------------------------------------------------------------| | Text | comment | | FreeText | comment | | Square | comment + picture (set the zoom factor by ~--image-zoom~) + text (extract from the PDF, or use the ~--ocr-service~ and ~--ocr-language~ to recognize text within images.) | | Highlight | comment + text (extract from the PDF) | | Underline | comment + text (extract from the PDF) | | Squiggly | comment + text (extract from the PDF) | | StrikeOut | comment + text (extract from the PDF) | | Ink | comment + picture (captures the content within the marked height of the document, rather than just the mark itself. set the zoom factor by ~--image-zoom~) + text (extract from the PDF, or use the ~--ocr-service~ and ~--ocr-language~ to recognize text within images.) | | Line | comment + picture (captures the content within the marked height of the document, rather than just the mark itself. set the zoom factor by ~--image-zoom~) + text (extract from the PDF, or use the ~--ocr-service~ and ~--ocr-language~ to recognize text within images.) |

You can customize the note format by:

~--with-toc~
~--toc-list-item-format~
~--annot-list-item-format~ *** Template Variables

For usage reference, see file:format_annots_template.py

| Variable | Description | toc-list-item | annot-list-item | |--------------+---------------------------------------+---------------+-----------------| | level | Level | ✓ | ✓ | | page | Page number | ✓ | ✓ | | content | Table of contents text | ✓ | ✗ | | pdf_path | PDF file path | ✓ | ✓ | | bib_key | Reference key | ✓ | ✓ | | type | Note type | ✗ | ✓ | | text | annot's own text | ✗ | ✓ | | comment | annot's comment | ✗ | ✓ | | annot_number | annot number, increments from 1 | ✗ | ✓ | | annot_id | annot ID in PDF | ✗ | ✓ | | height | annot position on page (0-1) | ✗ | ✓ | | color | annot color's hex code, e.g., #e44234 | ✗ | ✓ | | pic_path | annot image path | ✗ | ✓ |

Credits This project is inspired by the following tool:

[[https://github.com/0xabu/pdfannots][0xabu/pdfannots: Extracts and formats text annotations from a PDF file]]: based on pdfminer and format as markdown text. It deals with hyphens but donot extract rectangle annot.
[https://www.cnblogs.com/pdfpatcher/archive/2011/04/12/2013974.html][PDFPatcher] a great pdf utility tool.

pdfhelper

About pdfhelper

Platforms

Languages

Links

README.md

+TITLE: Readme

+begin_src bash

+end_src

+begin_src bash :results raw

+end_src

+RESULTS:

+begin_example

1 = 16

+2

+end_example

+begin_example

+end_example

+begin_example

1 = 16

+end_example

+begin_example

+2

+end_example