Home
Softono
ai-pdf-autofiller

ai-pdf-autofiller

Open source MIT Python
11
Stars
1
Forks
10
Issues
0
Watchers
2 weeks
Last Commit

About ai-pdf-autofiller

Automated PDF form filling with intelligent field mapping using AI-assisted semantic inference

Platforms

Web Self-hosted

Languages

Python

Links

PDF Autofiller

Backend service for filling AcroForm PDFs from structured user data.

The project favors deterministic behavior first: it normalizes keys, applies stable aliases, coerces values, and only uses optional semantic inference or controlled fallback mapping when explicitly enabled. The result is a small, testable pipeline that is easier to audit than heuristic-only form filling.

What It Does

  • Reads PDF metadata, form fields, and visible page text
  • Infers semantic meaning for fields when optional semantic inference is enabled
  • Maps user data to fields using deterministic rules first
  • Rejects outputs with unresolved required fields
  • Returns a new filled PDF through a small FastAPI service

Quick Start

Install development dependencies:

poetry install

or

pip install -r requirements-dev.txt

Run the API locally:

make run-api

Run the local smoke check:

PYTHONPATH=src python -m scripts.smoke_check

Run the demo workflow against the bundled sample:

PYTHONPATH=src python -m scripts.demo_workflow samples/sample_form.pdf

API Example

curl -s -X POST http://localhost:8000/fill \
  -F "pdf_file=@samples/sample_form.pdf;type=application/pdf" \
  -F 'user_data={"firstname":"Jane","lastname":"Doe","dob":"1990-01-01"}' \
  -F "strict=true" \
  -o filled.pdf

Configuration

  • MODEL_PROVIDER_API_KEY: enables semantic inference and fallback mapping
  • API_AUTH_ENABLED: API key validation on POST /fill (default true; set false for trusted/local use)
  • API_AUTH_TOKEN: expected token value when auth is enabled
  • API_KEY_HEADER: header name used for the incoming token
  • MAX_UPLOAD_BYTES: maximum accepted PDF size in bytes
  • MAX_PDF_PAGES: maximum accepted page count (default 200)
  • PDF_READ_TIMEOUT_SECONDS: budget for PDF parsing/extraction (default 20)
  • RATE_LIMIT_PER_MINUTE: per-client POST /fill budget; 0 disables (default 60)
  • LOG_LEVEL: process log level for the API service

Architecture

Core code lives in src/pdf_autofiller/ and is intentionally split by responsibility:

  • pdf_reader.py: extraction only
  • field_semantics.py: provider client wrapper and response normalization
  • mapping.py: deterministic matching and controlled fallback mapping
  • pdf_writer.py: output writing and required-field enforcement
  • api_service.py: HTTP boundary, auth, request validation, and temp-file lifecycle

The detailed system breakdown is in docs/ARCHITECTURE.md.

Quality

  • ruff, mypy, pip-audit, and pytest are enforced in CI
  • Coverage floor is 85%
  • API error responses use stable machine-readable error codes
  • Smoke-check and demo scripts are kept separate from the test suite

Scope

  • The current pipeline targets fillable AcroForm PDFs
  • OCR and scanned-document workflows are intentionally out of scope
  • Frontend, persistence, and deployment infrastructure are not part of this repository
  • If optional provider-backed features are enabled, field metadata and nearby page text may be sent to an external service

Documentation

  • docs/API.md: endpoint contracts and example requests
  • docs/ARCHITECTURE.md: module boundaries and data flow
  • docs/OPERATIONS.md: runtime configuration and deployment assumptions
  • docs/TESTING.md: local validation workflow
  • docs/PURPOSE.md: problem statement and intended usage
  • CONTRIBUTING.md: contributor expectations
  • SECURITY.md: vulnerability reporting and data-handling notes

License

MIT. See LICENSE.