bjornmelin

Professional software vendor delivering innovative solutions on the Softono platform. Specialized in both open-source and proprietary software development.

Visit Website

Total Products

Software by bjornmelin

Open Source

docmind-ai-llm

# 🧠 DocMind AI: Local LLM for AI-Powered Document Analysis ![Python](https://img.shields.io/badge/Python-3776AB?style=for-the-badge&logo=python&logoColor=white) ![Streamlit](https://img.shields.io/badge/Streamlit-FF4B4B?style=for-the-badge&logo=streamlit&logoColor=white) ![LlamaIndex](https://img.shields.io/badge/LlamaIndex-7C3AED?style=for-the-badge) ![LangGraph](https://img.shields.io/badge/🔗_LangGraph-4A90E2?style=for-the-badge) ![Qdrant](https://img.shields.io/badge/Qdrant-DC244C?style=for-the-badge&logo=qdrant&logoColor=white) ![spaCy](https://img.shields.io/badge/spaCy-09A3D5?style=for-the-badge&logo=spacy&logoColor=white) ![Docker](https://img.shields.io/badge/Docker-2496ED?style=for-the-badge&logo=docker&logoColor=white) ![Ollama](https://img.shields.io/badge/Ollama-000000?style=for-the-badge) [![MIT License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge)](https://choosealicense.com/licenses/mit/) **DocMind AI** provides local document analysis with zero cloud dependency. It combines hybrid retrieval (dense + sparse), optional knowledge graph extraction (GraphRAG), and a 5-agent coordinator to analyze PDFs, Office docs, HTML/Markdown, and image-rich PDFs. Built on LlamaIndex pipelines with LangGraph supervisor orchestration, the default vLLM profile targets Qwen/Qwen3-4B-Instruct-2507-FP8 (128 K context window) and runs entirely on your hardware with optional GPU acceleration. **Architecture**: Traditional document analysis tools either send your data to the cloud (privacy risk) or provide basic keyword search (limited intelligence). DocMind AI keeps everything local while still supporting complex, multi-step queries, entity/relationship extraction (GraphRAG), and agent-coordinated synthesis. Design goals: - Privacy by default: remote endpoints are blocked unless explicitly allowed. - Reproducibility: deterministic ingestion caching and snapshot manifests. - Extensibility: RouterQueryEngine routes across vector, hybrid, and optional graph retrieval. ## ✨ Features of DocMind AI - **Privacy-focused, local-first:** Remote LLM endpoints are blocked by default; enable explicitly when needed. - **Library-first ingestion pipeline:** LlamaIndex `IngestionPipeline` with `UnstructuredReader` when installed (fallback to plain-text), `TokenTextSplitter`, optional spaCy enrichment, and optional `TitleExtractor`. - **Multi-format parsing:** Unstructured covers common formats (PDF, DOCX, PPTX, XLSX, HTML, Markdown, TXT, CSV, JSON, RTF, MSG, ODT, EPUB) when supported; unsupported types fall back to plain-text when possible. - **Hybrid retrieval with routing:** RouterQueryEngine with `semantic_search`, optional `hybrid_search` (Qdrant server-side fusion), and optional `knowledge_graph` (GraphRAG). - **Qdrant server-side fusion:** Query API RRF (default) or DBSF over named vectors `text-dense` and `text-sparse`; sparse queries use FastEmbed BM42/BM25 when available. - **Reranking and multimodal:** Text rerank via BGE cross-encoder; SigLIP visual rerank runs when visual nodes are present; ColPali optional via the `multimodal` extra. - **Multi-agent coordination:** LangGraph supervisor orchestrates five agents (router, planner, retrieval, synthesis, validation). - **Snapshots and reproducibility:** DuckDB KV cache plus snapshot manifests with corpus/config hashes; graph exports as JSONL/Parquet (Parquet requires PyArrow). - **PDF page images:** PyMuPDF renders page images to WebP/JPEG; optional AES-GCM encryption with `.enc` outputs and just-in-time decryption for visual scoring. - **ArtifactStore (multimodal durability):** Page images/thumbnails are stored as content-addressed `ArtifactRef(sha256, suffix)` (no base64 blobs or host paths in durable stores). - **Multimodal UX:** Chat renders image sources and supports query-by-image “Visual search” (SigLIP) for image-rich PDFs. - **Offline-first design:** Runs fully offline once models are present; remote endpoints must be explicitly enabled. - **GPU acceleration:** Optional GPU extras for local embedding/reranking acceleration (vLLM runs as an external OpenAI-compatible server). - **Robust retries and logging:** Tenacity-backed retries for LLM calls and structured logging via Loguru. - **Observability and operations:** Optional OTLP tracing/metrics plus JSONL telemetry; Docker and Compose included for local deployments. ## Table of Contents - [🧠 DocMind AI: Local LLM for AI-Powered Document Analysis](#-docmind-ai-local-llm-for-ai-powered-document-analysis) - [✨ Features of DocMind AI](#-features-of-docmind-ai) - [Table of Contents](#table-of-contents) - [Getting Started with DocMind AI](#getting-started-with-docmind-ai) - [Prerequisites](#prerequisites) - [Installation](#installation) - [Running the App](#running-the-app) - [Usage](#usage) - [Configure LLM Runtime (Settings page)](#configure-llm-runtime-settings-page) - [Ingest Documents and Build Snapshots (Documents page)](#ingest-documents-and-build-snapshots-documents-page) - [Chat with Documents (Chat page)](#chat-with-documents-chat-page) - [Analytics (optional)](#analytics-optional) - [API Usage Examples](#api-usage-examples) - [Programmatic Ingestion](#programmatic-ingestion) - [Programmatic Query (Router + Coordinator)](#programmatic-query-router--coordinator) - [Prompt Templates (developer API)](#prompt-templates-developer-api) - [Custom Configuration](#custom-configuration) - [Batch Document Processing](#batch-document-processing) - [Architecture](#architecture) - [Implementation Details](#implementation-details) - [Document Processing Pipeline](#document-processing-pipeline) - [Hybrid Retrieval Architecture](#hybrid-retrieval-architecture) - [Multi-Agent Coordination](#multi-agent-coordination) - [Performance Optimizations](#performance-optimizations) - [Configuration](#configuration) - [Why `DOCMIND_*` and not provider env vars?](#why-docmind_-and-not-provider-env-vars) - [Configuration Philosophy](#configuration-philosophy) - [Environment Variables](#environment-variables) - [Enable DSPy Optimization (optional)](#enable-dspy-optimization-optional) - [Additional Configuration](#additional-configuration) - [Performance Defaults and Monitoring](#performance-defaults-and-monitoring) - [Configured Defaults](#configured-defaults) - [Measure Locally](#measure-locally) - [Retrieval \& Reranking Defaults](#retrieval--reranking-defaults) - [Operational Flags (local-first)](#operational-flags-local-first) - [Offline Operation](#offline-operation) - [Prerequisites for Offline Use](#prerequisites-for-offline-use) - [Prefetch Model Weights](#prefetch-model-weights) - [Snapshots \& Staleness](#snapshots--staleness) - [GraphRAG Exports \& Seeds](#graphrag-exports--seeds) - [Model Requirements](#model-requirements) - [Troubleshooting](#troubleshooting) - [Common Issues](#common-issues) - [1. Ollama Connection Errors](#1-ollama-connection-errors) - [2. GPU Not Detected](#2-gpu-not-detected) - [3. Model Download Issues](#3-model-download-issues) - [4. Memory Issues](#4-memory-issues) - [5. Document Processing Errors](#5-document-processing-errors) - [6. vLLM Server Connectivity Issues](#6-vllm-server-connectivity-issues) - [7. PyTorch Compatibility Issues](#7-pytorch-compatibility-issues) - [8. GPU Memory Issues (16 GB RTX 4090)](#8-gpu-memory-issues-16-gb-rtx-4090) - [9. Performance Validation](#9-performance-validation) - [Performance Optimization](#performance-optimization) - [Getting Help](#getting-help) - [How to Cite](#how-to-cite) - [Contributing](#contributing) - [Development Guidelines](#development-guidelines) - [Tests and CI](#tests-and-ci) - [License](#license) - [Observability](#observability) ## Getting Started with DocMind AI ### Prerequisites - One supported LLM backend running locally: [Ollama](https://ollama.com/) (default), vLLM OpenAI-compatible server, LM Studio, or a llama.cpp server. - Python 3.12.13 (see `pyproject.toml`) - (Optional) Docker and Docker Compose for containerized deployment. - (Optional) NVIDIA GPU (e.g., RTX 4090 Laptop) with at least 16 GB VRAM for 128 K context (vLLM) and accelerated performance. ### Installation 1. **Clone the repository:** ```bash git clone https://github.com/BjornMelin/docmind-ai-llm.git cd docmind-ai-llm ``` 2. **Install dependencies:** ```bash uv sync ``` _Need LlamaIndex OpenTelemetry instrumentation?_ Install the optional observability extras as well: ```bash uv sync --frozen --extra observability ``` _Need GraphRAG adapters or ColPali visual reranking?_ Install the optional extras: ```bash uv sync --frozen --extra graph uv sync --frozen --extra multimodal ``` **Key Dependencies Included:** - **LlamaIndex (>=0.14.12,<0.15.0)**: Retrieval, RouterQueryEngine, IngestionPipeline, PropertyGraphIndex - **LangGraph (>=1.0.10,<2.0.0)**: 5-agent supervisor orchestration (graph-native `StateGraph`, no external supervisor wrapper) - **Streamlit (>=1.52.2,<2.0.0)**: Web interface framework - **Ollama (0.6.1)**: Local LLM integration - **Qdrant Client (>=1.15.1,<2.0.0)**: Vector database operations - **Unstructured (>=0.18.26,<0.19.0)**: Multi-format parsing (PDF/DOCX/PPTX/XLSX, etc.) - **LlamaIndex Embeddings FastEmbed (>=0.5.0,<0.6.0)**: Sparse query encoding (optional fastembed-gpu >=0.8.0,<0.9.0) - **Tenacity (>=9.1.2,<10.0.0)**: Retry strategies with exponential backoff - **Loguru (>=0.7.3,<1.0.0)**: Structured logging - **Pydantic (2.12.5)**: Data validation and settings. 3. **Install spaCy language model:** spaCy is bundled for optional **NLP enrichment** (sentence segmentation + entity extraction during ingestion). Install a language model if you plan to use enrichment: ```bash # Install the small English model (recommended, ~15MB) uv run python -m spacy download en_core_web_sm # Optional: Install larger models for better accuracy # Medium model (~50MB): uv run python -m spacy download en_core_web_md # Large model (~560MB): uv run python -m spacy download en_core_web_lg ``` **Note:** spaCy models are downloaded and cached locally. The app does not auto-download models; install them explicitly for offline use. Optional configuration (defaults shown): ```bash # Enable/disable enrichment DOCMIND_SPACY__ENABLED=true # Pipeline name or path (blank fallback when missing) DOCMIND_SPACY__MODEL=en_core_web_sm # cpu|cuda|apple|auto (auto prefers CUDA, then Apple, else CPU) DOCMIND_SPACY__DEVICE=auto DOCMIND_SPACY__GPU_ID=0 ``` Cross-platform acceleration: - NVIDIA CUDA (Linux/Windows): `uv sync --frozen --extra gpu` and set `DOCMIND_SPACY__DEVICE=auto|cuda` - Apple Silicon (macOS arm64): `uv sync --frozen --extra apple` and set `DOCMIND_SPACY__DEVICE=auto|apple` See `docs/specs/spec-015-nlp-enrichment-spacy.md` and `docs/developers/gpu-setup.md`. 4. **Set up environment configuration:** Copy the example environment file and configure your settings: ```bash cp .env.example .env # Edit .env with your preferred settings # Model names are backend-specific: # - Ollama: use the local tag (e.g., qwen3-4b-instruct-2507) # - vLLM/LM Studio/llama.cpp: use the served model name # NOTE: DOCMIND_MODEL (top-level) overrides backend-specific model vars such as DOCMIND_VLLM__MODEL at runtime. # Example - Ollama (local, default): # DOCMIND_LLM_BACKEND=ollama # DOCMIND_OLLAMA_BASE_URL=http://localhost:11434 # DOCMIND_MODEL=qwen3-4b-instruct-2507 # Example - LM Studio (local, OpenAI-compatible): # DOCMIND_LLM_BACKEND=lmstudio # DOCMIND_OPENAI__BASE_URL=http://localhost:1234/v1 # DOCMIND_OPENAI__API_KEY=not-needed # DOCMIND_MODEL=your-model-name # Example - vLLM OpenAI-compatible server: # DOCMIND_LLM_BACKEND=vllm # DOCMIND_OPENAI__BASE_URL=http://localhost:8000/v1 # DOCMIND_OPENAI__API_KEY=not-needed # DOCMIND_VLLM__MODEL=Qwen/Qwen3-4B-Instruct-2507-FP8 # Example - llama.cpp server: # DOCMIND_LLM_BACKEND=llamacpp # DOCMIND_LLAMACPP_BASE_URL=http://localhost:8080/v1 # DOCMIND_OPENAI__API_KEY=not-needed # DOCMIND_MODEL=local-gguf # Offline-first recommended: # HF_HUB_OFFLINE=1 # TRANSFORMERS_OFFLINE=1 # Optional - OpenAI-compatible cloud / gateway (breaks strict offline): # DOCMIND_LLM_BACKEND=openai_compatible # DOCMIND_OPENAI__BASE_URL=https://api.openai.com/v1 # DOCMIND_OPENAI__API_KEY=sk-... # DOCMIND_OPENAI__API_MODE=responses # DOCMIND_SECURITY__ALLOW_REMOTE_ENDPOINTS=true ``` Start llama.cpp as an OpenAI-compatible GGUF server: ```bash # CPU / portable baseline llama-server -m ./models/model.gguf --alias local-gguf \ --ctx-size 8192 --host 127.0.0.1 --port 8080 # CUDA or other GPU backends llama-server -m ./models/model.gguf --alias local-gguf \ --ctx-size 8192 -ngl 999 -fa --host 127.0.0.1 --port 8080 ``` Use the `--alias` value as `DOCMIND_MODEL`, keep `/v1` in `DOCMIND_LLAMACPP_BASE_URL`, and bind to loopback unless remote access is explicitly required. For remote access, start `llama-server` with `--api-key` and configure `DOCMIND_OPENAI__API_KEY`. For a complete overview, see `docs/developers/configuration.md`. The relevant section is `LLM Backend Selection`. 5. **(Optional) Install GPU support (embeddings/reranking acceleration):** Install the repo’s GPU extras and the CUDA wheel index for PyTorch. The `unsafe-best-match` strategy is intentional for this CUDA-only command so uv can select CUDA 12.8 wheels from the PyTorch index instead of CPU wheels from the default index: ```bash nvidia-smi uv sync --frozen --extra gpu --index https://download.pytorch.org/whl/cu128 --index-strategy=unsafe-best-match uv run python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())" ``` **Hardware Guidance:** - CUDA-capable GPU (16 GB VRAM recommended for 128 K context) - CUDA Toolkit 12.8+ - Driver compatible with CUDA 12.8 **Notes:** - vLLM is supported via an external OpenAI-compatible server (see Troubleshooting section 6 for connectivity checks). - Measure performance on your hardware with `uv run python scripts/performance_monitor.py`. See [GPU Setup Guide](docs/developers/gpu-setup.md) (installation) and [Hardware Policy](docs/developers/hardware_policy.md) (hardware/VRAM guidance). ### Running the App **Locally:** ```bash uv run streamlit run app.py ``` To honor `DOCMIND_UI__STREAMLIT_PORT`, use: ```bash ./scripts/run_app.sh ``` **With Docker:** ```bash docker compose up --build ``` Access the app at `http://localhost:8501`. Note: GPU reservations in `docker-compose.yml` require Docker Engine with Compose V2 (Docker Compose plugin). The `deploy.resources.reservations.devices` block is ignored on older Compose versions and in swarm mode. ## Usage ### Configure LLM Runtime (Settings page) - Select the active provider (`ollama`, `vllm`, `lmstudio`, `llamacpp`). - Set model, context window, timeout, and GPU acceleration toggle. - Model IDs are backend-specific (Ollama tags vs OpenAI-compatible model names). - OpenAI-compatible base URLs are normalized to include `/v1` (LM Studio enforces `/v1`). - When `DOCMIND_SECURITY__ALLOW_REMOTE_ENDPOINTS=false` (default), loopback hosts are always allowed, but non-loopback hosts must be allowlisted via `DOCMIND_SECURITY__ENDPOINT_ALLOWLIST` and must DNS-resolve to public IPs (private/link-local/reserved ranges are rejected). - Set `DOCMIND_SECURITY__ALLOW_REMOTE_ENDPOINTS=true` to opt out (required for private/internal endpoints like Docker service hostnames). ### Ingest Documents and Build Snapshots (Documents page) - Upload files in the Documents page. - Optional toggles: - **Build GraphRAG (beta)** to create a PropertyGraphIndex when enabled. - **Encrypt page images (AES-GCM)** to store rendered PDF images as `.enc`. - GraphRAG requires optional graph dependencies; the Settings page shows adapter status. - Ingestion builds a vector index (Qdrant) and optional graph index, then writes a snapshot to `data/storage/`. - Graph exports (JSONL/Parquet) are available when a graph index exists. ### Chat with Documents (Chat page) - The Chat page autoloads the latest snapshot per `graphrag_cfg.autoload_policy`. - Stale snapshots trigger a warning; rebuild from the Documents page. - Responses are generated via `MultiAgentCoordinator` and the router engine; the UI streams chunks for readability. ### Analytics (optional) - Enable `DOCMIND_ANALYTICS_ENABLED=true` to use the Analytics page. - Charts read from `data/analytics/analytics.duckdb` when query metrics are present. ## API Usage Examples ### Programmatic Ingestion ```python from pathlib import Path from src.models.processing import IngestionConfig, IngestionInput from src.processing.ingestion_pipeline import ingest_documents_sync cfg = IngestionConfig(cache_dir=Path("./cache/ingestion")) inputs = [ IngestionInput( document_id="doc-1", source_path=Path("path/to/document.pdf"), metadata={"source": "local"}, ) ] result = ingest_documents_sync(cfg, inputs) print(result.manifest.model_dump()) ``` ### Programmatic Query (Router + Coordinator) ```python from llama_index.core import StorageContext, VectorStoreIndex from src.agents.coordinator import MultiAgentCoordinator from src.config import settings from src.retrieval.router_factory import build_router_engine from src.utils.storage import create_vector_store # Requires Qdrant running and embeddings configured. # Uses `result.nodes` from the ingestion example above. store = create_vector_store( settings.database.qdrant_collection, enable_hybrid=settings.retrieval.enable_server_hybrid, ) storage_context = StorageContext.from_defaults(vector_store=store) vector_index = VectorStoreIndex(result.nodes, storage_context=storage_context, show_progress=False) router = build_router_engine(vector_index, pg_index=None, settings=settings) coord = MultiAgentCoordinator() resp = coord.process_query( "Summarize the key findings and action items", context=None, settings_override={"router_engine": router, "vector": vector_index}, ) print(resp.content) ``` ### Prompt Templates (developer API) ```python from src.prompting import list_presets, list_templates, render_prompt tpl = next(t for t in list_templates() if t.id == "comprehensive-analysis") tones = list_presets("tones") roles = list_presets("roles") ctx = { "context": "Example context", "tone": tones["professional"], "role": roles["assistant"], } prompt = render_prompt(tpl.id, ctx) print(prompt) ``` Templates live in `templates/prompts/*.prompt.md`. Presets are in `templates/presets/*.yaml`. ### Custom Configuration ```python import os from src.config.settings import DocMindSettings os.environ["DOCMIND_LLM_BACKEND"] = "vllm" os.environ["DOCMIND_VLLM__MODEL"] = "Qwen/Qwen3-4B-Instruct-2507-FP8" os.environ["DOCMIND_VLLM__CONTEXT_WINDOW"] = "131072" os.environ["DOCMIND_ENABLE_GPU_ACCELERATION"] = "true" settings = DocMindSettings() print(settings.llm_backend, settings.vllm.model, settings.effective_context_window) ``` ### Batch Document Processing ```python import hashlib from pathlib import Path from src.models.processing import IngestionConfig, IngestionInput from src.processing.ingestion_pipeline import ingest_documents_sync folder = Path("/path/to/documents") extensions = {".pdf", ".docx", ".txt", ".md", ".pptx", ".xlsx"} paths = [p for p in folder.rglob("*") if p.suffix.lower() in extensions] inputs = [] for path in paths: digest = hashlib.sha256(path.read_bytes()).hexdigest() inputs.append( IngestionInput( document_id=f"doc-{digest[:16]}", source_path=path, metadata={"source": path.name}, ) ) result = ingest_documents_sync(IngestionConfig(cache_dir=Path("./cache/ingestion")), inputs) print(f"Processed {len(result.nodes)} nodes from {len(inputs)} files") ``` ## Architecture ```mermaid flowchart TD A["Documents page Upload files"] --> B["Ingestion pipeline UnstructuredReader or text fallback"] B --> C["TokenTextSplitter, spaCy enrichment (optional), TitleExtractor (optional) LlamaIndex IngestionPipeline"] C --> D["Nodes and metadata"] D --> E["VectorStoreIndex Qdrant named vectors"] C --> F["PDF page image exports PyMuPDF, optional AES-GCM"] D --> G["PropertyGraphIndex optional"] E --> H["RouterQueryEngine semantic_search / hybrid_search knowledge_graph"] G --> H H --> I["MultiAgentCoordinator LangGraph supervisor - 5 agents"] I --> J["Chat page Responses"] K["Snapshot manager data/storage"] <--> E K <--> G L["Ingestion cache DuckDB KV"] <--> C ``` ## Implementation Details ### Document Processing Pipeline - **Parsing:** Uses LlamaIndex `UnstructuredReader` when available; falls back to plain-text for unsupported inputs. - **Chunking:** `TokenTextSplitter` with configurable `chunk_size`/`chunk_overlap`; `TitleExtractor` is optional. - **NLP enrichment (optional):** spaCy sentence segmentation + entity extraction during ingestion; outputs are stored as safe node metadata (`docmind_nlp`). See `docs/specs/spec-015-nlp-enrichment-spacy.md`. - **Caching:** DuckDB KV ingestion cache with optional docstore persistence. - **PDF page images:** PyMuPDF renders page images; optional AES-GCM encryption and `.enc` handling. - **Observability:** OpenTelemetry spans are recorded when observability is enabled. ### Hybrid Retrieval Architecture - **Unified Text Embeddings:** BGE-M3 (BAAI/bge-m3) via LlamaIndex for dense vectors (1024D); sparse query vectors via FastEmbed BM42/BM25 when available. - **Multimodal:** SigLIP visual scoring by default via the shared pinned `src/utils/vision_siglip.py` loader; OpenCLIP optional. ColPali visual reranking is optional (multimodal extra). - **Multimodal retrieval (PDF images):** `multimodal_search` fuses text hybrid with SigLIP text→image retrieval over a dedicated Qdrant image collection and returns image-bearing sources for rendering. - **Fusion:** Server-side RRF via Qdrant Query API when `DOCMIND_RETRIEVAL__ENABLE_SERVER_HYBRID=true` (DBSF optional). - **Deduplication:** Configurable key via `DOCMIND_RETRIEVAL__DEDUP_KEY` (page_id|doc_id); default = `page_id`. - **Router composition:** See `src/retrieval/router_factory.py` (tools: `semantic_search`, `hybrid_search`, `knowledge_graph`). Selector preference: `PydanticSingleSelector` (preferred) → `LLMSingleSelector` fallback. The `knowledge_graph` tool is activated only when a PropertyGraphIndex is present and healthy; otherwise the router uses vector/hybrid only. - **Storage:** Qdrant vector database with metadata filtering and concurrent access ### Multi-Agent Coordination - **Supervisor Pattern:** LangGraph `StateGraph` supervisor (repo-local implementation in `src/agents/supervisor_graph.py`) with checkpoint/store support - **5 Specialized Agents:** - **Query Router:** Analyzes query complexity and determines optimal retrieval strategy - **Query Planner:** Decomposes complex queries into manageable sub-tasks for better processing - **Retrieval Expert:** Executes optimized retrieval with server-side hybrid (Qdrant) and optional GraphRAG; supports optional DSPy query optimization when enabled - **Result Synthesizer:** Combines and reconciles results from multiple retrieval passes with deduplication - **Response Validator:** Validates response quality, accuracy, and completeness before final output - **Enhanced Capabilities:** Optional GraphRAG for multi-hop reasoning and optional DSPy query optimization for query rewriting - **Workflow Coordination:** Supervisor routes between agents; coordination overhead is tracked with a 200ms target threshold. - **Session State:** Streamlit session state holds chat history; snapshots persist retrieval artifacts to disk. - **Async Execution:** Concurrent agent operations with automatic resource management and fallback mechanisms ### Performance Optimizations - **GPU Acceleration:** Optional GPU extras for embeddings/reranking; vLLM runs as an external OpenAI-compatible server. - **Async processing:** Asynchronous ingestion is supported; retrieval/rerank stages use bounded timeouts and fail open. - **Reranking:** Text cross-encoder + SigLIP visual stage with rank-level RRF merge; ColPali optional. - **Memory Management:** Device selection and VRAM checks are centralized in `src/utils/core.py`. ## Configuration DocMind AI uses a unified Pydantic Settings model (`src/config/settings.py`). Environment variables use the `DOCMIND_` prefix with `__` for nested fields. The Streamlit entrypoint calls `bootstrap_settings()` to load `.env` (no import-time `.env` IO). ### Why `DOCMIND_*` and not provider env vars? DocMind’s `DOCMIND_*` variables configure the **application** (routing, security, and provider selection) and are intentionally separate from provider/server variables such as `OLLAMA_*`, `OPENAI_*`, or `VLLM_*` that control those services directly. Keeping a single, app-scoped config surface: - avoids collisions with provider/daemon env vars on the same machine, - keeps security policy (remote endpoint allowlisting) centralized, and - ensures consistent behavior across backends. Use `DOCMIND_OLLAMA_API_KEY` for Ollama Cloud access; `OLLAMA_*` remains reserved for the Ollama server/CLI itself. ### Configuration Philosophy Configuration is centralized and strongly typed. Prefer `.env` overrides and keep runtime toggles in one place for repeatable local runs. ### Environment Variables DocMind AI uses environment variables for configuration. Copy the example file and customize: ```bash cp .env.example .env ``` Key configuration options in `.env`: ```bash # LLM backend DOCMIND_LLM_BACKEND=ollama DOCMIND_OLLAMA_BASE_URL=http://localhost:11434 # Optional (Ollama Cloud / web search) # DOCMIND_OLLAMA_API_KEY= # DOCMIND_OLLAMA_ENABLE_WEB_SEARCH=false # DOCMIND_OLLAMA_EMBED_DIMENSIONS= # DOCMIND_OLLAMA_ENABLE_LOGPROBS=false # DOCMIND_OLLAMA_TOP_LOGPROBS=0 # DOCMIND_LLM_BACKEND=vllm # DOCMIND_MODEL=Qwen/Qwen3-4B-Instruct-2507-FP8 # top-level override # DOCMIND_VLLM__VLLM_BASE_URL=http://localhost:8000 # DOCMIND_VLLM__MODEL=Qwen/Qwen3-4B-Instruct-2507-FP8 # DOCMIND_VLLM__CONTEXT_WINDOW=131072 # Embeddings DOCMIND_EMBEDDING__MODEL_NAME=BAAI/bge-m3 # Optional: only set when pinning a custom SigLIP model to a matching revision. # The default SigLIP model uses DocMind's curated revision automatically. # DOCMIND_EMBEDDING__SIGLIP_MODEL_REVISION=7fd15f0689c79d79e38b1c2e2e2370a7bf2761ed # Retrieval / reranking DOCMIND_RETRIEVAL__ENABLE_SERVER_HYBRID=false DOCMIND_RETRIEVAL__FUSION_MODE=rrf DOCMIND_RETRIEVAL__USE_RERANKING=true DOCMIND_RETRIEVAL__RERANKING_TOP_K=5 # Cache DOCMIND_CACHE__DIR=./cache DOCMIND_CACHE__FILENAME=docmind.duckdb DOCMIND_CACHE__MAX_SIZE_MB=1000 # GraphRAG (requires both flags) DOCMIND_ENABLE_GRAPHRAG=false DOCMIND_GRAPHRAG_CFG__ENABLED=false # GPU and security toggles DOCMIND_ENABLE_GPU_ACCELERATION=true DOCMIND_SECURITY__ALLOW_REMOTE_ENDPOINTS=false ``` See the complete [.env.example](.env.example) file for all available configuration options. ### Enable DSPy Optimization (optional) To turn on query optimization via DSPy, enable the feature flag in your `.env`: ```bash DOCMIND_ENABLE_DSPY_OPTIMIZATION=true ``` Optional tuning (defaults are sensible): ```bash DOCMIND_DSPY_OPTIMIZATION_ITERATIONS=10 DOCMIND_DSPY_OPTIMIZATION_SAMPLES=20 DOCMIND_DSPY_MAX_RETRIES=3 DOCMIND_DSPY_TEMPERATURE=0.1 DOCMIND_DSPY_METRIC_THRESHOLD=0.8 DOCMIND_ENABLE_DSPY_BOOTSTRAPPING=true ``` Notes: - DSPy runs in the agents layer and augments retrieval by refining the query; retrieval remains library-first (server-side hybrid via Qdrant + reranking). - `DOCMIND_ENABLE_DSPY_OPTIMIZATION=true` only takes effect when DSPy is installed. DSPy is not installed by default while its dependency chain conflicts with the project security floors, so on the supported baseline the flag falls back gracefully to standard retrieval unless you install DSPy separately. ### Additional Configuration **Streamlit UI Configuration** (optional): Create `.streamlit/config.toml` if you want to override Streamlit defaults: ```toml [theme] base = "light" primaryColor = "#FF4B4B" [server] maxUploadSize = 200 ``` **Cache Configuration**: - Ingestion cache: DuckDB KV store under `./cache/docmind.duckdb` (see `DOCMIND_CACHE__DIR` and `DOCMIND_CACHE__FILENAME`). - PDF page images: rendered under `./cache/page_images/` and stored durably as content-addressed artifacts under `./data/artifacts/` by default. - Model weights: cached via Hugging Face defaults (`~/.cache/huggingface`). ## Performance Defaults and Monitoring > **Note**: Performance depends on hardware, model size, and corpus size. Use the scripts below to measure on your machine. ### Configured Defaults - Rerank timeouts: text 250ms, SigLIP 150ms, ColPali 400ms, total budget 800ms (`DOCMIND_RETRIEVAL__*`). - Coordination overhead target: 200ms (`COORDINATION_OVERHEAD_THRESHOLD` in `src/agents/coordinator.py`). - Context cap: 131072 by default, max 200000 (`DOCMIND_LLM_CONTEXT_WINDOW_MAX`). - Monitoring thresholds: `DOCMIND_MONITORING__MAX_QUERY_LATENCY_MS`, `DOCMIND_MONITORING__MAX_MEMORY_GB`, `DOCMIND_MONITORING__MAX_VRAM_GB`. ### Measure Locally - `uv run python scripts/performance_monitor.py --run-tests --check-regressions --report` - `uv run python scripts/test_gpu.py --quick` ### Retrieval & Reranking Defaults - Hybrid retrieval uses Qdrant named vectors `text-dense` (1024D COSINE; BGE-M3) and `text-sparse` (FastEmbed BM42/BM25 + IDF) when `DOCMIND_RETRIEVAL__ENABLE_SERVER_HYBRID=true`. - Default fusion is RRF; DBSF is available with `DOCMIND_RETRIEVAL__FUSION_MODE=dbsf`. - Prefetch defaults: dense 200, sparse 400; `fused_top_k=60`; `page_id` de-dup. - Reranking is enabled by default: BGE v2-m3 (text) + SigLIP (visual), with optional ColPali; timeouts are enforced and fail open. - Feature flags (hybrid, reranking) are env-only; RRF K and timeouts are adjustable in the Settings page. - Router parity: RouterQueryEngine tools (vector/hybrid/KG) apply the same reranking policy via `node_postprocessors` behind `DOCMIND_RETRIEVAL__USE_RERANKING`. #### Operational Flags (local-first) - `HF_HUB_OFFLINE=1` and `TRANSFORMERS_OFFLINE=1` to disable network egress (after predownload). - `DOCMIND_RETRIEVAL__FUSION_MODE=rrf|dbsf` to control Qdrant fusion. - `DOCMIND_RETRIEVAL__USE_RERANKING=true|false` (canonical env override). - LLM base URLs are validated when `DOCMIND_SECURITY__ALLOW_REMOTE_ENDPOINTS=false`: loopback is always allowed; allowlisted non-loopback hosts are DNS-resolved and rejected if they map to private/link-local/reserved ranges. ## Offline Operation DocMind AI is designed for complete offline operation: ### Prerequisites for Offline Use 1. **Install Ollama locally:** ```bash # Download from https://ollama.com/download ollama serve # Start the service ``` 2. **Pull required models:** ```bash ollama pull qwen3-4b-instruct-2507 # Recommended for 128 K context ollama pull qwen2:7b # Alternative lightweight model ``` 3. **Verify GPU setup (optional):** ```bash nvidia-smi # Check GPU availability uv run python scripts/test_gpu.py --quick # Validate CUDA setup ``` ### Prefetch Model Weights Run once (online) to predownload required models for offline use: ```bash uv run python tools/models/pull.py --all --cache_dir ./models_cache ``` ### Snapshots & Staleness DocMind snapshots persist indices atomically for reproducible retrieval. - `manifest.meta.json` fields include `schema_version`, `persist_format_version`, `complete`, `created_at`, `index_id`, `graph_store_type`, `vector_store_type`, `corpus_hash`, `config_hash`, `versions`, and `graph_exports` when present. - Hashing: `corpus_hash` computed with POSIX relpaths relative to a stable base dir (the Documents UI uses `uploads/`) for OS-agnostic stability. - Chat autoload: the Chat page loads the latest non-stale snapshot when available; otherwise it shows a staleness badge and offers to rebuild. ### GraphRAG Exports & Seeds - Graph exports preserve relation labels when provided by `get_rel_map` (fallback label `related`). Exports: JSONL baseline (portable) and Parquet (optional, requires PyArrow). Export seeding follows a retriever-first policy: graph -> vector -> deterministic fallback. Set env for offline operation: ```bash export HF_HUB_OFFLINE=1 export TRANSFORMERS_OFFLINE=1 ``` ### Model Requirements Model sizing depends on your hardware and chosen backend. See [Hardware Policy](docs/developers/hardware_policy.md) for device and VRAM guidance. ## Troubleshooting ### Common Issues #### 1. Ollama Connection Errors ```bash # Check if Ollama is running curl http://localhost:11434/api/version # If not running, start it ollama serve ``` #### 2. GPU Not Detected ```bash # Install GPU dependencies uv sync --frozen --extra gpu --index https://download.pytorch.org/whl/cu128 --index-strategy=unsafe-best-match # Verify CUDA installation nvidia-smi uv run python -c "import torch; print(torch.cuda.is_available())" ``` The `unsafe-best-match` strategy is intentional for this CUDA-only command so uv can select CUDA 12.8 wheels from the PyTorch index instead of CPU wheels from the default index. #### 3. Model Download Issues ```bash # Pull models manually ollama pull qwen3-4b-instruct-2507 # For 128 K context ollama pull qwen2:7b # Alternative ollama list # Verify installation ``` #### 4. Memory Issues - Reduce context size in Settings (131072 → 65536 → 32768 → 4096) - Use smaller models (4B instead of 7B/14B for lower VRAM) - Adjust chunking via `DOCMIND_PROCESSING__CHUNK_SIZE` and `DOCMIND_PROCESSING__CHUNK_OVERLAP` - Close other applications to free RAM #### 5. Document Processing Errors ```bash # Smoke test ingestion (no external services) uv run python scripts/run_ingestion_demo.py # If a specific file fails in the UI, reproduce via a targeted ingest: uv run python -c "from pathlib import Path; from src.models.processing import IngestionConfig, IngestionInput; from src.processing.ingestion_pipeline import ingest_documents_sync; p=Path('path/to/problem-file.pdf'); r=ingest_documents_sync(IngestionConfig(cache_dir=Path('./cache/ingestion-debug')), [IngestionInput(document_id='debug', source_path=p, metadata={'source': p.name})]); print(f'nodes={len(r.nodes)} exports={len(r.exports)}')" ``` #### 6. vLLM Server Connectivity Issues ```bash # Confirm the app is pointing at the right server echo "$DOCMIND_LLM_BACKEND" echo "$DOCMIND_OPENAI__BASE_URL" # vLLM is OpenAI-compatible; this should return JSON. curl --fail --silent "$DOCMIND_OPENAI__BASE_URL/models" | head ``` Notes: - vLLM does not support Windows natively; use WSL2 or run vLLM on a Linux host. - vLLM performance features (FlashInfer, FP8 KV cache) are configured on the vLLM server process, not inside this app. #### 7. PyTorch Compatibility Issues This repo pins **PyTorch 2.8.0** for reproducibility. If you need CUDA wheels, install with the CUDA index: ```bash uv pip install torch==2.8.0 --extra-index-url https://download.pytorch.org/whl/cu128 uv run python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())" ``` #### 8. GPU Memory Issues (16 GB RTX 4090) ```bash # Reduce GPU memory utilization in .env export DOCMIND_VLLM__GPU_MEMORY_UTILIZATION=0.75 # Reduce from 0.85 # Monitor GPU memory usage nvidia-smi --query-gpu=memory.used,memory.total --format=csv --loop=1 # Clear GPU memory cache uv run python -c "import torch; torch.cuda.empty_cache()" ``` #### 9. Performance Validation ```bash # Run performance validation script uv run python scripts/performance_monitor.py --run-tests --check-regressions --report ``` ### Performance Optimization 1. **Enable GPU acceleration** in the Settings page 2. **Use appropriate model sizes** for your hardware 3. **Enable caching** to speed up repeat analysis 4. **Adjust chunk sizes** based on document complexity 5. **Use hybrid search** for better retrieval quality ### Getting Help - Check logs in `logs/` directory for detailed errors - Review [troubleshooting FAQ](docs/user/troubleshooting-faq.md) - Search existing [GitHub Issues](https://github.com/BjornMelin/docmind-ai-llm/issues) - Open a new issue with: steps to reproduce, error logs, system info ## How to Cite If you use DocMind AI in your research or work, please cite it as follows: ```bibtex @software{melin_docmind_ai_2025, author = {Melin, Bjorn}, title = {DocMind AI: Local LLM for AI-Powered Document Analysis}, url = {https://github.com/BjornMelin/docmind-ai-llm}, version = {0.1.0}, year = {2025} } ``` ## Contributing Contributions are welcome! Please follow these steps: 1. **Fork the repository** and create a feature branch 2. **Set up development environment:** ```bash git clone https://github.com/your-username/docmind-ai-llm.git cd docmind-ai-llm uv sync --group dev ``` 3. **Make your changes** following the established patterns 4. **Run tests and linting:** ```bash # Lint & format uv run ruff format . uv run ruff check . --fix uv run pyright --threads 4 # Fast tiered validation (unit + integration) uv run python scripts/run_tests.py --fast # Coverage gate uv run python scripts/run_tests.py --coverage # Quality gates (CI-style report) uv run python scripts/run_quality_gates.py --ci --report ``` 5. **Submit a pull request** with clear description of changes ### Development Guidelines - Follow PEP 8 style guide (enforced by Ruff) - Add type hints for all functions - Include docstrings for public APIs - Write tests for new functionality - Update documentation as needed #### Tests and CI We use a tiered test strategy and keep everything offline by default: - Unit (fast, offline): mocks only; no network/GPU. - Integration (offline): component interactions; router uses a session-autouse MockLLM fixture in `tests/integration/conftest.py`, preventing any Ollama/remote calls. - System/E2E (optional): heavier flows beyond the PR quality gates. Quick local commands: ```bash # Fast unit + integration sweep (offline) uv run python scripts/run_tests.py --fast # Extras (multimodal) lane - skips automatically when optional deps missing uv run python scripts/run_tests.py --extras # Full coverage gate (unit + integration) uv run python scripts/run_tests.py --coverage # Targeted module or pattern uv run python scripts/run_tests.py tests/unit/persistence/test_snapshot_manager.py ``` Default `pytest` invocations now run without implicit coverage gates. Use the scripted `--coverage` workflow (or run `coverage report`) when you need HTML, XML, or JSON artifacts for CI or local analysis. CI pipeline mirrors this flow using `uv run python scripts/run_tests.py --fast` as a quick gate followed by `--coverage` for the full report. This keeps coverage thresholds stable while still surfacing integration regressions early. See ADR-014 for quality gates/validation and ADR-029 for the boundary-first testing strategy. See the [Developer Handbook](docs/developers/developer-handbook.md) for detailed guidelines. For an overview of the unit test layout and fixture strategy, see tests/README.md. ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## Observability DocMind AI configures OpenTelemetry tracing and metrics via `configure_observability` (see SPEC-012). - Observability is disabled by default; enable with `DOCMIND_OBSERVABILITY__ENABLED=true`. - OTLP exporters are used when enabled; set `DOCMIND_OBSERVABILITY__ENDPOINT` and `DOCMIND_OBSERVABILITY__PROTOCOL` as needed. - LlamaIndex instrumentation requires the `observability` extra (`uv sync --frozen --extra observability`). - Core spans cover ingestion runs, snapshot promotion, GraphRAG exports, router selection, and UI actions. - Telemetry events (`router_selected`, `export_performed`, `snapshot_stale_detected`) are persisted as JSONL for local audits. For a local metrics smoke test, run: ```bash uv run python scripts/demo_metrics_console.py ``` Use `tests/unit/telemetry/test_observability_config.py` as a reference for wiring custom exporters in extensions. --- <div align="center"> Built by [Bjorn Melin](https://bjornmelin.io) </div>

Knowledge Bases & RAG Document Management

129 Github Stars

Open Source

ai-job-scraper

# 🕵️‍♂️ AI Job Scraper: Local-First, Privacy-Focused Job Search ![Python](https://img.shields.io/badge/Python-3776AB?style=for-the-badge&logo=python&logoColor=white)![Streamlit](https://img.shields.io/badge/Streamlit-FF4B4B?style=for-the-badge&logo=streamlit&logoColor=white)![vLLM](https://img.shields.io/badge/vLLM-2C2C2C?style=for-the-badge)![Docker](https://img.shields.io/badge/Docker-2496ED?style=for-the-badge&logo=docker&logoColor=white)![SQLite](https://img.shields.io/badge/SQLite-003B57?style=for-the-badge&logo=sqlite&logoColor=white) [![MIT License](https://img.shields.io/badge/License-MIT-green.svg)](https://choosealicense.com/licenses/mit/) [![GitHub](https://img.shields.io/badge/GitHub-BjornMelin-181717?logo=github)](https://github.com/BjornMelin) [![LinkedIn](https://img.shields.io/badge/LinkedIn-BjornMelin-0077B5?logo=linkedin)](https://www.linkedin.com/in/bjorn-melin/) AI Job Scraper is a modern, open-source Python application designed to automate and streamline your job search for roles in the AI and Machine Learning industry. Built with **local-first AI processing** using Qwen/Qwen3-4B-Instruct-2507-FP8, it automatically scrapes job postings from top AI companies and provides a powerful Streamlit interface—all while ensuring your data remains completely private and local. ## ✨ Key Features * **🤖 Local-First AI Processing:** Utilizes Qwen/Qwen3-4B-Instruct-2507-FP8 with FP8 quantization on RTX 4090 for fast, private job analysis * **⚡ 2-Tier Scraping Strategy:** Combines `ScrapeGraphAI` for intelligent company page scraping with `JobSpy` for structured job board data * **🔍 SQLite FTS5 Search:** Full-text search with porter stemming, <10ms response times scaling to 500K+ records * **🎨 Streamlit Native UI:** Modern card-based interface with real-time updates via st.session_state and threading * **🚀 Non-Blocking Background Tasks:** Real-time progress tracking with st.status while maintaining UI responsiveness * **⚡ High-Performance Caching:** st.cache_data for <100ms filter operations on 5000+ job records * **🏢 Smart Database Sync:** Content hash-based synchronization engine that preserves user data during updates * **📊 DuckDB Analytics:** Zero-ETL analytics via sqlite_scanner - no separate database maintenance * **🛡️ Privacy-First Architecture:** All processing happens locally - no personal data leaves your machine * **🐳 Docker Ready:** Complete containerization with GPU support for one-command deployment ## 🏗️ Architecture ### **Data Collection** * **Multi-Source Scraping:** JobSpy for major job boards + ScrapeGraphAI for company pages * **Proxy Integration:** Residential proxy integration with rotation * **Background Processing:** Non-blocking scraping with real-time progress updates * **AI Extraction:** AI-powered parsing for unstructured job postings ### **Local-First AI Processing** * **Local LLM:** Qwen/Qwen3-4B-Instruct-2507-FP8 with FP8 quantization * **Cloud Fallback:** GPT-4o-mini for complex tasks (>8K tokens) * **Hardware:** RTX 4090 Laptop GPU (16GB VRAM) with 90% utilization * **Inference:** vLLM >=0.6.2 with CUDA >=12.1 support * **Unified Client:** LiteLLM for seamless local/cloud routing ### **Technology Stack** * **Backend:** Python 3.12+, SQLModel ORM, threading-based background tasks * **Frontend:** Streamlit with native caching (st.cache_data), fragments, and real-time updates * **Database:** SQLite 3.38+ with WAL mode, FTS5 search, DuckDB 0.9.0+ sqlite_scanner analytics * **AI Processing:** LiteLLM unified client + Instructor + vLLM >=0.6.2 with FP8 support * **Analytics:** DuckDB sqlite_scanner for zero-ETL analytics, SQLModel cost tracking * **Deployment:** Docker + Docker Compose with GPU support, uv package management ### **Performance Characteristics** * **Search:** 5-15ms FTS5 queries (1K jobs), 50-300ms (500K jobs) with BM25 ranking * **AI Processing:** <2s local vLLM inference, 98% local processing rate, 8K token routing threshold * **GPU Utilization:** 90% efficiency with RTX 4090 FP8 quantization and continuous batching * **UI Rendering:** <100ms filter operations via st.cache_data, <200ms job card display * **Scalability:** Tested capacity 500K job records (1.3GB database), single-user architecture * **Analytics:** DuckDB sqlite_scanner for direct SQLite analytics queries * **Cost:** $25-30/month operational cost breakdown: AI $2.50, proxies $20, misc $5 * **Memory:** FP8 quantization for optimal 16GB VRAM utilization ```mermaid graph TD subgraph "UI Layer - Streamlit Native" UI_APP[Streamlit App] UI_CARDS[Mobile-First Card Interface] UI_SEARCH[FTS5 Search with BM25] UI_STATUS[Visual Status Indicators] UI_FRAGMENTS[Auto-refresh Fragments] UI_ANALYTICS[Analytics Dashboard] end subgraph "Search & Analytics" SEARCH_FTS5[SQLite FTS5 + Porter Stemming] SEARCH_UTILS[sqlite-utils Integration] ANALYTICS_SMART[Automatic Method Selection] ANALYTICS_DUCK[DuckDB sqlite_scanner] ANALYTICS_CACHE[Streamlit Native Caching] ANALYTICS_COST[Real-time Cost Tracking] end subgraph "AI Processing Layer" AI_LITELLM[LiteLLM Unified Client] AI_LOCAL[Qwen3-4B Local] AI_CLOUD[GPT-4o-mini Cloud] AI_INSTRUCTOR[Instructor Validation] end subgraph "Data Collection" SCRAPE_JOBSPY[JobSpy - 90% Coverage] SCRAPE_AI[ScrapeGraphAI - 10% Coverage] PROXY_IPROYAL[IPRoyal Residential Proxies] end subgraph "Database Layer" DB_SQLITE[SQLite + SQLModel] DB_SYNC[Database Sync Engine] DB_CACHE[Content Hash Detection] end UI_APP --> UI_CARDS UI_CARDS --> UI_SEARCH UI_SEARCH --> SEARCH_FTS5 UI_STATUS --> ANALYTICS_SMART ANALYTICS_SMART --> ANALYTICS_CACHE SEARCH_FTS5 --> SEARCH_UTILS SEARCH_UTILS --> DB_SQLITE ANALYTICS_SMART --> DB_SQLITE ANALYTICS_SMART --> ANALYTICS_DUCK SCRAPE_JOBSPY --> AI_LITELLM SCRAPE_AI --> AI_LITELLM AI_LITELLM --> AI_LOCAL AI_LITELLM --> AI_CLOUD AI_INSTRUCTOR --> DB_SYNC DB_SYNC --> DB_SQLITE DB_CACHE --> DB_SQLITE SCRAPE_JOBSPY --> PROXY_IPROYAL style UI_APP fill:#e1f5fe style SEARCH_FTS5 fill:#e8f5e8 style AI_LITELLM fill:#f3e5f5 style DB_SQLITE fill:#fff3e0 ``` ## 🚀 Installation ### **Requirements** * **GPU:** RTX 4090 Laptop GPU with 16GB VRAM * **Software:** CUDA >=12.1, Python 3.12+ * **Tools:** Docker + Docker Compose, uv package manager ### **Installation** 1. **Clone the repository:** ```bash git clone https://github.com/BjornMelin/ai-job-scraper.git cd ai-job-scraper ``` 2. **Install dependencies with uv:** ```bash uv sync ``` 3. **Set up environment variables:** ```bash cp .env.example .env # Edit .env with your API keys (optional for local-only mode) ``` 4. **Initialize the database:** ```bash uv run python -m src.seed seed ``` 5. **Start the application:** ```bash uv run streamlit run src/app.py ``` 6. **Open your browser** and navigate to `http://localhost:8501` ### **Docker Deployment** For containerized deployment with GPU support: ```bash # Build and run with Docker Compose docker-compose up --build # Or run with GPU support docker run --gpus all -p 8501:8501 ai-job-scraper ``` ## 📊 Performance Our architecture delivers production-ready performance for personal-scale usage: * **Search Speed:** 5-300ms SQLite FTS5 queries (scales with dataset size: 1K-500K records) * **AI Processing:** Local processing <2s response time, 98% local processing rate * **UI Operations:** <100ms filter operations via Streamlit native caching * **Real-time Updates:** Non-blocking progress with st.rerun() + session_state during background scraping * **GPU Efficiency:** 90% utilization with FP8 quantization on RTX 4090 (16GB VRAM) * **Database Performance:** SQLite handles 500K+ records, DuckDB analytics via sqlite_scanner * **Cost Control:** $25-30/month operational costs with real-time budget monitoring * **Memory Management:** FP8 quantization for optimal VRAM utilization with continuous batching ## 🔧 Configuration ### **AI Processing Configuration** The application uses a hybrid local/cloud approach: * **Local Model:** Qwen/Qwen3-4B-Instruct-2507-FP8 with automatic model download * **Inference:** vLLM >=0.6.2 with FP8 quantization for RTX 4090 optimization * **Token Routing:** 8K context window threshold measured via tiktoken * **Cloud Fallback:** LiteLLM unified client with GPT-4o-mini for complex tasks (>8K tokens) * **Memory:** 16GB VRAM with 90% utilization and continuous batching * **Processing Rate:** 98% local processing, <2% cloud fallback ### **Data Sources & Collection** * **Structured Sources:** JobSpy for LinkedIn, Indeed, Glassdoor (90% coverage) * **Unstructured Sources:** ScrapeGraphAI for company career pages (10% coverage) * **Proxy Integration:** IPRoyal residential proxies with native JobSpy compatibility * **Rate Limiting:** Respectful scraping with configurable delays and user-agent rotation * **Resilience:** Native HTTPX transport retries, eliminates custom retry logic * **Background Tasks:** Python threading.Thread with Streamlit st.status integration ### **Analytics & Monitoring Configuration** Built-in analytics and cost tracking: * **Search:** SQLite FTS5 handles 500K+ records with porter stemming * **Analytics:** DuckDB sqlite_scanner for zero-ETL analytics queries * **Database:** SQLite primary storage with WAL mode, DuckDB analytics via direct scanning * **Caching:** Session-based st.cache_data → Persistent cache layers with configurable TTL * **UI:** Streamlit fragments for auto-refresh, modern card-based interface * **Cost Control:** Real-time $50 budget monitoring with automated alerts at 80% and 100% ## 📚 Documentation * **[Product Requirements Document (PRD)](./docs/PRD.md):** Complete feature specifications and technical requirements * **[User Guide](./docs/user/user-guide.md):** Learn how to use all application features * **[Developer Guide](./docs/developers/developer-guide.md):** Architecture overview and contribution guidelines * **[Deployment Guide](./docs/developers/deployment.md):** Production deployment instructions ## 🛠️ Development Built with modern Python practices: * **Package Management:** uv (not pip) * **Code Quality:** ruff for linting and formatting * **Testing:** pytest with >80% coverage target * **Architecture:** KISS > DRY > YAGNI principles * **Timeline:** 1-week deployment target achieved ### **Development Setup** ```bash # Install dependencies uv sync # Run linting and formatting ruff check . --fix ruff format . # Run tests uv run pytest ``` ## 🤝 Contributing Contributions are welcome! Our development philosophy prioritizes: * **Library-first approaches** over custom implementations * **Simplicity and maintainability** over complex abstractions * **Local-first processing** for privacy and performance * **Modern Python patterns** with comprehensive type hints Please fork the repository, create a feature branch, and open a pull request. See the [Developer Guide](./docs/developers/developer-guide.md) for detailed contribution guidelines. ## 📃 License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. --- > **Built with ❤️ for the AI/ML community | Privacy-first | Local-first | Open source**

AI Agents Browser Automation

33 Github Stars

Open Source

openai-agents-travel-graph

# OpenAI Agents Travel Graph 🧳 ✈️ 🗺️ A state-of-the-art multi-agent travel planning system powered by OpenAI Agents SDK and LangGraph orchestration. This system autonomously researches and plans comprehensive trips with optimized budgets, personalized recommendations, and real-time data through intelligent browser automation. ## Table of Contents - [OpenAI Agents Travel Graph 🧳 ✈️ 🗺️](#openai-agents-travel-graph--️-️) - [Table of Contents](#table-of-contents) - [Overview](#overview) - [Key Features](#key-features) - [Technology Stack](#technology-stack) - [System Architecture](#system-architecture) - [Installation](#installation) - [Usage](#usage) - [Development](#development) - [Contributing](#contributing) - [License](#license) - [How to Cite](#how-to-cite) ## Overview OpenAI Agents Travel Graph is an advanced AI-powered travel planning system that leverages the latest in multi-agent technology to automate the entire travel planning process. The system orchestrates specialized agents to handle different aspects of travel planning, from destination research and flight bookings to accommodation selection and activity planning. By combining the power of the OpenAI Agents SDK with graph-based orchestration through LangGraph, the system can maintain complex workflows while providing personalized travel recommendations that meet user preferences and budget constraints. ## Key Features ```mermaid mindmap root((Travel Planning System)) (Multi-Agent Architecture) [Destination Research] [Flight Search] [Accommodation] [Transportation] [Activities] (Browser Automation) [Self-healing] [Parallel execution] [Data extraction] (Budget Management) [Cost tracking] [Value optimization] [Alternative options] (Personalization) [Preference analysis] [Custom recommendations] (Knowledge Storage) [Persistent memory] [Entity relationships] ``` - 🤖 **Multi-Agent Architecture** - Specialized agents for different travel planning aspects - 💰 **Budget Optimization** - Intelligent allocation of budget across travel components - 🔍 **Real-time Research** - Autonomous web research for current travel information - 🌐 **Browser Automation** - Intelligent interaction with travel websites - 📋 **Detailed Itineraries** - Day-by-day schedules with activities and logistics - 💼 **Personalization** - Tailored recommendations based on user preferences - 🔄 **Alternative Suggestions** - Multiple options with comparisons - 📊 **Budget Breakdowns** - Transparent cost allocation and justification ## Technology Stack - **Primary Framework**: [OpenAI Agents SDK](https://github.com/openai/openai-agents-python) (Latest 2025 Release) - Core agent framework - **Orchestration**: [LangGraph v0.4+](https://github.com/langchain-ai/langgraph) - Multi-agent workflow management - **Browser Automation**: [Stagehand v2.0+](https://github.com/browserbase/stagehand) - AI-enhanced browser control - **Data Persistence**: [Supabase](https://supabase.com/) - Database and storage - **Research Tools**: - [Firecrawl](https://firecrawl.dev/) - Web content extraction - [Tavily API](https://tavily.com/) - Intelligent search - [Context7](https://context7.com/) - Documentation access - **Memory Management**: Memory MCP Server - Persistent context across sessions For the full technology stack and detailed system architecture, see [Architecture & Requirements](docs/architecture-requirements.md). ## System Architecture The system follows a comprehensive multi-layered architecture with specialized agents coordinated through LangGraph: ```mermaid flowchart TD %% Main system layers UI[User Interface Layer] --> OL %% Orchestration layer subgraph OL[Orchestration Layer - LangGraph] SM[State Management] AWM[Agent Workflow Management] CP[Context Preservation] end %% Specialized agents OL --> DRA[Destination Research Agent] OL --> FSA[Flight Search Agent] OL --> ASA[Accommodation Search Agent] OL --> TPA[Transportation Planning Agent] OL --> APA[Activity Planning Agent] %% Browser automation & budget management DRA & FSA & ASA & TPA & APA --> BAL[Browser Automation Layer] BAL --> BMA[Budget Management Agent] BMA --> KML[Knowledge & Memory Layer] KML --> PS[Persistent Storage - Supabase] %% Styling classDef systemLayer fill:#f9f9f9,stroke:#333,stroke-width:2px classDef agent fill:#e8f5e9,stroke:#2e7d32,stroke-width:1px class UI,OL,BAL,KML,PS systemLayer class DRA,FSA,ASA,TPA,APA,BMA agent ``` Each specialized agent uses a combination of LLM capabilities and domain-specific tools to perform its tasks, with the orchestration layer maintaining state and ensuring proper handoffs between agents. ### Agent Interaction Flow ```mermaid sequenceDiagram participant User participant Orchestrator as Orchestration Agent participant Destination as Destination Agent participant Flight as Flight Agent participant Hotel as Hotel Agent User->>Orchestrator: Travel Request Orchestrator->>Destination: Research Request Destination-->>Orchestrator: Destination Options par Flight & Hotel Search Orchestrator->>Flight: Search Flights Orchestrator->>Hotel: Search Accommodations end Flight-->>Orchestrator: Flight Options Hotel-->>Orchestrator: Accommodation Options Orchestrator->>User: Complete Itinerary Note over User,Hotel: Human feedback loop can interrupt at any stage ``` For the complete detailed architecture diagram and component descriptions, see [Architecture & Requirements](docs/architecture-requirements.md). ## Installation ```bash # Clone the repository git clone https://github.com/BjornMelin/openai-agents-travel-graph.git cd openai-agents-travel-graph # Set up a virtual environment python -m venv venv source venv/bin/activate # On Windows, use: venv\\Scripts\\activate # Install uv package manager curl -sSf https://astral.sh/uv/install.sh | bash # Install dependencies uv pip install -e ".[dev]" # Set up environment variables cp .env.example .env # Edit .env with your API keys and configuration # Set up Supabase database python supabase_setup.py init ``` ### Supabase Setup This project uses Supabase for data persistence. You need to set up a Supabase project and configure it with the database schema. The `supabase_setup.py` script automates this process: ```bash # Initialize the database with required tables and indexes python supabase_setup.py init # Check the status of your Supabase database python supabase_setup.py status # Reset the database (WARNING: This will delete all data) python supabase_setup.py reset # Create test data for development python supabase_setup.py init --test-data ``` For more information about the database schema, see [Migrations README](travel_planner/data/migrations/README.md). ## Usage Run the travel planner in interactive mode: ```bash python -m travel_planner.main ``` Or provide a query directly: ```bash python -m travel_planner.main --query "I want to visit Tokyo for a week in October" --origin "New York" --budget "3000-5000" ``` For more options: ```bash python -m travel_planner.main --help ``` ## Development If you'd like to contribute to the development of this project, please follow these guidelines: 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Add tests for your changes 5. Run the test suite with `uv run pytest` 6. Submit a pull request This project uses [uv](https://github.com/astral-sh/uv) as its Python package manager. For all Python-related commands, use `uv run` (e.g., `uv run pytest`, `uv run python script.py`). ### Package Management This project now uses `pyproject.toml` for dependency management. The legacy `requirements.txt` files are kept only for backward compatibility. To set up your development environment with uv: ```bash # Create a virtual environment and install all dependencies uv venv source .venv/bin/activate # On Windows: .venv\Scripts\activate uv pip install -e ".[dev]" # Run linting and type checking uv run ruff . uv run mypy . # Format code uv run black . uv run isort . ``` See the [CONTRIBUTING.md](CONTRIBUTING.md) file for more details. ## Contributing Contributions are welcome! Please see our [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on how to contribute to this project. ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## How to Cite If you use this project in your research or work, please cite it as: ```plaintext Melin, B. (2025). OpenAI Agents Travel Graph: A multi-agent system for autonomous travel planning. GitHub repository. https://github.com/BjornMelin/openai-agents-travel-graph ``` **BibTeX:** ```bibtex @misc{openai-agents-travel-graph, author = {Melin, Bjorn}, title = {OpenAI Agents Travel Graph: A multi-agent system for autonomous travel planning}, year = {2025}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/BjornMelin/openai-agents-travel-graph}} } ```

AI Agents Browser Automation

15 Github Stars