Home
Softono
searcharvester

searcharvester

Open source Python
235
Stars
40
Forks
1
Issues
4
Watchers
1 month
Last Commit

About searcharvester

Self-hosted search + markdown harvester for AI agents. SearXNG (100+ engines) + FastAPI + trafilatura. Tavily-compatible /search plus /extract with size presets and pagination. One-command Docker Compose.

Platforms

Web Self-hosted Docker

Languages

Python

Searcharvester 🌾

Self-hosted search + extract + deep research for AI agents

πŸ“– Docs: English Β· Русский Β· δΈ­ζ–‡

Three composable HTTP services in a single docker compose up:

  • /search β€” Tavily-compatible search via SearXNG (100+ engines)
  • /extract β€” URL β†’ clean markdown via trafilatura, with size presets and pagination
  • /research β€” deep research agent: give it a question, get back a cited markdown report

No API keys, no quotas, fully self-hosted. Pre-built image on GHCR.

πŸš€ Quick start

# 1. Clone
git clone [email protected]:vakovalskii/searcharvester.git
cd searcharvester

# 2. Config
cp config.example.yaml config.yaml
# Change server.secret_key (32+ chars)

# 3. (Optional) LLM credentials for /research β€” any OpenAI-compatible endpoint
cat > .env <<EOF
OPENAI_API_KEY=sk-...
OPENAI_BASE_URL=https://api.openai.com/v1
EOF

# 4. Start β€” pulls ghcr.io/vakovalskii/searcharvester
docker compose up -d

# 5. Test search
curl -X POST localhost:8000/search -H 'Content-Type: application/json' \
  -d '{"query":"bitcoin price","max_results":3}'

# 6. Test extract (URL β†’ markdown)
curl -X POST localhost:8000/extract -H 'Content-Type: application/json' \
  -d '{"url":"https://en.wikipedia.org/wiki/Docker_(software)","size":"m"}'

# 7. Test deep research (needs LLM creds from step 3)
curl -X POST localhost:8000/research -H 'Content-Type: application/json' \
  -d '{"query":"What is trafilatura? One paragraph with source."}'
# β†’ {"job_id":"...","status":"queued"}
# Poll GET /research/{job_id} until status=completed, grab the report.

🧱 Three services, one API

1️⃣ POST /search β€” Tavily-compatible search

Drop-in replacement for the Tavily API:

from tavily import TavilyClient
client = TavilyClient(api_key="ignored", base_url="http://localhost:8000")
response = client.search(query="...", max_results=5, include_raw_content=True)

Request body:

{
  "query": "...",
  "max_results": 10,
  "include_raw_content": false,
  "engines": "google,duckduckgo,brave",
  "categories": "general"
}

Response β€” Tavily schema (see docs/en/api.md).

2️⃣ POST /extract β€” URL β†’ clean markdown

Takes a URL, fetches the HTML, runs trafilatura for main-content extraction (strips nav/footer/ads, preserves headings, lists, tables, links), returns ready-to-use markdown.

Size presets for different context windows:

Size Chars Use case
s 5 000 Quick summary, small-context LLMs
m 10 000 Default agent reading
l 25 000 Deep single-page read
f full Paginated by 25 000 β€” read long docs piece by piece

Pagination via cache:

# Get id + page 1
curl -X POST localhost:8000/extract -d '{"url":"...","size":"f"}'
# β†’ {"id":"abc123","content":"...","pages":{"current":1,"total":4,"next":"/extract/abc123/2"}}

# Next pages β€” no re-download
curl localhost:8000/extract/abc123/2

Cache keyed by md5(url)[:16], TTL 30 minutes. Cold fetch: 1-3 s; cached page: <50 ms.

Useful as a standalone service, not just for the agent β€” plug it into any LLM pipeline that needs clean page text.

3️⃣ POST /research β€” deep research agent

{query} β†’ orchestrator spawns an ephemeral Hermes Agent container with three skills:

Skill Role
searcharvester-search Tool: calls our /search
searcharvester-extract Tool: calls our /extract
searcharvester-deep-research Methodology (markdown only, no code): plan β†’ gather β†’ gap-check β†’ synthesise β†’ verify

The agent reads the methodology, plans sub-queries, loops search→extract, synthesises a markdown report with [1][2] citations, saves it to /workspace/report.md. The orchestrator watches for the REPORT_SAVED: marker and returns the file to the client.

LLM-agnostic β€” works with any OpenAI-compatible endpoint: OpenAI, OpenRouter, Anthropic (via LiteLLM), vLLM, Ollama, LM Studio.

# Async flow
JOB=$(curl -sX POST localhost:8000/research -d '{"query":"compare vLLM vs SGLang"}' | jq -r .job_id)
while true; do
  R=$(curl -s localhost:8000/research/$JOB)
  STATUS=$(echo "$R" | jq -r .status)
  [ "$STATUS" = "running" ] && sleep 5 && continue
  echo "$R" | jq -r .report
  break
done

🧱 Stack β€” how the services are wired

Four always-running containers + one ephemeral per research job.

                        HOST (Mac / Linux server)
╔══════════════════════════════════════════════════════════════════════════════╗
β•‘                                                                              β•‘
β•‘    Files on disk (bind-mounted into containers):                             β•‘
β•‘    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β•‘
β•‘    β”‚ config.yaml  β”‚  β”‚ hermes-data/    β”‚  β”‚ jobs/{job_id}/             β”‚     β•‘
β•‘    β”‚ (SearXNG +   β”‚  β”‚  skills/        β”‚  β”‚  plan.md                   β”‚     β•‘
β•‘    β”‚  adapter)    β”‚  β”‚   searcharv-*   β”‚  β”‚  notes.md                  β”‚     β•‘
β•‘    β”‚              β”‚  β”‚  config.yaml    β”‚  β”‚  report.md                 β”‚     β•‘
β•‘    β”‚              β”‚  β”‚  sessions/ ...  β”‚  β”‚  hermes.log                β”‚     β•‘
β•‘    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β•‘
β•‘           β”‚ ro (bind)       β”‚ rw (bind)              β”‚ rw (bind)             β•‘
β•‘           β–Ό                 β–Ό                        β–Ό                       β•‘
β•‘  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β•‘
β•‘  β”‚  DOCKER ENGINE                                                         β”‚  β•‘
β•‘  β”‚                                                                        β”‚  β•‘
β•‘  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ network: searxng (bridge) ──────────────────┐   β”‚  β•‘
β•‘  β”‚  β”‚                                                                 β”‚   β”‚  β•‘
β•‘  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    internal HTTP    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚   β”‚  β•‘
β•‘  β”‚  β”‚  β”‚ tavily-adapter │◀──────────────────▢│ searxng           β”‚    β”‚   β”‚  β•‘
β•‘  β”‚  β”‚  β”‚ :8000 (exposed)β”‚   /search?format=  β”‚ :8080             β”‚    β”‚   β”‚  β•‘
β•‘  β”‚  β”‚  β”‚                β”‚      json          β”‚ (:8999 exposed)   β”‚    β”‚   β”‚  β•‘
β•‘  β”‚  β”‚  β”‚ FastAPI:       β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   β”‚  β•‘
β•‘  β”‚  β”‚  β”‚  /search       β”‚                             β”‚ RESP          β”‚   β”‚  β•‘
β•‘  β”‚  β”‚  β”‚  /extract      β”‚                             β–Ό               β”‚   β”‚  β•‘
β•‘  β”‚  β”‚  β”‚  /research     β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚   β”‚  β•‘
β•‘  β”‚  β”‚  β”‚  /health       β”‚                    β”‚ valkey      β”‚          β”‚   β”‚  β•‘
β•‘  β”‚  β”‚  β”‚                β”‚                    β”‚ (redis)     β”‚          β”‚   β”‚  β•‘
β•‘  β”‚  β”‚  β”‚ + trafilatura  β”‚                    β”‚ (cache)     β”‚          β”‚   β”‚  β•‘
β•‘  β”‚  β”‚  β”‚ + orchestrator β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚   β”‚  β•‘
β•‘  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                             β”‚   β”‚  β•‘
β•‘  β”‚  β”‚          β”‚ Docker HTTP API                                      β”‚   β”‚  β•‘
β•‘  β”‚  β”‚          β”‚ (create / start /                                    β”‚   β”‚  β•‘
β•‘  β”‚  β”‚          β”‚  kill / rm / logs / wait)                            β”‚   β”‚  β•‘
β•‘  β”‚  β”‚          β–Ό                                                      β”‚   β”‚  β•‘
β•‘  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                       β”‚   β”‚  β•‘
β•‘  β”‚  β”‚  β”‚ docker-socket-proxy  β”‚  Whitelist:                           β”‚   β”‚  β•‘
β•‘  β”‚  β”‚  β”‚ :2375                β”‚   CONTAINERS=1 POST=1 IMAGES=1        β”‚   β”‚  β•‘
β•‘  β”‚  β”‚  β”‚                      β”‚   (everything else denied)            β”‚   β”‚  β•‘
β•‘  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                       β”‚   β”‚  β•‘
β•‘  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚  β•‘
β•‘  β”‚                β”‚                                                       β”‚  β•‘
β•‘  β”‚                β”‚ reads (ro) /var/run/docker.sock                       β”‚  β•‘
β•‘  β”‚                β”‚  β€” adapter itself never touches it                    β”‚  β•‘
β•‘  β”‚                β”‚                                                       β”‚  β•‘
β•‘  β”‚                β–Ό                                                       β”‚  β•‘
β•‘  β”‚        (host docker daemon)                                            β”‚  β•‘
β•‘  β”‚                β”‚                                                       β”‚  β•‘
β•‘  β”‚                β”‚ spawns ephemeral container                            β”‚  β•‘
β•‘  β”‚                β–Ό                                                       β”‚  β•‘
β•‘  β”‚        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚  β•‘
β•‘  β”‚        β”‚ hermes-agent  (EPHEMERAL, one per /research)  β”‚               β”‚  β•‘
β•‘  β”‚        β”‚                                               β”‚               β”‚  β•‘
β•‘  β”‚        β”‚   /opt/data   ← hermes-data bind mount        β”‚               β”‚  β•‘
β•‘  β”‚        β”‚   /workspace  ← jobs/{job_id} bind mount      β”‚               β”‚  β•‘
β•‘  β”‚        β”‚                                               β”‚               β”‚  β•‘
β•‘  β”‚        β”‚   Env: OPENAI_API_KEY, OPENAI_BASE_URL,       β”‚               β”‚  β•‘
β•‘  β”‚        β”‚        SEARCHARVESTER_URL                     β”‚               β”‚  β•‘
β•‘  β”‚        β”‚                                               β”‚               β”‚  β•‘
β•‘  β”‚        β”‚   Skills loaded at startup:                   β”‚               β”‚  β•‘
β•‘  β”‚        β”‚     - searcharvester-deep-research            β”‚               β”‚  β•‘
β•‘  β”‚        β”‚     - searcharvester-search                   β”‚               β”‚  β•‘
β•‘  β”‚        β”‚     - searcharvester-extract                  β”‚               β”‚  β•‘
β•‘  β”‚        β”‚                                               β”‚               β”‚  β•‘
β•‘  β”‚        β”‚   Exits 0 β†’ container --rm                    β”‚               β”‚  β•‘
β•‘  β”‚        β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚  β•‘
β•‘  β”‚        β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚  β•‘
β•‘  β”‚           β”‚                    β”‚                                       β”‚  β•‘
β•‘  β”‚           β”‚                    β”‚ HTTP via host.docker.internal:8000    β”‚  β•‘
β•‘  β”‚           β”‚                    └─────────▢ tavily-adapter above        β”‚  β•‘
β•‘  β”‚           β”‚                      (calls our /search and /extract)      β”‚  β•‘
β•‘  β”‚           β”‚                                                            β”‚  β•‘
β•‘  β”‚           β”‚                                                            β”‚  β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β•‘
β•‘              β”‚                                                               β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β”‚β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
               β”‚ HTTPS
               β–Ό
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚  EXTERNAL SERVICES                          β”‚
      β”‚                                             β”‚
      β”‚  β€’ LLM endpoint                             β”‚
      β”‚    (OpenAI, OpenRouter, Anthropic,          β”‚
      β”‚     vLLM, Ollama β€” whatever                 β”‚
      β”‚     OpenAI-compatible API)                  β”‚
      β”‚                                             β”‚
      β”‚  β€’ Search engines                           β”‚
      β”‚    (Google, DuckDuckGo, Brave, ...          β”‚
      β”‚     ← queried by searxng)                   β”‚
      β”‚                                             β”‚
      β”‚  β€’ Target websites                          β”‚
      β”‚    (← scraped by tavily-adapter /extract    β”‚
      β”‚       and by /search with raw_content=true) β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key points:

  • tavily-adapter sees the Docker API only through docker-socket-proxy β€” never /var/run/docker.sock directly. If the adapter is ever compromised, the attacker gets whitelisted container ops and nothing else.
  • Every /research call = a fresh, short-lived Hermes container. After the agent exits, --rm wipes it. No cross-session state leakage.
  • The spawned Hermes container reaches back to tavily-adapter via host.docker.internal:8000 (uses extra_hosts=host-gateway). It's not on the searxng network.
  • /workspace inside Hermes = jobs/{job_id}/ on the host. Everything the agent writes there β€” plan, notes, report, log β€” is readable by the adapter after the job finishes.

/research flow (sequence)

Client                tavily-adapter           socket-proxy      hermes (ephemeral)    LLM / web
  β”‚                         β”‚                       β”‚                   β”‚                  β”‚
  │──POST /research────────▢│                       β”‚                   β”‚                  β”‚
  β”‚   {query}               │─ generate job_id      β”‚                   β”‚                  β”‚
  β”‚                         │─ mkdir jobs/{id}      β”‚                   β”‚                  β”‚
  β”‚                         │──create container────▢│──docker daemon──▢ (spawn)            β”‚
  β”‚                         │──start container─────▢│                   β”‚                  β”‚
  │◀─202 {job_id, queued}───│                       β”‚                   β”‚                  β”‚
  β”‚                         β”‚                       β”‚                   │──load skills     β”‚
  β”‚                         β”‚                       β”‚                   │──chat with LLM──▢│
  β”‚                         β”‚                       β”‚                   │◀──tool_call──────│
  β”‚                         β”‚                       β”‚                   β”‚    "search(...)" β”‚
  β”‚                         β”‚                       β”‚                   β”‚                  β”‚
  β”‚                         │◀──HTTP /search────────────────────────────│                  β”‚
  β”‚                         │──SearXNG query                            β”‚                  β”‚
  β”‚                         │──results JSON────────────────────────────▢│                  β”‚
  β”‚                         β”‚                                           β”‚                  β”‚
  β”‚                         β”‚                       β”‚                   │──chat───────────▢│
  β”‚                         β”‚                       β”‚                   │◀──tool_call──────│
  β”‚                         β”‚                       β”‚                   β”‚    "extract(url)"β”‚
  β”‚                         │◀──HTTP /extract───────────────────────────│                  β”‚
  β”‚                         │──trafilatura β†’ md─────────────────────── β–Άβ”‚                  β”‚
  β”‚                         β”‚                                           β”‚                  β”‚
  β”‚                         β”‚                       β”‚                   │──chat───────────▢│
  β”‚                         β”‚                       β”‚                   │◀──tool: bash──   β”‚
  β”‚                         β”‚                       β”‚                   β”‚    "cat > /workspace/report.md"
  β”‚                         β”‚                       β”‚                   β”‚     + print "REPORT_SAVED:"
  β”‚                         β”‚                       β”‚                   │──exit 0 (--rm)   β”‚
  β”‚                         │◀─container done───────│                   β”‚                  β”‚
  β”‚                         │──read logs + report.md                    β”‚                  β”‚
  β”‚                         │─ check REPORT_SAVED marker                β”‚                  β”‚
  β”‚                         │─ status = completed                       β”‚                  β”‚
  β”‚                                                                                        β”‚
  β”‚  (polling in parallel)                                                                 β”‚
  │──GET /research/{id}────▢│                                                              β”‚
  │◀─200 {completed, report}β”‚                                                              β”‚

For C4 diagrams in Mermaid (Context / Container / Component + Deployment), see docs/en/architecture.md.

πŸ§ͺ Tests

Written TDD-style (tests first, then implementation):

  • 12 unit tests for the orchestrator with a fake Docker client
  • 7 FastAPI route tests with mocked orchestrator
  • 1 E2E test (real Hermes + real LLM)
docker compose exec tavily-adapter pytest tests/test_orchestrator.py tests/test_research_api.py -q
# 19 passed in ~3s

🎯 SimpleQA smoke bench

Stratified sample of 20 questions from OpenAI's SimpleQA:

  • 6/6 correct on the first six (rest interrupted β€” next benchmark round is parallel + LLM-judge)
  • 30–120 s/question on gpt-oss-120b via an external vLLM

Harness in bench/.

🎯 Why this vs. hosted services

Tavily / Exa / You.com Searcharvester
πŸ’° Cost Paid Free (compute only)
πŸ”‘ Keys Required None
πŸ“Š Quotas Yes None
🏒 Data location External Your host
πŸŽ› Search sources Opaque You control the engines
πŸ€– Deep research Add-on product Built-in via /research

βš™οΈ Configuration

config.yaml β€” single file, shared by SearXNG and the adapter. See CONFIG_SETUP.md and docs/en/getting-started.md.

LLM credentials for /research go in .env (or the environment of whoever runs docker compose up) β€” only passed through to the spawned Hermes container.

🐳 Pre-built image

Published to GitHub Container Registry β€” public:

  • ghcr.io/vakovalskii/searcharvester:latest
  • ghcr.io/vakovalskii/searcharvester:2.1.0

docker-compose.yaml uses image: by default β€” no build needed. For local dev: docker compose up --build.

πŸ”§ Development

# Adapter β€” any change, fast iteration
cd simple_tavily_adapter
docker compose build tavily-adapter && docker compose up -d

# Run tests
docker compose exec tavily-adapter pytest -q

# Tail logs
docker compose logs -f tavily-adapter

πŸ“œ License

MIT on our code. AGPL on upstream SearXNG artifacts (Caddyfile, limiter.toml).

πŸ”— https://github.com/vakovalskii/searcharvester