Searcharvester πΎ
Self-hosted search + extract + deep research for AI agents
π Docs: English Β· Π ΡΡΡΠΊΠΈΠΉ Β· δΈζ
Three composable HTTP services in a single docker compose up:
/searchβ Tavily-compatible search via SearXNG (100+ engines)/extractβ URL β clean markdown via trafilatura, with size presets and pagination/researchβ deep research agent: give it a question, get back a cited markdown report
No API keys, no quotas, fully self-hosted. Pre-built image on GHCR.
π Quick start
# 1. Clone
git clone [email protected]:vakovalskii/searcharvester.git
cd searcharvester
# 2. Config
cp config.example.yaml config.yaml
# Change server.secret_key (32+ chars)
# 3. (Optional) LLM credentials for /research β any OpenAI-compatible endpoint
cat > .env <<EOF
OPENAI_API_KEY=sk-...
OPENAI_BASE_URL=https://api.openai.com/v1
EOF
# 4. Start β pulls ghcr.io/vakovalskii/searcharvester
docker compose up -d
# 5. Test search
curl -X POST localhost:8000/search -H 'Content-Type: application/json' \
-d '{"query":"bitcoin price","max_results":3}'
# 6. Test extract (URL β markdown)
curl -X POST localhost:8000/extract -H 'Content-Type: application/json' \
-d '{"url":"https://en.wikipedia.org/wiki/Docker_(software)","size":"m"}'
# 7. Test deep research (needs LLM creds from step 3)
curl -X POST localhost:8000/research -H 'Content-Type: application/json' \
-d '{"query":"What is trafilatura? One paragraph with source."}'
# β {"job_id":"...","status":"queued"}
# Poll GET /research/{job_id} until status=completed, grab the report.
π§± Three services, one API
1οΈβ£ POST /search β Tavily-compatible search
Drop-in replacement for the Tavily API:
from tavily import TavilyClient
client = TavilyClient(api_key="ignored", base_url="http://localhost:8000")
response = client.search(query="...", max_results=5, include_raw_content=True)
Request body:
{
"query": "...",
"max_results": 10,
"include_raw_content": false,
"engines": "google,duckduckgo,brave",
"categories": "general"
}
Response β Tavily schema (see docs/en/api.md).
2οΈβ£ POST /extract β URL β clean markdown
Takes a URL, fetches the HTML, runs trafilatura for main-content extraction (strips nav/footer/ads, preserves headings, lists, tables, links), returns ready-to-use markdown.
Size presets for different context windows:
| Size | Chars | Use case |
|---|---|---|
s |
5 000 | Quick summary, small-context LLMs |
m |
10 000 | Default agent reading |
l |
25 000 | Deep single-page read |
f |
full | Paginated by 25 000 β read long docs piece by piece |
Pagination via cache:
# Get id + page 1
curl -X POST localhost:8000/extract -d '{"url":"...","size":"f"}'
# β {"id":"abc123","content":"...","pages":{"current":1,"total":4,"next":"/extract/abc123/2"}}
# Next pages β no re-download
curl localhost:8000/extract/abc123/2
Cache keyed by md5(url)[:16], TTL 30 minutes. Cold fetch: 1-3 s; cached page: <50 ms.
Useful as a standalone service, not just for the agent β plug it into any LLM pipeline that needs clean page text.
3οΈβ£ POST /research β deep research agent
{query} β orchestrator spawns an ephemeral Hermes Agent container with three skills:
| Skill | Role |
|---|---|
searcharvester-search |
Tool: calls our /search |
searcharvester-extract |
Tool: calls our /extract |
searcharvester-deep-research |
Methodology (markdown only, no code): plan β gather β gap-check β synthesise β verify |
The agent reads the methodology, plans sub-queries, loops searchβextract, synthesises a markdown report with [1][2] citations, saves it to /workspace/report.md. The orchestrator watches for the REPORT_SAVED: marker and returns the file to the client.
LLM-agnostic β works with any OpenAI-compatible endpoint: OpenAI, OpenRouter, Anthropic (via LiteLLM), vLLM, Ollama, LM Studio.
# Async flow
JOB=$(curl -sX POST localhost:8000/research -d '{"query":"compare vLLM vs SGLang"}' | jq -r .job_id)
while true; do
R=$(curl -s localhost:8000/research/$JOB)
STATUS=$(echo "$R" | jq -r .status)
[ "$STATUS" = "running" ] && sleep 5 && continue
echo "$R" | jq -r .report
break
done
π§± Stack β how the services are wired
Four always-running containers + one ephemeral per research job.
HOST (Mac / Linux server)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Files on disk (bind-mounted into containers): β
β ββββββββββββββββ βββββββββββββββββββ ββββββββββββββββββββββββββββββ β
β β config.yaml β β hermes-data/ β β jobs/{job_id}/ β β
β β (SearXNG + β β skills/ β β plan.md β β
β β adapter) β β searcharv-* β β notes.md β β
β β β β config.yaml β β report.md β β
β β β β sessions/ ... β β hermes.log β β
β ββββββββ¬ββββββββ ββββββββ¬βββββββββββ ββββββββββββ¬ββββββββββββββββββ β
β β ro (bind) β rw (bind) β rw (bind) β
β βΌ βΌ βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DOCKER ENGINE β β
β β β β
β β βββββββββββββββββββββ network: searxng (bridge) βββββββββββββββββββ β β
β β β β β β
β β β ββββββββββββββββββ internal HTTP ββββββββββββββββββββ β β β
β β β β tavily-adapter βββββββββββββββββββββΆβ searxng β β β β
β β β β :8000 (exposed)β /search?format= β :8080 β β β β
β β β β β json β (:8999 exposed) β β β β
β β β β FastAPI: β ββββββββββ¬βββββββββββ β β β
β β β β /search β β RESP β β β
β β β β /extract β βΌ β β β
β β β β /research β βββββββββββββββ β β β
β β β β /health β β valkey β β β β
β β β β β β (redis) β β β β
β β β β + trafilatura β β (cache) β β β β
β β β β + orchestrator β βββββββββββββββ β β β
β β β βββββββββ¬βββββββββ β β β
β β β β Docker HTTP API β β β
β β β β (create / start / β β β
β β β β kill / rm / logs / wait) β β β
β β β βΌ β β β
β β β ββββββββββββββββββββββββ β β β
β β β β docker-socket-proxy β Whitelist: β β β
β β β β :2375 β CONTAINERS=1 POST=1 IMAGES=1 β β β
β β β β β (everything else denied) β β β
β β β ββββββββββββ¬ββββββββββββ β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β β reads (ro) /var/run/docker.sock β β
β β β β adapter itself never touches it β β
β β β β β
β β βΌ β β
β β (host docker daemon) β β
β β β β β
β β β spawns ephemeral container β β
β β βΌ β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β hermes-agent (EPHEMERAL, one per /research) β β β
β β β β β β
β β β /opt/data β hermes-data bind mount β β β
β β β /workspace β jobs/{job_id} bind mount β β β
β β β β β β
β β β Env: OPENAI_API_KEY, OPENAI_BASE_URL, β β β
β β β SEARCHARVESTER_URL β β β
β β β β β β
β β β Skills loaded at startup: β β β
β β β - searcharvester-deep-research β β β
β β β - searcharvester-search β β β
β β β - searcharvester-extract β β β
β β β β β β
β β β Exits 0 β container --rm β β β
β β ββββ¬βββββββββββββββββββββ¬ββββββββββββββββββββββββ β β
β β ββββ¬βββββββββββββββββββββ¬ββββββββββββββββββββββββ β β
β β β β β β
β β β β HTTP via host.docker.internal:8000 β β
β β β βββββββββββΆ tavily-adapter above β β
β β β (calls our /search and /extract) β β
β β β β β
β β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HTTPS
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β EXTERNAL SERVICES β
β β
β β’ LLM endpoint β
β (OpenAI, OpenRouter, Anthropic, β
β vLLM, Ollama β whatever β
β OpenAI-compatible API) β
β β
β β’ Search engines β
β (Google, DuckDuckGo, Brave, ... β
β β queried by searxng) β
β β
β β’ Target websites β
β (β scraped by tavily-adapter /extract β
β and by /search with raw_content=true) β
βββββββββββββββββββββββββββββββββββββββββββββββ
Key points:
tavily-adaptersees the Docker API only throughdocker-socket-proxyβ never/var/run/docker.sockdirectly. If the adapter is ever compromised, the attacker gets whitelisted container ops and nothing else.- Every
/researchcall = a fresh, short-lived Hermes container. After the agent exits,--rmwipes it. No cross-session state leakage. - The spawned Hermes container reaches back to
tavily-adapterviahost.docker.internal:8000(usesextra_hosts=host-gateway). It's not on thesearxngnetwork. /workspaceinside Hermes =jobs/{job_id}/on the host. Everything the agent writes there β plan, notes, report, log β is readable by the adapter after the job finishes.
/research flow (sequence)
Client tavily-adapter socket-proxy hermes (ephemeral) LLM / web
β β β β β
βββPOST /researchβββββββββΆβ β β β
β {query} ββ generate job_id β β β
β ββ mkdir jobs/{id} β β β
β βββcreate containerβββββΆβββdocker daemonβββΆ (spawn) β
β βββstart containerββββββΆβ β β
βββ202 {job_id, queued}ββββ β β β
β β β βββload skills β
β β β βββchat with LLMβββΆβ
β β β ββββtool_callβββββββ
β β β β "search(...)" β
β β β β β
β ββββHTTP /searchβββββββββββββββββββββββββββββ β
β βββSearXNG query β β
β βββresults JSONβββββββββββββββββββββββββββββΆβ β
β β β β
β β β βββchatββββββββββββΆβ
β β β ββββtool_callβββββββ
β β β β "extract(url)"β
β ββββHTTP /extractββββββββββββββββββββββββββββ β
β βββtrafilatura β mdβββββββββββββββββββββββ βΆβ β
β β β β
β β β βββchatββββββββββββΆβ
β β β ββββtool: bashββ β
β β β β "cat > /workspace/report.md"
β β β β + print "REPORT_SAVED:"
β β β βββexit 0 (--rm) β
β βββcontainer doneββββββββ β β
β βββread logs + report.md β β
β ββ check REPORT_SAVED marker β β
β ββ status = completed β β
β β
β (polling in parallel) β
βββGET /research/{id}βββββΆβ β
βββ200 {completed, report}β β
For C4 diagrams in Mermaid (Context / Container / Component + Deployment), see docs/en/architecture.md.
π§ͺ Tests
Written TDD-style (tests first, then implementation):
- 12 unit tests for the orchestrator with a fake Docker client
- 7 FastAPI route tests with mocked orchestrator
- 1 E2E test (real Hermes + real LLM)
docker compose exec tavily-adapter pytest tests/test_orchestrator.py tests/test_research_api.py -q
# 19 passed in ~3s
π― SimpleQA smoke bench
Stratified sample of 20 questions from OpenAI's SimpleQA:
- 6/6 correct on the first six (rest interrupted β next benchmark round is parallel + LLM-judge)
- 30β120 s/question on
gpt-oss-120bvia an external vLLM
Harness in bench/.
π― Why this vs. hosted services
| Tavily / Exa / You.com | Searcharvester | |
|---|---|---|
| π° Cost | Paid | Free (compute only) |
| π Keys | Required | None |
| π Quotas | Yes | None |
| π’ Data location | External | Your host |
| π Search sources | Opaque | You control the engines |
| π€ Deep research | Add-on product | Built-in via /research |
βοΈ Configuration
config.yaml β single file, shared by SearXNG and the adapter. See CONFIG_SETUP.md and docs/en/getting-started.md.
LLM credentials for /research go in .env (or the environment of whoever runs docker compose up) β only passed through to the spawned Hermes container.
π³ Pre-built image
Published to GitHub Container Registry β public:
ghcr.io/vakovalskii/searcharvester:latestghcr.io/vakovalskii/searcharvester:2.1.0
docker-compose.yaml uses image: by default β no build needed. For local dev: docker compose up --build.
π§ Development
# Adapter β any change, fast iteration
cd simple_tavily_adapter
docker compose build tavily-adapter && docker compose up -d
# Run tests
docker compose exec tavily-adapter pytest -q
# Tail logs
docker compose logs -f tavily-adapter
π License
MIT on our code. AGPL on upstream SearXNG artifacts (Caddyfile, limiter.toml).