Samuraizer — Cyber‑Security Knowledge Base Engine

NotebookLM on steroids — purpose-built for security researchers.

GitHub last commit

💡 Why Samuraizer?

Every security researcher knows the feeling — you find an interesting GitHub repo, a fresh CVE writeup, a blog post about a new exploitation technique. You forward it to yourself on WhatsApp. It immediately drowns in the chat. Weeks later you actually need that article — and it's gone.

Stop sending yourself links you'll never find again — send them to Samuraizer once, and they're summarized, tagged, and searchable forever.

Before 😵	After 🗡️
_{Scattered links drowning in chat history}	_{Analyzed, tagged, and ready to be found}

🔐 Local-first privacy (local-only feature)

🎯 Extra privacy mode (local only) — run with ollama, all data stays on your machine.

🗄️ Local vector storage in samuraizer.db + local AI embeddings via Ollama.
⚡ Works fully offline once models are pulled (ollama pull qwen3:4b, ollama pull qwen3-embedding:8b).
✅ Great for security teams and researchers who require air-gapped/isolated setups.

🧩 What you get (at-a-glance)

🔍 Analyze — Paste URLs, watch results stream

📝 Paste one or more URLs (GitHub repos, CVE writeups, blog posts, YouTube videos) — results stream back in real time
📄 Upload files directly from the browser or Telegram — full text extracted, analyzed, stored, and viewable in the UI
- supported formats: .pdf, .docx, .pptx, .txt, .md
🗞️ Blog scanner: paste a blog homepage and extract all article links for batch analysis in one click
✨ Suggested Read: a relevant unread entry is surfaced on the Analyze tab each session to keep your queue moving

🗂️ Knowledge Base

✏️ Inline tag editing (add/remove tags on entries, feeds, and list items)
🔎 Semantic search (vector search via Gemini embeddings) + classic full-text search
🧩 Tag cloud + multi-filtering (by tag, category, source, list, read/useful)
📚 List management — group entries into manual lists, RSS lists, or channel lists
👁️‍🗨️ Hover preview (summary cards) and quick copy buttons

🗺️ Knowledge Graph

Visualize your entire knowledge base as an interactive force-directed graph
Entries and tags are nodes — edges show which tags link to which articles
Click to preview an entry; double-click to open the original URL
Color-coded by category (CVE, article, tool, video, blog, etc.)
Search tags to highlight related clusters across the graph

📡 RSS Feeds & YouTube Subscriptions

Add RSS/Atom feeds — the server polls hourly and auto-ingests and summarizes new posts
New posts are automatically added to the Knowledge Base
Each feed becomes its own list, making it easy to batch-review
Feed items show source metadata and can be tagged/filtered like any entry

🎥 YouTube Channel Subscriptions

Subscribe to YouTube channels via URL (e.g. https://www.youtube.com/@handle, /channel/UCxxx)
Preview latest videos before subscribing and select which videos to analyze
On subscribe, selected videos are analyzed immediately; future uploads are auto-polled hourly
Runs via /yt-channels API and appears in the UI under RSS/YT sections

🤖 Telegram Bot (Optional)

Send any URL to the bot — it analyzes it through the same backend and returns a formatted card
Send a PDF file — it downloads, analyzes, and returns a result card with a link to view/download the file
Live progress updates streamed as the analysis runs
Receives a Suggested Read notification — the bot proactively surfaces unread entries

_{Analyzing a URL}

_{Daily Suggested Read}

💬 Chat (RAG + streaming + pinned context)

Ask questions over your knowledge base — answers are cited from the best matching entries
⚡ Streaming responses with live typing and per-source relevance scores
🗂️ Multiple chat sessions with saved history and model selection
📌 Pin specific articles as context — type @ for autocomplete or use the @ browse button
- When entries are pinned, Gemini answers only from those articles — no RAG noise
- Pinned entries appear as chips above the input; sources show a 📌 badge instead of a score
- Perfect for deep-diving a specific PDF, writeup, or CVE

_{RAG chat with source scores}

_{Pinned-context chat}

[!IMPORTANT] Shape the Future of Samuraizer! We are currently voting on new features like Local LLMs and Obsidian export. Cast your vote here!

🏗 Architecture (high-level)

flowchart LR
  Browser[Browser UI] -->|HTTP| Frontend[React Frontend]
  Frontend -->|REST/NDJSON| Backend[Flask API]
  Backend -->|SQL| SQLite[(samuraizer.db)]
  Backend -->|API| Gemini[Gemini 2.5 Flash]
  Backend -->|local API| Ollama[Ollama - Local LLM]
  Backend -->|GitHub API| GitHub[GitHub]
  Backend -->|RSS| RSS[RSS feeds]
  Telegram[Telegram Bot] -->|HTTP| Backend

🧠 How it works (end-to-end)

Submit a URL or PDF via the web UI (or Telegram bot).
Backend determines the type (GitHub repo, blog post, RSS feed, PDF, etc.) and fetches/extracts content.
Content is sent to Gemini 2.5 Flash to generate:
- A concise summary
- A category and tags
- (Optionally) embeddings used for semantic search
Results are stored in samuraizer.db and surfaced in the frontend.
The frontend lets you:
- Filter by tags, category, source, list, read/useful flags
- Edit tags inline (updates persisted via PATCH /entries/<id>)
- Use semantic search (vector search over Gemini embeddings)
RSS feeds are polled periodically; new posts are automatically ingested.

🧰 Tech Stack

Layer	Tech / Libraries
Backend	Python, Flask, SQLite, feedparser, PyMuPDF
LLM	Gemini 2.5 Flash (Gemini API), Ollama (local optionally)
Frontend	React 18, Vite, Tailwind CSS
Bot	python-telegram-bot v20
Transcripts	transcriptapi.com

🚀 Setup

Choose your preferred setup method:

⚙️ Local Setup (manual install)

0) Clone the repo 📥

git clone https://github.com/zomry1/Samuraizer.git
cd Samuraizer

1) Config 🔐

Copy .env.example to .env and fill in your values. You can also open the web UI, go to the Settings tab, and adjust settings there (provider, API keys, Ollama model names, embedding model names) before save.

cp .env.example .env

Variable	Required	Where to get it
`LLM_PROVIDER`	No	`gemini` (default, cloud) or `ollama` (local)
`GEMINI_API_KEY`	When `gemini`	Google AI Studio → Get API key
`OLLAMA_URL`	When `ollama`	Ollama API URL (default: `http://localhost:11434`)
`OLLAMA_MODEL`	When `ollama`	Reasoning model (default: `qwen3:14b`)
`OLLAMA_EMBED_MODEL`	When `ollama`	Embedding model (default: `qwen3-embedding:8b`)
`TELEGRAM_BOT_TOKEN`	No	Create a bot with @BotFather on Telegram
`GITHUB_TOKEN`	No	GitHub → Settings → Developer settings → Personal access tokens — raises API rate limit from 60 to 5,000 req/hr
`TRANSCRIPTAPI`	No	transcriptapi.com/dashboard/api-keys — required for YouTube transcript fetching
`SAMURAIZER_URL`	No	URL of your backend (default: `http://localhost:8000`), used by the Telegram bot

🏠 Local Mode (Ollama)

Run Samuraizer fully offline with Ollama:

# Install Ollama CLI (platform-dependent)
# macOS
curl -fsSL https://ollama.com/install.sh | sh

# Windows
irm https://ollama.com/install.ps1 | iex

# Linux (Ubuntu/Debian)
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama service
ollama serve

# Install models
ollama pull qwen3:4b            # reasoning
ollama pull qwen3-embedding:8b  # embeddings

# Set provider in .env
LLM_PROVIDER=ollama

If you already have Gemini embeddings, switching to Ollama will automatically wipe them on next startup (dimension mismatch). Re-embed via the UI or POST /entries/embed-all.

2) Install dependencies 📦

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

pip install -r requirements.txt

cd frontend
npm install
cd ..

3) Run backend ▶️

python server.py

4) Run frontend 🌐 (new terminal)

cd frontend
npm run dev

5) (Optional) Run Telegram bot 🤖 (new terminal)

python telegram_bot.py

🐳 Docker Setup (recommended)

The easiest way to run the full stack — one command builds and starts everything.

0) Clone the repo 📥

git clone https://github.com/zomry1/Samuraizer.git
cd Samuraizer

1) Config 🔐

cp .env.example .env
# Edit .env and fill in GEMINI_API_KEY (and any other values you need)

2) Build & run ▶️

All Docker files live in the docker/ folder. Run commands from the project root:

Gemini (cloud) — default:

docker compose -f docker/docker-compose.yml up -d

Ollama on CPU:

docker compose -f docker/docker-compose.yml --profile ollama up -d

Ollama on NVIDIA GPU:

docker compose -f docker/docker-compose.yml --profile ollama-nvidia up -d

Prerequisites for NVIDIA GPU

Windows (WSL 2):

Install the latest NVIDIA Game Ready / Studio drivers on Windows — no separate Linux driver needed inside WSL.
Install WSL 2: open PowerShell as Administrator and run:
```
wsl --install
wsl --update
```
Install Docker Desktop for Windows and in its settings:
- General → enable "Use the WSL 2 based engine"
- Resources → WSL Integration → enable for your distro

Docker Desktop for Windows bundles the NVIDIA container runtime automatically — no extra steps needed.

Linux:

Install the NVIDIA Container Toolkit:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Ollama on AMD GPU (requires ROCm-capable GPU + amdgpu driver on Linux):

docker compose -f docker/docker-compose.yml --profile ollama-amd up -d

That's it. The frontend is at http://localhost and the API at http://localhost:8000.

Tip: When using any Ollama profile, set LLM_PROVIDER=ollama in your .env. The OLLAMA_URL is automatically set to http://ollama:11434 inside the containers — you don't need to change it.

After first start with Ollama, pull the models (use the container name matching the profile you started):

# --profile ollama
docker exec -it samuraizer-ollama-1 ollama pull qwen3:4b
docker exec -it samuraizer-ollama-1 ollama pull qwen3-embedding:8b

# --profile ollama-nvidia
docker exec -it samuraizer-ollama-nvidia-1 ollama pull qwen3:4b
docker exec -it samuraizer-ollama-nvidia-1 ollama pull qwen3-embedding:8b

# --profile ollama-amd
docker exec -it samuraizer-ollama-amd-1 ollama pull qwen3:4b
docker exec -it samuraizer-ollama-amd-1 ollama pull qwen3-embedding:8b

Backend and frontend restart automatically on failure.
Docker stores the database and log file under the project's data/ directory and backups under db_backups/.
Ollama model files are stored in the ollama_data Docker volume (persists across restarts).
The Telegram bot is disabled by default. Combine profiles to enable it alongside Ollama:

docker compose -f docker/docker-compose.yml --profile ollama-nvidia --profile bot up -d

Updating

git pull
docker compose -f docker/docker-compose.yml up -d --build

Stopping

docker compose -f docker/docker-compose.yml down

📺 YouTube Transcript Fetching

Why not `youtube-transcript-api`?

The original implementation used the open-source youtube-transcript-api Python library. It works well locally but has a critical limitation in practice: YouTube aggressively blocks IP addresses that make automated transcript requests, especially:

IPs belonging to cloud providers / VPS hosts (AWS, GCP, Azure, Hetzner, etc.)
IPs that hit the transcript endpoint too frequently

This meant that after analyzing just a handful of videos, the whole server would get blocked and every subsequent transcript fetch would fail with an IPBlocked / RequestBlocked error — completely breaking YouTube video analysis.

Current solution: `transcriptapi.com`

Samuraizer now uses transcriptapi.com — a third-party paid API that handles the YouTube transcript fetching on their end, routing through infrastructure that isn't blocked.

Pros:

No IP blocks — they manage the anti-bot problem for you
Simple REST API (GET /api/v2/youtube/transcript)
Free tier available; credits only charged on success (HTTP 200)
Retryable error codes (408 / 503) with clear semantics

Cons:

Not free beyond the free tier (credit-based billing)
External dependency — if their service is down, transcript fetching fails
Data goes through a third party

Setup: Add to .env:

TRANSCRIPTAPI=your_key_here

Get a key at transcriptapi.com/dashboard/api-keys.

Alternatives worth considering

Option	How it works	IP block risk	Cost
transcriptapi.com (current)	Managed REST API	None (their problem)	Credit-based
yt-dlp	Downloads subtitles via `--write-sub --skip-download`	Low (mimics browser)	Free, self-hosted
`youtube-transcript-api` + cookies	Pass a Netscape cookies.txt from a logged-in browser session	Medium (burner account risk)	Free
YouTube Data API v3	Official Google API, no scraping	None	Free quota, then paid
Supadata	Similar managed REST API	None	Free tier (100 req/day)

Best free alternative: yt-dlp — it is actively maintained, mimics real browser requests, and is unlikely to get blocked as quickly as a plain HTTP request. To switch, replace _fetch_youtube_content to shell out to yt-dlp --write-auto-sub --sub-format vtt --skip-download and parse the resulting .vtt file.

📦 API Endpoints

Analyze a URL

POST /analyze

Body:

{ "url": "https://github.com/owner/repo" }

Analyze a PDF

POST /analyze-pdf

Body: multipart/form-data with a file field containing a .pdf file. Streams NDJSON events in the same shape as /analyze (using the filename as the url key).

Retrieve a stored PDF

GET /entries/<id>/pdf — serves the PDF inline in the browser. GET /entries/<id>/pdf?dl=1 — serves the PDF as a file download.

List entries

GET /entries (supports filters: search, category, tag, source, list_id, read, useful)

YouTube channel subscriptions

GET /yt-channels — list subscribed channels (id, channel_id, channel_url, name, last_checked)
POST /yt-channels/preview — body { "url": "https://www.youtube.com/@handle" }, returns channel info + latest videos (url/title/published)
POST /yt-channels — body { "url": "...", "name": "optional", "analyze_urls": ["https://...", ...] }; create subscription and optionally analyze selected videos
POST /yt-channels/<id>/poll — immediate manual poll for a channel
DELETE /yt-channels/<id> — remove subscription

Manage tags

Tag edits happen via PATCH /entries/<id> with JSON { "tags": ["tag1","tag2"] }

🙌 Contributing

Fork
Create a branch
Make changes
Submit a PR

⚖️ License

MIT

Samuraizer

About Samuraizer

Platforms

Languages

Links

README.md

Samuraizer — Cyber‑Security Knowledge Base Engine

💡 Why Samuraizer?

🔐 Local-first privacy (local-only feature)

🧩 What you get (at-a-glance)

🔍 Analyze — Paste URLs, watch results stream

🗂️ Knowledge Base

🗺️ Knowledge Graph

📡 RSS Feeds & YouTube Subscriptions

🎥 YouTube Channel Subscriptions

🤖 Telegram Bot (Optional)

💬 Chat (RAG + streaming + pinned context)

🏗 Architecture (high-level)

🧠 How it works (end-to-end)

🧰 Tech Stack

🚀 Setup

0) Clone the repo 📥

1) Config 🔐

2) Install dependencies 📦

3) Run backend ▶️

4) Run frontend 🌐 (new terminal)

5) (Optional) Run Telegram bot 🤖 (new terminal)

0) Clone the repo 📥

1) Config 🔐

2) Build & run ▶️

Updating

Stopping

Why not youtube-transcript-api?

Current solution: transcriptapi.com

Alternatives worth considering

Analyze a URL

Analyze a PDF

Retrieve a stored PDF

List entries

YouTube channel subscriptions

Manage tags

🙌 Contributing

⚖️ License

Why not `youtube-transcript-api`?

Current solution: `transcriptapi.com`