Samuraizer β CyberβSecurity Knowledge Base Engine
π‘ Why Samuraizer?
Every security researcher knows the feeling β you find an interesting GitHub repo, a fresh CVE writeup, a blog post about a new exploitation technique. You forward it to yourself on WhatsApp. It immediately drowns in the chat. Weeks later you actually need that article β and it's gone.
Stop sending yourself links you'll never find again β send them to Samuraizer once, and they're summarized, tagged, and searchable forever.
| Before π΅ | After π‘οΈ |
|---|---|
![]() Scattered links drowning in chat history |
![]() Analyzed, tagged, and ready to be found |
π Local-first privacy (local-only feature)
π― Extra privacy mode (local only) β run with ollama, all data stays on your machine.
- ποΈ Local vector storage in
samuraizer.db+ local AI embeddings via Ollama. - β‘ Works fully offline once models are pulled (
ollama pull qwen3:4b,ollama pull qwen3-embedding:8b). - β Great for security teams and researchers who require air-gapped/isolated setups.
π§© What you get (at-a-glance)
π Analyze β Paste URLs, watch results stream
- π Paste one or more URLs (GitHub repos, CVE writeups, blog posts, YouTube videos) β results stream back in real time
- π Upload files directly from the browser or Telegram β full text extracted, analyzed, stored, and viewable in the UI
- supported formats:
.pdf,.docx,.pptx,.txt,.md
- supported formats:
- ποΈ Blog scanner: paste a blog homepage and extract all article links for batch analysis in one click
- β¨ Suggested Read: a relevant unread entry is surfaced on the Analyze tab each session to keep your queue moving

ποΈ Knowledge Base
- βοΈ Inline tag editing (add/remove tags on entries, feeds, and list items)
- π Semantic search (vector search via Gemini embeddings) + classic full-text search
- π§© Tag cloud + multi-filtering (by tag, category, source, list, read/useful)
- π List management β group entries into manual lists, RSS lists, or channel lists
- ποΈβπ¨οΈ Hover preview (summary cards) and quick copy buttons

πΊοΈ Knowledge Graph
- Visualize your entire knowledge base as an interactive force-directed graph
- Entries and tags are nodes β edges show which tags link to which articles
- Click to preview an entry; double-click to open the original URL
- Color-coded by category (CVE, article, tool, video, blog, etc.)
- Search tags to highlight related clusters across the graph

π‘ RSS Feeds & YouTube Subscriptions
- Add RSS/Atom feeds β the server polls hourly and auto-ingests and summarizes new posts
- New posts are automatically added to the Knowledge Base
- Each feed becomes its own list, making it easy to batch-review
- Feed items show source metadata and can be tagged/filtered like any entry
π₯ YouTube Channel Subscriptions
- Subscribe to YouTube channels via URL (e.g. https://www.youtube.com/@handle, /channel/UCxxx)
- Preview latest videos before subscribing and select which videos to analyze
- On subscribe, selected videos are analyzed immediately; future uploads are auto-polled hourly
- Runs via
/yt-channelsAPI and appears in the UI under RSS/YT sections

π€ Telegram Bot (Optional)
- Send any URL to the bot β it analyzes it through the same backend and returns a formatted card
- Send a PDF file β it downloads, analyzes, and returns a result card with a link to view/download the file
- Live progress updates streamed as the analysis runs
- Receives a Suggested Read notification β the bot proactively surfaces unread entries
![]() Analyzing a URL |
![]() Daily Suggested Read |
π¬ Chat (RAG + streaming + pinned context)
- Ask questions over your knowledge base β answers are cited from the best matching entries
- β‘ Streaming responses with live typing and per-source relevance scores
- ποΈ Multiple chat sessions with saved history and model selection
- π Pin specific articles as context β type
@for autocomplete or use the@browse button- When entries are pinned, Gemini answers only from those articles β no RAG noise
- Pinned entries appear as chips above the input; sources show a π badge instead of a score
- Perfect for deep-diving a specific PDF, writeup, or CVE
![]() RAG chat with source scores |
![]() Pinned-context chat |
[!IMPORTANT] Shape the Future of Samuraizer! We are currently voting on new features like Local LLMs and Obsidian export. Cast your vote here!
π Architecture (high-level)
flowchart LR
Browser[Browser UI] -->|HTTP| Frontend[React Frontend]
Frontend -->|REST/NDJSON| Backend[Flask API]
Backend -->|SQL| SQLite[(samuraizer.db)]
Backend -->|API| Gemini[Gemini 2.5 Flash]
Backend -->|local API| Ollama[Ollama - Local LLM]
Backend -->|GitHub API| GitHub[GitHub]
Backend -->|RSS| RSS[RSS feeds]
Telegram[Telegram Bot] -->|HTTP| Backend
π§ How it works (end-to-end)
- Submit a URL or PDF via the web UI (or Telegram bot).
- Backend determines the type (GitHub repo, blog post, RSS feed, PDF, etc.) and fetches/extracts content.
- Content is sent to Gemini 2.5 Flash to generate:
- A concise summary
- A category and tags
- (Optionally) embeddings used for semantic search
- Results are stored in
samuraizer.dband surfaced in the frontend. - The frontend lets you:
- Filter by tags, category, source, list, read/useful flags
- Edit tags inline (updates persisted via
PATCH /entries/<id>) - Use semantic search (vector search over Gemini embeddings)
- RSS feeds are polled periodically; new posts are automatically ingested.
π§° Tech Stack
| Layer | Tech / Libraries |
|---|---|
| Backend | Python, Flask, SQLite, feedparser, PyMuPDF |
| LLM | Gemini 2.5 Flash (Gemini API), Ollama (local optionally) |
| Frontend | React 18, Vite, Tailwind CSS |
| Bot | python-telegram-bot v20 |
| Transcripts | transcriptapi.com |
π Setup
Choose your preferred setup method:
βοΈ Local Setup (manual install)
0) Clone the repo π₯
git clone https://github.com/zomry1/Samuraizer.git
cd Samuraizer
1) Config π
Copy .env.example to .env and fill in your values.
You can also open the web UI, go to the Settings tab, and adjust settings there (provider, API keys, Ollama model names, embedding model names) before save.
cp .env.example .env
| Variable | Required | Where to get it |
|---|---|---|
LLM_PROVIDER |
No | gemini (default, cloud) or ollama (local) |
GEMINI_API_KEY |
When gemini |
Google AI Studio β Get API key |
OLLAMA_URL |
When ollama |
Ollama API URL (default: http://localhost:11434) |
OLLAMA_MODEL |
When ollama |
Reasoning model (default: qwen3:14b) |
OLLAMA_EMBED_MODEL |
When ollama |
Embedding model (default: qwen3-embedding:8b) |
TELEGRAM_BOT_TOKEN |
No | Create a bot with @BotFather on Telegram |
GITHUB_TOKEN |
No | GitHub β Settings β Developer settings β Personal access tokens β raises API rate limit from 60 to 5,000 req/hr |
TRANSCRIPTAPI |
No | transcriptapi.com/dashboard/api-keys β required for YouTube transcript fetching |
SAMURAIZER_URL |
No | URL of your backend (default: http://localhost:8000), used by the Telegram bot |
π Local Mode (Ollama)
Run Samuraizer fully offline with Ollama:
# Install Ollama CLI (platform-dependent)
# macOS
curl -fsSL https://ollama.com/install.sh | sh
# Windows
irm https://ollama.com/install.ps1 | iex
# Linux (Ubuntu/Debian)
curl -fsSL https://ollama.com/install.sh | sh
# Start Ollama service
ollama serve
# Install models
ollama pull qwen3:4b # reasoning
ollama pull qwen3-embedding:8b # embeddings
# Set provider in .env
LLM_PROVIDER=ollama
If you already have Gemini embeddings, switching to Ollama will automatically wipe them on next startup (dimension mismatch). Re-embed via the UI or POST /entries/embed-all.
2) Install dependencies π¦
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
cd frontend
npm install
cd ..
3) Run backend βΆοΈ
python server.py
4) Run frontend π (new terminal)
cd frontend
npm run dev
5) (Optional) Run Telegram bot π€ (new terminal)
python telegram_bot.py
π³ Docker Setup (recommended)
The easiest way to run the full stack β one command builds and starts everything.
0) Clone the repo π₯
git clone https://github.com/zomry1/Samuraizer.git
cd Samuraizer
1) Config π
cp .env.example .env
# Edit .env and fill in GEMINI_API_KEY (and any other values you need)
2) Build & run βΆοΈ
All Docker files live in the docker/ folder. Run commands from the project root:
Gemini (cloud) β default:
docker compose -f docker/docker-compose.yml up -d
Ollama on CPU:
docker compose -f docker/docker-compose.yml --profile ollama up -d
Ollama on NVIDIA GPU:
docker compose -f docker/docker-compose.yml --profile ollama-nvidia up -d
Prerequisites for NVIDIA GPU
Windows (WSL 2):
- Install the latest NVIDIA Game Ready / Studio drivers on Windows β no separate Linux driver needed inside WSL.
- Install WSL 2: open PowerShell as Administrator and run:
wsl --install wsl --update - Install Docker Desktop for Windows and in its settings:
- General β enable "Use the WSL 2 based engine"
- Resources β WSL Integration β enable for your distro
Docker Desktop for Windows bundles the NVIDIA container runtime automatically β no extra steps needed.
Linux:
Install the NVIDIA Container Toolkit:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Ollama on AMD GPU (requires ROCm-capable GPU + amdgpu driver on Linux):
docker compose -f docker/docker-compose.yml --profile ollama-amd up -d
That's it. The frontend is at http://localhost and the API at http://localhost:8000.
Tip: When using any Ollama profile, set
LLM_PROVIDER=ollamain your.env. TheOLLAMA_URLis automatically set tohttp://ollama:11434inside the containers β you don't need to change it.
After first start with Ollama, pull the models (use the container name matching the profile you started):
# --profile ollama
docker exec -it samuraizer-ollama-1 ollama pull qwen3:4b
docker exec -it samuraizer-ollama-1 ollama pull qwen3-embedding:8b
# --profile ollama-nvidia
docker exec -it samuraizer-ollama-nvidia-1 ollama pull qwen3:4b
docker exec -it samuraizer-ollama-nvidia-1 ollama pull qwen3-embedding:8b
# --profile ollama-amd
docker exec -it samuraizer-ollama-amd-1 ollama pull qwen3:4b
docker exec -it samuraizer-ollama-amd-1 ollama pull qwen3-embedding:8b
- Backend and frontend restart automatically on failure.
- Docker stores the database and log file under the project's
data/directory and backups underdb_backups/. - Ollama model files are stored in the
ollama_dataDocker volume (persists across restarts). - The Telegram bot is disabled by default. Combine profiles to enable it alongside Ollama:
docker compose -f docker/docker-compose.yml --profile ollama-nvidia --profile bot up -d
Updating
git pull
docker compose -f docker/docker-compose.yml up -d --build
Stopping
docker compose -f docker/docker-compose.yml down
πΊ YouTube Transcript Fetching
Why not youtube-transcript-api?
The original implementation used the open-source youtube-transcript-api Python library. It works well locally but has a critical limitation in practice: YouTube aggressively blocks IP addresses that make automated transcript requests, especially:
- IPs belonging to cloud providers / VPS hosts (AWS, GCP, Azure, Hetzner, etc.)
- IPs that hit the transcript endpoint too frequently
This meant that after analyzing just a handful of videos, the whole server would get blocked and every subsequent transcript fetch would fail with an IPBlocked / RequestBlocked error β completely breaking YouTube video analysis.
Current solution: transcriptapi.com
Samuraizer now uses transcriptapi.com β a third-party paid API that handles the YouTube transcript fetching on their end, routing through infrastructure that isn't blocked.
Pros:
- No IP blocks β they manage the anti-bot problem for you
- Simple REST API (
GET /api/v2/youtube/transcript) - Free tier available; credits only charged on success (HTTP 200)
- Retryable error codes (408 / 503) with clear semantics
Cons:
- Not free beyond the free tier (credit-based billing)
- External dependency β if their service is down, transcript fetching fails
- Data goes through a third party
Setup: Add to .env:
TRANSCRIPTAPI=your_key_here
Get a key at transcriptapi.com/dashboard/api-keys.
Alternatives worth considering
| Option | How it works | IP block risk | Cost |
|---|---|---|---|
| transcriptapi.com (current) | Managed REST API | None (their problem) | Credit-based |
| yt-dlp | Downloads subtitles via --write-sub --skip-download |
Low (mimics browser) | Free, self-hosted |
youtube-transcript-api + cookies |
Pass a Netscape cookies.txt from a logged-in browser session | Medium (burner account risk) | Free |
| YouTube Data API v3 | Official Google API, no scraping | None | Free quota, then paid |
| Supadata | Similar managed REST API | None | Free tier (100 req/day) |
Best free alternative: yt-dlp β it is actively maintained, mimics real browser requests, and is unlikely to get blocked as quickly as a plain HTTP request. To switch, replace _fetch_youtube_content to shell out to yt-dlp --write-auto-sub --sub-format vtt --skip-download and parse the resulting .vtt file.
π¦ API Endpoints
Analyze a URL
POST /analyze
Body:
{ "url": "https://github.com/owner/repo" }
Analyze a PDF
POST /analyze-pdf
Body: multipart/form-data with a file field containing a .pdf file.
Streams NDJSON events in the same shape as /analyze (using the filename as the url key).
Retrieve a stored PDF
GET /entries/<id>/pdf β serves the PDF inline in the browser.
GET /entries/<id>/pdf?dl=1 β serves the PDF as a file download.
List entries
GET /entries (supports filters: search, category, tag, source, list_id, read, useful)
YouTube channel subscriptions
GET /yt-channelsβ list subscribed channels (id, channel_id, channel_url, name, last_checked)POST /yt-channels/previewβ body{ "url": "https://www.youtube.com/@handle" }, returns channel info + latest videos (url/title/published)POST /yt-channelsβ body{ "url": "...", "name": "optional", "analyze_urls": ["https://...", ...] }; create subscription and optionally analyze selected videosPOST /yt-channels/<id>/pollβ immediate manual poll for a channelDELETE /yt-channels/<id>β remove subscription
Manage tags
- Tag edits happen via
PATCH /entries/<id>with JSON{ "tags": ["tag1","tag2"] }
π Contributing
- Fork
- Create a branch
- Make changes
- Submit a PR
βοΈ License
MIT





.png)