Home
Softono
Samuraizer

Samuraizer

Open source MIT JavaScript
28
Stars
0
Forks
6
Issues
0
Watchers
1 month
Last Commit

About Samuraizer

NotebookLM on steroids β€” purpose-built for security researchers.

Platforms

Web Self-hosted

Languages

JavaScript

Links

Samuraizer

Samuraizer β€” Cyber‑Security Knowledge Base Engine

NotebookLM on steroids β€” purpose-built for security researchers.

Self-hosted Built For
License Python React GitHub last commit

πŸ’‘ Why Samuraizer?

Every security researcher knows the feeling β€” you find an interesting GitHub repo, a fresh CVE writeup, a blog post about a new exploitation technique. You forward it to yourself on WhatsApp. It immediately drowns in the chat. Weeks later you actually need that article β€” and it's gone.

Stop sending yourself links you'll never find again β€” send them to Samuraizer once, and they're summarized, tagged, and searchable forever.

Before 😡 After πŸ—‘οΈ

Scattered links drowning in chat history

Analyzed, tagged, and ready to be found

πŸ” Local-first privacy (local-only feature)

🎯 Extra privacy mode (local only) β€” run with ollama, all data stays on your machine.

  • πŸ—„οΈ Local vector storage in samuraizer.db + local AI embeddings via Ollama.
  • ⚑ Works fully offline once models are pulled (ollama pull qwen3:4b, ollama pull qwen3-embedding:8b).
  • βœ… Great for security teams and researchers who require air-gapped/isolated setups.

🧩 What you get (at-a-glance)

πŸ” Analyze β€” Paste URLs, watch results stream

  • πŸ“ Paste one or more URLs (GitHub repos, CVE writeups, blog posts, YouTube videos) β€” results stream back in real time
  • πŸ“„ Upload files directly from the browser or Telegram β€” full text extracted, analyzed, stored, and viewable in the UI
    • supported formats: .pdf, .docx, .pptx, .txt, .md
  • πŸ—žοΈ Blog scanner: paste a blog homepage and extract all article links for batch analysis in one click
  • ✨ Suggested Read: a relevant unread entry is surfaced on the Analyze tab each session to keep your queue moving

πŸ—‚οΈ Knowledge Base

  • ✏️ Inline tag editing (add/remove tags on entries, feeds, and list items)
  • πŸ”Ž Semantic search (vector search via Gemini embeddings) + classic full-text search
  • 🧩 Tag cloud + multi-filtering (by tag, category, source, list, read/useful)
  • πŸ“š List management β€” group entries into manual lists, RSS lists, or channel lists
  • πŸ‘οΈβ€πŸ—¨οΈ Hover preview (summary cards) and quick copy buttons

πŸ—ΊοΈ Knowledge Graph

  • Visualize your entire knowledge base as an interactive force-directed graph
  • Entries and tags are nodes β€” edges show which tags link to which articles
  • Click to preview an entry; double-click to open the original URL
  • Color-coded by category (CVE, article, tool, video, blog, etc.)
  • Search tags to highlight related clusters across the graph

πŸ“‘ RSS Feeds & YouTube Subscriptions

  • Add RSS/Atom feeds β€” the server polls hourly and auto-ingests and summarizes new posts
  • New posts are automatically added to the Knowledge Base
  • Each feed becomes its own list, making it easy to batch-review
  • Feed items show source metadata and can be tagged/filtered like any entry

πŸŽ₯ YouTube Channel Subscriptions

  • Subscribe to YouTube channels via URL (e.g. https://www.youtube.com/@handle, /channel/UCxxx)
  • Preview latest videos before subscribing and select which videos to analyze
  • On subscribe, selected videos are analyzed immediately; future uploads are auto-polled hourly
  • Runs via /yt-channels API and appears in the UI under RSS/YT sections

πŸ€– Telegram Bot (Optional)

  • Send any URL to the bot β€” it analyzes it through the same backend and returns a formatted card
  • Send a PDF file β€” it downloads, analyzes, and returns a result card with a link to view/download the file
  • Live progress updates streamed as the analysis runs
  • Receives a Suggested Read notification β€” the bot proactively surfaces unread entries

Analyzing a URL

Daily Suggested Read

πŸ’¬ Chat (RAG + streaming + pinned context)

  • Ask questions over your knowledge base β€” answers are cited from the best matching entries
  • ⚑ Streaming responses with live typing and per-source relevance scores
  • πŸ—‚οΈ Multiple chat sessions with saved history and model selection
  • πŸ“Œ Pin specific articles as context β€” type @ for autocomplete or use the @ browse button
    • When entries are pinned, Gemini answers only from those articles β€” no RAG noise
    • Pinned entries appear as chips above the input; sources show a πŸ“Œ badge instead of a score
    • Perfect for deep-diving a specific PDF, writeup, or CVE

RAG chat with source scores

Pinned-context chat

[!IMPORTANT] Shape the Future of Samuraizer! We are currently voting on new features like Local LLMs and Obsidian export. Cast your vote here!


πŸ— Architecture (high-level)

flowchart LR
  Browser[Browser UI] -->|HTTP| Frontend[React Frontend]
  Frontend -->|REST/NDJSON| Backend[Flask API]
  Backend -->|SQL| SQLite[(samuraizer.db)]
  Backend -->|API| Gemini[Gemini 2.5 Flash]
  Backend -->|local API| Ollama[Ollama - Local LLM]
  Backend -->|GitHub API| GitHub[GitHub]
  Backend -->|RSS| RSS[RSS feeds]
  Telegram[Telegram Bot] -->|HTTP| Backend

🧠 How it works (end-to-end)

  1. Submit a URL or PDF via the web UI (or Telegram bot).
  2. Backend determines the type (GitHub repo, blog post, RSS feed, PDF, etc.) and fetches/extracts content.
  3. Content is sent to Gemini 2.5 Flash to generate:
    • A concise summary
    • A category and tags
    • (Optionally) embeddings used for semantic search
  4. Results are stored in samuraizer.db and surfaced in the frontend.
  5. The frontend lets you:
    • Filter by tags, category, source, list, read/useful flags
    • Edit tags inline (updates persisted via PATCH /entries/<id>)
    • Use semantic search (vector search over Gemini embeddings)
  6. RSS feeds are polled periodically; new posts are automatically ingested.

🧰 Tech Stack

Layer Tech / Libraries
Backend Python, Flask, SQLite, feedparser, PyMuPDF
LLM Gemini 2.5 Flash (Gemini API), Ollama (local optionally)
Frontend React 18, Vite, Tailwind CSS
Bot python-telegram-bot v20
Transcripts transcriptapi.com

πŸš€ Setup

Choose your preferred setup method:

βš™οΈ Local Setup (manual install)

0) Clone the repo πŸ“₯

git clone https://github.com/zomry1/Samuraizer.git
cd Samuraizer

1) Config πŸ”

Copy .env.example to .env and fill in your values. You can also open the web UI, go to the Settings tab, and adjust settings there (provider, API keys, Ollama model names, embedding model names) before save.

cp .env.example .env
Variable Required Where to get it
LLM_PROVIDER No gemini (default, cloud) or ollama (local)
GEMINI_API_KEY When gemini Google AI Studio β†’ Get API key
OLLAMA_URL When ollama Ollama API URL (default: http://localhost:11434)
OLLAMA_MODEL When ollama Reasoning model (default: qwen3:14b)
OLLAMA_EMBED_MODEL When ollama Embedding model (default: qwen3-embedding:8b)
TELEGRAM_BOT_TOKEN No Create a bot with @BotFather on Telegram
GITHUB_TOKEN No GitHub β†’ Settings β†’ Developer settings β†’ Personal access tokens β€” raises API rate limit from 60 to 5,000 req/hr
TRANSCRIPTAPI No transcriptapi.com/dashboard/api-keys β€” required for YouTube transcript fetching
SAMURAIZER_URL No URL of your backend (default: http://localhost:8000), used by the Telegram bot
🏠 Local Mode (Ollama)

Run Samuraizer fully offline with Ollama:

# Install Ollama CLI (platform-dependent)
# macOS
curl -fsSL https://ollama.com/install.sh | sh

# Windows
irm https://ollama.com/install.ps1 | iex

# Linux (Ubuntu/Debian)
curl -fsSL https://ollama.com/install.sh | sh
# Start Ollama service
ollama serve

# Install models
ollama pull qwen3:4b            # reasoning
ollama pull qwen3-embedding:8b  # embeddings

# Set provider in .env
LLM_PROVIDER=ollama

If you already have Gemini embeddings, switching to Ollama will automatically wipe them on next startup (dimension mismatch). Re-embed via the UI or POST /entries/embed-all.

2) Install dependencies πŸ“¦

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

pip install -r requirements.txt

cd frontend
npm install
cd ..

3) Run backend ▢️

python server.py

4) Run frontend 🌐 (new terminal)

cd frontend
npm run dev

5) (Optional) Run Telegram bot πŸ€– (new terminal)

python telegram_bot.py
🐳 Docker Setup (recommended)

The easiest way to run the full stack β€” one command builds and starts everything.

0) Clone the repo πŸ“₯

git clone https://github.com/zomry1/Samuraizer.git
cd Samuraizer

1) Config πŸ”

cp .env.example .env
# Edit .env and fill in GEMINI_API_KEY (and any other values you need)

2) Build & run ▢️

All Docker files live in the docker/ folder. Run commands from the project root:

Gemini (cloud) β€” default:

docker compose -f docker/docker-compose.yml up -d

Ollama on CPU:

docker compose -f docker/docker-compose.yml --profile ollama up -d

Ollama on NVIDIA GPU:

docker compose -f docker/docker-compose.yml --profile ollama-nvidia up -d
Prerequisites for NVIDIA GPU

Windows (WSL 2):

  1. Install the latest NVIDIA Game Ready / Studio drivers on Windows β€” no separate Linux driver needed inside WSL.
  2. Install WSL 2: open PowerShell as Administrator and run:
    wsl --install
    wsl --update
  3. Install Docker Desktop for Windows and in its settings:
    • General β†’ enable "Use the WSL 2 based engine"
    • Resources β†’ WSL Integration β†’ enable for your distro

Docker Desktop for Windows bundles the NVIDIA container runtime automatically β€” no extra steps needed.

Linux:

Install the NVIDIA Container Toolkit:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Ollama on AMD GPU (requires ROCm-capable GPU + amdgpu driver on Linux):

docker compose -f docker/docker-compose.yml --profile ollama-amd up -d

That's it. The frontend is at http://localhost and the API at http://localhost:8000.

Tip: When using any Ollama profile, set LLM_PROVIDER=ollama in your .env. The OLLAMA_URL is automatically set to http://ollama:11434 inside the containers β€” you don't need to change it.

After first start with Ollama, pull the models (use the container name matching the profile you started):

# --profile ollama
docker exec -it samuraizer-ollama-1 ollama pull qwen3:4b
docker exec -it samuraizer-ollama-1 ollama pull qwen3-embedding:8b

# --profile ollama-nvidia
docker exec -it samuraizer-ollama-nvidia-1 ollama pull qwen3:4b
docker exec -it samuraizer-ollama-nvidia-1 ollama pull qwen3-embedding:8b

# --profile ollama-amd
docker exec -it samuraizer-ollama-amd-1 ollama pull qwen3:4b
docker exec -it samuraizer-ollama-amd-1 ollama pull qwen3-embedding:8b
  • Backend and frontend restart automatically on failure.
  • Docker stores the database and log file under the project's data/ directory and backups under db_backups/.
  • Ollama model files are stored in the ollama_data Docker volume (persists across restarts).
  • The Telegram bot is disabled by default. Combine profiles to enable it alongside Ollama:
docker compose -f docker/docker-compose.yml --profile ollama-nvidia --profile bot up -d

Updating

git pull
docker compose -f docker/docker-compose.yml up -d --build

Stopping

docker compose -f docker/docker-compose.yml down

πŸ“Ί YouTube Transcript Fetching

Why not youtube-transcript-api?

The original implementation used the open-source youtube-transcript-api Python library. It works well locally but has a critical limitation in practice: YouTube aggressively blocks IP addresses that make automated transcript requests, especially:

  • IPs belonging to cloud providers / VPS hosts (AWS, GCP, Azure, Hetzner, etc.)
  • IPs that hit the transcript endpoint too frequently

This meant that after analyzing just a handful of videos, the whole server would get blocked and every subsequent transcript fetch would fail with an IPBlocked / RequestBlocked error β€” completely breaking YouTube video analysis.

Current solution: transcriptapi.com

Samuraizer now uses transcriptapi.com β€” a third-party paid API that handles the YouTube transcript fetching on their end, routing through infrastructure that isn't blocked.

Pros:

  • No IP blocks β€” they manage the anti-bot problem for you
  • Simple REST API (GET /api/v2/youtube/transcript)
  • Free tier available; credits only charged on success (HTTP 200)
  • Retryable error codes (408 / 503) with clear semantics

Cons:

  • Not free beyond the free tier (credit-based billing)
  • External dependency β€” if their service is down, transcript fetching fails
  • Data goes through a third party

Setup: Add to .env:

TRANSCRIPTAPI=your_key_here

Get a key at transcriptapi.com/dashboard/api-keys.

Alternatives worth considering

Option How it works IP block risk Cost
transcriptapi.com (current) Managed REST API None (their problem) Credit-based
yt-dlp Downloads subtitles via --write-sub --skip-download Low (mimics browser) Free, self-hosted
youtube-transcript-api + cookies Pass a Netscape cookies.txt from a logged-in browser session Medium (burner account risk) Free
YouTube Data API v3 Official Google API, no scraping None Free quota, then paid
Supadata Similar managed REST API None Free tier (100 req/day)

Best free alternative: yt-dlp β€” it is actively maintained, mimics real browser requests, and is unlikely to get blocked as quickly as a plain HTTP request. To switch, replace _fetch_youtube_content to shell out to yt-dlp --write-auto-sub --sub-format vtt --skip-download and parse the resulting .vtt file.

πŸ“¦ API Endpoints

Analyze a URL

POST /analyze

Body:

{ "url": "https://github.com/owner/repo" }

Analyze a PDF

POST /analyze-pdf

Body: multipart/form-data with a file field containing a .pdf file. Streams NDJSON events in the same shape as /analyze (using the filename as the url key).

Retrieve a stored PDF

GET /entries/<id>/pdf β€” serves the PDF inline in the browser. GET /entries/<id>/pdf?dl=1 β€” serves the PDF as a file download.

List entries

GET /entries (supports filters: search, category, tag, source, list_id, read, useful)

YouTube channel subscriptions

  • GET /yt-channels β€” list subscribed channels (id, channel_id, channel_url, name, last_checked)
  • POST /yt-channels/preview β€” body { "url": "https://www.youtube.com/@handle" }, returns channel info + latest videos (url/title/published)
  • POST /yt-channels β€” body { "url": "...", "name": "optional", "analyze_urls": ["https://...", ...] }; create subscription and optionally analyze selected videos
  • POST /yt-channels/<id>/poll β€” immediate manual poll for a channel
  • DELETE /yt-channels/<id> β€” remove subscription

Manage tags

  • Tag edits happen via PATCH /entries/<id> with JSON { "tags": ["tag1","tag2"] }

πŸ™Œ Contributing

  1. Fork
  2. Create a branch
  3. Make changes
  4. Submit a PR

βš–οΈ License

MIT