Home
Softono
No data found
EpsteinFiles-RAG

EpsteinFiles-RAG

Open source MIT Python
377
Stars
63
Forks
2
Issues
2
Watchers
4 months
Last Commit

About EpsteinFiles-RAG

A RAG pipeline implementation built on the 'Epstein Files 20K' dataset from Hugging Face (Teyler).

Platforms

Web Self-hosted

Languages

Python

Links

EpsteinFiles-RAG

A RAG pipeline implementation built on the 'Epstein Files 20K' dataset from Hugging Face (Teyler).

Recording 2026-02-10 230408

Dataset source

πŸ‘‰ https://huggingface.co/datasets/teyler/epstein-files-20k

Precomputed Embeddings

πŸ‘‰ https://huggingface.co/datasets/devankit7873/EpsteinFiles-Vector-Embeddings-ChromaDB


⚑ Quick Demo

Process 2M+ document lines β†’ Get accurate, grounded answers in seconds

What it does:

  • Automatically cleans and reconstructs fragmented documents
  • Intelligently chunks documents while preserving context
  • Embeds everything into a searchable vector database
  • Retrieves diverse, relevant information using MMR algorithm
  • Generates answers grounded solely in the retrieved context

🎯 Key Features

βœ… No Hallucinations - Answers only from source documents
βœ… Intelligent Retrieval - MMR algorithm for diverse results
βœ… Fast Processing - ~1 second end-to-end query response
βœ… Semantic Understanding - Context-aware document chunking
βœ… REST API - Easy integration with other systems
βœ… Interactive UI - Streamlit web interface included
βœ… Scalable - Handles 100K+ document chunks
βœ… Production-Ready - Async support, error handling, logging


πŸ—οΈ How It Works

Three Simple Stages

Stage 1: Data Preparation

Raw Documents (2.5M lines)
    ↓
Clean & Reconstruct
    ↓
Smart Chunking
    ↓
Vector Embeddings

Stage 2: Intelligent Retrieval

User Question
    ↓
Find Similar Context (MMR)
    ↓
Return Top Chunks

Stage 3: Grounded Answer

Context + Question
    ↓
LLaMA 3.3 LLM
    ↓
Grounded Answer (with sources)

Why MMR Instead of Similarity?

Previous Approach: Pure semantic similarity
β†’ Returned redundant chunks from same document

Current Approach: Maximal Marginal Relevance (MMR)
β†’ Balances relevance + diversity for comprehensive context


πŸ“¦ Installation

Requirements

Setup (5 minutes)

1. Clone repository

git clone https://github.com/AnkitNayak-eth/EpsteinFiles-RAG.git
cd EpsteinFiles-RAG

2. Create virtual environment

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

3. Install dependencies

pip install -r requirements.txt

4. Configure environment

Create .env file:

GROQ_API_KEY=your_api_key_here

πŸš€ Getting Started

Run Complete Pipeline (First Time)

This processes data and prepares the system for queries:

# Stage 1: Download raw data (~5-15 min)
python ingest/download_dataset.py

# Stage 2: Clean and reconstruct documents (~3-8 min)
python ingest/clean_dataset.py

# Stage 3: Create semantic chunks (~5-12 min)
python ingest/chunk_dataset.py

# Stage 4: Generate embeddings (~20-45 min)
python ingest/embed_chunks.py

Start Using the System

Terminal 1 - Start API Server

uvicorn api.main:app --reload

API runs at: http://127.0.0.1:8000

Terminal 2 - Start Web UI

streamlit run app.py

UI opens at: http://localhost:8501

That's it! You can now query through the web interface or API.


πŸ“š Project Structure

EpsteinFiles-RAG/
β”œβ”€β”€ ingest/                    # Data processing pipeline
β”‚   β”œβ”€β”€ download_dataset.py    # Download from Hugging Face
β”‚   β”œβ”€β”€ clean_dataset.py       # Clean & reconstruct docs
β”‚   β”œβ”€β”€ chunk_dataset.py       # Semantic chunking
β”‚   └── embed_chunks.py        # Embed & index
β”œβ”€β”€ api/                       # FastAPI backend
β”‚   β”œβ”€β”€ main.py               # API routes
β”‚   └── models.py             # Data models
β”œβ”€β”€ app.py                     # Streamlit UI
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ .env.example              # Environment template
└── README.md                 # This file

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments


πŸ“ž Support

Built by: Ankit Kumar Nayak
Full-Stack Developer | AI & RAG Systems

Get Help:


⚠️ Disclaimer

This project is built for research, transparency, and educational purposes. All data is sourced from public records. Users are responsible for complying with applicable laws and ethical guidelines when using this system.