EpsteinFiles-RAG
A RAG pipeline implementation built on the 'Epstein Files 20K' dataset from Hugging Face (Teyler).
Dataset source
π https://huggingface.co/datasets/teyler/epstein-files-20k
Precomputed Embeddings
π https://huggingface.co/datasets/devankit7873/EpsteinFiles-Vector-Embeddings-ChromaDB
β‘ Quick Demo
Process 2M+ document lines β Get accurate, grounded answers in seconds
What it does:
- Automatically cleans and reconstructs fragmented documents
- Intelligently chunks documents while preserving context
- Embeds everything into a searchable vector database
- Retrieves diverse, relevant information using MMR algorithm
- Generates answers grounded solely in the retrieved context
π― Key Features
β
No Hallucinations - Answers only from source documents
β
Intelligent Retrieval - MMR algorithm for diverse results
β
Fast Processing - ~1 second end-to-end query response
β
Semantic Understanding - Context-aware document chunking
β
REST API - Easy integration with other systems
β
Interactive UI - Streamlit web interface included
β
Scalable - Handles 100K+ document chunks
β
Production-Ready - Async support, error handling, logging
ποΈ How It Works
Three Simple Stages
Stage 1: Data Preparation
Raw Documents (2.5M lines)
β
Clean & Reconstruct
β
Smart Chunking
β
Vector Embeddings
Stage 2: Intelligent Retrieval
User Question
β
Find Similar Context (MMR)
β
Return Top Chunks
Stage 3: Grounded Answer
Context + Question
β
LLaMA 3.3 LLM
β
Grounded Answer (with sources)
Why MMR Instead of Similarity?
Previous Approach: Pure semantic similarity
β Returned redundant chunks from same document
Current Approach: Maximal Marginal Relevance (MMR)
β Balances relevance + diversity for comprehensive context
π¦ Installation
Requirements
- Python 3.11+
- 16GB RAM (8GB minimum)
- Groq API key (free at console.groq.com)
Setup (5 minutes)
1. Clone repository
git clone https://github.com/AnkitNayak-eth/EpsteinFiles-RAG.git
cd EpsteinFiles-RAG
2. Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
3. Install dependencies
pip install -r requirements.txt
4. Configure environment
Create .env file:
GROQ_API_KEY=your_api_key_here
π Getting Started
Run Complete Pipeline (First Time)
This processes data and prepares the system for queries:
# Stage 1: Download raw data (~5-15 min)
python ingest/download_dataset.py
# Stage 2: Clean and reconstruct documents (~3-8 min)
python ingest/clean_dataset.py
# Stage 3: Create semantic chunks (~5-12 min)
python ingest/chunk_dataset.py
# Stage 4: Generate embeddings (~20-45 min)
python ingest/embed_chunks.py
Start Using the System
Terminal 1 - Start API Server
uvicorn api.main:app --reload
API runs at: http://127.0.0.1:8000
Terminal 2 - Start Web UI
streamlit run app.py
UI opens at: http://localhost:8501
That's it! You can now query through the web interface or API.
π Project Structure
EpsteinFiles-RAG/
βββ ingest/ # Data processing pipeline
β βββ download_dataset.py # Download from Hugging Face
β βββ clean_dataset.py # Clean & reconstruct docs
β βββ chunk_dataset.py # Semantic chunking
β βββ embed_chunks.py # Embed & index
βββ api/ # FastAPI backend
β βββ main.py # API routes
β βββ models.py # Data models
βββ app.py # Streamlit UI
βββ requirements.txt # Python dependencies
βββ .env.example # Environment template
βββ README.md # This file
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
- Dataset: Teyler/Epstein Files 20K on Hugging Face
- Embeddings: Sentence Transformers
- Vector DB: Chroma
- LLM Inference: Groq Cloud
- Framework: LangChain
- UI: Streamlit
π Support
Built by: Ankit Kumar Nayak
Full-Stack Developer | AI & RAG Systems
Get Help:
- π Open an Issue
- π¬ Start a Discussion
β οΈ Disclaimer
This project is built for research, transparency, and educational purposes. All data is sourced from public records. Users are responsible for complying with applicable laws and ethical guidelines when using this system.