Enterprise Agentic RAG Platform
Table of Contents
- System Overview
- RAG Methodologies & Agentic Logic
- Prerequisites & Tooling
- Phase 1: Infrastructure Initialization (Terraform)
- Phase 2: Cluster Bootstrapping (Kubernetes)
- Phase 3: The Data Plane (Ray & Databases)
- Phase 4: The Control Plane (API Deployment)
- Phase 5: Data Ingestion Pipeline
- Validation & Testing
- Cost Optimization & Scaling
- Troubleshooting
1. System Overview
This repository contains the source code and Infrastructure-as-Code (IaC) definitions for a production-grade Retrieval-Augmented Generation (RAG) system. Unlike standard RAG implementations, this platform utilizes an Agentic Architecture (via LangGraph) to perform multi-step reasoning, query expansion, and hybrid retrieval (Vector + Knowledge Graph).
High-Level Architecture
The system is decoupled into two primary processing planes:
- Control Plane (The Brain): Handles HTTP requests, state management, agent orchestration, and business logic. Runs on low-cost CPU nodes.
- Data Plane (The Muscle): Handles heavy compute tasks including LLM Inference, Embedding generation, and Graph Extraction. Runs on autoscaling GPU nodes via Ray.
2. RAG Methodologies & Agentic Logic
This platform implements advanced RAG techniques to solve common failure modes (hallucination, retrieval misses).
2.1. The Planning Agent (services/api/app/agents/)
Instead of a linear chain, we use LangGraph to model the RAG process as a state machine.
- Planner Node: Analyzes user intent. Decides whether to perform a direct answer, a retrieval, or use a tool (Code Interpreter).
- Query Rewriter: Uses an LLM to rewrite the user's query, resolving coreferences (e.g., changing "How much does it cost?" to "How much does Kubernetes cost?").
- HyDE (Hypothetical Document Embeddings): Generates a fake "ideal" answer, embeds it, and uses that vector to find real documents. This bridges the semantic gap between a question and a declarative statement.
2.2. Hybrid Retrieval
- Vector Search (Qdrant): Uses BGE-M3 embeddings (dense retrieval) to find semantically similar text chunks.
- Graph Search (Neo4j): Executes Cypher queries to find entities and their relationships (e.g.,
(Entity A)-[RELATED_TO]->(Entity B)). This captures structural knowledge that vector search misses.
3. Prerequisites & Tooling
Ensure the following tools are installed on your workstation (Bastion Host):
- AWS CLI (
v2.x): Configured with AdministratorAccess. - Terraform (
v1.5+): For infrastructure provisioning. - Kubectl (
v1.29+): For Kubernetes interaction. - Helm (
v3.x): For chart management. - Python (
3.10+): For local scripting. - Docker: For building container images.
4. Phase 1: Infrastructure Initialization (Terraform)
We use Terraform to provision the "Hardware" layer: VPC, EKS Control Plane, S3, and RDS.
4.1. Remote State Setup (Manual Step)
Terraform requires a backend to store the state file safely.
- Log in to the AWS Console.
- Navigate to S3 and create a bucket named:
rag-platform-terraform-state-prod-001(Must be unique globally). - Navigate to DynamoDB and create a table named
terraform-state-lock.- Partition Key:
LockID(String).
- Partition Key:
4.2. Provisioning Resources
Navigate to the infrastructure directory:
cd infra/terraform
Initialize the backend and providers:
terraform init
Review the execution plan. This will show creation of:
- VPC: 10.0.0.0/16 with 3 Public, 3 Private, and 3 Database subnets.
- EKS: Cluster named
rag-platform-cluster(Version 1.29). - RDS: Aurora Postgres Serverless v2.
- IAM: OIDC Providers and IRSA roles.
terraform plan -var="db_password=YourStrongPassword#123" -out=tfplan
Apply the infrastructure (Estimated time: 20 minutes):
terraform apply tfplan
4.3. Connection Configuration
Once Terraform completes, configure kubectl to communicate with the new cluster:
aws eks update-kubeconfig --region us-east-1 --name rag-platform-cluster
Verify connectivity:
kubectl get nodes
# Expected Output: ip-10-0-x-x.ec2.internal Ready <none> m6i.large
5. Phase 2: Cluster Bootstrapping (Kubernetes)
The EKS cluster is currently empty. We need to install the core system controllers.
5.1. Run Bootstrap Script
Execute the helper script to install Karpenter (Autoscaler), KubeRay Operator, External Secrets, and Ingress Controller.
cd scripts
chmod +x bootstrap_cluster.sh
./bootstrap_cluster.sh
5.2. Configure Karpenter Provisioners
Karpenter is responsible for analyzing unschedulable pods and spinning up EC2 instances dynamically.
Apply the CPU Provisioner (For API & System pods):
kubectl apply -f infra/karpenter/provisioner-cpu.yaml
- Technical Detail: This targets
m6i,c6iinstances and uses Spot pricing where available.
Apply the GPU Provisioner (For AI Inference):
kubectl apply -f infra/karpenter/provisioner-gpu.yaml
- Technical Detail: This targets
g5(Nvidia A10G) instances. It creates a taintnvidia.com/gpu=true:NoScheduleto prevent non-AI pods from accidentally using expensive nodes.
6. Phase 3: The Data Plane (Ray & Databases)
We now deploy the "Muscle" of the system.
6.1. Deploy Vector & Graph Databases
In a full production environment, you might use Terraform managed services (AWS Neptune / Qdrant Cloud), but for this setup, we deploy HA clusters inside K8s.
# Deploy Qdrant
helm upgrade --install qdrant deploy/helm/qdrant --namespace default
# Deploy Neo4j
helm upgrade --install neo4j deploy/helm/neo4j --namespace default
6.2. Deploy Ray Cluster
The Ray Cluster consists of a Head Node (orchestrator) and Worker Groups.
kubectl apply -f deploy/ray/ray-cluster.yaml
Verification:
kubectl get pods -l ray.io/cluster=rag-ray-cluster
# Wait until the 'ray-head' pod is Running.
6.3. Deploy Model Services (Ray Serve)
We deploy two separate Ray Services. These utilize the ServeConfigV2 specification.
A. Embedding Service (BGE-M3):
kubectl apply -f deploy/ray/ray-serve-embed.yaml
B. LLM Service (vLLM / Llama-3-70B): This is the most resource-intensive step.
kubectl apply -f deploy/ray/ray-serve-llm.yaml
What happens technically:
- The
RayServiceCRD submits a request to the Ray Head. - Ray realizes it needs 1 GPU (
nvidia.com/gpu: 1resource request). - The Ray Worker pod goes into
Pendingstate. - Karpenter detects the pending pod, calls AWS Fleet API, and provisions a
g5.xlargeinstance. - Once the node joins (approx. 90s), the pod starts, downloads the weights (approx. 40GB) from HuggingFace, and initializes the vLLM engine (PagedAttention).
7. Phase 4: The Control Plane (API Deployment)
7.1. Secret Management
Create the Kubernetes Secret containing database credentials and keys.
kubectl create secret generic app-env-secret \
--from-literal=DATABASE_URL="postgresql+asyncpg://ragadmin:YourStrongPassword#123@rag-platform-cluster-postgres.cluster-xxxx.us-east-1.rds.amazonaws.com:5432/rag_db" \
--from-literal=REDIS_URL="redis://rag-redis-prod.xxxx.ng.0001.use1.cache.amazonaws.com:6379/0" \
--from-literal=NEO4J_PASSWORD="password" \
--from-literal=JWT_SECRET_KEY="$(openssl rand -hex 32)" \
--from-literal=QDRANT_HOST="qdrant" \
--from-literal=RAY_LLM_ENDPOINT="http://llm-service:8000/llm" \
--from-literal=RAY_EMBED_ENDPOINT="http://embed-service:8000/embed"
7.2. Deploy the API
Deploy the FastAPI application using Helm.
helm upgrade --install api deploy/helm/api
7.3. Apply Ingress
Configure the Load Balancer to route traffic to the API.
kubectl apply -f deploy/ingress/nginx.yaml
- Note: Get your Load Balancer DNS name via
kubectl get ingress. Map your domain (CNAME) to this DNS.
8. Phase 5: Data Ingestion Pipeline
The system requires data to function. The ingestion pipeline is an asynchronous, distributed Ray Job.
8.1. Bulk Upload to S3
Upload your dataset (PDF, DOCX, HTML) to the S3 bucket created by Terraform.
# Retrieve bucket name
BUCKET_NAME=$(cd infra/terraform && terraform output -raw s3_documents_bucket_name)
# Run bulk uploader script
python scripts/bulk_upload_s3.py ./data/finance_reports $BUCKET_NAME
8.2. Trigger Ingestion Job
Normally triggered by S3 Events, we can manually submit the job to the Ray Cluster.
-
Port Forward Ray Dashboard:
kubectl port-forward service/rag-ray-cluster-head-svc 8265:8265 -
Submit Job via Python SDK:
python -m pipelines.jobs.s3_event_handler
Technical Workflow:
- Ray Data reads binaries from S3 lazily.
- MapBatches (CPU):
unstructuredlibrary parses PDFs (OCR via Tesseract if needed) and chunks text (512 tokens). - MapBatches (GPU - Embed): Chunks are sent to the
embed-serviceActor. - MapBatches (GPU - Graph): Chunks are sent to the
llm-serviceto extract(Subject, Predicate, Object)tuples. - Write:
- Vectors -> Qdrant (Upsert).
- Nodes/Edges -> Neo4j (MERGE queries).
9. Validation & Testing
9.1. Health Checks
Verify the API connects to all subsystems.
curl https://<YOUR_ALB_DNS>/health/readiness
# Expected: {"redis": "up", "neo4j": "up"}
9.2. End-to-End Chat Test
Perform a request to verify the Agentic flow (Authentication required).
- Obtain Token (Dev Mode): Use the
jwt.pyutility or disable auth inconfig.pytemporarily for testing. - Curl Request:
curl -X POST https://<YOUR_ALB_DNS>/api/v1/chat/stream \ -H "Content-Type: application/json" \ -d '{ "message": "Analyze the financial risks mentioned in the Q3 report.", "session_id": "test-session-1" }'
10. Cost Optimization & Scaling
The system uses aggressive scaling policies to minimize costs:
- Spot Instances:
provisioner-gpu.yamlis configured to request Spot instances (karpenter.sh/capacity-type: spot). This reduces GPU costs by ~70%. - Scale-to-Zero:
- The Ray Autoscaler is configured in
ray-serve-llm.yamlwithmin_replicas: 1(can be 0 for dev). - If
min_replicasis 0 and no requests arrive, Ray kills the Pod. - Karpenter sees the node is empty (TTL 30s) and terminates the EC2 instance.
- The Ray Autoscaler is configured in
11. Troubleshooting
- Pod Pending (Insufficient CPU/Mem): Check
kubectl describe pod <pod_name>. If it saysFailedScheduling, check if Karpenter logs showlaunching node. - Ray Actor Death: Check Ray Dashboard
http://localhost:8265. Common issue is OOM (Out Of Memory) on the GPU. Decreasemax_num_seqsinllama-70b.yaml. - Database Connection Refused: Ensure Security Groups in
infra/terraform/vpc.tfallow traffic on ports 5432 (Postgres), 6333 (Qdrant), and 7687 (Neo4j) from the EKS Subnet CIDR.
12. Contributing
- Create a feature branch (
git checkout -b feature/amazing-feature). - Commit your changes.
- Run tests (
make test). - Push to the branch.
- Open a Pull Request.
License
Distributed under the MIT License. See LICENSE for more information.