π§ AI System Design Guide
The Complete Interview & Production Reference
If this guide helps you, follow @ombharatiya on GitHub, X, and LinkedIn to get notified when new chapters, model refreshes, and interview prompts ship.
The living reference for production AI systems. Continuously updated. Interview-ready depth.
A practical, continuously updated guide to AI system design, RAG architectures, LLM engineering, agentic AI, MCP and A2A protocols, and AI engineering interview preparation. Covers production patterns, model selection, evaluation, and real-world case studies from staff-level interviews.
New here? Jump to the 116-question Interview Bank, the RAG Fundamentals chapter, or pick the right LLM for production.
π Quick Navigation
| I want to... | Start here |
|---|---|
| Prepare for interviews | Question Bank β Answer Frameworks |
| Learn AI systems fast | LLM Internals β RAG Fundamentals |
| Build production RAG | Chunking β Vector DBs β Reranking β Production RAG |
| Advanced retrieval | Contextual Retrieval β ColBERT β Multi-modal RAG |
| Design multi-tenant AI | Isolation Patterns β Case Study |
| Build agents | Agent Fundamentals β MCP & A2A β LangGraph |
| Tool-use & computer agents | Landscape β OpenClaw β Safety |
| Autonomous coding agents | Claude Code β OpenCoder Landscape |
| Pick the right model (2026) | Model Taxonomy β Pricing |
| Evaluate AI in production | AI Evals Guide (Phoenix/Langfuse) β AI Evals Guide (LangWatch/Langfuse) |
| Find the best courses to learn AI | Recommended Courses & Learning Paths |
| Transition from my current role to AI | Role Transition Guide |
| Understand the 2026 AI job market | Job Market Trends - June 2026 |
| Get a quick answer to a common question | FAQ (RAG, agents, models, eval, inference, memory, security) |
| Look up a term | Glossary (every term defined) |
Pick a path
flowchart TD
A[New visitor] --> B{Your goal}
B -->|Interview prep| C[Question Bank]
B -->|Build RAG| D[RAG Fundamentals]
B -->|Build agents| E[Agent Fundamentals]
B -->|Pick a model| F[Model Taxonomy]
B -->|Evaluate AI| G[AI Evals Guide]
C --> H[Answer Frameworks]
D --> I[Chunking + Vector DBs]
E --> J[MCP and Tool Use]
F --> K[Pricing 2026]
G --> L[Phoenix or LangWatch]
π― Why This Guide
Traditional books are outdated before they ship. This is a living document: when new models release, when patterns evolve, this updates.
| This Guide | Printed Books |
|---|---|
| June 2026 models (Claude Fable 5, Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, DeepSeek V4 Pro, Llama 4, Kimi K2.6, Qwen 3.6, Mistral Medium 3.5, Gemma 4) | Stuck on GPT-4 |
| MCP 2.0, A2A v1.0, OpenClaw, Computer Use, Agentic RAG, ColBERT, latent reasoning, MoE serving | Does not exist |
| Real pricing with June 2026 verification dates | Already wrong |
| Staff-level interview Q&A (116 questions through June 2026) + Job Market Trends | Generic questions |
Quick model picker (June 2026): Claude Fable 5 for the capability ceiling ($10/$50 per 1M), Claude Opus 4.8 for tool-use and long-horizon agentic coding, GPT-5.5 for general production, Gemini 3.1 Pro for multimodal, DeepSeek V4 Flash ($0.14/$0.28 per 1M) or V4 Pro ($0.435/$0.87) for cheap frontier-class output, Llama 4 for self-hosted. Full breakdown in Model Taxonomy.
π― What This Guide Is (and Is Not)
This guide IS:
- A staff-level reference for designing production AI systems (RAG, agents, MCP, eval pipelines, multi-tenant isolation).
- An interview-prep companion with 116 real questions, answer frameworks with a worked mock transcript, and nine whiteboard exercises through June 2026.
- A living document tracking new model releases, protocol changes, and emerging patterns as they ship.
- Opinionated about tradeoffs: latency vs cost, accuracy vs faithfulness, single-agent vs multi-agent.
- Free, MIT-licensed, and open to PRs from practitioners.
This guide is NOT:
- A tutorial on Python, PyTorch, or basic ML fundamentals (start with a course; see COURSES.md).
- A vendor-neutral hedge; it names specific models, prices, and frameworks because real systems require real choices.
- A replacement for hands-on building; read it alongside a project, not instead of one.
- A research paper digest; it cites papers when they change practice, not for completeness.
π Guide Structure
βββ 00-interview-prep/ # Questions (116), frameworks, exercises, job-market trends (June 2026)
βββ 01-foundations/ # Transformers, attention, embeddings
βββ 02-model-landscape/ # Claude Fable 5, Claude Opus 4.8, GPT-5.5, Gemini 3.1, DeepSeek V4, Llama 4, Kimi K2.6, Qwen 3.6
βββ 03-training-and-adaptation/ # Fine-tuning, LoRA, DPO, distillation
βββ 04-inference-optimization/ # KV cache, PagedAttention, vLLM
βββ 05-prompting-and-context/ # Prompt engineering, CoT, Extended Thinking, DSPy, prompt injection
βββ 06-retrieval-systems/ # RAG, chunking, GraphRAG, Agentic RAG, ColBERT, Contextual Retrieval
βββ 07-agentic-systems/ # MCP 2.0, A2A protocol, multi-agent, computer-use
βββ 08-memory-and-state/ # L1-L3 memory tiers, Mem0, caching
βββ 09-frameworks-and-tools/ # LangGraph, DSPy, LlamaIndex, Claude Code, OpenCoder
βββ 10-document-processing/ # Vision-LLM OCR, multimodal parsing
βββ 11-infrastructure-and-mlops/ # GPU clusters, LLMOps, cost management
βββ 12-security-and-access/ # RBAC, ABAC, multi-tenant isolation
βββ 13-reliability-and-safety/ # Guardrails, red-teaming
βββ 14-evaluation-and-observability/ # RAGAS, LangSmith, drift detection
βββ 15-ai-design-patterns/ # Pattern catalog, anti-patterns
βββ 16-case-studies/ # Real-world architectures with diagrams
βββ 17-tool-use-and-computer-agents/ # OpenClaw, Computer Use, tool agents, safety
βββ GLOSSARY.md # Every term defined
β
βββ ai_evals_comprehensive_study_guide.md # π¬ Deep-dive: AI Evals (Phoenix + Langfuse)
βββ ai_evals_complete_guide_langwatch_langfuse.md # π¬ Deep-dive: AI Evals (LangWatch + Langfuse)
βββ COURSES.md # π Recommended courses & learning paths
βββ TRANSITION_GUIDE.md # π Transition from Backend/QA/PM/EM to AI roles
Chapters by AI System Lifecycle Stage
mindmap
root((AI System Design Guide))
Foundations
LLM Internals
Model Landscape
Training and Adaptation
Build
Prompting and Context
Retrieval Systems
Agentic Systems
Tool Use and Computer Agents
Operate
Inference Optimization
Memory and State
Frameworks and Tools
Infrastructure and MLOps
Govern
Security and Access
Reliability and Safety
Evaluation and Observability
Apply
Design Patterns
Case Studies
Interview Prep
π₯ Featured Case Studies
Real interview problems with complete solutions and diagrams:
| Case Study | Problem | Key Patterns |
|---|---|---|
| Real-Time Search | 5-minute data freshness at scale | Streaming + Hybrid Search |
| Coding Agent | Autonomous multi-file changes | Sandboxing + Self-Correction |
| Multi-Tenant SaaS | Coca-Cola and Pepsi on same infra | Defense-in-Depth Isolation |
| Customer Support | 60% auto-resolution rate | Tiered Routing + Escalation |
| Document Intelligence | 50K contracts/month extraction | Vision-LLM + Parallel Extractors |
| Recommendation Engine | Personalized explanations at 50M users | ML Ranking + LLM Explanations |
| Compliance Automation | FDA regulation pre-screening | Claim Extraction + Precedent DB |
| Voice Healthcare | Real-time clinical note generation | On-Prem ASR + HIPAA |
| Fraud Detection | 100ms decision with explainability | ML + Rules Hybrid |
| Knowledge Management | 2M docs with access control | Permission-Aware RAG |
| Computer-Use Agent | Expense-report automation across 3 legacy UIs | Firecracker VMs + Action Gate + IPI Defense |
| Multi-Tenant Fine-Tuning | 280 tenants on shared base + per-tenant LoRA | LoRA Hot-Swap + Eval-as-PRD per Tenant |
| Eval-Gated CI/CD | Block PRs that regress AI quality | Golden Sets + LLM Judges + Statistical Correction |
| Customer Distillation | Cut $50K/mo frontier spend to $6K with 3-mo payback | Trace-Based Distillation + Canary Rollout |
| MCP Knowledge Agent | Cross-system answers from Snowflake/Confluence/Jira/Slack | MCP + OAuth Resource Server + Capability Gating |
π¬ Bonus Deep-Dive Guides
Two companion guides (3,000+ lines each) covering AI evaluation end-to-end - for Engineers, PMs, and QAs:
| Guide | Platforms Covered | What's Inside |
|---|---|---|
| AI Evals: Comprehensive Study Guide | Arize Phoenix + Langfuse | LLM-as-a-Judge, RAG eval, multi-turn eval, production safety, statistical correction with judgy, 30-day learning path |
| AI Evals: LangWatch + Langfuse Guide | LangWatch + Langfuse | Same syllabus with LangWatch's 40+ built-in evaluators, side-by-side platform comparisons, platform choice guidance |
Topics covered across both guides:
- Tracing and observability setup (Phoenix, LangWatch, Langfuse)
- Error analysis: open coding β axial coding β failure mode taxonomy
- Building LLM judges with Train/Dev/Test split and ground truth calibration
- Code-based evaluators (regex, JSON schema, format validators)
- RAG-specific evals: faithfulness, context recall, answer relevance
- Multi-step pipeline evaluation and multi-turn conversation eval
- Production guardrails, safety monitoring, real-time drift detection
- Statistical correction with
judgylibrary - Human annotation best practices and inter-rater reliability
- Cost/latency optimization for eval pipelines at scale
π For Interview Prep
AI engineering and system design interviews ask questions like:
"Design a multi-tenant RAG system where competitors cannot see each other's data."
"Your agent takes 15 steps for a 3-step task. How do you debug it?"
This guide gives you concrete patterns, real tradeoffs, and production failure modes: the depth interviewers expect at senior levels.
β‘οΈ Start with Interview Prep
β Frequently Asked Questions
What is AI system design?
AI system design is the discipline of architecting production-grade systems built around LLMs, retrieval, agents, and evaluation. It covers model selection, RAG pipelines, agent orchestration, memory, observability, and safety. See LLM Internals and AI Design Patterns to get oriented.
How do I prepare for an AI engineering interview?
Start with the Question Bank (116 questions through June 2026), then practice with Answer Frameworks and Whiteboard Exercises. Most senior interviews test RAG design, agent debugging, multi-tenant isolation, and cost/latency tradeoffs, all covered in the Case Studies.
What is RAG (Retrieval-Augmented Generation)?
RAG is a pattern where an LLM retrieves relevant context from an external knowledge source (vector DB, search index, graph) before generating an answer, reducing hallucinations and grounding responses in your data. The full pipeline is covered in RAG Fundamentals and scaled in Production RAG at Scale.
What are AI agents and how are they different from chatbots?
AI agents are LLM-driven systems that plan, call tools, and act over multiple steps to accomplish goals, whereas chatbots typically respond in a single turn. Agents introduce loops, memory, error recovery, and tool-use via protocols like MCP. Start with Agent Fundamentals.
What is MCP (Model Context Protocol) and how does it compare to A2A?
MCP is an open protocol that lets LLMs discover and call external tools and data sources in a standardized way. A2A (Agent-to-Agent) is a complementary protocol for inter-agent communication. They solve different layers: MCP is the tool boundary, A2A is the agent boundary. See Tool Use and MCP.
Which LLM should I use in production: Claude, GPT, Gemini, or open-source?
It depends on latency budget, context length, cost per million tokens, tool-use quality, and data residency. The Model Taxonomy and Pricing chapters give a head-to-head for Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro, DeepSeek V4, Llama 4, and others as of May 2026.
How do I evaluate an LLM or RAG system in production?
Combine offline evals (LLM-as-a-judge with ground-truth calibration), online metrics (faithfulness, context recall, answer relevance), and continuous tracing. The companion deep-dives AI Evals: Phoenix + Langfuse and AI Evals: LangWatch + Langfuse walk through this end-to-end.
How do I build a multi-tenant RAG system safely?
Use defense-in-depth: per-tenant indexes or namespaces, query-time access checks, and prompt-layer guards. The Multi-Tenant RAG Isolation chapter and Multi-Tenant SaaS Case Study cover the patterns that hold up in interviews and production.
What is agentic RAG?
Agentic RAG combines retrieval with an agent loop that can decide what to search, when to re-query, and when to escalate, instead of running a single fixed retrieve-then-generate pass. See Agentic RAG for the architectures and tradeoffs.
Is this guide free? Can I contribute?
Yes, MIT-licensed and free. PRs are welcome; see Contributing Guide. If you have production failure modes, new model benchmarks, or interview questions to add, open a PR.
How often is this guide updated?
Continuously. New model releases, protocol changes (MCP, A2A), and emerging patterns are added as they ship. Recent additions include Tool-Use and Computer Agents and the May 2026 Job Market Trends.
Can I use this guide if I am transitioning from backend, QA, PM, or EM into AI?
Yes. The Role Transition Guide maps existing skills to AI engineering, MLE, and AI architect tracks, with reading paths per role. Pair it with COURSES.md for curated learning resources.
π Living Book
This guide tracks:
- New model releases and real-world performance
- Emerging patterns (MCP, Agentic RAG, Flow Engineering)
- Updated pricing and rate limits
- Deprecations and best practice changes
β Star and Watch the repo to get notified when updates are pushed.
π€ Contributing
Found outdated info? Have production experience to share? PRs welcome. See Contributing Guide.
π Stay Connected
If this guide helps you, the easiest way to support it is to follow along where new chapters and refreshes get announced first:
- GitHub: @ombharatiya - follow for the repo, star the project, and watch for new releases.
- X / Twitter: @ombharatiya - short takes on model releases, MCP, agents, and interviews.
- LinkedIn: ombharatiya - deeper writeups and interview prep tips for senior AI roles.
π License
MIT License. See LICENSE.
Built and maintained by Om Bharatiya Β· GitHub Β· Twitter Β· LinkedIn