Home
Softono
ai-system-design-guide

ai-system-design-guide

Open source
1.7K
Stars
347
Forks
1
Issues
13
Watchers
1 week
Last Commit

About ai-system-design-guide

AI system design guide for engineers building production AI systems and evals.

Platforms

Web Self-hosted

🧠 AI System Design Guide

The Complete Interview & Production Reference

Follow on GitHub Follow on Twitter Connect on LinkedIn

If this guide helps you, follow @ombharatiya on GitHub, X, and LinkedIn to get notified when new chapters, model refreshes, and interview prompts ship.

Last commit License PRs Welcome Stars Contributors Open issues

The living reference for production AI systems. Continuously updated. Interview-ready depth.

A practical, continuously updated guide to AI system design, RAG architectures, LLM engineering, agentic AI, MCP and A2A protocols, and AI engineering interview preparation. Covers production patterns, model selection, evaluation, and real-world case studies from staff-level interviews.

New here? Jump to the 116-question Interview Bank, the RAG Fundamentals chapter, or pick the right LLM for production.


πŸ“š Quick Navigation

I want to... Start here
Prepare for interviews Question Bank β†’ Answer Frameworks
Learn AI systems fast LLM Internals β†’ RAG Fundamentals
Build production RAG Chunking β†’ Vector DBs β†’ Reranking β†’ Production RAG
Advanced retrieval Contextual Retrieval β†’ ColBERT β†’ Multi-modal RAG
Design multi-tenant AI Isolation Patterns β†’ Case Study
Build agents Agent Fundamentals β†’ MCP & A2A β†’ LangGraph
Tool-use & computer agents Landscape β†’ OpenClaw β†’ Safety
Autonomous coding agents Claude Code β†’ OpenCoder Landscape
Pick the right model (2026) Model Taxonomy β†’ Pricing
Evaluate AI in production AI Evals Guide (Phoenix/Langfuse) β†’ AI Evals Guide (LangWatch/Langfuse)
Find the best courses to learn AI Recommended Courses & Learning Paths
Transition from my current role to AI Role Transition Guide
Understand the 2026 AI job market Job Market Trends - June 2026
Get a quick answer to a common question FAQ (RAG, agents, models, eval, inference, memory, security)
Look up a term Glossary (every term defined)

Pick a path

flowchart TD
    A[New visitor] --> B{Your goal}
    B -->|Interview prep| C[Question Bank]
    B -->|Build RAG| D[RAG Fundamentals]
    B -->|Build agents| E[Agent Fundamentals]
    B -->|Pick a model| F[Model Taxonomy]
    B -->|Evaluate AI| G[AI Evals Guide]
    C --> H[Answer Frameworks]
    D --> I[Chunking + Vector DBs]
    E --> J[MCP and Tool Use]
    F --> K[Pricing 2026]
    G --> L[Phoenix or LangWatch]

🎯 Why This Guide

Traditional books are outdated before they ship. This is a living document: when new models release, when patterns evolve, this updates.

This Guide Printed Books
June 2026 models (Claude Fable 5, Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, DeepSeek V4 Pro, Llama 4, Kimi K2.6, Qwen 3.6, Mistral Medium 3.5, Gemma 4) Stuck on GPT-4
MCP 2.0, A2A v1.0, OpenClaw, Computer Use, Agentic RAG, ColBERT, latent reasoning, MoE serving Does not exist
Real pricing with June 2026 verification dates Already wrong
Staff-level interview Q&A (116 questions through June 2026) + Job Market Trends Generic questions

Quick model picker (June 2026): Claude Fable 5 for the capability ceiling ($10/$50 per 1M), Claude Opus 4.8 for tool-use and long-horizon agentic coding, GPT-5.5 for general production, Gemini 3.1 Pro for multimodal, DeepSeek V4 Flash ($0.14/$0.28 per 1M) or V4 Pro ($0.435/$0.87) for cheap frontier-class output, Llama 4 for self-hosted. Full breakdown in Model Taxonomy.


🎯 What This Guide Is (and Is Not)

This guide IS:

  • A staff-level reference for designing production AI systems (RAG, agents, MCP, eval pipelines, multi-tenant isolation).
  • An interview-prep companion with 116 real questions, answer frameworks with a worked mock transcript, and nine whiteboard exercises through June 2026.
  • A living document tracking new model releases, protocol changes, and emerging patterns as they ship.
  • Opinionated about tradeoffs: latency vs cost, accuracy vs faithfulness, single-agent vs multi-agent.
  • Free, MIT-licensed, and open to PRs from practitioners.

This guide is NOT:

  • A tutorial on Python, PyTorch, or basic ML fundamentals (start with a course; see COURSES.md).
  • A vendor-neutral hedge; it names specific models, prices, and frameworks because real systems require real choices.
  • A replacement for hands-on building; read it alongside a project, not instead of one.
  • A research paper digest; it cites papers when they change practice, not for completeness.

πŸ“– Guide Structure

β”œβ”€β”€ 00-interview-prep/           # Questions (116), frameworks, exercises, job-market trends (June 2026)
β”œβ”€β”€ 01-foundations/              # Transformers, attention, embeddings
β”œβ”€β”€ 02-model-landscape/          # Claude Fable 5, Claude Opus 4.8, GPT-5.5, Gemini 3.1, DeepSeek V4, Llama 4, Kimi K2.6, Qwen 3.6
β”œβ”€β”€ 03-training-and-adaptation/  # Fine-tuning, LoRA, DPO, distillation
β”œβ”€β”€ 04-inference-optimization/   # KV cache, PagedAttention, vLLM
β”œβ”€β”€ 05-prompting-and-context/    # Prompt engineering, CoT, Extended Thinking, DSPy, prompt injection
β”œβ”€β”€ 06-retrieval-systems/        # RAG, chunking, GraphRAG, Agentic RAG, ColBERT, Contextual Retrieval
β”œβ”€β”€ 07-agentic-systems/          # MCP 2.0, A2A protocol, multi-agent, computer-use
β”œβ”€β”€ 08-memory-and-state/         # L1-L3 memory tiers, Mem0, caching
β”œβ”€β”€ 09-frameworks-and-tools/     # LangGraph, DSPy, LlamaIndex, Claude Code, OpenCoder
β”œβ”€β”€ 10-document-processing/      # Vision-LLM OCR, multimodal parsing
β”œβ”€β”€ 11-infrastructure-and-mlops/ # GPU clusters, LLMOps, cost management
β”œβ”€β”€ 12-security-and-access/      # RBAC, ABAC, multi-tenant isolation
β”œβ”€β”€ 13-reliability-and-safety/   # Guardrails, red-teaming
β”œβ”€β”€ 14-evaluation-and-observability/ # RAGAS, LangSmith, drift detection
β”œβ”€β”€ 15-ai-design-patterns/       # Pattern catalog, anti-patterns
β”œβ”€β”€ 16-case-studies/             # Real-world architectures with diagrams
β”œβ”€β”€ 17-tool-use-and-computer-agents/ # OpenClaw, Computer Use, tool agents, safety
β”œβ”€β”€ GLOSSARY.md                  # Every term defined
β”‚
β”œβ”€β”€ ai_evals_comprehensive_study_guide.md      # πŸ”¬ Deep-dive: AI Evals (Phoenix + Langfuse)
└── ai_evals_complete_guide_langwatch_langfuse.md  # πŸ”¬ Deep-dive: AI Evals (LangWatch + Langfuse)
└── COURSES.md                   # πŸŽ“ Recommended courses & learning paths
└── TRANSITION_GUIDE.md          # πŸ”„ Transition from Backend/QA/PM/EM to AI roles

Chapters by AI System Lifecycle Stage

mindmap
  root((AI System Design Guide))
    Foundations
      LLM Internals
      Model Landscape
      Training and Adaptation
    Build
      Prompting and Context
      Retrieval Systems
      Agentic Systems
      Tool Use and Computer Agents
    Operate
      Inference Optimization
      Memory and State
      Frameworks and Tools
      Infrastructure and MLOps
    Govern
      Security and Access
      Reliability and Safety
      Evaluation and Observability
    Apply
      Design Patterns
      Case Studies
      Interview Prep

πŸ”₯ Featured Case Studies

Real interview problems with complete solutions and diagrams:

Case Study Problem Key Patterns
Real-Time Search 5-minute data freshness at scale Streaming + Hybrid Search
Coding Agent Autonomous multi-file changes Sandboxing + Self-Correction
Multi-Tenant SaaS Coca-Cola and Pepsi on same infra Defense-in-Depth Isolation
Customer Support 60% auto-resolution rate Tiered Routing + Escalation
Document Intelligence 50K contracts/month extraction Vision-LLM + Parallel Extractors
Recommendation Engine Personalized explanations at 50M users ML Ranking + LLM Explanations
Compliance Automation FDA regulation pre-screening Claim Extraction + Precedent DB
Voice Healthcare Real-time clinical note generation On-Prem ASR + HIPAA
Fraud Detection 100ms decision with explainability ML + Rules Hybrid
Knowledge Management 2M docs with access control Permission-Aware RAG
Computer-Use Agent Expense-report automation across 3 legacy UIs Firecracker VMs + Action Gate + IPI Defense
Multi-Tenant Fine-Tuning 280 tenants on shared base + per-tenant LoRA LoRA Hot-Swap + Eval-as-PRD per Tenant
Eval-Gated CI/CD Block PRs that regress AI quality Golden Sets + LLM Judges + Statistical Correction
Customer Distillation Cut $50K/mo frontier spend to $6K with 3-mo payback Trace-Based Distillation + Canary Rollout
MCP Knowledge Agent Cross-system answers from Snowflake/Confluence/Jira/Slack MCP + OAuth Resource Server + Capability Gating

πŸ”¬ Bonus Deep-Dive Guides

Two companion guides (3,000+ lines each) covering AI evaluation end-to-end - for Engineers, PMs, and QAs:

Guide Platforms Covered What's Inside
AI Evals: Comprehensive Study Guide Arize Phoenix + Langfuse LLM-as-a-Judge, RAG eval, multi-turn eval, production safety, statistical correction with judgy, 30-day learning path
AI Evals: LangWatch + Langfuse Guide LangWatch + Langfuse Same syllabus with LangWatch's 40+ built-in evaluators, side-by-side platform comparisons, platform choice guidance

Topics covered across both guides:

  • Tracing and observability setup (Phoenix, LangWatch, Langfuse)
  • Error analysis: open coding β†’ axial coding β†’ failure mode taxonomy
  • Building LLM judges with Train/Dev/Test split and ground truth calibration
  • Code-based evaluators (regex, JSON schema, format validators)
  • RAG-specific evals: faithfulness, context recall, answer relevance
  • Multi-step pipeline evaluation and multi-turn conversation eval
  • Production guardrails, safety monitoring, real-time drift detection
  • Statistical correction with judgy library
  • Human annotation best practices and inter-rater reliability
  • Cost/latency optimization for eval pipelines at scale

πŸŽ“ For Interview Prep

AI engineering and system design interviews ask questions like:

"Design a multi-tenant RAG system where competitors cannot see each other's data."

"Your agent takes 15 steps for a 3-step task. How do you debug it?"

This guide gives you concrete patterns, real tradeoffs, and production failure modes: the depth interviewers expect at senior levels.

➑️ Start with Interview Prep


❓ Frequently Asked Questions

What is AI system design?

AI system design is the discipline of architecting production-grade systems built around LLMs, retrieval, agents, and evaluation. It covers model selection, RAG pipelines, agent orchestration, memory, observability, and safety. See LLM Internals and AI Design Patterns to get oriented.

How do I prepare for an AI engineering interview?

Start with the Question Bank (116 questions through June 2026), then practice with Answer Frameworks and Whiteboard Exercises. Most senior interviews test RAG design, agent debugging, multi-tenant isolation, and cost/latency tradeoffs, all covered in the Case Studies.

What is RAG (Retrieval-Augmented Generation)?

RAG is a pattern where an LLM retrieves relevant context from an external knowledge source (vector DB, search index, graph) before generating an answer, reducing hallucinations and grounding responses in your data. The full pipeline is covered in RAG Fundamentals and scaled in Production RAG at Scale.

What are AI agents and how are they different from chatbots?

AI agents are LLM-driven systems that plan, call tools, and act over multiple steps to accomplish goals, whereas chatbots typically respond in a single turn. Agents introduce loops, memory, error recovery, and tool-use via protocols like MCP. Start with Agent Fundamentals.

What is MCP (Model Context Protocol) and how does it compare to A2A?

MCP is an open protocol that lets LLMs discover and call external tools and data sources in a standardized way. A2A (Agent-to-Agent) is a complementary protocol for inter-agent communication. They solve different layers: MCP is the tool boundary, A2A is the agent boundary. See Tool Use and MCP.

Which LLM should I use in production: Claude, GPT, Gemini, or open-source?

It depends on latency budget, context length, cost per million tokens, tool-use quality, and data residency. The Model Taxonomy and Pricing chapters give a head-to-head for Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro, DeepSeek V4, Llama 4, and others as of May 2026.

How do I evaluate an LLM or RAG system in production?

Combine offline evals (LLM-as-a-judge with ground-truth calibration), online metrics (faithfulness, context recall, answer relevance), and continuous tracing. The companion deep-dives AI Evals: Phoenix + Langfuse and AI Evals: LangWatch + Langfuse walk through this end-to-end.

How do I build a multi-tenant RAG system safely?

Use defense-in-depth: per-tenant indexes or namespaces, query-time access checks, and prompt-layer guards. The Multi-Tenant RAG Isolation chapter and Multi-Tenant SaaS Case Study cover the patterns that hold up in interviews and production.

What is agentic RAG?

Agentic RAG combines retrieval with an agent loop that can decide what to search, when to re-query, and when to escalate, instead of running a single fixed retrieve-then-generate pass. See Agentic RAG for the architectures and tradeoffs.

Is this guide free? Can I contribute?

Yes, MIT-licensed and free. PRs are welcome; see Contributing Guide. If you have production failure modes, new model benchmarks, or interview questions to add, open a PR.

How often is this guide updated?

Continuously. New model releases, protocol changes (MCP, A2A), and emerging patterns are added as they ship. Recent additions include Tool-Use and Computer Agents and the May 2026 Job Market Trends.

Can I use this guide if I am transitioning from backend, QA, PM, or EM into AI?

Yes. The Role Transition Guide maps existing skills to AI engineering, MLE, and AI architect tracks, with reading paths per role. Pair it with COURSES.md for curated learning resources.


πŸ”„ Living Book

This guide tracks:

  • New model releases and real-world performance
  • Emerging patterns (MCP, Agentic RAG, Flow Engineering)
  • Updated pricing and rate limits
  • Deprecations and best practice changes

⭐ Star and Watch the repo to get notified when updates are pushed.


🀝 Contributing

Found outdated info? Have production experience to share? PRs welcome. See Contributing Guide.


πŸ‘‹ Stay Connected

If this guide helps you, the easiest way to support it is to follow along where new chapters and refreshes get announced first:

  • GitHub: @ombharatiya - follow for the repo, star the project, and watch for new releases.
  • X / Twitter: @ombharatiya - short takes on model releases, MCP, agents, and interviews.
  • LinkedIn: ombharatiya - deeper writeups and interview prep tips for senior AI roles.

Follow on GitHub Follow on Twitter Connect on LinkedIn


πŸ“„ License

MIT License. See LICENSE.


Built and maintained by Om Bharatiya Β· GitHub Β· Twitter Β· LinkedIn