ai-infra-engineer-learning
# AI Infrastructure Engineer - Learning Path <div align="center">     *Master AI Infrastructure Engineering through hands-on projects and practical learning* [Prerequisites](./PREREQUISITES.md) β’ [Getting Started](#-getting-started) β’ [Curriculum](#-curriculum-overview) β’ [Projects](#-projects) β’ [Resources](#-resources) </div> --- ## π― Overview This repository contains a **complete, production-ready learning path** for becoming an **AI Infrastructure Engineer**. Through comprehensive modules, real-world projects, and production-grade code stubs with educational TODO comments, you'll develop the skills needed to build, deploy, and maintain ML infrastructure at scale. **Repository Status:** β **100% COMPLETE** - All modules and projects ready for learning! ### What You'll Master - β **Build ML Infrastructure** from scratch (Docker, Kubernetes, cloud platforms) - β **Deploy Production ML Systems** with auto-scaling and comprehensive monitoring - β **Implement End-to-End MLOps** pipelines (Airflow, MLflow, DVC) - β **Deploy Cutting-Edge LLM Infrastructure** (vLLM, RAG, vector databases) - β **Scale Training** with distributed systems and GPU clusters - β **Monitor and Troubleshoot** complex ML systems in production - β **Optimize Costs** across cloud providers (60-80% savings possible) ### Why This Learning Path? - π **Industry-Aligned**: Based on actual job requirements from FAANG and top tech companies - π» **Hands-On**: Code stubs with TODO comments guide you through real implementations - ποΈ **Production-Ready**: Learn patterns used at Netflix, Uber, Airbnb, OpenAI - π **Career-Focused**: Directly maps to $120k-$180k AI Infrastructure Engineer roles - π **Progressive**: 10 modules building from basics to advanced LLM infrastructure - π₯ **Modern Stack**: 2024-2025 technologies (vLLM, RAG, GPU optimization) --- ## β¨ What's New **2026-05-27 β Layout standardisation:** - π§Ή **Removed 10 empty root-level `mod-XXX-*/` placeholder directories.** They were vestiges from a pre-refactor layout; all canonical module content has lived under `lessons/mod-XXX-*/` for some time. The repo now matches the layout expected by the curriculum-runner audit (`lessons/` for learning content, `modules/` in the paired solutions repo). - π§Ή **Removed orphan `lessons/mod-101-foundations/exercises/solutions/`** (a duplicate single-file index). Reference solutions live in the paired [`ai-infra-engineer-solutions`](https://github.com/ai-infra-curriculum/ai-infra-engineer-solutions) repo; inline pointers throughout the lessons now link there directly. **May 2026 Update:** - π§ͺ **All 62 promised labs authored** across all 10 modules (foundations β LLM infrastructure). Each lab is a substantive, runnable walkthrough with objectives, prerequisites, numbered steps, validation checklist, cleanup, and troubleshooting. - π **Two new reading lists:** `advanced-engineer-path.md` and `staff-engineer-path.md` (9β18 months and 2β5 years respectively). - π§Ή **Structural cleanup:** mod-101 lecture duplicates resolved, quiz placement consolidated, empty Makefile/pyproject populated with real content, CURRICULUM.md self-claim corrected to reflect actual completion state. - π **Honesty pass on CURRICULUM.md:** the prior "100% Complete" claim has been replaced with a per-module exercise/lab accounting. Lectures and projects are excellent; exercises are 32 of 119 promised and being filled in over subsequent content drops. **Earlier:** - π **Comprehensive Quizzes** for modules 102-110 (265+ questions) - Module 102: Cloud Computing (mid-module + final, 50 questions) - Module 103: Containerization (25 questions) - Module 104: Kubernetes (30 questions) - Module 105: Data Pipelines (25 questions) - Module 106: MLOps (30 questions) - Module 107: GPU Computing (25 questions) - Module 108: Monitoring (25 questions) - Module 109: IaC (25 questions) - Module 110: LLM Infrastructure (30 questions) - π **Technology Versions Guide** - Complete specifications for 100+ tools - πΊοΈ **Curriculum Cross-Reference** - Mapping to Junior track - π **Career Progression Guide** - Engineer to Principal roadmap --- ## π What's Included ### 10 Complete Learning Modules (130 Files) | Module | Topic | Hours | Status | Quiz | |--------|-------|-------|--------|------| | 01 | **Foundations** | 50h | β Complete (15 files) | β 30Q | | 02 | **Cloud Computing** | 50h | β Complete (11 files) | β¨ **+50Q** | | 03 | **Containerization** | 50h | β Complete (14 files) | β¨ **+25Q** | | 04 | **Kubernetes** | 50h | β Complete (13 files) | β¨ **+30Q** | | 05 | **Data Pipelines** | 50h | β Complete (12 files) | β¨ **+25Q** | | 06 | **MLOps** | 50h | β Complete (12 files) | β¨ **+30Q** | | 07 | **GPU Computing** | 50h | β Complete (12 files) | β¨ **+25Q** | | 08 | **Monitoring & Observability** | 50h | β Complete (11 files) | β¨ **+25Q** | | 09 | **Infrastructure as Code** | 50h | β Complete (12 files) | β¨ **+25Q** | | 10 | **LLM Infrastructure** | 50h | β Complete (12 files) | β¨ **+30Q** | ### 3 Production-Grade Projects (77 Files) | Project | Technologies | Duration | Files | Status | |---------|-------------|----------|-------|--------| | **01: Basic Model Serving** | FastAPI + K8s + Monitoring | 30h | ~30 | β Complete | | **02: MLOps Pipeline** | Airflow + MLflow + DVC | 40h | 30 | β Complete | | **03: LLM Deployment** | vLLM + RAG + Vector DB | 50h | 47 | β Complete | **Total Repository:** 207 files | ~95,000+ lines of code | 500+ hours of learning content --- ## π Prerequisites ### Option 1: Complete Junior Curriculum (RECOMMENDED) If you've completed the [**Junior AI Infrastructure Engineer**](https://github.com/ai-infra-curriculum/ai-infra-junior-engineer-learning) curriculum, you have **ALL** required prerequisites! β The Junior curriculum covers: - β Python fundamentals & advanced concepts - β Linux/Unix command line mastery - β Git & version control workflows - β ML basics (PyTorch, TensorFlow) - β Docker & containerization - β Kubernetes introduction - β API development & databases - β Monitoring & cloud platforms **Duration**: 440 hours (22 weeks part-time, 11 weeks full-time) ### Option 2: Self-Assessment **Haven't completed Junior curriculum?** Use our comprehensive [**Prerequisites Guide**](./PREREQUISITES.md) to: - Check your readiness with detailed skill checklists - Identify knowledge gaps - Get personalized learning recommendations - Run automated skill assessment ### Minimum Requirements If self-studying, you must have: - **Python 3.9+** (intermediate level: OOP, async, testing, type hints) - **Linux/Unix CLI** (bash scripting, processes, debugging) - **Git fundamentals** (branching, merging, collaboration) - **ML basics** (PyTorch/TensorFlow, training, inference, evaluation) - **Docker basics** (images, containers, Compose) - **Kubernetes intro** (pods, deployments, services) **π Not sure if you're ready?** Read the [**Prerequisites Guide**](./PREREQUISITES.md) for detailed assessment. --- ## π Getting Started ### Quick Start ```bash # 1. Clone repository git clone https://github.com/ai-infra-curriculum/ai-infra-engineer-learning.git cd ai-infra-engineer-learning # 2. Create virtual environment python3.11 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # 3. Install dependencies pip install -r requirements.txt # 4. Start with Module 01 cd lessons/mod-101-foundations cat README.md ``` ### Learning Path 1. **Modules 01-02 (Foundations)** - Start here if new to ML infrastructure 2. **Modules 03-04 (Core Infrastructure)** - Docker and Kubernetes mastery 3. **Modules 05-06 (MLOps)** - Data pipelines and ML operations 4. **Modules 07-08 (Advanced)** - GPU computing and monitoring 5. **Modules 09-10 (Modern Stack)** - IaC and LLM infrastructure **Detailed guide:** [GETTING_STARTED.md](./GETTING_STARTED.md) --- ## π Curriculum Overview ### Module 01: Foundations β **50 hours | 15 files** Build your foundation in ML infrastructure: - ML infrastructure landscape and career paths - Python environment setup and best practices - ML frameworks (PyTorch, TensorFlow) - Docker fundamentals and containerization - REST API development with FastAPI [View Module 01 β](./lessons/mod-101-foundations/README.md) --- ### Module 02: Cloud Computing β **50 hours | 11 files** Master cloud platforms for ML: - Cloud architecture for ML workloads - AWS (EC2, S3, EKS, SageMaker) - GCP (Compute Engine, GCS, GKE, Vertex AI) - Azure (VMs, Blob Storage, AKS, Azure ML) - Multi-cloud strategies and cost optimization (60-80% savings) [View Module 02 β](./lessons/mod-102-cloud-computing/README.md) --- ### Module 03: Containerization β **50 hours | 14 files** Deep dive into containers: - Docker architecture and best practices - Multi-stage builds and optimization - Docker Compose for multi-service applications - Container registries and image management - Security and vulnerability scanning [View Module 03 β](./lessons/mod-103-containerization/README.md) --- ### Module 04: Kubernetes β **50 hours | 13 files** Master Kubernetes for ML: - Kubernetes architecture and components - Deployments, Services, ConfigMaps, Secrets - GPU resource management and scheduling - Autoscaling (HPA, VPA, Cluster Autoscaler) - Helm charts and GitOps with ArgoCD [View Module 04 β](./lessons/mod-104-kubernetes/README.md) --- ### Module 05: Data Pipelines β **50 hours | 12 files** Build robust data pipelines: - Apache Airflow for workflow orchestration - Data processing with Apache Spark - Streaming data with Apache Kafka - Data version control with DVC - Data quality validation and monitoring [View Module 05 β](./lessons/mod-105-data-pipelines/README.md) --- ### Module 06: MLOps β **50 hours | 12 files** Implement MLOps best practices: - Experiment tracking with MLflow - Model registry and versioning - Feature stores and engineering - CI/CD for ML models - A/B testing and experimentation - ML governance and best practices [View Module 06 β](./lessons/mod-106-mlops/README.md) --- ### Module 07: GPU Computing & Distributed Training β **50 hours | 12 files** Harness GPU power: - CUDA programming fundamentals - PyTorch GPU acceleration - Distributed training (DDP, FSDP) - Multi-GPU and multi-node training - Model and pipeline parallelism - GPU memory optimization [View Module 07 β](./lessons/mod-107-gpu-computing/README.md) --- ### Module 08: Monitoring & Observability β **50 hours | 11 files** Build comprehensive observability: - Prometheus and Grafana - Metrics, logs, and traces (OpenTelemetry) - Distributed tracing with Jaeger - Alerting and incident response - Model performance monitoring - SLIs, SLOs, and SLAs [View Module 08 β](./lessons/mod-108-monitoring-observability/README.md) --- ### Module 09: Infrastructure as Code β **50 hours | 12 files** Automate infrastructure: - Terraform fundamentals and best practices - Pulumi for multi-language IaC - CloudFormation for AWS - State management and modules - Multi-environment deployments - GitOps workflows [View Module 09 β](./lessons/mod-109-infrastructure-as-code/README.md) --- ### Module 10: LLM Infrastructure β **50 hours | 12 files** Master cutting-edge LLM infrastructure (2024-2025): - LLM serving with vLLM and TensorRT-LLM - RAG (Retrieval-Augmented Generation) - Vector databases (Pinecone, Weaviate, Milvus) - Model quantization (FP16, INT8) - GPU optimization for inference - Cost tracking and optimization [View Module 10 β](./lessons/mod-110-llm-infrastructure/README.md) --- ## π οΈ Projects ### Project 01: Basic Model Serving System β **β Beginner | 30 hours | ~30 files** Build a complete model serving system: - FastAPI REST API for image classification - Docker containerization with optimization - Kubernetes deployment with monitoring - Prometheus and Grafana dashboards - CI/CD pipeline with GitHub Actions **Technologies:** FastAPI, Docker, Kubernetes, PyTorch, Prometheus, Grafana [View Project 01 β](./projects/project-101-basic-model-serving/README.md) --- ### Project 02: End-to-End MLOps Pipeline β **ββ Intermediate | 40 hours | 30 files** Create a production MLOps pipeline: - Apache Airflow DAGs (data, training, deployment) - MLflow experiment tracking and model registry - DVC for data versioning - Automated model deployment to Kubernetes - Comprehensive monitoring and alerting - CI/CD with automated testing **Technologies:** Airflow, MLflow, DVC, PostgreSQL, Redis, MinIO, Kubernetes [View Project 02 β](./projects/project-102-mlops-pipeline/README.md) --- ### Project 03: LLM Deployment Platform β **βββ Advanced | 50 hours | 47 files** Deploy cutting-edge LLM infrastructure: - vLLM/TensorRT-LLM for optimized serving - RAG system with vector database (Pinecone/ChromaDB/Milvus) - Document ingestion pipeline (PDF, TXT, web) - FastAPI with Server-Sent Events streaming - Kubernetes with GPU support - Cost tracking and optimization - Comprehensive monitoring **Technologies:** vLLM, LangChain, Vector DBs, FastAPI, Kubernetes + GPU, Transformers [View Project 03 β](./projects/project-103-llm-deployment/README.md) --- ## π° Cost Considerations ### Cloud Costs All learning materials can be completed within **free tier limits**: - **AWS**: 750 hours/month t2.micro + $300 credits (varies) - **GCP**: $300 credit (90 days) - **Azure**: $200 credit (30 days) **GPU costs** (optional, for advanced projects): - On-demand: $1-3/hour - Spot instances: $0.30-1/hour (70% savings) - Estimated total: $50-150 for complete curriculum ### Optimization Tips - Use spot instances for training (60-90% savings) - Leverage free tiers across multiple cloud providers - Delete resources when not in use - Use local development where possible --- ## π Resources ### Included Documentation - Comprehensive lesson materials with examples - Code stubs with TODO comments for guided implementation - Complete project specifications with architecture diagrams - Quizzes and assessments for each module - Best practices and design patterns ### External Resources - π **Reading Lists**: [resources/reading-lists/](./resources/reading-lists/) β advanced + staff-engineer paths - π οΈ **Cheat Sheets**: [resources/cheat-sheets/](./resources/cheat-sheets/) β docker, kubernetes, git, linux, python infrastructure - β **FAQ**: [resources/faq.md](./resources/faq.md) ### Curriculum Documentation - π **[Technology Versions Guide](VERSIONS.md)** - Recommended versions for all tools and frameworks - πΊοΈ **[Curriculum Cross-Reference](https://github.com/ai-infra-curriculum/.github/blob/main/CURRICULUM_CROSS_REFERENCE.md)** - Mapping between Junior and Engineer tracks - π **[Career Progression Guide](https://github.com/ai-infra-curriculum/.github/blob/main/CAREER_PROGRESSION.md)** - Complete career ladder from Junior to Principal --- ## π― Learning Outcomes & Career Impact ### After Completion, You'll Be Qualified For: **AI Infrastructure Engineer** - π° Salary: $120,000 - $180,000 - π’ Companies: Tech companies, AI startups, ML-focused organizations - π Demand: Very high (growing 35% year-over-year) **ML Platform Engineer** - π° Salary: $130,000 - $190,000 - π’ Companies: Large tech firms, enterprises with ML teams - π Demand: High (specialized role) **MLOps Engineer** - π° Salary: $110,000 - $170,000 - π’ Companies: All organizations doing ML at scale - π Demand: Very high (fastest growing ML role) ### Skills You'll Demonstrate β Kubernetes expertise with GPU scheduling β End-to-end MLOps pipeline implementation β LLM infrastructure and RAG systems β Distributed training and GPU optimization β Production monitoring and observability β Cloud platform mastery (AWS, GCP, Azure) β Infrastructure as Code with Terraform β Cost optimization strategies --- ## π Repository Statistics - **Total Files:** 207 - **Estimated Lines:** ~95,000+ - **Modules:** 10 (all complete) - **Projects:** 3 (all complete) - **Learning Hours:** 500+ - **Technologies:** 50+ ### Technology Stack Covered **Core Infrastructure:** Docker, Kubernetes, Terraform, Helm, ArgoCD **ML & Data:** PyTorch, TensorFlow, Apache Airflow, Apache Spark, Kafka, DVC **MLOps:** MLflow, Feature Stores, Model Registry, CI/CD **LLM Infrastructure:** vLLM, TensorRT-LLM, LangChain, Vector Databases (Pinecone, Milvus, ChromaDB) **Cloud Platforms:** AWS (EC2, S3, EKS, SageMaker), GCP (GCE, GCS, GKE, Vertex AI), Azure (VMs, AKS, Azure ML) **Monitoring:** Prometheus, Grafana, OpenTelemetry, Jaeger, ELK Stack **GPU Computing:** CUDA, NCCL, Multi-GPU training, Distributed training --- ## π€ Contributing We welcome contributions! Please see [CONTRIBUTING.md](./CONTRIBUTING.md) for: - Bug reports and fixes - Documentation improvements - New exercises and examples - Updated best practices --- ## π Getting Help - π **Documentation**: Start with [GETTING_STARTED.md](./GETTING_STARTED.md) - π¬ **GitHub Discussions**: [Ask questions](https://github.com/ai-infra-curriculum/ai-infra-engineer-learning/discussions) - π **Issues**: [Report bugs](https://github.com/ai-infra-curriculum/ai-infra-engineer-learning/issues) - π§ **Contact**: [email protected] --- ## π License This project is licensed under the **MIT License** - see [LICENSE](./LICENSE) for details. --- ## π Success Metrics Upon completion, you should be able to: - [ ] Deploy ML models to production with confidence - [ ] Build complete MLOps pipelines from scratch - [ ] Implement LLM infrastructure with RAG - [ ] Optimize cloud costs by 60-80% - [ ] Debug complex distributed systems - [ ] Pass technical interviews for AI Infrastructure roles - [ ] Confidently discuss trade-offs in system design - [ ] Lead infrastructure projects at your organization --- ## π Next Steps After Completion This curriculum prepares you for **AI Infrastructure Engineer** roles. For career progression: 1. **Gain Experience** (1-2 years) - Work on production ML systems - Handle incidents and on-call rotations - Contribute to open-source ML infrastructure projects 2. **Advance to Senior Engineer** (2-3 years total) - Continue with the [Senior Engineer track](https://github.com/ai-infra-curriculum/ai-infra-senior-engineer-learning) - Lead larger projects and mentor juniors - Design complex systems 3. **Become an Architect** (4-6 years total) - Continue with the [Architect track](https://github.com/ai-infra-curriculum/ai-infra-architect-learning) - Design enterprise ML platforms - Strategic technical leadership --- <div align="center"> ## Ready to Master AI Infrastructure Engineering? **Start your journey today!** [π Get Started](./GETTING_STARTED.md) | [π View Full Curriculum](./CURRICULUM.md) | [π Start Module 01](./lessons/mod-101-foundations/README.md) --- β **Star this repository** if you find it valuable! **Share with others** learning AI Infrastructure Engineering! --- *Contact: [email protected]* **Happy Learning!** ππ </div> --- <!-- aicg:maintained-by --> Maintained by [VeriSwarm.ai](https://veriswarm.ai)