ai-infra-curriculum

Open Source

ai-infra-engineer-learning

# AI Infrastructure Engineer - Learning Path <div align="center"> ![License](https://img.shields.io/badge/license-MIT-blue.svg) ![Progress](https://img.shields.io/badge/modules-10/10_complete-brightgreen.svg) ![Projects](https://img.shields.io/badge/projects-3/3_complete-brightgreen.svg) ![Duration](https://img.shields.io/badge/duration-500+_hours-red.svg) *Master AI Infrastructure Engineering through hands-on projects and practical learning* [Prerequisites](./PREREQUISITES.md) • [Getting Started](#-getting-started) • [Curriculum](#-curriculum-overview) • [Projects](#-projects) • [Resources](#-resources) </div> --- ## 🎯 Overview This repository contains a **complete, production-ready learning path** for becoming an **AI Infrastructure Engineer**. Through comprehensive modules, real-world projects, and production-grade code stubs with educational TODO comments, you'll develop the skills needed to build, deploy, and maintain ML infrastructure at scale. **Repository Status:** ✅ **100% COMPLETE** - All modules and projects ready for learning! ### What You'll Master - ✅ **Build ML Infrastructure** from scratch (Docker, Kubernetes, cloud platforms) - ✅ **Deploy Production ML Systems** with auto-scaling and comprehensive monitoring - ✅ **Implement End-to-End MLOps** pipelines (Airflow, MLflow, DVC) - ✅ **Deploy Cutting-Edge LLM Infrastructure** (vLLM, RAG, vector databases) - ✅ **Scale Training** with distributed systems and GPU clusters - ✅ **Monitor and Troubleshoot** complex ML systems in production - ✅ **Optimize Costs** across cloud providers (60-80% savings possible) ### Why This Learning Path? - 🎓 **Industry-Aligned**: Based on actual job requirements from FAANG and top tech companies - 💻 **Hands-On**: Code stubs with TODO comments guide you through real implementations - 🏗️ **Production-Ready**: Learn patterns used at Netflix, Uber, Airbnb, OpenAI - 📊 **Career-Focused**: Directly maps to $120k-$180k AI Infrastructure Engineer roles - 🚀 **Progressive**: 10 modules building from basics to advanced LLM infrastructure - 🔥 **Modern Stack**: 2024-2025 technologies (vLLM, RAG, GPU optimization) --- ## ✨ What's New **2026-05-27 — Layout standardisation:** - 🧹 **Removed 10 empty root-level `mod-XXX-*/` placeholder directories.** They were vestiges from a pre-refactor layout; all canonical module content has lived under `lessons/mod-XXX-*/` for some time. The repo now matches the layout expected by the curriculum-runner audit (`lessons/` for learning content, `modules/` in the paired solutions repo). - 🧹 **Removed orphan `lessons/mod-101-foundations/exercises/solutions/`** (a duplicate single-file index). Reference solutions live in the paired [`ai-infra-engineer-solutions`](https://github.com/ai-infra-curriculum/ai-infra-engineer-solutions) repo; inline pointers throughout the lessons now link there directly. **May 2026 Update:** - 🧪 **All 62 promised labs authored** across all 10 modules (foundations → LLM infrastructure). Each lab is a substantive, runnable walkthrough with objectives, prerequisites, numbered steps, validation checklist, cleanup, and troubleshooting. - 📒 **Two new reading lists:** `advanced-engineer-path.md` and `staff-engineer-path.md` (9–18 months and 2–5 years respectively). - 🧹 **Structural cleanup:** mod-101 lecture duplicates resolved, quiz placement consolidated, empty Makefile/pyproject populated with real content, CURRICULUM.md self-claim corrected to reflect actual completion state. - 🔍 **Honesty pass on CURRICULUM.md:** the prior "100% Complete" claim has been replaced with a per-module exercise/lab accounting. Lectures and projects are excellent; exercises are 32 of 119 promised and being filled in over subsequent content drops. **Earlier:** - 📝 **Comprehensive Quizzes** for modules 102-110 (265+ questions) - Module 102: Cloud Computing (mid-module + final, 50 questions) - Module 103: Containerization (25 questions) - Module 104: Kubernetes (30 questions) - Module 105: Data Pipelines (25 questions) - Module 106: MLOps (30 questions) - Module 107: GPU Computing (25 questions) - Module 108: Monitoring (25 questions) - Module 109: IaC (25 questions) - Module 110: LLM Infrastructure (30 questions) - 📋 **Technology Versions Guide** - Complete specifications for 100+ tools - 🗺️ **Curriculum Cross-Reference** - Mapping to Junior track - 📈 **Career Progression Guide** - Engineer to Principal roadmap --- ## 📊 What's Included ### 10 Complete Learning Modules (130 Files) | Module | Topic | Hours | Status | Quiz | |--------|-------|-------|--------|------| | 01 | **Foundations** | 50h | ✅ Complete (15 files) | ✅ 30Q | | 02 | **Cloud Computing** | 50h | ✅ Complete (11 files) | ✨ **+50Q** | | 03 | **Containerization** | 50h | ✅ Complete (14 files) | ✨ **+25Q** | | 04 | **Kubernetes** | 50h | ✅ Complete (13 files) | ✨ **+30Q** | | 05 | **Data Pipelines** | 50h | ✅ Complete (12 files) | ✨ **+25Q** | | 06 | **MLOps** | 50h | ✅ Complete (12 files) | ✨ **+30Q** | | 07 | **GPU Computing** | 50h | ✅ Complete (12 files) | ✨ **+25Q** | | 08 | **Monitoring & Observability** | 50h | ✅ Complete (11 files) | ✨ **+25Q** | | 09 | **Infrastructure as Code** | 50h | ✅ Complete (12 files) | ✨ **+25Q** | | 10 | **LLM Infrastructure** | 50h | ✅ Complete (12 files) | ✨ **+30Q** | ### 3 Production-Grade Projects (77 Files) | Project | Technologies | Duration | Files | Status | |---------|-------------|----------|-------|--------| | **01: Basic Model Serving** | FastAPI + K8s + Monitoring | 30h | ~30 | ✅ Complete | | **02: MLOps Pipeline** | Airflow + MLflow + DVC | 40h | 30 | ✅ Complete | | **03: LLM Deployment** | vLLM + RAG + Vector DB | 50h | 47 | ✅ Complete | **Total Repository:** 207 files | ~95,000+ lines of code | 500+ hours of learning content --- ## 🎓 Prerequisites ### Option 1: Complete Junior Curriculum (RECOMMENDED) If you've completed the [**Junior AI Infrastructure Engineer**](https://github.com/ai-infra-curriculum/ai-infra-junior-engineer-learning) curriculum, you have **ALL** required prerequisites! ✅ The Junior curriculum covers: - ✅ Python fundamentals & advanced concepts - ✅ Linux/Unix command line mastery - ✅ Git & version control workflows - ✅ ML basics (PyTorch, TensorFlow) - ✅ Docker & containerization - ✅ Kubernetes introduction - ✅ API development & databases - ✅ Monitoring & cloud platforms **Duration**: 440 hours (22 weeks part-time, 11 weeks full-time) ### Option 2: Self-Assessment **Haven't completed Junior curriculum?** Use our comprehensive [**Prerequisites Guide**](./PREREQUISITES.md) to: - Check your readiness with detailed skill checklists - Identify knowledge gaps - Get personalized learning recommendations - Run automated skill assessment ### Minimum Requirements If self-studying, you must have: - **Python 3.9+** (intermediate level: OOP, async, testing, type hints) - **Linux/Unix CLI** (bash scripting, processes, debugging) - **Git fundamentals** (branching, merging, collaboration) - **ML basics** (PyTorch/TensorFlow, training, inference, evaluation) - **Docker basics** (images, containers, Compose) - **Kubernetes intro** (pods, deployments, services) **👉 Not sure if you're ready?** Read the [**Prerequisites Guide**](./PREREQUISITES.md) for detailed assessment. --- ## 🚀 Getting Started ### Quick Start ```bash # 1. Clone repository git clone https://github.com/ai-infra-curriculum/ai-infra-engineer-learning.git cd ai-infra-engineer-learning # 2. Create virtual environment python3.11 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # 3. Install dependencies pip install -r requirements.txt # 4. Start with Module 01 cd lessons/mod-101-foundations cat README.md ``` ### Learning Path 1. **Modules 01-02 (Foundations)** - Start here if new to ML infrastructure 2. **Modules 03-04 (Core Infrastructure)** - Docker and Kubernetes mastery 3. **Modules 05-06 (MLOps)** - Data pipelines and ML operations 4. **Modules 07-08 (Advanced)** - GPU computing and monitoring 5. **Modules 09-10 (Modern Stack)** - IaC and LLM infrastructure **Detailed guide:** [GETTING_STARTED.md](./GETTING_STARTED.md) --- ## 📖 Curriculum Overview ### Module 01: Foundations ✅ **50 hours | 15 files** Build your foundation in ML infrastructure: - ML infrastructure landscape and career paths - Python environment setup and best practices - ML frameworks (PyTorch, TensorFlow) - Docker fundamentals and containerization - REST API development with FastAPI [View Module 01 →](./lessons/mod-101-foundations/README.md) --- ### Module 02: Cloud Computing ✅ **50 hours | 11 files** Master cloud platforms for ML: - Cloud architecture for ML workloads - AWS (EC2, S3, EKS, SageMaker) - GCP (Compute Engine, GCS, GKE, Vertex AI) - Azure (VMs, Blob Storage, AKS, Azure ML) - Multi-cloud strategies and cost optimization (60-80% savings) [View Module 02 →](./lessons/mod-102-cloud-computing/README.md) --- ### Module 03: Containerization ✅ **50 hours | 14 files** Deep dive into containers: - Docker architecture and best practices - Multi-stage builds and optimization - Docker Compose for multi-service applications - Container registries and image management - Security and vulnerability scanning [View Module 03 →](./lessons/mod-103-containerization/README.md) --- ### Module 04: Kubernetes ✅ **50 hours | 13 files** Master Kubernetes for ML: - Kubernetes architecture and components - Deployments, Services, ConfigMaps, Secrets - GPU resource management and scheduling - Autoscaling (HPA, VPA, Cluster Autoscaler) - Helm charts and GitOps with ArgoCD [View Module 04 →](./lessons/mod-104-kubernetes/README.md) --- ### Module 05: Data Pipelines ✅ **50 hours | 12 files** Build robust data pipelines: - Apache Airflow for workflow orchestration - Data processing with Apache Spark - Streaming data with Apache Kafka - Data version control with DVC - Data quality validation and monitoring [View Module 05 →](./lessons/mod-105-data-pipelines/README.md) --- ### Module 06: MLOps ✅ **50 hours | 12 files** Implement MLOps best practices: - Experiment tracking with MLflow - Model registry and versioning - Feature stores and engineering - CI/CD for ML models - A/B testing and experimentation - ML governance and best practices [View Module 06 →](./lessons/mod-106-mlops/README.md) --- ### Module 07: GPU Computing & Distributed Training ✅ **50 hours | 12 files** Harness GPU power: - CUDA programming fundamentals - PyTorch GPU acceleration - Distributed training (DDP, FSDP) - Multi-GPU and multi-node training - Model and pipeline parallelism - GPU memory optimization [View Module 07 →](./lessons/mod-107-gpu-computing/README.md) --- ### Module 08: Monitoring & Observability ✅ **50 hours | 11 files** Build comprehensive observability: - Prometheus and Grafana - Metrics, logs, and traces (OpenTelemetry) - Distributed tracing with Jaeger - Alerting and incident response - Model performance monitoring - SLIs, SLOs, and SLAs [View Module 08 →](./lessons/mod-108-monitoring-observability/README.md) --- ### Module 09: Infrastructure as Code ✅ **50 hours | 12 files** Automate infrastructure: - Terraform fundamentals and best practices - Pulumi for multi-language IaC - CloudFormation for AWS - State management and modules - Multi-environment deployments - GitOps workflows [View Module 09 →](./lessons/mod-109-infrastructure-as-code/README.md) --- ### Module 10: LLM Infrastructure ✅ **50 hours | 12 files** Master cutting-edge LLM infrastructure (2024-2025): - LLM serving with vLLM and TensorRT-LLM - RAG (Retrieval-Augmented Generation) - Vector databases (Pinecone, Weaviate, Milvus) - Model quantization (FP16, INT8) - GPU optimization for inference - Cost tracking and optimization [View Module 10 →](./lessons/mod-110-llm-infrastructure/README.md) --- ## 🛠️ Projects ### Project 01: Basic Model Serving System ✅ **⭐ Beginner | 30 hours | ~30 files** Build a complete model serving system: - FastAPI REST API for image classification - Docker containerization with optimization - Kubernetes deployment with monitoring - Prometheus and Grafana dashboards - CI/CD pipeline with GitHub Actions **Technologies:** FastAPI, Docker, Kubernetes, PyTorch, Prometheus, Grafana [View Project 01 →](./projects/project-101-basic-model-serving/README.md) --- ### Project 02: End-to-End MLOps Pipeline ✅ **⭐⭐ Intermediate | 40 hours | 30 files** Create a production MLOps pipeline: - Apache Airflow DAGs (data, training, deployment) - MLflow experiment tracking and model registry - DVC for data versioning - Automated model deployment to Kubernetes - Comprehensive monitoring and alerting - CI/CD with automated testing **Technologies:** Airflow, MLflow, DVC, PostgreSQL, Redis, MinIO, Kubernetes [View Project 02 →](./projects/project-102-mlops-pipeline/README.md) --- ### Project 03: LLM Deployment Platform ✅ **⭐⭐⭐ Advanced | 50 hours | 47 files** Deploy cutting-edge LLM infrastructure: - vLLM/TensorRT-LLM for optimized serving - RAG system with vector database (Pinecone/ChromaDB/Milvus) - Document ingestion pipeline (PDF, TXT, web) - FastAPI with Server-Sent Events streaming - Kubernetes with GPU support - Cost tracking and optimization - Comprehensive monitoring **Technologies:** vLLM, LangChain, Vector DBs, FastAPI, Kubernetes + GPU, Transformers [View Project 03 →](./projects/project-103-llm-deployment/README.md) --- ## 💰 Cost Considerations ### Cloud Costs All learning materials can be completed within **free tier limits**: - **AWS**: 750 hours/month t2.micro + $300 credits (varies) - **GCP**: $300 credit (90 days) - **Azure**: $200 credit (30 days) **GPU costs** (optional, for advanced projects): - On-demand: $1-3/hour - Spot instances: $0.30-1/hour (70% savings) - Estimated total: $50-150 for complete curriculum ### Optimization Tips - Use spot instances for training (60-90% savings) - Leverage free tiers across multiple cloud providers - Delete resources when not in use - Use local development where possible --- ## 📚 Resources ### Included Documentation - Comprehensive lesson materials with examples - Code stubs with TODO comments for guided implementation - Complete project specifications with architecture diagrams - Quizzes and assessments for each module - Best practices and design patterns ### External Resources - 📖 **Reading Lists**: [resources/reading-lists/](./resources/reading-lists/) — advanced + staff-engineer paths - 🛠️ **Cheat Sheets**: [resources/cheat-sheets/](./resources/cheat-sheets/) — docker, kubernetes, git, linux, python infrastructure - ❓ **FAQ**: [resources/faq.md](./resources/faq.md) ### Curriculum Documentation - 📋 **[Technology Versions Guide](VERSIONS.md)** - Recommended versions for all tools and frameworks - 🗺️ **[Curriculum Cross-Reference](https://github.com/ai-infra-curriculum/.github/blob/main/CURRICULUM_CROSS_REFERENCE.md)** - Mapping between Junior and Engineer tracks - 📈 **[Career Progression Guide](https://github.com/ai-infra-curriculum/.github/blob/main/CAREER_PROGRESSION.md)** - Complete career ladder from Junior to Principal --- ## 🎯 Learning Outcomes & Career Impact ### After Completion, You'll Be Qualified For: **AI Infrastructure Engineer** - 💰 Salary: $120,000 - $180,000 - 🏢 Companies: Tech companies, AI startups, ML-focused organizations - 📈 Demand: Very high (growing 35% year-over-year) **ML Platform Engineer** - 💰 Salary: $130,000 - $190,000 - 🏢 Companies: Large tech firms, enterprises with ML teams - 📈 Demand: High (specialized role) **MLOps Engineer** - 💰 Salary: $110,000 - $170,000 - 🏢 Companies: All organizations doing ML at scale - 📈 Demand: Very high (fastest growing ML role) ### Skills You'll Demonstrate ✅ Kubernetes expertise with GPU scheduling ✅ End-to-end MLOps pipeline implementation ✅ LLM infrastructure and RAG systems ✅ Distributed training and GPU optimization ✅ Production monitoring and observability ✅ Cloud platform mastery (AWS, GCP, Azure) ✅ Infrastructure as Code with Terraform ✅ Cost optimization strategies --- ## 📊 Repository Statistics - **Total Files:** 207 - **Estimated Lines:** ~95,000+ - **Modules:** 10 (all complete) - **Projects:** 3 (all complete) - **Learning Hours:** 500+ - **Technologies:** 50+ ### Technology Stack Covered **Core Infrastructure:** Docker, Kubernetes, Terraform, Helm, ArgoCD **ML & Data:** PyTorch, TensorFlow, Apache Airflow, Apache Spark, Kafka, DVC **MLOps:** MLflow, Feature Stores, Model Registry, CI/CD **LLM Infrastructure:** vLLM, TensorRT-LLM, LangChain, Vector Databases (Pinecone, Milvus, ChromaDB) **Cloud Platforms:** AWS (EC2, S3, EKS, SageMaker), GCP (GCE, GCS, GKE, Vertex AI), Azure (VMs, AKS, Azure ML) **Monitoring:** Prometheus, Grafana, OpenTelemetry, Jaeger, ELK Stack **GPU Computing:** CUDA, NCCL, Multi-GPU training, Distributed training --- ## 🤝 Contributing We welcome contributions! Please see [CONTRIBUTING.md](./CONTRIBUTING.md) for: - Bug reports and fixes - Documentation improvements - New exercises and examples - Updated best practices --- ## 🆘 Getting Help - 📖 **Documentation**: Start with [GETTING_STARTED.md](./GETTING_STARTED.md) - 💬 **GitHub Discussions**: [Ask questions](https://github.com/ai-infra-curriculum/ai-infra-engineer-learning/discussions) - 🐛 **Issues**: [Report bugs](https://github.com/ai-infra-curriculum/ai-infra-engineer-learning/issues) - 📧 **Contact**: [email protected] --- ## 📜 License This project is licensed under the **MIT License** - see [LICENSE](./LICENSE) for details. --- ## 🌟 Success Metrics Upon completion, you should be able to: - [ ] Deploy ML models to production with confidence - [ ] Build complete MLOps pipelines from scratch - [ ] Implement LLM infrastructure with RAG - [ ] Optimize cloud costs by 60-80% - [ ] Debug complex distributed systems - [ ] Pass technical interviews for AI Infrastructure roles - [ ] Confidently discuss trade-offs in system design - [ ] Lead infrastructure projects at your organization --- ## 🚀 Next Steps After Completion This curriculum prepares you for **AI Infrastructure Engineer** roles. For career progression: 1. **Gain Experience** (1-2 years) - Work on production ML systems - Handle incidents and on-call rotations - Contribute to open-source ML infrastructure projects 2. **Advance to Senior Engineer** (2-3 years total) - Continue with the [Senior Engineer track](https://github.com/ai-infra-curriculum/ai-infra-senior-engineer-learning) - Lead larger projects and mentor juniors - Design complex systems 3. **Become an Architect** (4-6 years total) - Continue with the [Architect track](https://github.com/ai-infra-curriculum/ai-infra-architect-learning) - Design enterprise ML platforms - Strategic technical leadership --- <div align="center"> ## Ready to Master AI Infrastructure Engineering? **Start your journey today!** [📘 Get Started](./GETTING_STARTED.md) | [📚 View Full Curriculum](./CURRICULUM.md) | [🚀 Start Module 01](./lessons/mod-101-foundations/README.md) --- ⭐ **Star this repository** if you find it valuable! **Share with others** learning AI Infrastructure Engineering! --- *Contact: [email protected]* **Happy Learning!** 🎓🚀 </div> ---  Maintained by [VeriSwarm.ai](https://veriswarm.ai)

Education & Learning Mobile Development

1.1K Github Stars

Software by ai-infra-curriculum

ai-infra-engineer-learning