Home
Softono
a

ai-infra-curriculum

Professional software vendor delivering innovative solutions on the Softono platform. Specialized in both open-source and proprietary software development.

Total Products
1

Software by ai-infra-curriculum

ai-infra-engineer-learning
Open Source

ai-infra-engineer-learning

# AI Infrastructure Engineer - Learning Path <div align="center"> ![License](https://img.shields.io/badge/license-MIT-blue.svg) ![Progress](https://img.shields.io/badge/modules-10/10_complete-brightgreen.svg) ![Projects](https://img.shields.io/badge/projects-3/3_complete-brightgreen.svg) ![Duration](https://img.shields.io/badge/duration-500+_hours-red.svg) *Master AI Infrastructure Engineering through hands-on projects and practical learning* [Prerequisites](./PREREQUISITES.md) β€’ [Getting Started](#-getting-started) β€’ [Curriculum](#-curriculum-overview) β€’ [Projects](#-projects) β€’ [Resources](#-resources) </div> --- ## 🎯 Overview This repository contains a **complete, production-ready learning path** for becoming an **AI Infrastructure Engineer**. Through comprehensive modules, real-world projects, and production-grade code stubs with educational TODO comments, you'll develop the skills needed to build, deploy, and maintain ML infrastructure at scale. **Repository Status:** βœ… **100% COMPLETE** - All modules and projects ready for learning! ### What You'll Master - βœ… **Build ML Infrastructure** from scratch (Docker, Kubernetes, cloud platforms) - βœ… **Deploy Production ML Systems** with auto-scaling and comprehensive monitoring - βœ… **Implement End-to-End MLOps** pipelines (Airflow, MLflow, DVC) - βœ… **Deploy Cutting-Edge LLM Infrastructure** (vLLM, RAG, vector databases) - βœ… **Scale Training** with distributed systems and GPU clusters - βœ… **Monitor and Troubleshoot** complex ML systems in production - βœ… **Optimize Costs** across cloud providers (60-80% savings possible) ### Why This Learning Path? - πŸŽ“ **Industry-Aligned**: Based on actual job requirements from FAANG and top tech companies - πŸ’» **Hands-On**: Code stubs with TODO comments guide you through real implementations - πŸ—οΈ **Production-Ready**: Learn patterns used at Netflix, Uber, Airbnb, OpenAI - πŸ“Š **Career-Focused**: Directly maps to $120k-$180k AI Infrastructure Engineer roles - πŸš€ **Progressive**: 10 modules building from basics to advanced LLM infrastructure - πŸ”₯ **Modern Stack**: 2024-2025 technologies (vLLM, RAG, GPU optimization) --- ## ✨ What's New **2026-05-27 β€” Layout standardisation:** - 🧹 **Removed 10 empty root-level `mod-XXX-*/` placeholder directories.** They were vestiges from a pre-refactor layout; all canonical module content has lived under `lessons/mod-XXX-*/` for some time. The repo now matches the layout expected by the curriculum-runner audit (`lessons/` for learning content, `modules/` in the paired solutions repo). - 🧹 **Removed orphan `lessons/mod-101-foundations/exercises/solutions/`** (a duplicate single-file index). Reference solutions live in the paired [`ai-infra-engineer-solutions`](https://github.com/ai-infra-curriculum/ai-infra-engineer-solutions) repo; inline pointers throughout the lessons now link there directly. **May 2026 Update:** - πŸ§ͺ **All 62 promised labs authored** across all 10 modules (foundations β†’ LLM infrastructure). Each lab is a substantive, runnable walkthrough with objectives, prerequisites, numbered steps, validation checklist, cleanup, and troubleshooting. - πŸ“’ **Two new reading lists:** `advanced-engineer-path.md` and `staff-engineer-path.md` (9–18 months and 2–5 years respectively). - 🧹 **Structural cleanup:** mod-101 lecture duplicates resolved, quiz placement consolidated, empty Makefile/pyproject populated with real content, CURRICULUM.md self-claim corrected to reflect actual completion state. - πŸ” **Honesty pass on CURRICULUM.md:** the prior "100% Complete" claim has been replaced with a per-module exercise/lab accounting. Lectures and projects are excellent; exercises are 32 of 119 promised and being filled in over subsequent content drops. **Earlier:** - πŸ“ **Comprehensive Quizzes** for modules 102-110 (265+ questions) - Module 102: Cloud Computing (mid-module + final, 50 questions) - Module 103: Containerization (25 questions) - Module 104: Kubernetes (30 questions) - Module 105: Data Pipelines (25 questions) - Module 106: MLOps (30 questions) - Module 107: GPU Computing (25 questions) - Module 108: Monitoring (25 questions) - Module 109: IaC (25 questions) - Module 110: LLM Infrastructure (30 questions) - πŸ“‹ **Technology Versions Guide** - Complete specifications for 100+ tools - πŸ—ΊοΈ **Curriculum Cross-Reference** - Mapping to Junior track - πŸ“ˆ **Career Progression Guide** - Engineer to Principal roadmap --- ## πŸ“Š What's Included ### 10 Complete Learning Modules (130 Files) | Module | Topic | Hours | Status | Quiz | |--------|-------|-------|--------|------| | 01 | **Foundations** | 50h | βœ… Complete (15 files) | βœ… 30Q | | 02 | **Cloud Computing** | 50h | βœ… Complete (11 files) | ✨ **+50Q** | | 03 | **Containerization** | 50h | βœ… Complete (14 files) | ✨ **+25Q** | | 04 | **Kubernetes** | 50h | βœ… Complete (13 files) | ✨ **+30Q** | | 05 | **Data Pipelines** | 50h | βœ… Complete (12 files) | ✨ **+25Q** | | 06 | **MLOps** | 50h | βœ… Complete (12 files) | ✨ **+30Q** | | 07 | **GPU Computing** | 50h | βœ… Complete (12 files) | ✨ **+25Q** | | 08 | **Monitoring & Observability** | 50h | βœ… Complete (11 files) | ✨ **+25Q** | | 09 | **Infrastructure as Code** | 50h | βœ… Complete (12 files) | ✨ **+25Q** | | 10 | **LLM Infrastructure** | 50h | βœ… Complete (12 files) | ✨ **+30Q** | ### 3 Production-Grade Projects (77 Files) | Project | Technologies | Duration | Files | Status | |---------|-------------|----------|-------|--------| | **01: Basic Model Serving** | FastAPI + K8s + Monitoring | 30h | ~30 | βœ… Complete | | **02: MLOps Pipeline** | Airflow + MLflow + DVC | 40h | 30 | βœ… Complete | | **03: LLM Deployment** | vLLM + RAG + Vector DB | 50h | 47 | βœ… Complete | **Total Repository:** 207 files | ~95,000+ lines of code | 500+ hours of learning content --- ## πŸŽ“ Prerequisites ### Option 1: Complete Junior Curriculum (RECOMMENDED) If you've completed the [**Junior AI Infrastructure Engineer**](https://github.com/ai-infra-curriculum/ai-infra-junior-engineer-learning) curriculum, you have **ALL** required prerequisites! βœ… The Junior curriculum covers: - βœ… Python fundamentals & advanced concepts - βœ… Linux/Unix command line mastery - βœ… Git & version control workflows - βœ… ML basics (PyTorch, TensorFlow) - βœ… Docker & containerization - βœ… Kubernetes introduction - βœ… API development & databases - βœ… Monitoring & cloud platforms **Duration**: 440 hours (22 weeks part-time, 11 weeks full-time) ### Option 2: Self-Assessment **Haven't completed Junior curriculum?** Use our comprehensive [**Prerequisites Guide**](./PREREQUISITES.md) to: - Check your readiness with detailed skill checklists - Identify knowledge gaps - Get personalized learning recommendations - Run automated skill assessment ### Minimum Requirements If self-studying, you must have: - **Python 3.9+** (intermediate level: OOP, async, testing, type hints) - **Linux/Unix CLI** (bash scripting, processes, debugging) - **Git fundamentals** (branching, merging, collaboration) - **ML basics** (PyTorch/TensorFlow, training, inference, evaluation) - **Docker basics** (images, containers, Compose) - **Kubernetes intro** (pods, deployments, services) **πŸ‘‰ Not sure if you're ready?** Read the [**Prerequisites Guide**](./PREREQUISITES.md) for detailed assessment. --- ## πŸš€ Getting Started ### Quick Start ```bash # 1. Clone repository git clone https://github.com/ai-infra-curriculum/ai-infra-engineer-learning.git cd ai-infra-engineer-learning # 2. Create virtual environment python3.11 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # 3. Install dependencies pip install -r requirements.txt # 4. Start with Module 01 cd lessons/mod-101-foundations cat README.md ``` ### Learning Path 1. **Modules 01-02 (Foundations)** - Start here if new to ML infrastructure 2. **Modules 03-04 (Core Infrastructure)** - Docker and Kubernetes mastery 3. **Modules 05-06 (MLOps)** - Data pipelines and ML operations 4. **Modules 07-08 (Advanced)** - GPU computing and monitoring 5. **Modules 09-10 (Modern Stack)** - IaC and LLM infrastructure **Detailed guide:** [GETTING_STARTED.md](./GETTING_STARTED.md) --- ## πŸ“– Curriculum Overview ### Module 01: Foundations βœ… **50 hours | 15 files** Build your foundation in ML infrastructure: - ML infrastructure landscape and career paths - Python environment setup and best practices - ML frameworks (PyTorch, TensorFlow) - Docker fundamentals and containerization - REST API development with FastAPI [View Module 01 β†’](./lessons/mod-101-foundations/README.md) --- ### Module 02: Cloud Computing βœ… **50 hours | 11 files** Master cloud platforms for ML: - Cloud architecture for ML workloads - AWS (EC2, S3, EKS, SageMaker) - GCP (Compute Engine, GCS, GKE, Vertex AI) - Azure (VMs, Blob Storage, AKS, Azure ML) - Multi-cloud strategies and cost optimization (60-80% savings) [View Module 02 β†’](./lessons/mod-102-cloud-computing/README.md) --- ### Module 03: Containerization βœ… **50 hours | 14 files** Deep dive into containers: - Docker architecture and best practices - Multi-stage builds and optimization - Docker Compose for multi-service applications - Container registries and image management - Security and vulnerability scanning [View Module 03 β†’](./lessons/mod-103-containerization/README.md) --- ### Module 04: Kubernetes βœ… **50 hours | 13 files** Master Kubernetes for ML: - Kubernetes architecture and components - Deployments, Services, ConfigMaps, Secrets - GPU resource management and scheduling - Autoscaling (HPA, VPA, Cluster Autoscaler) - Helm charts and GitOps with ArgoCD [View Module 04 β†’](./lessons/mod-104-kubernetes/README.md) --- ### Module 05: Data Pipelines βœ… **50 hours | 12 files** Build robust data pipelines: - Apache Airflow for workflow orchestration - Data processing with Apache Spark - Streaming data with Apache Kafka - Data version control with DVC - Data quality validation and monitoring [View Module 05 β†’](./lessons/mod-105-data-pipelines/README.md) --- ### Module 06: MLOps βœ… **50 hours | 12 files** Implement MLOps best practices: - Experiment tracking with MLflow - Model registry and versioning - Feature stores and engineering - CI/CD for ML models - A/B testing and experimentation - ML governance and best practices [View Module 06 β†’](./lessons/mod-106-mlops/README.md) --- ### Module 07: GPU Computing & Distributed Training βœ… **50 hours | 12 files** Harness GPU power: - CUDA programming fundamentals - PyTorch GPU acceleration - Distributed training (DDP, FSDP) - Multi-GPU and multi-node training - Model and pipeline parallelism - GPU memory optimization [View Module 07 β†’](./lessons/mod-107-gpu-computing/README.md) --- ### Module 08: Monitoring & Observability βœ… **50 hours | 11 files** Build comprehensive observability: - Prometheus and Grafana - Metrics, logs, and traces (OpenTelemetry) - Distributed tracing with Jaeger - Alerting and incident response - Model performance monitoring - SLIs, SLOs, and SLAs [View Module 08 β†’](./lessons/mod-108-monitoring-observability/README.md) --- ### Module 09: Infrastructure as Code βœ… **50 hours | 12 files** Automate infrastructure: - Terraform fundamentals and best practices - Pulumi for multi-language IaC - CloudFormation for AWS - State management and modules - Multi-environment deployments - GitOps workflows [View Module 09 β†’](./lessons/mod-109-infrastructure-as-code/README.md) --- ### Module 10: LLM Infrastructure βœ… **50 hours | 12 files** Master cutting-edge LLM infrastructure (2024-2025): - LLM serving with vLLM and TensorRT-LLM - RAG (Retrieval-Augmented Generation) - Vector databases (Pinecone, Weaviate, Milvus) - Model quantization (FP16, INT8) - GPU optimization for inference - Cost tracking and optimization [View Module 10 β†’](./lessons/mod-110-llm-infrastructure/README.md) --- ## πŸ› οΈ Projects ### Project 01: Basic Model Serving System βœ… **⭐ Beginner | 30 hours | ~30 files** Build a complete model serving system: - FastAPI REST API for image classification - Docker containerization with optimization - Kubernetes deployment with monitoring - Prometheus and Grafana dashboards - CI/CD pipeline with GitHub Actions **Technologies:** FastAPI, Docker, Kubernetes, PyTorch, Prometheus, Grafana [View Project 01 β†’](./projects/project-101-basic-model-serving/README.md) --- ### Project 02: End-to-End MLOps Pipeline βœ… **⭐⭐ Intermediate | 40 hours | 30 files** Create a production MLOps pipeline: - Apache Airflow DAGs (data, training, deployment) - MLflow experiment tracking and model registry - DVC for data versioning - Automated model deployment to Kubernetes - Comprehensive monitoring and alerting - CI/CD with automated testing **Technologies:** Airflow, MLflow, DVC, PostgreSQL, Redis, MinIO, Kubernetes [View Project 02 β†’](./projects/project-102-mlops-pipeline/README.md) --- ### Project 03: LLM Deployment Platform βœ… **⭐⭐⭐ Advanced | 50 hours | 47 files** Deploy cutting-edge LLM infrastructure: - vLLM/TensorRT-LLM for optimized serving - RAG system with vector database (Pinecone/ChromaDB/Milvus) - Document ingestion pipeline (PDF, TXT, web) - FastAPI with Server-Sent Events streaming - Kubernetes with GPU support - Cost tracking and optimization - Comprehensive monitoring **Technologies:** vLLM, LangChain, Vector DBs, FastAPI, Kubernetes + GPU, Transformers [View Project 03 β†’](./projects/project-103-llm-deployment/README.md) --- ## πŸ’° Cost Considerations ### Cloud Costs All learning materials can be completed within **free tier limits**: - **AWS**: 750 hours/month t2.micro + $300 credits (varies) - **GCP**: $300 credit (90 days) - **Azure**: $200 credit (30 days) **GPU costs** (optional, for advanced projects): - On-demand: $1-3/hour - Spot instances: $0.30-1/hour (70% savings) - Estimated total: $50-150 for complete curriculum ### Optimization Tips - Use spot instances for training (60-90% savings) - Leverage free tiers across multiple cloud providers - Delete resources when not in use - Use local development where possible --- ## πŸ“š Resources ### Included Documentation - Comprehensive lesson materials with examples - Code stubs with TODO comments for guided implementation - Complete project specifications with architecture diagrams - Quizzes and assessments for each module - Best practices and design patterns ### External Resources - πŸ“– **Reading Lists**: [resources/reading-lists/](./resources/reading-lists/) β€” advanced + staff-engineer paths - πŸ› οΈ **Cheat Sheets**: [resources/cheat-sheets/](./resources/cheat-sheets/) β€” docker, kubernetes, git, linux, python infrastructure - ❓ **FAQ**: [resources/faq.md](./resources/faq.md) ### Curriculum Documentation - πŸ“‹ **[Technology Versions Guide](VERSIONS.md)** - Recommended versions for all tools and frameworks - πŸ—ΊοΈ **[Curriculum Cross-Reference](https://github.com/ai-infra-curriculum/.github/blob/main/CURRICULUM_CROSS_REFERENCE.md)** - Mapping between Junior and Engineer tracks - πŸ“ˆ **[Career Progression Guide](https://github.com/ai-infra-curriculum/.github/blob/main/CAREER_PROGRESSION.md)** - Complete career ladder from Junior to Principal --- ## 🎯 Learning Outcomes & Career Impact ### After Completion, You'll Be Qualified For: **AI Infrastructure Engineer** - πŸ’° Salary: $120,000 - $180,000 - 🏒 Companies: Tech companies, AI startups, ML-focused organizations - πŸ“ˆ Demand: Very high (growing 35% year-over-year) **ML Platform Engineer** - πŸ’° Salary: $130,000 - $190,000 - 🏒 Companies: Large tech firms, enterprises with ML teams - πŸ“ˆ Demand: High (specialized role) **MLOps Engineer** - πŸ’° Salary: $110,000 - $170,000 - 🏒 Companies: All organizations doing ML at scale - πŸ“ˆ Demand: Very high (fastest growing ML role) ### Skills You'll Demonstrate βœ… Kubernetes expertise with GPU scheduling βœ… End-to-end MLOps pipeline implementation βœ… LLM infrastructure and RAG systems βœ… Distributed training and GPU optimization βœ… Production monitoring and observability βœ… Cloud platform mastery (AWS, GCP, Azure) βœ… Infrastructure as Code with Terraform βœ… Cost optimization strategies --- ## πŸ“Š Repository Statistics - **Total Files:** 207 - **Estimated Lines:** ~95,000+ - **Modules:** 10 (all complete) - **Projects:** 3 (all complete) - **Learning Hours:** 500+ - **Technologies:** 50+ ### Technology Stack Covered **Core Infrastructure:** Docker, Kubernetes, Terraform, Helm, ArgoCD **ML & Data:** PyTorch, TensorFlow, Apache Airflow, Apache Spark, Kafka, DVC **MLOps:** MLflow, Feature Stores, Model Registry, CI/CD **LLM Infrastructure:** vLLM, TensorRT-LLM, LangChain, Vector Databases (Pinecone, Milvus, ChromaDB) **Cloud Platforms:** AWS (EC2, S3, EKS, SageMaker), GCP (GCE, GCS, GKE, Vertex AI), Azure (VMs, AKS, Azure ML) **Monitoring:** Prometheus, Grafana, OpenTelemetry, Jaeger, ELK Stack **GPU Computing:** CUDA, NCCL, Multi-GPU training, Distributed training --- ## 🀝 Contributing We welcome contributions! Please see [CONTRIBUTING.md](./CONTRIBUTING.md) for: - Bug reports and fixes - Documentation improvements - New exercises and examples - Updated best practices --- ## πŸ†˜ Getting Help - πŸ“– **Documentation**: Start with [GETTING_STARTED.md](./GETTING_STARTED.md) - πŸ’¬ **GitHub Discussions**: [Ask questions](https://github.com/ai-infra-curriculum/ai-infra-engineer-learning/discussions) - πŸ› **Issues**: [Report bugs](https://github.com/ai-infra-curriculum/ai-infra-engineer-learning/issues) - πŸ“§ **Contact**: [email protected] --- ## πŸ“œ License This project is licensed under the **MIT License** - see [LICENSE](./LICENSE) for details. --- ## 🌟 Success Metrics Upon completion, you should be able to: - [ ] Deploy ML models to production with confidence - [ ] Build complete MLOps pipelines from scratch - [ ] Implement LLM infrastructure with RAG - [ ] Optimize cloud costs by 60-80% - [ ] Debug complex distributed systems - [ ] Pass technical interviews for AI Infrastructure roles - [ ] Confidently discuss trade-offs in system design - [ ] Lead infrastructure projects at your organization --- ## πŸš€ Next Steps After Completion This curriculum prepares you for **AI Infrastructure Engineer** roles. For career progression: 1. **Gain Experience** (1-2 years) - Work on production ML systems - Handle incidents and on-call rotations - Contribute to open-source ML infrastructure projects 2. **Advance to Senior Engineer** (2-3 years total) - Continue with the [Senior Engineer track](https://github.com/ai-infra-curriculum/ai-infra-senior-engineer-learning) - Lead larger projects and mentor juniors - Design complex systems 3. **Become an Architect** (4-6 years total) - Continue with the [Architect track](https://github.com/ai-infra-curriculum/ai-infra-architect-learning) - Design enterprise ML platforms - Strategic technical leadership --- <div align="center"> ## Ready to Master AI Infrastructure Engineering? **Start your journey today!** [πŸ“˜ Get Started](./GETTING_STARTED.md) | [πŸ“š View Full Curriculum](./CURRICULUM.md) | [πŸš€ Start Module 01](./lessons/mod-101-foundations/README.md) --- ⭐ **Star this repository** if you find it valuable! **Share with others** learning AI Infrastructure Engineering! --- *Contact: [email protected]* **Happy Learning!** πŸŽ“πŸš€ </div> --- <!-- aicg:maintained-by --> Maintained by [VeriSwarm.ai](https://veriswarm.ai)

Education & Learning Mobile Development
1.1K Github Stars