About oreilly-hands-on-gpt-llm

Mastering the Art of Scalable and Efficient AI Model Deployment

s

Published by

sinanuozdemir

Visit View Profile

README.md

View on GitHub

oreilly-logo

Deploying GPT & Large Language Models

This repository contains code for the O'Reilly Live Online Training for Deploying GPT & LLMs

This course is designed to equip software engineers, data scientists, and machine learning professionals with the skills and knowledge needed to deploy AI models effectively in production environments. As AI continues to revolutionize industries, the ability to deploy, manage, and optimize AI applications at scale is becoming increasingly crucial. This course covers the full spectrum of deployment considerations, from leveraging cutting-edge tools like Kubernetes, llama.cpp, and GGUF, to mastering cost management, compute optimization, and model quantization.

Base Notebooks

Introduction to LLMs and Prompting

Notebook	Description
Introduction to 3rd Party Providers	Using Together.ai, HuggingFace, and Groq to run LLMs
Prompt Injection Examples	See how three kinds of prompt injection attacks can attempt to jailbreak an LLM

Cleaning Data and Monitoring Drift

Notebook	Description
Cleaning Data using Deep Learning	Using AUM and Cosine Similarity to clean data
Combating AI drift	Using Online Learning to combat drift

Evaluating Agents

Notebook	Description
Evaluating AI Agents: Task Automation and Tool Integration	A basic case study on tool selection accuracy
Positional Bias on Agent Response Evaluation	Identifying and evaluating positional bias on multiple LLMs

LangGraph and Agents

Notebook	Description
From Prompts to Workflows	Why single LLM prompts break on multi-step tasks and how LangGraph provides structure, state, and control flow
LangGraph Basics	Foundational LangGraph primitives: StateGraph, nodes, edges, conditional routing, memory, and visualization
Tools and ReAct Agents	Tool integration and the ReAct pattern with LangChain, manual StateGraph, and MCP

Advanced Deployment Techniques

Notebook	Description
Speculative Decoding	Using an assistant model to aid token decoding
Prompt Caching Llama 3	Replicating prompt caching with HuggingFace tools
Distilling BERT	Distilling models to optimize for speed/memory
Quantizing Llama-3 dynamically	Using bitsandbytes to quantize nearly any LLM on HuggingFace
Working with GGUF (no GPU)	Using Llama.cpp to work with models
Working with GGUF (with a GPU)	Using Llama.cpp to work with models
DeepSeek Model on GGUF	Running a DeepSeek Distilled Llama model using Llama.cpp
Qwen on GGUF with Llama.cpp	Running Qwen models using Llama.cpp
K8s GGUF Demo	Using embedding models and Llama 3 with GGUF on a GPU
vLLM + Gateway on K8s	Production GPU deployment with vLLM, FastAPI gateway, and DigitalOcean K8s

More

Fine-Tuning LLMs

Notebook	Description
Finetuning app_reviews with OpenAI
Fine-tuning BERT for app_reviews
Model Freezing with BERT

Prompt Engineering

Notebook	Description
Introduction to Prompt Engineering
Advanced Prompt Engineering

Instructor

Sinan Ozdemir Sinan is a former lecturer of Data Science at Johns Hopkins University and the author of multiple textbooks on data science and machine learning. Additionally, he is the founder of the recently acquired Kylie.ai, an enterprise-grade conversational AI platform with RPA capabilities. He holds a master's degree in Pure Mathematics from Johns Hopkins University and is based in San Francisco, CA.

oreilly-hands-on-gpt-llm