About Hands-On-AI-Engineering

<p align="center"> <a href="https://aiengineering.beehiiv.com/"> <img src="assets/theaiengineering_logo.jpeg" alt="Hands-On AI Engineering Banner" width="150"> </a> </p> <div align="center"> # 🚀 Hands-On AI Engineering [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md) </div> A curated collection of practical, production-ready AI projects across multiple modalities, including language models, multimodal models, OCR systems, RAG pipelines, and AI agents. Each project is designed to help you learn, experiment, and build real-world AI applications. ## 📋 Table of Contents - [🎯 Why This Repository?](#-why-this-repository) - [🗂️ Project Categories](#️-project-categories) - [🚀 Getting Started](#-getting-started) - [🤝 Contributing](#-contributing) - [📜 License](#-license) --- ## 🎯 Why This Repository? - **Learn by Doing**: Each project includes co ...

s

Published by

sumanth077

Visit View Profile

README.md

View on GitHub

🚀 Hands-On AI Engineering

A curated collection of practical, production-ready AI projects across multiple modalities, including language models, multimodal models, OCR systems, RAG pipelines, and AI agents. Each project is designed to help you learn, experiment, and build real-world AI applications.

🎯 Why This Repository?

Learn by Doing: Each project includes complete code, setup instructions, and documentation
Production-Ready: Projects follow best practices and are ready to be adapted for real-world use
Diverse Use Cases: From RAG systems to multi-agent workflows and specialized applications
Multiple Model Providers: Projects use OpenAI, Anthropic, Google, and open-source models
Active Community: Regular updates and new project additions

🗂️ Project Categories

🤖 AI Agents

Intelligent ai agents for various automation tasks.

Multi-Agent Financial Analyst — Team of specialized agents for comprehensive financial analysis.
FinAgent — Financial assistant agent for stock market analysis and insights.
Daily AI News Digest — Automated daily digest from 92 Karpathy-curated tech blogs delivered to Telegram every morning. MiniMax M2.7 scores articles from the last 24 hours and surfaces the 3 most significant stories.
Agentic Form Filler — Agentic form-filling agent using Landing AI for layout parsing and MiniMax M2.7 for multi-turn data gathering.
AI Travel Planning Agent — Multi-agent travel planner that turns a single natural language request into a complete trip plan with flights, hotels, and a day-by-day itinerary.
Competitive Intelligence Agent — Generates strategic sales battlecards by analyzing competitors through the lens of your own business context.
Multi-Agent Research Assistant (AG2) — Multi-agent research pipeline using AG2 where three specialists collaborate to research any topic and produce a structured report.
Self-Reflective Agentic RAG — LangGraph RAG system that grades retrieved context, rewrites the query if needed, and generates an answer only once the context passes validation.
Agentic SQL Search — Natural language to SQL agent powered by Gemma 4 that writes, executes, and explains queries against an e-commerce database.
Stock Portfolio Analyst — Portfolio analysis agent built with Agno and DeepSeek-V4-Flash. Fetches live market data via YFinance and generates a report covering P&L, concentration risk, and rebalancing recommendations.
Eagle Eye — GitHub PR review agent using OpenClaw and Telegram. Fetches diffs via GitHub MCP, performs structured code review with severity ratings, and posts feedback after user approval.
CartMate — AI Customer Support Agent — Memory-powered e-commerce support agent built with Mem0 and Mistral Small 4 that remembers customers and picks up conversations where they left off.
Multi-Agent Coding Assistant — Four-stage coding pipeline powered by Mistral Small 4 and LangChain. A Planner, Coder, and Reviewer agent collaborate to produce a polished final implementation.
Startup Analyst — Startup due-diligence agent powered by MiniMax M2.5. Scrapes a company's site with Firecrawl and produces an investment-grade report covering market position, financials, team, and risks.
Research Team — Multi-agent research system powered by MiniMax M2.5. Seek searches the web, Scout navigates internal documents, and a team leader synthesises findings into a structured report.
GitHub Intelligence Agent — GitHub research agent powered by Gemini 3 Flash and GitHub's official MCP server. Ask anything about repos, contributors, issues, or codebases.
Smolagents Code Agent — Agentic task runner powered by Mistral Small 4 and HuggingFace smolagents. Writes and executes Python code at each step using DuckDuckGo and Wikipedia.
Agent Discovery Agent — Searches and compares AI agents across NANDA, MCP, Virtuals Protocol, A2A, and ERC-8004 through a single natural language interface. Powered by Gemini 3 Flash.
Cal Scheduling Agent — Conversational scheduling assistant that manages Cal.com appointments through natural language. Book, reschedule, cancel, and check availability with automatic timezone handling.
Hacker News Newsletter Agent — Fetches the 10 latest Hacker News stories, scrapes full article content with Trafilatura, generates a structured HTML newsletter with Gemma 4, and delivers it to your inbox via Gmail SMTP.
Hotel Finder Agent — Conversational hotel search agent powered by qwen3.6-flash via Orq.ai and the Trivago MCP Server. Search by location, dates, guest count, price range, star rating, and amenities.
Marketing Strategy Agent — Multi-agent marketing campaign generator. A Market Analyst (with Serper web search), Strategy Officer, and Creative Director run sequentially to produce market research, a full strategy, and creative campaign content. Powered by deepseek-v4-flash via Orq.ai.
Brand Monitor — Monitors brand mentions across Web, YouTube, Twitter/X, and LinkedIn in a single run. Scrapingdog collects platform data and DeepSeek V4 Flash produces a structured intelligence brief per channel.
AI Debate Agent - Two LLM debaters argue opposing sides of any topic you choose. A judge scores each turn and declares a winner.
Browser Automation Agent - Takes a natural language instruction and autonomously navigates the web to complete it using browser-use.
Documentation QnA Agent - Chat with any documentation by URL. Uses Fetch MCP and DeepSeek V4 Flash on NVIDIA NIM.
Job Posting Agent - Generates tailored job postings from a company name and role using DeepSeek V4 Flash on NVIDIA NIM.
LangChain Data Agent - Query the Chinook SQLite database in plain English through a conversational Streamlit chat interface.
Travel Planner Agent - AI trip planning assistant covering weather, budget, packing lists, and day-by-day itineraries from a single request.
Personal Finance Agent - Upload a bank statement CSV, auto-categorize transactions, and ask natural language questions about your spending. Powered by a LangChain tool-calling agent backed by Orq.ai with SQLite persistence.

📸 OCR

Extracting structure and meaning from visual data and documents.

Image-to-Structured-Data Extractor — Converts images into validated, structured JSON using Mistral Large 3 and Instructor.
LaTeX Formula OCR — Extracts math formulas from images and PDFs into LaTeX using a local vision-language model.
Medical Prescription Digitizer — Digitizes handwritten or printed prescriptions into structured output using Mistral Large 3, with real-time drug name validation against RxNorm.

🎧 Audio

Projects for audio understanding and analysis.

Music Explorer — Chat with any audio file or YouTube video using Gemini 3 Flash. Ask for transcriptions, emotion analysis, instrument identification, and timestamp-aware breakdowns.
Multilingual Audio Translator — Upload or record audio in any language, get it transcribed with faster-whisper, translated via Gemini, and played back as synthesized speech using Kokoro TTS.

🎬 Multimodal

Projects combining vision, video, and language models.

GLM-OCR Pro — Structured document extraction using GLM-OCR via Ollama, transforming images and PDFs into formatted Markdown locally.
Video Understanding Agent — Summarizes YouTube videos into chapters, key takeaways, and action items using Gemini Flash.
Multimodal Weather App — Upload a map image and get live weather. Mistral Small 4 identifies the city via vision, then fetches real-time conditions through native tool calling.
Multimodal RAG — RAG system that ingests text, URLs, PDFs, images, audio, and video into a shared ChromaDB index. Gemini Embedding 2 handles retrieval and Gemini 3 Flash generates grounded answers, passing actual file URIs for media sources.
Image Question Answering — Upload a PDF, select a page, and ask visual questions answered by Gemma 4 with thinking mode. PyMuPDF renders each page to a full-resolution image for grounded reasoning over charts, tables, and figures.
Medical Document Parser - Extracts a structured clinical profile from medical PDFs and images using Gemma 4 vision.

📚 RAG Applications

Retrieval-Augmented Generation systems for knowledge-enhanced AI applications.

Agentic RAG with O3-Mini & DuckDuckGo — RAG system using O3-Mini with DuckDuckGo for real-time web search.
Agentic RAG with Qwen & FireCrawl — RAG system using Qwen and FireCrawl for web scraping and retrieval.
Vision RAG — Multimodal RAG system for processing and querying visual content.
Clinical RAG with ADE — High-precision clinical RAG using LandingAI ADE for visual-first document parsing and Mistral Large for grounded reasoning.
YouTube Transcript RAG — Chat with any YouTube video using Whisper transcription, ChromaDB retrieval, and Mistral Small 4, with timestamp-linked answers.
GraphRAG Knowledge System — Builds a local knowledge graph from uploaded documents using Mistral Small 4 and NetworkX, supporting both entity-level and thematic queries.
Hybrid RAG System — Indexes documents into a knowledge graph and a vector store in parallel. Mistral Small 4 answers questions with fused context from both retrieval paths.
HyDE RAG — RAG pipeline using Hypothetical Document Embeddings. Gemini 3 Flash generates hypothetical answers, Gemini Embedding 2 embeds and averages them, and the result retrieves more precise chunks from ChromaDB.
Rock Music RAG — Custom rock music knowledge base built from Wikipedia. Add any band, ask questions across all of them, and get sourced answers powered by BM25 retrieval and Gemma 4.
RAG Agent with Database Routing — Routes queries across three specialized Qdrant databases (products, support, financial) using an Agno router agent. Falls back to a LangGraph ReAct web search agent when no relevant documents are found.
Reasoning RAG - Ask questions against any web source and get cited answers with a live, step-by-step reasoning trace via Gradio.

🤝 Contributing

We welcome contributions! Whether you're adding new projects, improving existing ones, or fixing bugs, your help makes this repository better for everyone.

How to Contribute

Read the guidelines: Check CONTRIBUTING.md for detailed instructions
Create an issue: Propose your project or improvement
Follow the structure: Use the appropriate category folder
Submit a PR: One project per pull request

Project Structure Requirements

Each project must be in its own folder within the appropriate category
Must include a comprehensive README.md (use our template)
Must include requirements.txt or pyproject.toml
Must include .env.example for required API keys
Follow snake_case naming convention

📜 License

This repository is licensed under the MIT License. See the LICENSE file for details.

🙏 Acknowledgments

Thank you to all contributors who have helped build this collection of AI engineering projects!

Built with ❤️ by the AI Engineering Community

For sponsorship or collaboration inquiries, reach the maintainer at [email protected].

⬆ Back to Top

Hands-On-AI-Engineering