About awesome-tinyml

TinyML & Edge AI: On-device inference, model quantization, embedded ML, ultra-low-power AI for microcontrollers and IoT devices.

u

Published by

umitkacar

Visit View Profile

README.md

View on GitHub

🚀 AI Edge Computing & TinyML

Comprehensive Guide to State-of-the-Art Edge AI

🌟 Latest Update: January 2025

Production-Ready Python Implementation with modern tooling (Hatch, Ruff, Mypy) 62/62 Tests Passing • 81.76% Coverage • Zero Security Issues State-of-the-Art Algorithms & Trends for Edge AI and Embedded Systems

📋 Table of Contents

📚 Documentation

🎓 Community

🤝 Contributing
📊 Repository Stats
🏷️ Keywords

🚀 Quick Start & Development

📦 Installation

This project uses modern Python tooling with Hatch for dependency management and development workflows.

# Clone the repository
git clone https://github.com/umitkacar/ai-edge-computing-tiny-embedded.git
cd ai-edge-computing-tiny-embedded

# Install dependencies (using hatch)
pip install hatch

# Run tests
hatch run test

# Run full CI pipeline
hatch run ci

🛠️ Development Setup

Modern Python Stack:

Build System: Hatch - Modern Python project manager
Linting: Ruff - Ultra-fast Python linter (100x faster than flake8)
Formatting: Black - The uncompromising code formatter
Type Checking: Mypy - Static type checker (strict mode)
Testing: Pytest - Comprehensive test framework
Security: Bandit - Security vulnerability scanner
Pre-commit: Automated quality checks on commit/push

Available Commands:

# Linting & Formatting
hatch run lint          # Run Ruff linter
hatch run format        # Format code with Black
hatch run format-check  # Check formatting without changes

# Type Checking
hatch run type-check    # Run Mypy strict type checking

# Testing
hatch run test                    # Run tests (sequential)
hatch run test-parallel           # Run tests with auto workers
hatch run test-parallel-cov       # Parallel tests with coverage

# Security
hatch run security      # Run Bandit security audit

# Complete CI Pipeline
hatch run ci           # Run all checks (format, lint, type-check, security, test)

📊 Project Structure

ai-edge-computing-tiny-embedded/
├── src/ai_edge_tinyml/          # Source code (src layout)
│   ├── __init__.py              # Package initialization
│   ├── quantization.py          # INT8/INT4/FP16 quantization
│   ├── model_optimizer.py       # Model optimization pipeline
│   ├── utils.py                 # Utility functions
│   └── py.typed                 # PEP 561 marker (typed package)
├── tests/                       # Test suite (62 tests, 81.76% coverage)
│   ├── conftest.py              # Pytest configuration & fixtures
│   ├── test_quantization.py     # Quantization tests (21 tests)
│   ├── test_model_optimizer.py  # Optimizer tests (19 tests)
│   └── test_utils.py            # Utility tests (22 tests)
├── pyproject.toml               # Project configuration (single source of truth)
├── .pre-commit-config.yaml      # Pre-commit hooks configuration
├── CHANGELOG.md                 # Detailed change history
├── LESSONS-LEARNED.md           # Best practices & insights
├── DEVELOPMENT.md               # Development guidelines
└── README.md                    # This file

✅ Quality Assurance

This project maintains production-ready code quality:

Check	Status	Details
Ruff Linting	✅ PASS	50+ rules, zero errors
Black Formatting	✅ PASS	Line length: 100
Mypy Type Check	✅ PASS	Strict mode enabled
Bandit Security	✅ PASS	0 vulnerabilities
Test Suite	✅ PASS	62/62 tests passing
Code Coverage	✅ PASS	81.76% (exceeds 80%)
Pre-commit Hooks	✅ PASS	15+ automated checks

Test Results:

tests/test_quantization.py      21 passed
tests/test_model_optimizer.py   19 passed
tests/test_utils.py             22 passed
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total: 62 passed in 0.50s ✅
Coverage: 81.76% (exceeds 80% threshold) ✅

🔒 Security

Bandit Security Audit: Zero vulnerabilities detected
Type Safety: Full type annotations with mypy strict mode
Dependency Scanning: Automated security checks in CI
Pre-commit Hooks: Security validations before commit

📚 Documentation

CHANGELOG.md - Detailed version history and changes
LESSONS-LEARNED.md - Best practices, insights, and technical decisions
DEVELOPMENT.md - Comprehensive development guidelines
API Documentation: Auto-generated from Google-style docstrings

🎯 Features

Quantization Support:

✅ INT8 Quantization (8-bit integers)
✅ INT4 Quantization (4-bit integers)
✅ FP16 Quantization (16-bit floats)
✅ Dynamic Quantization
✅ Symmetric & Asymmetric modes
✅ Per-tensor & per-channel quantization

Model Optimization:

✅ Weight quantization with 6 different modes
✅ Compression ratio analysis
✅ Model size calculation
✅ Type-safe APIs with full annotations
✅ Comprehensive error handling

Example Usage:

import numpy as np
from ai_edge_tinyml import Quantizer, QuantizationConfig, QuantizationMode

# Create quantization config
config = QuantizationConfig(
    mode=QuantizationMode.INT8,
    symmetric=True,
    per_channel=False
)

# Initialize quantizer
quantizer = Quantizer(config)

# Quantize weights
weights = np.random.randn(100, 100).astype(np.float32)
quantized = quantizer.quantize(weights)

# Dequantize for inference
dequantized = quantizer.dequantize(quantized)

# Calculate compression
from ai_edge_tinyml.utils import calculate_compression_ratio
ratio = calculate_compression_ratio(weights, quantized)
print(f"Compression ratio: {ratio:.2f}x")

🔥 SOTA Models & Algorithms (2024-2025)

🎯 Object Detection Models

🥇 YOLOv11 (YOLO11)

🚀 State-of-the-art real-time object detection with transformer-based improvements

✨ Key Features:

⚡ Transformer-based backbone with C3k2 blocks
🎯 Partial Self-Attention (PSA) mechanism
🔥 NMS-free training with dual label assignment
📉 25-40% lower latency vs YOLOv10
📊 10-15% improvement in mAP
⚡ 60+ FPS processing capability

📚 Resources:

📖 Ultralytics Docs → https://docs.ultralytics.com/models/
📄 YOLO Evolution → https://arxiv.org/html/2510.09653v2

🥈 YOLOv10

⚡ Eliminates NMS for end-to-end real-time detection

📊 Performance Metrics:

🔸 YOLOv10s: 1.8x faster than RT-DETR-R18
🔸 YOLOv10b: 46% less latency, 25% fewer parameters
🔸 mAP Range: 38.5 - 54.4

📚 Resources:

📄 Paper → https://arxiv.org/pdf/2405.14458
📖 Docs → https://docs.ultralytics.com/models/yolov10/

🤖 RT-DETR & RT-DETRv2

🎯 First practical real-time detection transformer

Model	AP Score	FPS	Device
RT-DETR	53.1%	108	NVIDIA T4
RT-DETRv2	>55%	108+	NVIDIA T4

🔗 Resources:

📊 RT-DETR vs YOLO11 Comparison

📱 Efficient Vision Models for Edge

graph LR
    A[🖼️ Input Image] --> B[📱 MobileNetV4]
    A --> C[⚡ EfficientViT]
    B --> D[🎯 87% Accuracy]
    C --> E[🔥 3.8ms Latency]
    D --> F[📲 Edge TPU]
    E --> F
    style A fill:#e1f5ff
    style B fill:#ffe1f5
    style C fill:#f5ffe1
    style D fill:#ffe1e1
    style E fill:#e1ffe1
    style F fill:#ffd700

📱 MobileNetV4

🌐 Universal efficient architecture for mobile ecosystem

🎨 Innovations:

🔹 Universal Inverted Bottleneck (UIB) block
⚡ Mobile MQA attention (39% speedup)
🎯 Optimized NAS recipe
🏆 87% ImageNet accuracy @ 3.8ms (Pixel 8 EdgeTPU)

📚 Resources:

📄 MobileNetV4 Paper (Springer)
🔬 Google Research

⚡ EfficientViT

🧠 Lightweight multi-scale attention for high-resolution tasks

✨ Features:

🔸 Memory-efficient Vision Transformer
🔸 Cascaded group attention
🔸 Dense prediction tasks optimized
🔸 High-resolution image processing

🤖 Small Language Models (SLMs) for Edge

🧠 Microsoft Phi-3

📊 Variants:

Model: Phi-3-mini
Parameters: 3.8B
Context: Up to 128K tokens
Deployment: GPU, CPU, Mobile
Status: ✅ Production Ready

🎯 Optimized For:

💻 GPU acceleration
🖥️ CPU inference
📱 Mobile deployment

🔗 Resources:

Phi-3 Overview

🦙 TinyLlama

📊 Specifications:

Parameters: 1.1B
Target: Mobile/Edge devices
Performance: High for size class
Year: 2024
Status: ✅ Active

✨ Highlights:

🔸 Compact architecture
🔸 Edge-optimized
🔸 Strong performance/size ratio

🌟 Google Gemini Nano

📱 On-device AI for Smartphones

Variants:

📊 1.8B parameters (lightweight)
📊 3.25B parameters (standard)

🎯 Capabilities:

✅ Context-aware reasoning
✅ Real-time translation
✅ Text summarization
✅ Edge-optimized for phones/IoT

🦙 Meta Llama 3.2

🖼️ Edge AI & Vision Capabilities

Features:

⚡ Edge deployment optimized
👁️ Vision-language capabilities
📱 Mobile-friendly variants
🔥 Latest architecture

🔗 Resources:

Llama 3.2 Announcement

📷 MobileVLM

🎨 Efficient vision-language model for mobile devices

Specifications:

🔹 mobileLLaMA: 2.7B parameters
🔹 Trained from scratch on open datasets
🔹 Fully optimized for mobile deployment
🔹 Vision + Language capabilities

⚡ State Space Models - Efficient Transformers

🐍 Mamba

⚡ Linear-time sequence modeling with selective state spaces

🚀 Performance Highlights:

Metric	Performance
Throughput	5x higher than Transformers
Scaling	Linear in sequence length
Comparison	Mamba-3B > Transformers (same size)
Power	Matches Transformers 2x its size

📊 Advantages:

+ ✅ Linear time complexity
+ ✅ 5x throughput improvement
+ ✅ Efficient long sequences
+ ✅ Lower memory footprint
- ❌ Newer architecture (less tested)

📚 Resources:

📱 eMamba

🔧 Edge-optimized Mamba acceleration framework

✨ Features:

Design: End-to-end hardware acceleration
Target: Edge platforms
Complexity: Linear time
Status: 2024 Release

🎯 Optimizations:

🔹 Hardware-aware design
🔹 Edge platform specific
🔹 Leverages linear complexity
🔹 Memory efficient

📚 Resources:

📄 eMamba Paper

🚀 Inference Frameworks & Runtimes

⚡ TensorRT-LLM

🏆 High-performance LLM inference on NVIDIA GPUs

📊 Performance:

+ 70% faster than llama.cpp on RTX 4090
+ State-of-the-art optimizations
+ Quality maintained across precisions

✨ Features:

🔸 Python & C++ API
🔸 Multi-precision support
🔸 Advanced kernel optimization
🔸 Production-grade quality

🔗 Resources:

📄 vLLM

💡 High-throughput LLM serving with PagedAttention

🎯 Innovations:

⚡ PagedAttention memory management
🔸 Optimized KV cache handling
🌐 Multi-platform support

🖥️ Supported Hardware:

AMD: GPU support
Google: TPU support
AWS: Inferentia support
Base: PyTorch

🔗 Resources:

vLLM vs TensorRT-LLM

🦙 ExecuTorch

📱 Efficient LLM execution on edge devices

Features:

🔹 Lightweight edge runtime
🔹 Static memory planning
🔹 Multi-platform support
🔹 TorchAO quantization

💻 Hardware Support:

✅ CPU
✅ GPU
✅ AI Accelerators
✅ Mobile devices

🔗 Resources:

PyTorch Conference 2024

💻 llama.cpp

⚡ CPU-optimized LLM inference

Advantages:

+ ✅ Lower memory usage
+ ✅ No GPU required
+ ✅ Fast generation
+ ✅ Cross-platform
+ ✅ Wide model support

🔗 Comparison:

vLLM vs Ollama vs llama.cpp vs TGI vs TensorRT-LLM

🔧 Model Compression & Optimization

📉 Advanced Quantization Techniques

🏆 AWQ

Activation-aware Weight Quantization

🎯 MIT HAN Lab Innovation

Key Concept:

# Not all weights are equal!
if is_salient(weight):
    skip_quantization()
else:
    quantize_weight()

Features:

⚡ Protects critical weights
🎯 Activation-aware
🔥 State-of-the-art results

🔗 Resources:

💎 GPTQ

GPU-Focused Quantization

Features:

🔸 Row-wise quantization
🔸 Hessian optimization
🔸 GPU inference focused
🔸 175B models supported

Achievements:

Models: BLOOM, OPT-175B
Precision: 4-bit
Platform: GPU optimized

🔬 QLoRA

Efficient Fine-tuning

Innovations:

✨ 4-bit NormalFloat (NF4)
✨ Double quantization
✨ LoRA adapters
✨ Single GPU fine-tuning

Capability:

+ Fine-tune 65B model
+ On single GPU
+ Maintain quality

🆕 Unsloth Dynamic 4-bit

🔥 Latest quantization innovation

Features:

Built on BitsandBytes
Dynamic parameter quantization
Per-parameter optimization

📚 Comprehensive Guides:

📖 Quantization Comparison
📊 GPTQ vs GGUF vs AWQ

🔬 Neural Architecture Search (NAS)

🤖 Automate neural network architecture design

🎯 Once-for-All (OFA)

Concept: Train once, deploy everywhere

graph TD
    A[🌐 Supernet Training] --> B[📦 Weight Sharing]
    B --> C[📱 Mobile]
    B --> D[💻 Desktop]
    B --> E[⚡ Edge]
    style A fill:#e1f5ff
    style B fill:#ffe1f5
    style C fill:#f5ffe1
    style D fill:#ffe1e1
    style E fill:#ffd700

Features:

🔹 Weight-sharing supernetwork
🔹 Represents any architecture in search space
🔹 Massive computational savings
🔹 Applied to ImageNet with ProxylessNAS & MobileNetV3

🔗 Resources:

🎓 Knowledge Distillation & Pruning

🔬 TinyBERT

📚 Two-stage distillation approach

Performance Metrics:

Accuracy: 96.8% of BERT-base
Size: 7.5x smaller (4 layers)
Energy: Lowest variability (0.1032 kWh SD)
Stages: Task-agnostic + Task-specific

Advantages:

✅ Dual-stage distillation
✅ Ultra-low energy variability
✅ Compact architecture
✅ High performance retention

📖 DistilBERT

⚡ Single-phase task-agnostic distillation

Performance Metrics:

Accuracy: 97% of BERT
Size Reduction: 40% smaller
Speed: 60% faster
Use Case: General-purpose

Recent Research (2025):

🔸 32% energy reduction with pruning
🔸 Iterative distillation + adaptive pruning
🔸 Nature Scientific Reports

📚 Resources:

🎯 TinyML & MCU-specific Advances

🧠 MCUNet Series - MIT HAN Lab

📱 MCUNetV1

Foundation:

🔸 Neural architecture for MCUs
🔸 Co-designed model + inference engine
🔸 Ultra-low memory footprint

🚀 MCUNetV2

Achievements:

ImageNet: 71.8% accuracy
Visual Wake: >90% (32kB SRAM)
Capability: Object detection
Platform: Tiny devices

⚡ MCUNetV3

Latest:

🔸 Enhanced efficiency
🔸 State-of-the-art MCU AI
🔸 Production ready

🎓 Additional MCU Tools

🔧 TinyTL

Tiny transfer learning for MCUs
On-device learning capabilities
Minimal resource overhead

⚙️ PockEngine

Inference engine optimization
MCU-specific acceleration
Memory-efficient execution

📚 Resources:

🔬 TinyDL (Tiny Deep Learning)

🎯 Evolution from TinyML to deep learning on edge

Focus Areas:

🔹 Deep learning on ultra-constrained hardware
🔹 Power consumption in mW range
🔹 On-device sensor analytics
🔹 Real-time inference

📄 Resources:

TinyDL Survey

🔩 Hardware Acceleration & Platforms

🖥️ Edge AI Platforms

🟢 NVIDIA Jetson Orin Nano Super

Specifications:

Compute: 67 INT8 TOPS
Performance: 1.7x vs previous Orin
Price: $249
Release: Late 2024
Status: ✅ Available

Features:

⚡ Generative AI optimized
🎯 Edge AI development kit
💰 Affordable price point

🔷 Edge TPU & Neural Accelerators

Hardware Platforms:

Google Pixel EdgeTPU
Coral Dev Board

Apple Neural Engine
A-series chips

Specialized NPUs
Custom ASICs

📱 Mobile Deployment Targets

Platform	Architecture	Use Case
🔧 ARM CPUs	ARM Cortex	General compute
📡 Mobile DSPs	Qualcomm/MediaTek	Signal processing
🎮 Mobile GPUs	Mali/Adreno	Graphics + AI
🧠 NPUs	Custom ASICs	Neural processing

🛠️ Implementation Resources & Tools

🔷 ONNX Runtime

Cross-platform inference with ONNX models

📚 Documentation & Tutorials

🔧 Compatibility

💻 Example Implementations

📦 Model Repositories

📉 ONNX Runtime Quantization

Tools & Resources:

🎯 YOLO Implementations

🔥 Click to expand YOLO implementations

🟣 YOLO-NAS with ONNX

💻 YOLO-NAS ONNXRuntime

🟢 YOLO + TensorRT (Detection, Pose, Segmentation)

🔵 YOLO + ONNXRuntime (All Tasks)

🌐 Community Resources

⚡ TensorRT

🚀 NVIDIA's high-performance deep learning inference optimizer

Resources:

🔧 TensorRT Execution Provider
💾 TensorRT Engine Cache

🌐 Edge Deployment Frameworks

🚀 FastDeploy - PaddlePaddle

📦 Easy-to-use deployment toolbox for AI models

Resources:

💻 FastDeploy GitHub
📥 Prebuilt Libraries

💎 DeepSparse & SparseML - Neural Magic

🖥️ CPU-optimized inference with sparsity

Features:

⚡ CPU inference acceleration
🔸 Sparsity-aware optimization
📊 YOLOv5 CPU benchmarks

Resources:

📱 NCNN - Tencent

🎯 High-performance neural network inference for mobile

Resources:

🔧 MACE - Xiaomi

🤖 Mobile AI Compute Engine

Resources:

💻 MACE GitHub

🍎 CoreML - Apple

🎨 Machine learning framework for iOS/macOS

📦 Click to expand CoreML resources

🎨 Model Collections

🛠️ Tools & Documentation

🎨 Stable Diffusion on CoreML

⚙️ Compilers & Low-Level Frameworks

🔧 TVM - Apache

🎯 End-to-end deep learning compiler stack

Resources:

TVM GitHub

🔨 LLVM

⚙️ Compiler infrastructure project

Resources:

LLVM Project

⚡ XNNPack - Google

🚀 High-efficiency floating-point neural network operators

Resources:

XNNPack GitHub

🔷 ARM-NN

💪 Inference engine for ARM platforms

Resources:

🧠 CMSIS-NN

📱 Efficient neural network kernels for ARM Cortex-M

Resources:

CMSIS-NN GitHub

📱 Samsung ONE

🔧 On-device Neural Engine compiler

Resources:

ONE GitHub

💼 Industry & Commercial Solutions

🚀 Deeplite

🎯 AI-Driven Optimizer for Deep Neural Networks

Focus:

⚡
Faster
Inference

📦
Smaller
Models

🔋
Energy
Efficient

☁️
Cloud to
Edge

🎯
Maintain
Accuracy

🔗 Resources:

Deeplite Website

🔧 Utility Frameworks & Tools

👁️ OpenCV

📷 Computer vision library with C++ support

Resources:

📺 OpenCV C++ Playlist
🔨 Build OpenCV C++

🎬 VQRF - Video Compression

📹 Vector Quantized Radiance Fields

Resources:

VQRF GitHub

🖼️ Additional Model Architectures

🎯 PP-PicoDet

📱 Lightweight real-time object detector for mobile

Resources:

PP-PicoDet Paper

🔬 EtinyNet

🎯 Extremely tiny network for TinyML

Resources:

EtinyNet GitHub

TinyML Architecture

🧠 Computing Architectures & APIs

Mobile &
Embedded

Open-Source
ISA

NVIDIA
GPU

Apple
GPU

Cross-
Platform

Graphics &
Compute

📚 Research Papers & Academic Resources

📖 Foundational Surveys (2024-2025)

🔍 Click to expand research papers

🌐 Edge Computing & Deep Learning

🔬 TinyML Specific

📄 From Tiny ML to Tiny DL: A Survey (2024)
📄 EtinyNet: Extremely Tiny Network
📄 Ultra-low Power TinyML System

⚡ State Space Models & Efficient Architectures

👁️ Vision Models

🔧 Model Compression & Optimization

📚 Collections

⭐ Awesome Embedded and Mobile Deep Learning

🎓 Contributing & Community

This repository serves as a comprehensive resource for AI edge computing and TinyML practitioners.

Contributions, updates, and corrections are welcome! 🚀

📊 Repository Stats

🏷️ Keywords

TinyML • Edge AI • Embedded ML • Model Compression • Quantization • Neural Architecture Search • YOLO • MobileNet • Transformer • State Space Models • ONNX Runtime • TensorRT • Inference Optimization • MCU • IoT • Real-Time AI

📅 Last Updated

January 2025

awesome-tinyml

About awesome-tinyml

Platforms

Languages

Links

README.md

🚀 AI Edge Computing & TinyML

Comprehensive Guide to State-of-the-Art Edge AI

🌟 Latest Update: January 2025

📋 Table of Contents

🚀 Getting Started

🔥 Core Topics

🛠️ Frameworks & Tools

📚 Documentation

📚 Resources

🎓 Community

🚀 Quick Start & Development

📦 Installation

🛠️ Development Setup

📊 Project Structure

✅ Quality Assurance

🔒 Security

📚 Documentation

🎯 Features

🔥 SOTA Models & Algorithms (2024-2025)

🎯 Object Detection Models

🥇 YOLOv11 (YOLO11)

🥈 YOLOv10

🤖 RT-DETR & RT-DETRv2

📱 Efficient Vision Models for Edge

📱 MobileNetV4

⚡ EfficientViT

🤖 Small Language Models (SLMs) for Edge

🧠 Microsoft Phi-3

🦙 TinyLlama

🌟 Google Gemini Nano

🦙 Meta Llama 3.2

📷 MobileVLM

⚡ State Space Models - Efficient Transformers

🐍 Mamba

📱 eMamba

🚀 Inference Frameworks & Runtimes

⚡ TensorRT-LLM

📄 vLLM

🦙 ExecuTorch

💻 llama.cpp

🔧 Model Compression & Optimization

📉 Advanced Quantization Techniques

🏆 AWQ

💎 GPTQ

🔬 QLoRA

🆕 Unsloth Dynamic 4-bit

🔬 Neural Architecture Search (NAS)

🎯 Once-for-All (OFA)

🎓 Knowledge Distillation & Pruning

🔬 TinyBERT

📖 DistilBERT

🎯 TinyML & MCU-specific Advances

🧠 MCUNet Series - MIT HAN Lab

📱 MCUNetV1

🚀 MCUNetV2

⚡ MCUNetV3

🎓 Additional MCU Tools

🔬 TinyDL (Tiny Deep Learning)

🔩 Hardware Acceleration & Platforms

🖥️ Edge AI Platforms

🟢 NVIDIA Jetson Orin Nano Super

🔷 Edge TPU & Neural Accelerators

📱 Mobile Deployment Targets

🛠️ Implementation Resources & Tools

🔷 ONNX Runtime

📚 Documentation & Tutorials

🔧 Compatibility

💻 Example Implementations

📦 Model Repositories

📉 ONNX Runtime Quantization

🎯 YOLO Implementations

🟣 YOLO-NAS with ONNX

🟢 YOLO + TensorRT (Detection, Pose, Segmentation)

🔵 YOLO + ONNXRuntime (All Tasks)