Awesome On-Device AI Systems 
A curated list of efficient on-device AI systems, including practical inference engines, benchmarks, and state-of-the-art research papers for mobile and edge devices.
This repository bridges the gap between Systems Research (academic papers) and Practical Deployment (engineering frameworks), focusing on optimizing ML models (e.g., LLM/VLMs, ViTs, etc.) on resource-constrained hardware.
đź“‚ Table of Contents
- 🚀 Inference Engines
- 📝 Research Papers
🚀 Inference Engines
Frameworks and runtimes designed for deploying models on edge devices.
General ML Workloads
- LiteRT (formerly TensorFlow Lite) - Google's framework for on-device inference.
- ExecuTorch - PyTorch’s end-to-end solution for enabling on-device AI.
- ONNX Runtime - Cross-platform inference engine for ONNX models.
- MNN - Lightweight deep learning framework by Alibaba.
- NCNN - High-performance NN inference framework by Tencent.
Vendor-Specific SDKs
- Qualcomm QNN - Qualcomm AI Stack for Snapdragon NPUs/DSPs.
- Apple Core ML - Framework to integrate ML models into iOS/macOS apps.
- FluidAudio - Local audio AI SDK for Apple platforms with ASR, speaker diarization, VAD, and TTS optimized for Apple Neural Engine.
- NVIDIA TensorRT - SDK for high-performance deep learning inference on NVIDIA GPUs (including Jetson).
- Intel OpenVINO - Toolkit for optimizing and deploying AI inference on Intel hardware (CPU/GPU/NPU).
- MediaTek NeuroPilot - AI ecosystem and SDK for MediaTek NPUs.
LLM & GenAI Specialized
- llama.cpp - LLM inference in C/C++ with minimal dependencies.
- MLC LLM - Universal solution for deploying LLMs on any hardware (based on TVM).
- TensorRT-LLM - NVIDIA GPU-optimized LLM inference library, relevant for Jetson-class edge devices.
- mllm - A fast and lightweight LLM inference engine for mobile and edge devices.
- MLX LM - LLM inference and fine-tuning toolkit built on MLX for Apple silicon.
- OmniInfer - High-performance, on-device VLM inference with hybrid NPU acceleration.
- RunAnywhere - Open-source SDK for running LLMs and multimodal models on-device across iOS, Android, and cross-platform apps.
- Off Grid - Open-source iOS/Android app running LLMs (Llama, Qwen, Gemma, Phi, DeepSeek) entirely on-device via llama.cpp. Includes voice (whisper.cpp), vision, on-device image generation, and tool calling.
📝 Research Papers
Note: Some of the works are designed for inference acceleration on cloud/server infrastructure, which has much higher computational resources, but I also include them here if they can be potentially generalized to on-device inference use cases.
LLM Inference on Mobile SoCs
- [OSDI 2026] Inference in the Shadows: Taming Memory Bandwidth Contention in Mobile LLM Inference with Sereno
- [MobiSys 2026] Agent-X: Full Pipeline Acceleration of On-device AI Agents
- [MLSys 2026] Rethinking DVFS for Mobile LLMs: Unified Energy-Aware Scheduling with CORE
- [SenSys 2026] LLM as a System Service on Mobile Devices
- [EuroSys 2026] Scaling LLM Test-Time Compute with Mobile NPU on Smartphones
- [SOSP 2025] Characterizing Mobile SoC for Accelerating Heterogeneous LLM Inference
- [ASPLOS 2025] Neuralink: Fast on-Device LLM Inference with Neuron Co-Activation Linking
- [ASPLOS 2025] Fast On-device LLM Inference with NPUs
- [arXiv 2024] PowerInfer-2: Fast Large Language Model Inference on a Smartphone
Mobile Processor Characterization & Optimization
- [EuroSys 2026] viNPU: Optimizing Vision Transformer Inference on Mobile NPUs
- [ASPLOS 2026] FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations
- [ICS 2025] TMModel: Modeling Texture Memory and Mobile GPU Performance to Accelerate DNN Computations
Compiler-based ML Optimization
- [ASPLOS 2024] SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile
- [ASPLOS 2024] SoD2: Statically Optimizing Dynamic Deep Neural Network Execution
- [MICRO 2023] Improving Data Reuse in NPU On-chip Memory with Interleaved Gradient Order for DNN Training
- [MICRO 2022] GCD2: A Globally Optimizing Compiler for Mapping DNNs to Mobile DSPs
- [PLDI 2021] DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion
Attention Acceleration
- [MLSys 2026] IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference
- [MobiSys 2026] ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference
- [MLSys 2025] MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices
- [MLSys 2025] TurboAttention: Efficient attention approximation for High Throughputs LLMs
- [ASPLOS 2023] FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks
- [NeurIPS 2022] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Quantization/Sparsity
- [ASPLOS 2026] oFFN: Outlier and Neuron-aware Structured FFN for Fast yet Accurate LLM Inference
- [MLSys 2024] AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration
- [ISCA 2023] OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization
Application-centric On-device AI Systems
- [MobiSys 2025] ARIA: Optimizing Vision Foundation Model Inference on Heterogeneous Mobile Processors for Augmented Reality
- [MobiCom 2024] Panopticus: Omnidirectional 3D Object Detection on Resource-constrained Edge Devices
- [MobiCom 2024] Perceptual-Centric Image Super-Resolution using Heterogeneous Processors on Mobile Devices
- [IPSN 2023] PointSplit: Towards On-device 3D Object Detection with Heterogeneous Low-power Accelerators
- [MobiSys 2023] OmniLive: Super-Resolution Enhanced 360° Video Live Streaming for Mobile Devices
- [MobiCom 2022] NeuLens: Spatial-based Dynamic Acceleration of Convolutional Neural Networks on Edge
- [MobiCom 2021] Flexible high-resolution object detection on edge devices with tunable latency
Multi-DNN / Heterogeneous Runtime Scheduling
- [PPoPP 2024] Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous SoCs
- [RTSS 2024] FLEX: Adaptive Task Batch Scheduling with Elastic Fusion in Multi-Modal Multi-View Machine Perception
- [MobiSys 2024] Pantheon: Preemptible Multi-DNN Inference on Mobile Edge GPUs
- [Sensys 2023] Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU
- [ATC 2023] Decentralized Application-Level Adaptive Scheduling for Multi-Instance DNNs on Open Mobile Devices
- [MobiSys 2022] Band: Coordinated Multi-DNN Inference on Heterogeneous Mobile Processors
- [MobiSys 2022] CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices
On-device Training, Model Adaptation
- [ASPLOS 2025] Nazar: Monitoring and Adapting ML Models on Mobile Devices
- [SenSys 2024] AdaShadow: Responsive Test-time Model Adaptation in Non-stationary Mobile Environments
- [SenSys 2023] EdgeFM: Leveraging Foundation Model for Open-set Learning on the Edge
- [MobiCom 2023] Cost-effective On-device Continual Learning over Memory Hierarchy with Miro
- [MobiCom 2023] AdaptiveNet: Post-deployment Neural Architecture Adaptation for Diverse Edge Environments
- [MobiSys 2023] ElasticTrainer: Speeding Up On-Device Training with Runtime Elastic Tensor Selection
- [SenSys 2023] On-NAS: On-Device Neural Architecture Search on Memory-Constrained Intelligent Embedded Systems
- [MobiCom 2022] Mandheling: mixed-precision on-device DNN training with DSP offloading
- [MobiSys 2022] Memory-efficient DNN training on mobile devices