About Awesome-On-Device-AI-Systems

Awesome On-Device AI Systems is a curated repository focused on efficient on-device AI inference for mobile and edge devices. It serves as a bridge between academic systems research and practical engineering deployment, covering optimization techniques for machine learning models such as large language models, vision-language models, and vision transformers running on resource-constrained hardware. The repository is organized into two main sections. The first covers inference engines and runtimes, including general purpose frameworks like LiteRT, ExecuTorch, ONNX Runtime, MNN, and NCNN, along with vendor specific SDKs for platforms such as Qualcomm Snapdragon NPUs, Apple Core ML, NVIDIA TensorRT, Intel OpenVINO, and MediaTek NeuroPilot. It also features LLM and generative AI specialized engines like llama.cpp, MLC LLM, TensorRT-LLM, mllm, and MLX. The second section compiles research papers grouped into topics including LLM inference on mobile SoCs, processor characterization and optimization, compiler-based

j

Published by

jeho-lee

Visit View Profile

README.md

View on GitHub

Awesome On-Device AI Systems

A curated list of efficient on-device AI systems, including practical inference engines, benchmarks, and state-of-the-art research papers for mobile and edge devices.

This repository bridges the gap between Systems Research (academic papers) and Practical Deployment (engineering frameworks), focusing on optimizing ML models (e.g., LLM/VLMs, ViTs, etc.) on resource-constrained hardware.

📂 Table of Contents

🚀 Inference Engines

📝 Research Papers

🚀 Inference Engines

Frameworks and runtimes designed for deploying models on edge devices.

General ML Workloads

LiteRT (formerly TensorFlow Lite) - Google's framework for on-device inference.
ExecuTorch - PyTorch’s end-to-end solution for enabling on-device AI.
ONNX Runtime - Cross-platform inference engine for ONNX models.
MNN - Lightweight deep learning framework by Alibaba.
NCNN - High-performance NN inference framework by Tencent.

Vendor-Specific SDKs

Qualcomm QNN - Qualcomm AI Stack for Snapdragon NPUs/DSPs.
Apple Core ML - Framework to integrate ML models into iOS/macOS apps.
FluidAudio - Local audio AI SDK for Apple platforms with ASR, speaker diarization, VAD, and TTS optimized for Apple Neural Engine.
NVIDIA TensorRT - SDK for high-performance deep learning inference on NVIDIA GPUs (including Jetson).
Intel OpenVINO - Toolkit for optimizing and deploying AI inference on Intel hardware (CPU/GPU/NPU).
MediaTek NeuroPilot - AI ecosystem and SDK for MediaTek NPUs.

LLM & GenAI Specialized

llama.cpp - LLM inference in C/C++ with minimal dependencies.
MLC LLM - Universal solution for deploying LLMs on any hardware (based on TVM).
TensorRT-LLM - NVIDIA GPU-optimized LLM inference library, relevant for Jetson-class edge devices.
mllm - A fast and lightweight LLM inference engine for mobile and edge devices.
MLX LM - LLM inference and fine-tuning toolkit built on MLX for Apple silicon.
OmniInfer - High-performance, on-device VLM inference with hybrid NPU acceleration.
RunAnywhere - Open-source SDK for running LLMs and multimodal models on-device across iOS, Android, and cross-platform apps.
Off Grid - Open-source iOS/Android app running LLMs (Llama, Qwen, Gemma, Phi, DeepSeek) entirely on-device via llama.cpp. Includes voice (whisper.cpp), vision, on-device image generation, and tool calling.

📝 Research Papers

Note: Some of the works are designed for inference acceleration on cloud/server infrastructure, which has much higher computational resources, but I also include them here if they can be potentially generalized to on-device inference use cases.

Quantization/Sparsity

Application-centric On-device AI Systems

[MobiSys 2025] ARIA: Optimizing Vision Foundation Model Inference on Heterogeneous Mobile Processors for Augmented Reality
[MobiCom 2024] Panopticus: Omnidirectional 3D Object Detection on Resource-constrained Edge Devices
[MobiCom 2024] Perceptual-Centric Image Super-Resolution using Heterogeneous Processors on Mobile Devices
[IPSN 2023] PointSplit: Towards On-device 3D Object Detection with Heterogeneous Low-power Accelerators
[MobiSys 2023] OmniLive: Super-Resolution Enhanced 360° Video Live Streaming for Mobile Devices
[MobiCom 2022] NeuLens: Spatial-based Dynamic Acceleration of Convolutional Neural Networks on Edge
[MobiCom 2021] Flexible high-resolution object detection on edge devices with tunable latency

Multi-DNN / Heterogeneous Runtime Scheduling

[PPoPP 2024] Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous SoCs
[RTSS 2024] FLEX: Adaptive Task Batch Scheduling with Elastic Fusion in Multi-Modal Multi-View Machine Perception
[MobiSys 2024] Pantheon: Preemptible Multi-DNN Inference on Mobile Edge GPUs
[Sensys 2023] Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU
[ATC 2023] Decentralized Application-Level Adaptive Scheduling for Multi-Instance DNNs on Open Mobile Devices
[MobiSys 2022] Band: Coordinated Multi-DNN Inference on Heterogeneous Mobile Processors
[MobiSys 2022] CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices

On-device Training, Model Adaptation

[ASPLOS 2025] Nazar: Monitoring and Adapting ML Models on Mobile Devices
[SenSys 2024] AdaShadow: Responsive Test-time Model Adaptation in Non-stationary Mobile Environments
[SenSys 2023] EdgeFM: Leveraging Foundation Model for Open-set Learning on the Edge
[MobiCom 2023] Cost-effective On-device Continual Learning over Memory Hierarchy with Miro
[MobiCom 2023] AdaptiveNet: Post-deployment Neural Architecture Adaptation for Diverse Edge Environments
[MobiSys 2023] ElasticTrainer: Speeding Up On-Device Training with Runtime Elastic Tensor Selection
[SenSys 2023] On-NAS: On-Device Neural Architecture Search on Memory-Constrained Intelligent Embedded Systems
[MobiCom 2022] Mandheling: mixed-precision on-device DNN training with DSP offloading
[MobiSys 2022] Memory-efficient DNN training on mobile devices

Profilers

[MobiCom 2024] MELTing point: Mobile Evaluation of Language Transformers [code]
[SenSys 2023] nnPerf: Demystifying DNN Runtime Inference Latency on Mobile Platforms
[MobiSys 2021] nn-Meter: towards accurate latency prediction of deep-learning model inference on diverse edge devices

Awesome-On-Device-AI-Systems