OnnxStream
OnnxStream is a lightweight inference library written in C++ designed to run large ONNX models on devices with extremely limited memory. Unlike frameworks that prioritize throughput at the cost of RAM, OnnxStream focuses on minimizing memory consumption through a unique architecture that decouples the inference engine from the weight provider. This allows for streaming model parameters directly from disk or HTTP without loading them entirely into RAM. The library supports running complex models like Stable Diffusion XL 1.0 on a Raspberry Pi Zero 2 with just 298MB of RAM, as well as large language models such as Mistral 7B on desktop servers. It is optimized with XNNPACK acceleration and runs on ARM, x86, RISC-V, and WebAssembly. Key use cases include browser-based AI with YOLOv8 and Whisper, image generation on microcontrollers, and deploying LLMs on resource-constrained hardware. The project offers bindings for Python, C, and JavaScript (WASM), enabling developers to integrate high-performance, low-memory AI