About bitmamba.cpp

Ultra-lightweight C++ inference engine for BitMamba-2 (1.58-bit SSM). Runs 1B models on consumer CPUs at 50+ tok/s using <700MB RAM. No heavy dependencies.

z

Published by

zhayr1

Visit View Profile

README.md

View on GitHub

BitMamba.cpp Library

This library is designed for efficient inference using BitMamba2 models with 255M and 1B parameters. It implements support for quantization and BitNet-optimized architectures.

Requirements and compatibility

⚠️ Hardware Requirements: This C++ implementation utilizes AVX2 SIMD instructions for high-performance inference on x86 CPUs (Intel/AMD).

Supported: Intel Haswell (4th Gen) or newer, AMD Ryzen.

Not currently supported: ARM devices (Raspberry Pi, Apple Silicon, Android) require a port to NEON intrinsics.

Note: The architectural efficiency (250MB RAM usage) makes it theoretically ideal for Edge devices, but this specific demo code is optimized for x86 desktops.

Usage Instructions

1. Exporting the Model

Use the scripts/export_bin.py script to convert your PyTorch/JAX checkpoints to the optimized C++ binary format.

Arguments:

--version: Model version to export (1b or 250m).
--ckpt_path: Path to the checkpoint file (.msgpack).
--output_name: Output binary filename.

Example for 1B version:

python3 scripts/export_bin.py --version 1b --ckpt_path ./bitmamba_1b.msgpack --output_name bitmamba_1b.bin

Example for 250M version:

python3 scripts/export_bin.py --version 250m --ckpt_path ./bitmamba_250m.msgpack --output_name bitmamba_250m.bin

2. Compile the C++ Inference Engine

Option 1: Using CMake (Recommended)

Ensure you have CMake installed (sudo apt install cmake or equivalent).

cmake -B build
cmake --build build

The executable will be located at build/bitmamba.

Target ISA (`BITMAMBA_ISA`)

The matmul kernel compiles a single-instruction int8 dot-product (VPDPBUSD) when the target supports AVX-VNNI, falling back to an AVX2 maddubs path otherwise. Select the feature level with -DBITMAMBA_ISA:

Value	Flags	Use
`native` (default)	`-march=native`	Build == run machine. Picks up AVX-VNNI automatically on Alder Lake+ / Zen4+. Best local performance.
`avx2`	`-march=x86-64-v3`	Portable AVX2+FMA+BMI2 binary (no VNNI). Use when cross-building or shipping one binary to mixed CPUs.
`avxvnni`	`+ -mavxvnni`	Force the VNNI kernel on a portable v3 baseline.
`avx512`	`+ AVX-512 VNNI`	AVX-512 capable servers.

# portable binary that still runs the VNNI kernel where the CPU has it:
cmake -B build -DBITMAMBA_ISA=avxvnni
cmake --build build

Tip: this is single-thread compute that scales with cores. Set OMP_NUM_THREADS to the number of physical cores on the deployment VM (e.g. OMP_NUM_THREADS=2 on a 2-vCPU instance) to avoid hyperthread oversubscription.

Option 2: Quick Build (Manual)

If you prefer g++:

g++ -O3 -march=native -fopenmp -Iinclude -Isrc -o bitmamba examples/main.cpp src/*.cpp

3. Running Inference

3.1 Download Weights (from Hugging Face)

BitMamba-2 1B

wget https://huggingface.co/Zhayr1/BitMamba-2-1B/resolve/main/bitmamba_cpp/bitmamba_1b.bin

BitMamba-2 0.25B

wget https://huggingface.co/Zhayr1/BitMamba-2-0.25B/resolve/main/bitmamba_cpp/bitmamba_255m.bin

Once you have the binary model (.bin) and the compiled executable, use the exported binary to run inference.

Example command:

./build/bitmamba <model.bin> "<prompt_tokens>" <mode> <temp> <repeat_penalty> <top_p> <top_k> <max_tokens>

Practical Example:

CMake Build

Tokenizer mode:

./build/bitmamba bitmamba_1b.bin "Hello, I am" tokenizer 0.7 1.1 0.05 0.9 40 200

Raw mode:

./build/bitmamba bitmamba_1b.bin "15496 11 314 716" raw 0.7 1.1 0.05 0.9 40 200

Manual Build

Tokenizer mode:

./bitmamba bitmamba_1b.bin "Hello, I am" tokenizer 0.7 1.1 0.05 0.9 40 200

Raw mode:

./bitmamba bitmamba_1b.bin "15496 11 314 716" raw 0.7 1.1 0.05 0.9 40 200

⚠️ IMPORTANT: the tokenizer.bin file must be in the same directory as the bitmamba compiled executable.

_This command runs the bitmamba_1b.bin model with a tokenized prompt, temperature 0.7, repetition penalty 1.1, generating 200 tokens._

4. Decoding Tokens

If you use raw mode, you can use the scripts/decoder.py script to convert token IDs back into text.

Usage:

python scripts/decoder.py "tokens"

Example:

python scripts/decoder.py "15496 11 314 716"

TODO

Future Work: Add ARM/NEON support for Raspberry Pi deployment.

API Server (Optional)

For OpenAI-compatible API access, a Python FastAPI server is available:

# Install Python dependencies
cd python
pip install -r requirements.txt

# Start the server
python server.py --model ../bitmamba_1b.bin --host 127.0.0.1 --port 8000

The server provides OpenAI-compatible endpoints:

/v1/chat/completions - Chat completions
/v1/completions - Text completions
/v1/models - List models

See python/README.md for full documentation.

Layer Repetition Scanner (RYS — LLM Neuroanatomy)

This implementation is based on the approach described by David Noel in his blog post on RYS.

The C++ binary supports virtual layer repetition at zero extra weight-memory cost via the --repeat-start, --repeat-end, and --repeat-count flags. The same physical layer is executed multiple times with independent recurrent state, which can improve reasoning on certain prompts depending on the chosen slice.

Use scripts/brain_scanner.py to grid-search the best slice for your model on BoolQ + ARC-Easy:

python3 scripts/brain_scanner.py \
    --binary ./build/bitmamba \
    --model bitmamba_1b.bin \
    --range-start 0 --range-end 31 --min-span 2 \
    --log brain_scan_1b.csv

Then run inference with the chosen slice:

./build/bitmamba --repeat-start 17 --repeat-end 21 \
    bitmamba_1b.bin "Hello, I am" tokenizer 0.7 1.1 0.05 0.9 40 200

brain_scanner.py requires the optional dependencies tiktoken and datasets (already listed in requirements.txt).

Python Inference Evaluation test

Use the scripts/fast_inference.py script to evaluate the models:

Get the weights

Weights for 250M version:

wget https://huggingface.co/Zhayr1/BitMamba-2-0.25B/resolve/main/jax_weights/bitmamba_255m.msgpack

Weights for 1B version:

wget https://huggingface.co/Zhayr1/BitMamba-2-1B/resolve/main/jax_weights/bit_mamba_1b.msgpack

250M Version

python scripts/fast_inference.py --ckpt bitmamba_255m.msgpack --version 250m --eval

1B Version

python scripts/fast_inference.py --ckpt bit_mamba_1b.msgpack --version 1b --eval

bitmamba.cpp

About bitmamba.cpp

Platforms

Languages

Links

README.md

BitMamba.cpp Library

Requirements and compatibility

Usage Instructions

1. Exporting the Model

Example for 1B version:

Example for 250M version:

2. Compile the C++ Inference Engine

Option 1: Using CMake (Recommended)

Target ISA (`BITMAMBA_ISA`)

Option 2: Quick Build (Manual)

3. Running Inference

3.1 Download Weights (from Hugging Face)

Practical Example:

CMake Build

Manual Build

4. Decoding Tokens

TODO

API Server (Optional)

Layer Repetition Scanner (RYS — LLM Neuroanatomy)

Python Inference Evaluation test

Get the weights

250M Version

1B Version

bitmamba.cpp

About bitmamba.cpp

Platforms

Languages

Links

README.md

BitMamba.cpp Library

Requirements and compatibility

Usage Instructions

1. Exporting the Model

Example for 1B version:

Example for 250M version:

2. Compile the C++ Inference Engine

Option 1: Using CMake (Recommended)

Target ISA (BITMAMBA_ISA)

Option 2: Quick Build (Manual)

3. Running Inference

3.1 Download Weights (from Hugging Face)

Practical Example:

CMake Build

Manual Build

4. Decoding Tokens

TODO

API Server (Optional)

Layer Repetition Scanner (RYS — LLM Neuroanatomy)

Python Inference Evaluation test

Get the weights

250M Version

1B Version

Target ISA (`BITMAMBA_ISA`)