BitMamba.cpp Library
This library is designed for efficient inference using BitMamba2 models with 255M and 1B parameters. It implements support for quantization and BitNet-optimized architectures.
Requirements and compatibility
⚠️ Hardware Requirements: This C++ implementation utilizes AVX2 SIMD instructions for high-performance inference on x86 CPUs (Intel/AMD).
Supported: Intel Haswell (4th Gen) or newer, AMD Ryzen.
Not currently supported: ARM devices (Raspberry Pi, Apple Silicon, Android) require a port to NEON intrinsics.
Note: The architectural efficiency (250MB RAM usage) makes it theoretically ideal for Edge devices, but this specific demo code is optimized for x86 desktops.
Usage Instructions
1. Exporting the Model
Use the scripts/export_bin.py script to convert your PyTorch/JAX checkpoints to the optimized C++ binary format.
Arguments:
--version: Model version to export (1bor250m).--ckpt_path: Path to the checkpoint file (.msgpack).--output_name: Output binary filename.
Example for 1B version:
python3 scripts/export_bin.py --version 1b --ckpt_path ./bitmamba_1b.msgpack --output_name bitmamba_1b.bin
Example for 250M version:
python3 scripts/export_bin.py --version 250m --ckpt_path ./bitmamba_250m.msgpack --output_name bitmamba_250m.bin
2. Compile the C++ Inference Engine
Option 1: Using CMake (Recommended)
Ensure you have CMake installed (sudo apt install cmake or equivalent).
cmake -B build
cmake --build build
The executable will be located at build/bitmamba.
Target ISA (BITMAMBA_ISA)
The matmul kernel compiles a single-instruction int8 dot-product (VPDPBUSD)
when the target supports AVX-VNNI, falling back to an AVX2 maddubs path
otherwise. Select the feature level with -DBITMAMBA_ISA:
| Value | Flags | Use |
|---|---|---|
native (default) |
-march=native |
Build == run machine. Picks up AVX-VNNI automatically on Alder Lake+ / Zen4+. Best local performance. |
avx2 |
-march=x86-64-v3 |
Portable AVX2+FMA+BMI2 binary (no VNNI). Use when cross-building or shipping one binary to mixed CPUs. |
avxvnni |
+ -mavxvnni |
Force the VNNI kernel on a portable v3 baseline. |
avx512 |
+ AVX-512 VNNI |
AVX-512 capable servers. |
# portable binary that still runs the VNNI kernel where the CPU has it:
cmake -B build -DBITMAMBA_ISA=avxvnni
cmake --build build
Tip: this is single-thread compute that scales with cores. Set
OMP_NUM_THREADSto the number of physical cores on the deployment VM (e.g.OMP_NUM_THREADS=2on a 2-vCPU instance) to avoid hyperthread oversubscription.
Option 2: Quick Build (Manual)
If you prefer g++:
g++ -O3 -march=native -fopenmp -Iinclude -Isrc -o bitmamba examples/main.cpp src/*.cpp
3. Running Inference
3.1 Download Weights (from Hugging Face)
BitMamba-2 1B
wget https://huggingface.co/Zhayr1/BitMamba-2-1B/resolve/main/bitmamba_cpp/bitmamba_1b.bin
BitMamba-2 0.25B
wget https://huggingface.co/Zhayr1/BitMamba-2-0.25B/resolve/main/bitmamba_cpp/bitmamba_255m.bin
Once you have the binary model (.bin) and the compiled executable, use the exported binary to run inference.
Example command:
./build/bitmamba <model.bin> "<prompt_tokens>" <mode> <temp> <repeat_penalty> <top_p> <top_k> <max_tokens>
Practical Example:
CMake Build
Tokenizer mode:
./build/bitmamba bitmamba_1b.bin "Hello, I am" tokenizer 0.7 1.1 0.05 0.9 40 200
Raw mode:
./build/bitmamba bitmamba_1b.bin "15496 11 314 716" raw 0.7 1.1 0.05 0.9 40 200
Manual Build
Tokenizer mode:
./bitmamba bitmamba_1b.bin "Hello, I am" tokenizer 0.7 1.1 0.05 0.9 40 200
Raw mode:
./bitmamba bitmamba_1b.bin "15496 11 314 716" raw 0.7 1.1 0.05 0.9 40 200
⚠️ IMPORTANT: the tokenizer.bin file must be in the same directory as the bitmamba compiled executable.
_This command runs the bitmamba_1b.bin model with a tokenized prompt, temperature 0.7, repetition penalty 1.1, generating 200 tokens._
4. Decoding Tokens
If you use raw mode, you can use the scripts/decoder.py script to convert token IDs back into text.
Usage:
python scripts/decoder.py "tokens"
Example:
python scripts/decoder.py "15496 11 314 716"
TODO
- Future Work: Add ARM/NEON support for Raspberry Pi deployment.
API Server (Optional)
For OpenAI-compatible API access, a Python FastAPI server is available:
# Install Python dependencies
cd python
pip install -r requirements.txt
# Start the server
python server.py --model ../bitmamba_1b.bin --host 127.0.0.1 --port 8000
The server provides OpenAI-compatible endpoints:
/v1/chat/completions- Chat completions/v1/completions- Text completions/v1/models- List models
See python/README.md for full documentation.
Layer Repetition Scanner (RYS — LLM Neuroanatomy)
This implementation is based on the approach described by David Noel in his blog post on RYS.
The C++ binary supports virtual layer repetition at zero extra weight-memory cost via the --repeat-start, --repeat-end, and --repeat-count flags. The same physical layer is executed multiple times with independent recurrent state, which can improve reasoning on certain prompts depending on the chosen slice.
Use scripts/brain_scanner.py to grid-search the best slice for your model on BoolQ + ARC-Easy:
python3 scripts/brain_scanner.py \
--binary ./build/bitmamba \
--model bitmamba_1b.bin \
--range-start 0 --range-end 31 --min-span 2 \
--log brain_scan_1b.csv
Then run inference with the chosen slice:
./build/bitmamba --repeat-start 17 --repeat-end 21 \
bitmamba_1b.bin "Hello, I am" tokenizer 0.7 1.1 0.05 0.9 40 200
brain_scanner.py requires the optional dependencies tiktoken and datasets (already listed in requirements.txt).
Python Inference Evaluation test
Use the scripts/fast_inference.py script to evaluate the models:
Get the weights
Weights for 250M version:
wget https://huggingface.co/Zhayr1/BitMamba-2-0.25B/resolve/main/jax_weights/bitmamba_255m.msgpack
Weights for 1B version:
wget https://huggingface.co/Zhayr1/BitMamba-2-1B/resolve/main/jax_weights/bit_mamba_1b.msgpack
250M Version
python scripts/fast_inference.py --ckpt bitmamba_255m.msgpack --version 250m --eval
1B Version
python scripts/fast_inference.py --ckpt bit_mamba_1b.msgpack --version 1b --eval