yuniko-software

Open Source

bge-m3-onnx

# BGE-M3 ONNX <p align="left"> <a href="https://github.com/yuniko-software/bge-m3-onnx"> <img alt="Build Status" src="https://github.com/yuniko-software/bge-m3-onnx/actions/workflows/ci-build.yml/badge.svg"> </a> <a href="https://huggingface.co/yuniko-software/bge-m3-onnx"> <img alt="HuggingFace Model" src="https://img.shields.io/badge/BGE_M3_ONNX-%F0%9F%A4%97-yellow"> </a> <a href="https://github.com/yuniko-software"> <img alt="Build" src="https://img.shields.io/badge/Contribution-Welcome-blue"> </a> </p> This repository demonstrates how to convert the complete [BGE-M3](https://github.com/FlagOpen/FlagEmbedding) model to [ONNX](https://github.com/microsoft/onnxruntime) format and use it in multiple programming languages with **full multi-vector functionality**. <img width="1589" height="1180" alt="image" src="https://github.com/user-attachments/assets/c30cf557-4b54-42be-adc6-1c84bb704337" /> ## Key Features - Generate all three BGE-M3 embedding types: dense, sparse, and ColBERT vectors - Reduced latency with local embedding generation - Full control over the embedding pipeline with no external dependencies - Works offline without internet connectivity requirements - Cross-platform compatibility (C#, Java, Python) - CUDA GPU acceleration support ## Repository Structure - `bge-m3-to-onnx.ipynb` - Jupyter notebook demonstrating the BGE-M3 conversion process - `/samples/dotnet` - C# implementation - `/samples/java` - Java implementation - `/samples/python` - Python implementation - `generate_reference_embeddings.py` - Script to generate reference embeddings for cross-language testing - `run_tests.sh` and `run_tests.ps1` - Test scripts for Linux/macOS and Windows ## Getting Started 1. Clone this repository: ```bash git clone https://github.com/yuniko-software/bge-m3-onnx.git cd bge-m3-onnx ``` 2. Get the BGE-M3 ONNX models: - Option 1: Download from releases (recommended) - Check the repository releases and download `onnx.zip` - It already contains the bge-m3 embedding model and its tokenizer - Option 2: Generate yourself using the notebook - Open and run `bge-m3-to-onnx.ipynb` - this is the most important file in the repository - The notebook demonstrates how to convert BGE-M3 from FlagEmbedding to ONNX format - This will create `bge_m3_tokenizer.onnx`, `bge_m3_model.onnx`, and `bge_m3_model.onnx_data` in the `/onnx` folder > Note: This repository uses [`BAAI/bge-m3`](https://github.com/FlagOpen/FlagEmbedding) as the embedding model with its XLM-RoBERTa tokenizer. 3. Generate reference embeddings (optional): - Run `python generate_reference_embeddings.py` to create reference embeddings for testing Python dependencies are managed in `requirements.txt`: ```bash pip install -r requirements.txt ``` 4. Run the samples: - Once you have the ONNX models in the `/onnx` folder, you can run any sample - Try the .NET sample in `/samples/dotnet` or the Java sample in `/samples/java` 5. Verify cross-language embeddings (optional): - To ensure that .NET and Java embeddings match the Python-generated embeddings, you can run: - On Linux/macOS: ```bash chmod +x run_tests.sh ./run_tests.sh ``` - On Windows: ```powershell ./run_tests.ps1 ``` > Note: These scripts require Python, .NET, Java, and Maven to be installed. ## CUDA Support This BGE-M3 ONNX model supports CUDA GPU acceleration for improved performance. To enable CUDA support: ### Python Install the ONNX Runtime with CUDA support: **Resource**: [ONNX Runtime CUDA Execution Provider Requirements](https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements) This model is compatible with: - `pip install onnxruntime-gpu[cuda,cudnn]` - packages that include CUDA and cuDNN DLLs - [PyTorch packages that include CUDA and cuDNN DLLs](https://pytorch.org/get-started/locally/) ### C# and Java For C# and Java implementations, you need to install CUDA and cuDNN separately: **CUDA Installation:** - Linux: [CUDA Installation Guide for Linux](https://docs.nvidia.com/cuda/cuda-installation-guide-linux) - Windows: [CUDA Installation Guide for Windows](https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows) **cuDNN Installation:** - [cuDNN Backend Installation Guide](https://docs.nvidia.com/deeplearning/cudnn/installation/latest/backend.html) ## Python Example ```python from bge_m3_embedder import create_cpu_embedder, create_cuda_embedder # Create CPU-optimized embedder embedder = create_cpu_embedder("onnx/bge_m3_tokenizer.onnx", "onnx/bge_m3_model.onnx") # Generate all three embedding types result = embedder.encode("Hello world!") print(f"Dense: {len(result['dense_vecs'])} dimensions") print(f"Sparse: {len(result['lexical_weights'])} tokens") print(f"ColBERT: {len(result['colbert_vecs'])} vectors") # Clean up resources embedder.close() # For CUDA acceleration cuda_embedder = create_cuda_embedder("onnx/bge_m3_tokenizer.onnx", "onnx/bge_m3_model.onnx", device_id=0) result = cuda_embedder.encode("Hello world!") cuda_embedder.close() # See full implementation in samples/python ``` ## C# Example ```csharp using BgeM3.Onnx; // Create CPU-optimized embedder using var embedder = M3EmbedderFactory.CreateCpuOptimized(tokenizerPath, modelPath); // Generate all embedding types var result = embedder.GenerateEmbeddings("Hello world!"); Console.WriteLine($"Dense: {result.DenseEmbedding.Length} dimensions"); Console.WriteLine($"Sparse: {result.SparseWeights.Count} tokens"); Console.WriteLine($"ColBERT: {result.ColBertVectors.Length} vectors"); // For CUDA acceleration using var cudaEmbedder = M3EmbedderFactory.CreateCudaOptimized(tokenizerPath, modelPath, deviceId: 0); var cudaResult = cudaEmbedder.GenerateEmbeddings("Hello world!"); // See full implementation in samples/dotnet ``` ## Java Example ```java import com.yunikosoftware.bgem3onnx.*; // Create CPU-optimized embedder try (M3Embedder embedder = M3EmbedderFactory.createCpuOptimized(tokenizerPath, modelPath)) { // Generate all embedding types M3EmbeddingOutput result = embedder.generateEmbeddings("Hello world!"); System.out.println("Dense: " + result.getDenseEmbedding().length + " dimensions"); System.out.println("Sparse: " + result.getSparseWeights().size() + " tokens"); System.out.println("ColBERT: " + result.getColBertVectors().length + " vectors"); } // For CUDA acceleration try (M3Embedder cudaEmbedder = M3EmbedderFactory.createCudaOptimized(tokenizerPath, modelPath, 0)) { M3EmbeddingOutput result = cudaEmbedder.generateEmbeddings("Hello world!"); // Process CUDA results } // See full implementation in samples/java ``` --- ⭐ **If you find this project useful, please consider giving it a star on GitHub!** ⭐ Your support helps make this project more visible to other developers who might benefit from BGE-M3's complete multi-vector functionality.

ML Frameworks Vector Databases

40 Github Stars

tokenizer-to-onnx-model

# Hugging Face Tokenizer to ONNX Model > ⚠️ **Looking for Full BGE-M3 Functionality?** > > This repository demonstrates basic tokenizer conversion and generates only **dense embeddings**. If you need the complete BGE-M3 experience with **all three embedding types** (dense, sparse, and ColBERT vectors), check out our new repository: > > **[BGE-M3 ONNX](https://github.com/yuniko-software/bge-m3-onnx)** > > The new repository provides: > - All three BGE-M3 embedding types (dense, sparse, ColBERT) > - Production-ready implementations in C#, Java, and Python This repository demonstrates how to convert [Hugging Face](https://github.com/huggingface/transformers) tokenizers to [ONNX](https://github.com/microsoft/onnxruntime) format and use them along with embedding models in multiple programming languages. ## Key Features - Generate embeddings directly in C#, Java, or Python without third-party APIs or services - Reduced latency with local embedding generation - Full control over the embedding pipeline with no external dependencies - Works offline without internet connectivity requirements - Cross-platform compatibility ## The Problem While we can easily download ONNX models from Hugging Face or convert existing PyTorch models to ONNX format for portability, **tokenizers** present a significant challenge. Tokenizers for embedding models are not often implemented in languages other than Python. This becomes a major obstacle when trying to use embedding models in languages like C# or Java. Developers face the difficult choice of either implementing complex tokenizers from scratch or relying on Python interop, which adds complexity and dependencies. ## The Solution This repository uses [ONNX Runtime Extensions](https://github.com/microsoft/onnxruntime-extensions) to convert Hugging Face tokenizers to ONNX format. This gives you the complete embedding pipeline in your preferred programming language without having to implement tokenizers yourself. ONNX Runtime Extensions are currently supported in: - C# - Java - C++ - Python ## Repository Structure - `tokenizer_to_onnx_model.ipynb` - Jupyter notebook demonstrating the tokenizer conversion process - `/samples/dotnet` - C# implementation and tests - `/samples/java` - Java implementation and tests - `generate_reference_embeddings.py` - Script to generate reference embeddings for cross-language testing - `run_tests.sh` and `run_tests.ps1` - Test scripts for Linux/macOS and Windows ## Getting Started 1. Clone this repository: ```bash git clone https://github.com/yuniko-software/tokenizer-to-onnx-model.git cd tokenizer-to-onnx-model ``` 2. Download the embedding model: - Create an `onnx` folder in the repository root - Download `model.onnx` and `model.onnx_data` from https://huggingface.co/BAAI/bge-m3/tree/main/onnx - Place these files in the `/onnx` folder > Note: In this repository, we use [`bge-m3`](https://github.com/FlagOpen/FlagEmbedding) as the embedding model and `XLM-RoBERTa Fast` as the tokenizer. 3. Generate the ONNX tokenizer: - Option 1: Run the Jupyter notebook - Open and run `tokenizer_to_onnx_model.ipynb` - this is the most important file in the repository - The notebook demonstrates how to convert a Hugging Face tokenizer to ONNX format - This will create a `tokenizer.onnx` file in the `/onnx` folder - Option 2: Download pre-converted files - Check the repository releases and download `onnx.zip` - It already contains the bge-m3 embedding model and its tokenizer 4. Run the samples: - Once you have `tokenizer.onnx`, `model.onnx`, and `model.onnx_data` in the `/onnx` folder, you can run any sample - Try the .NET sample in `/samples/dotnet` or the Java sample in `/samples/java` 5. Verify cross-language embeddings (optional): - To ensure that .NET and Java embeddings match the HuggingFace-generated embeddings, you can run: - On Linux/macOS: ```bash chmod +x run_tests.sh ./run_tests.sh ``` - On Windows: ```powershell ./run_tests.ps1 ``` > Note: These scripts are primarily used for CI in this repository, but you can run them to verify everything works correctly. They require Python, .NET, Java, and Maven to be installed. ## Python Example ```python import onnxruntime as ort import numpy as np from onnxruntime_extensions import get_library_path def generate_embedding(text, tokenizer_session, model_session): tokenizer_outputs = tokenizer_session.run(None, {"inputs": np.array([text])}) tokens, _, token_indices = tokenizer_outputs token_pairs = [] for i in range(len(tokens)): if i < len(token_indices): token_pairs.append((token_indices[i], tokens[i])) token_pairs.sort() ordered_tokens = [pair[1] for pair in token_pairs] input_ids = np.array([ordered_tokens], dtype=np.int64) attention_mask = np.ones_like(input_ids, dtype=np.int64) outputs = model_session.run(None, { "input_ids": input_ids, "attention_mask": attention_mask }) return outputs[1].flatten().tolist() # Initialize sessions sess_options = ort.SessionOptions() sess_options.register_custom_ops_library(get_library_path()) tokenizer_session = ort.InferenceSession("onnx/tokenizer.onnx", sess_options=sess_options) model_session = ort.InferenceSession("onnx/model.onnx") # Generate embedding embedding = generate_embedding("Hello world!", tokenizer_session, model_session) # See full implementation in tokenizer_to_onnx_model.ipynb ``` ## C# Example ```csharp using Microsoft.ML.OnnxRuntime; using Microsoft.ML.OnnxRuntime.Tensors; // Initialize sessions var tokenizerOptions = new SessionOptions(); tokenizerOptions.RegisterOrtExtensions(); var tokenizerSession = new InferenceSession("onnx/tokenizer.onnx", tokenizerOptions); var modelSession = new InferenceSession("onnx/model.onnx"); // Run tokenizer and model // See full implementation in samples/dotnet ``` ## Java Example ```java import ai.onnxruntime.*; import ai.onnxruntime.extensions.OrtxPackage; // Initialize sessions OrtSession.SessionOptions tokenizerOptions = new OrtSession.SessionOptions(); tokenizerOptions.registerCustomOpLibrary(OrtxPackage.getLibraryPath()); tokenizerSession = environment.createSession("onnx/tokenizer.onnx", tokenizerOptions); modelSession = environment.createSession("onnx/model.onnx"); // Run tokenizer and model // See full implementation in samples/java ``` --- ⭐ **If you find this project useful, please consider giving it a star on GitHub!** ⭐ Your support helps make this project more visible to other developers who might benefit from it.

AI Tools ML Frameworks

37 Github Stars

bge-m3-qdrant-sample

# BGE-M3 Qdrant sample. Hybrid search & reranking ![image](https://github.com/user-attachments/assets/f59dc6ae-4189-4fd7-8351-6d5c64f6cf92) This repository contains a [Jupyter notebook](sample.ipynb) that demonstrates how to build an advanced search system using BGE-M3 and Qdrant. The key feature of this sample is the use of an all-in-one embedding model (BGE-M3) that generates three types of vectors in a single pass: - **Dense vectors**: For semantic similarity (1024 dimensions) - **Sparse vectors**: For lexical/keyword matching - **ColBERT token vectors**: For fine-grained token-level matching This multi-vector approach provides superior search quality by combining the strengths of different embedding types within a single model. ## Requirements - Python 3.9+ - Docker (for running Qdrant) - Jupyter Notebook ## How It Works The system operates in the following steps: 1. **Data Loading**: Products are loaded from a CSV file 2. **Text Formatting**: Product information is formatted for embedding 3. **Embedding Generation**: BGE-M3 generates all three embedding types in one pass 4. **Vector Database Setup**: Qdrant collection is configured for hybrid search 5. **Data Indexing**: Product data and embeddings are stored in Qdrant 6. **Search**: Queries go through the same embedding process and retrieve results

Knowledge Bases & RAG

32 Github Stars

Software by yuniko-software

bge-m3-onnx

tokenizer-to-onnx-model

bge-m3-qdrant-sample