About AnglE

AnglE is a powerful library for training and inferring sentence embeddings, recognized for achieving State-of-the-Art performance on STS and MTEB leaderboards. Designed for flexibility and ease of use, it enables users to train BERT-based models and large language models like LLaMA, Mistral, and Qwen with minimal code. The framework supports diverse architectural backbones, including standard BERT variants, modern LLMs, and bi-directional LLM configurations. It offers multiple advanced training loss functions, such as AnglE loss, Contrastive loss, CoSENT loss, and Espresso loss, to optimize embedding quality for specific tasks. AnglE facilitates efficient training on both single and multi-GPU setups. Pretrained models available through the library cover general-purpose English text, code similarity, and specialized medical domains. Originally presented in the ACL 2024 paper Angle-optimized Text Embeddings, AnglE also integrates research from NAACL 2024 regarding backward dependency enhancement in large langua

s

Published by

seanlee97

Visit View Profile

README.md

View on GitHub

EN | 简体中文

AnglE 📐

Sponsored by Mixedbread

For more detailed usage, please read the 📘 document: https://angle.readthedocs.io/en/latest/index.html

📢 Train/Infer Powerful Sentence Embeddings with AnglE. This library is from the paper: AnglE: Angle-optimized Text Embeddings. It allows for training state-of-the-art BERT/LLM-based sentence embeddings with just a few lines of code. AnglE is also a general sentence embedding inference framework, allowing for infering a variety of transformer-based sentence embeddings.

✨ Features

Loss:

📐 AnglE loss (ACL24)
⚖ Contrastive loss
📏 CoSENT loss
☕️ Espresso loss (ICLR 2025, a.k.a 2DMSE, detail: README_ESE)

Backbones:

BERT-based models (BERT, RoBERTa, ModernBERT, etc.)
LLM-based models (LLaMA, Mistral, Qwen, etc.)
Bi-directional LLM-based models (LLaMA, Mistral, Qwen, OpenELMo, etc.. refer to: https://github.com/WhereIsAI/BiLLM)

Training:

Single-GPU training
Multi-GPU training

More features will be added in the future.

🏆 Achievements

📅 May 16, 2024 | Paper "AnglE: Angle-optimized Text Embeddings" is accepted by ACL 2024 Main Conference.

📅 Mar 13, 2024 | Paper "BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings" is accepted by NAACL 2024 Main Conference.

📅 Mar 8, 2024 | 🍞 mixedbread's embedding (mixedbread-ai/mxbai-embed-large-v1) achieves SOTA on the MTEB Leaderboard with an average score of 64.68! The model is trained using AnglE. Congrats mixedbread!

📅 Dec 4, 2023 | Our universal sentence embedding WhereIsAI/UAE-Large-V1 achieves SOTA on the MTEB Leaderboard with an average score of 64.64! The model is trained using AnglE.

📅 Dec, 2023 | AnglE achieves SOTA performance on the STS Bechmark Semantic Textual Similarity!

🤗 Official Pretrained Models

BERT-based models:

🤗 HF	Max Tokens	Pooling Strategy	Scenario
WhereIsAI/UAE-Large-V1	512	cls	English, General-purpose
WhereIsAI/UAE-Code-Large-V1	512	cls	Code Similarity
WhereIsAI/pubmed-angle-base-en	512	cls	Medical Similarity
WhereIsAI/pubmed-angle-large-en	512	cls	Medical Similarity

LLM-based models:

🤗 HF (lora weight)	Backbone	Max Tokens	Prompts	Pooling Strategy	Scenario
SeanLee97/angle-llama-13b-nli	NousResearch/Llama-2-13b-hf	4096	`Prompts.A`	last token	English, Similarity Measurement
SeanLee97/angle-llama-7b-nli-v2	NousResearch/Llama-2-7b-hf	4096	`Prompts.A`	last token	English, Similarity Measurement

💡 You can find more third-party embeddings trained with AnglE in HuggingFace Collection

🚀 Quick Start

⬇️ Installation

use uv

uv pip install -U angle-emb

or pip

pip install -U angle-emb

🔍 Inference

1️⃣ BERT-based Models

Option A: With Prompts (for Retrieval Tasks)

Use prompts with {text} as placeholder. Check available prompts via Prompts.list_prompts().

from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity

# Load model
angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()

# Encode query with prompt, documents without prompt
qv = angle.encode(['what is the weather?'], to_numpy=True, prompt=Prompts.C)
doc_vecs = angle.encode([
    'The weather is great!',
    'it is rainy today.',
    'i am going to bed'
], to_numpy=True)

# Calculate similarity
for dv in doc_vecs:
    print(cosine_similarity(qv[0], dv))

Option B: Without Prompts (for Similarity Tasks)

from angle_emb import AnglE
from angle_emb.utils import cosine_similarity

# Load model
angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()

# Encode documents
doc_vecs = angle.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
])

# Calculate pairwise similarity
for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))

2️⃣ LLM-based Models

For LoRA-based models, specify both the backbone model and LoRA weights. Always set is_llm=True for LLM models.

import torch
from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity

# Load LLM with LoRA weights
angle = AnglE.from_pretrained(
    'NousResearch/Llama-2-7b-hf',
    pretrained_lora_path='SeanLee97/angle-llama-7b-nli-v2',
    pooling_strategy='last',
    is_llm=True,
    torch_dtype=torch.float16
).cuda()

# Encode with prompt
doc_vecs = angle.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
], prompt=Prompts.A)

# Calculate similarity
for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))

3️⃣ BiLLM-based Models

Enable bidirectional LLMs with apply_billm=True and specify the model class.

import os
import torch
from angle_emb import AnglE
from angle_emb.utils import cosine_similarity

# Set BiLLM environment variable
os.environ['BiLLM_START_INDEX'] = '31'

# Load BiLLM model
angle = AnglE.from_pretrained(
    'NousResearch/Llama-2-7b-hf',
    pretrained_lora_path='SeanLee97/bellm-llama-7b-nli',
    pooling_strategy='last',
    is_llm=True,
    apply_billm=True,
    billm_model_class='LlamaForCausalLM',
    torch_dtype=torch.float16
).cuda()

# Encode with custom prompt
doc_vecs = angle.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
], prompt='The representative word for sentence {text} is:"')

# Calculate similarity
for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))

4️⃣ Espresso/Matryoshka Models

Truncate layers and embedding dimensions for flexible model compression.

from angle_emb import AnglE
from angle_emb.utils import cosine_similarity

# Load model
angle = AnglE.from_pretrained('mixedbread-ai/mxbai-embed-2d-large-v1', pooling_strategy='cls').cuda()

# Truncate to specific layer
angle = angle.truncate_layer(layer_index=22)

# Encode with truncated embedding size
doc_vecs = angle.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
], embedding_size=768)

# Calculate similarity
for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))

5️⃣ Third-party Models

Load any transformer-based models (e.g., sentence-transformers, BAAI/bge, etc.) using AnglE.

from angle_emb import AnglE

# Load third-party model
model = AnglE.from_pretrained('mixedbread-ai/mxbai-embed-large-v1', pooling_strategy='cls').cuda()

# Encode text
vec = model.encode('hello world', to_numpy=True)
print(vec)

⚡ Batch Inference

Speed up inference with the batched library (recommended for large-scale processing).

uv pip install batched

import batched
from angle_emb import AnglE

# Load model
model = AnglE.from_pretrained("WhereIsAI/UAE-Large-V1", pooling_strategy='cls').cuda()

# Enable dynamic batching
model.encode = batched.dynamically(model.encode, batch_size=64)

# Encode large batch
vecs = model.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
] * 50)

🕸️ Custom Training

💡 For complete details, see the official training documentation.

🗂️ Step 1: Prepare Your Dataset

AnglE supports three dataset formats. Choose based on your task:

Format	Columns	Description	Use Case
Format A	`text1`, `text2`, `label`	Paired texts with similarity scores (0-1)	Similarity scoring
Format B	`query`, `positive`	Query-document pairs	Retrieval without hard negatives
Format C	`query`, `positive`, `negative`	Query with positive and negative samples	Contrastive learning

Notes:

All formats use HuggingFace datasets.Dataset
text1, text2, query, positive, and negative can be str or List[str] (random sampling for lists)

🚂 Step 2: Training Methods

Option A: CLI Training (Recommended)

Single GPU:

CUDA_VISIBLE_DEVICES=0 angle-trainer --help

Multi-GPU with FSDP:

CUDA_VISIBLE_DEVICES=0,1,2,3 WANDB_MODE=disabled accelerate launch \
  --multi_gpu \
  --num_processes 4 \
  --main_process_port 2345 \
  --config_file examples/FSDP/fsdp_config.yaml \
  -m angle_emb.angle_trainer \
  --gradient_checkpointing 1 \
  --use_reentrant 0 \
  ...

Multi-GPU (Standard):

CUDA_VISIBLE_DEVICES=0,1,2,3 WANDB_MODE=disabled accelerate launch \
  --multi_gpu \
  --num_processes 4 \
  --main_process_port 2345 \
  -m angle_emb.angle_trainer \
  --model_name_or_path YOUR_MODEL \
  --train_name_or_path YOUR_DATASET \
  ...

📁 More examples: examples/Training

Option B: Python API Training

from datasets import load_dataset
from angle_emb import AnglE

# Step 1: Load pretrained model
angle = AnglE.from_pretrained(
    'SeanLee97/angle-bert-base-uncased-nli-en-v1',
    max_length=128,
    pooling_strategy='cls'
).cuda()

# Step 2: Prepare dataset (Format A example)
ds = load_dataset('mteb/stsbenchmark-sts')
ds = ds.map(lambda obj: {
    "text1": str(obj["sentence1"]),
    "text2": str(obj['sentence2']),
    "label": obj['score']
})
ds = ds.select_columns(["text1", "text2", "label"])

# Step 3: Train the model
angle.fit(
    train_ds=ds['train'].shuffle(),
    valid_ds=ds['validation'],
    output_dir='ckpts/sts-b',
    batch_size=32,
    epochs=5,
    learning_rate=2e-5,
    save_steps=100,
    eval_steps=1000,
    warmup_steps=0,
    gradient_accumulation_steps=1,
    loss_kwargs={
        'cosine_w': 1.0,
        'ibn_w': 1.0,
        'angle_w': 0.02,
        'cosine_tau': 20,
        'ibn_tau': 20,
        'angle_tau': 20
    },
    fp16=True,
    logging_steps=100
)

# Step 4: Evaluate
corrcoef = angle.evaluate(ds['test'])
print('Spearman\'s corrcoef:', corrcoef)

⚙️ Advanced Configuration

Training Special Models

Model Type	CLI Flags	Description
LLM	`--is_llm 1` + LoRA params	Must manually enable LLM mode
BiLLM	`--apply_billm 1 --billm_model_class LlamaForCausalLM`	Bidirectional LLMs (guide)
Espresso (ESE)	`--apply_ese 1 --ese_kl_temperature 1.0 --ese_compression_size 256`	Matryoshka-style embeddings

Applying Prompts

Format	Flag	Applies To
Format A	`--text_prompt "text: {text}"`	Both `text1` and `text2`
Format B/C	`--query_prompt "query: {text}"`	`query` field
Format B/C	`--doc_prompt "document: {text}"`	`positive` and `negative` fields

Column Mapping (Legacy Compatibility)

Adapt old datasets without modification:

# CLI
--column_rename_mapping "text:query"

# Python
column_rename_mapping={"text": "query"}

Model Conversion

Convert trained models to sentence-transformers format:

python scripts/convert_to_sentence_transformers.py --help

💡 Fine-tuning Tips

📖 Full documentation

Format	Recommendation
Format A	Increase `cosine_w` or decrease `ibn_w`
Format B	Only tune `ibn_w` and `ibn_tau`
Format C	Set `cosine_w=0`, `angle_w=0.02`, and configure `cln_w` + `ibn_w`

Prevent Catastrophic Forgetting:

Set teacher_name_or_path for knowledge distillation
Use same model path for self-distillation
⚠️ Ensure teacher and student use the same tokenizer

🔄 Integration with sentence-transformers

Task	Status	Notes
Training	⚠️ Partial	SentenceTransformers has AnglE loss, but use official `angle_emb` for best results
Inference	✅ Full	Convert trained models: `examples/convert_to_sentence_transformers.py`

🫡 Citation

If you use our code and pre-trained models, please support us by citing our work as follows:

@article{li2023angle,
  title={AnglE-optimized Text Embeddings},
  author={Li, Xianming and Li, Jing},
  journal={arXiv preprint arXiv:2309.12871},
  year={2023}
}

📜 ChangeLogs

📅	Description
2025 Jan	v0.6.0 - Major refactoring 🎉: • Removed `AngleDataTokenizer` - no need to pre-tokenize datasets! • Removed `DatasetFormats` class - use string literals ('A', 'B', 'C') • Removed auto-detection of LLM models - set `is_llm` manually • Renamed `--prompt_template` to `--text_prompt` (Format A only) • Added `--query_prompt` and `--doc_prompt` for Format B/C • Added `--column_rename_mapping` to adapt old datasets without modification • Updated data formats: Format B/C now use `query`, `positive`, `negative` fields • Support list-based sampling in Format B/C • Updated examples to use `accelerate launch` • See MIGRATION_GUIDE.md for upgrade instructions
2024 May 21	support Espresso Sentence Embeddings
2024 Feb 7	support training with only positive pairs (Format C: query, positive)
2023 Dec 4	Release a universal English sentence embedding model: WhereIsAI/UAE-Large-V1
2023 Nov 2	Release an English pretrained model: `SeanLee97/angle-llama-13b-nli`
2023 Oct 28	Release two chinese pretrained models: `SeanLee97/angle-roberta-wwm-base-zhnli-v1` and `SeanLee97/angle-llama-7b-zhnli-v1`; Add chinese README.md

📧 Contact

If you have any questions or suggestions, please feel free to contact us via email: [email protected]

© License

This project is licensed under the MIT License. For the pretrained models, please refer to the corresponding license of the models.

AnglE