Home
Softono
AnglE

AnglE

Open source MIT Python
571
Stars
38
Forks
15
Issues
7
Watchers
2 months
Last Commit

About AnglE

AnglE is a powerful library for training and inferring sentence embeddings, recognized for achieving State-of-the-Art performance on STS and MTEB leaderboards. Designed for flexibility and ease of use, it enables users to train BERT-based models and large language models like LLaMA, Mistral, and Qwen with minimal code. The framework supports diverse architectural backbones, including standard BERT variants, modern LLMs, and bi-directional LLM configurations. It offers multiple advanced training loss functions, such as AnglE loss, Contrastive loss, CoSENT loss, and Espresso loss, to optimize embedding quality for specific tasks. AnglE facilitates efficient training on both single and multi-GPU setups. Pretrained models available through the library cover general-purpose English text, code similarity, and specialized medical domains. Originally presented in the ACL 2024 paper Angle-optimized Text Embeddings, AnglE also integrates research from NAACL 2024 regarding backward dependency enhancement in large langua

Platforms

Web Self-hosted

Languages

Python

EN | ็ฎ€ไฝ“ไธญๆ–‡

AnglE ๐Ÿ“

Sponsored by Mixedbread

For more detailed usage, please read the ๐Ÿ“˜ document: https://angle.readthedocs.io/en/latest/index.html

https://arxiv.org/abs/2309.12871 PyPI version PyPI Downloads Read the docs

๐Ÿ“ข Train/Infer Powerful Sentence Embeddings with AnglE. This library is from the paper: AnglE: Angle-optimized Text Embeddings. It allows for training state-of-the-art BERT/LLM-based sentence embeddings with just a few lines of code. AnglE is also a general sentence embedding inference framework, allowing for infering a variety of transformer-based sentence embeddings.

โœจ Features

Loss:

  • ๐Ÿ“ AnglE loss (ACL24)
  • โš– Contrastive loss
  • ๐Ÿ“ CoSENT loss
  • โ˜•๏ธ Espresso loss (ICLR 2025, a.k.a 2DMSE, detail: README_ESE)

Backbones:

  • BERT-based models (BERT, RoBERTa, ModernBERT, etc.)
  • LLM-based models (LLaMA, Mistral, Qwen, etc.)
  • Bi-directional LLM-based models (LLaMA, Mistral, Qwen, OpenELMo, etc.. refer to: https://github.com/WhereIsAI/BiLLM)

Training:

  • Single-GPU training
  • Multi-GPU training

http://makeapullrequest.com More features will be added in the future.

๐Ÿ† Achievements

๐Ÿ“… May 16, 2024 | Paper "AnglE: Angle-optimized Text Embeddings" is accepted by ACL 2024 Main Conference.

๐Ÿ“… Mar 13, 2024 | Paper "BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings" is accepted by NAACL 2024 Main Conference.

๐Ÿ“… Mar 8, 2024 | ๐Ÿž mixedbread's embedding (mixedbread-ai/mxbai-embed-large-v1) achieves SOTA on the MTEB Leaderboard with an average score of 64.68! The model is trained using AnglE. Congrats mixedbread!

๐Ÿ“… Dec 4, 2023 | Our universal sentence embedding WhereIsAI/UAE-Large-V1 achieves SOTA on the MTEB Leaderboard with an average score of 64.64! The model is trained using AnglE.

๐Ÿ“… Dec, 2023 | AnglE achieves SOTA performance on the STS Bechmark Semantic Textual Similarity!

๐Ÿค— Official Pretrained Models

BERT-based models:

๐Ÿค— HF Max Tokens Pooling Strategy Scenario
WhereIsAI/UAE-Large-V1 512 cls English, General-purpose
WhereIsAI/UAE-Code-Large-V1 512 cls Code Similarity
WhereIsAI/pubmed-angle-base-en 512 cls Medical Similarity
WhereIsAI/pubmed-angle-large-en 512 cls Medical Similarity

LLM-based models:

๐Ÿค— HF (lora weight) Backbone Max Tokens Prompts Pooling Strategy Scenario
SeanLee97/angle-llama-13b-nli NousResearch/Llama-2-13b-hf 4096 Prompts.A last token English, Similarity Measurement
SeanLee97/angle-llama-7b-nli-v2 NousResearch/Llama-2-7b-hf 4096 Prompts.A last token English, Similarity Measurement

๐Ÿ’ก You can find more third-party embeddings trained with AnglE in HuggingFace Collection

๐Ÿš€ Quick Start

โฌ‡๏ธ Installation

use uv

uv pip install -U angle-emb

or pip

pip install -U angle-emb

๐Ÿ” Inference

1๏ธโƒฃ BERT-based Models

Open In Colab

Option A: With Prompts (for Retrieval Tasks)

Use prompts with {text} as placeholder. Check available prompts via Prompts.list_prompts().

from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity

# Load model
angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()

# Encode query with prompt, documents without prompt
qv = angle.encode(['what is the weather?'], to_numpy=True, prompt=Prompts.C)
doc_vecs = angle.encode([
    'The weather is great!',
    'it is rainy today.',
    'i am going to bed'
], to_numpy=True)

# Calculate similarity
for dv in doc_vecs:
    print(cosine_similarity(qv[0], dv))

Option B: Without Prompts (for Similarity Tasks)

from angle_emb import AnglE
from angle_emb.utils import cosine_similarity

# Load model
angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()

# Encode documents
doc_vecs = angle.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
])

# Calculate pairwise similarity
for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))

2๏ธโƒฃ LLM-based Models

Open In Colab

For LoRA-based models, specify both the backbone model and LoRA weights. Always set is_llm=True for LLM models.

import torch
from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity

# Load LLM with LoRA weights
angle = AnglE.from_pretrained(
    'NousResearch/Llama-2-7b-hf',
    pretrained_lora_path='SeanLee97/angle-llama-7b-nli-v2',
    pooling_strategy='last',
    is_llm=True,
    torch_dtype=torch.float16
).cuda()

# Encode with prompt
doc_vecs = angle.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
], prompt=Prompts.A)

# Calculate similarity
for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))

3๏ธโƒฃ BiLLM-based Models

Open In Colab

Enable bidirectional LLMs with apply_billm=True and specify the model class.

import os
import torch
from angle_emb import AnglE
from angle_emb.utils import cosine_similarity

# Set BiLLM environment variable
os.environ['BiLLM_START_INDEX'] = '31'

# Load BiLLM model
angle = AnglE.from_pretrained(
    'NousResearch/Llama-2-7b-hf',
    pretrained_lora_path='SeanLee97/bellm-llama-7b-nli',
    pooling_strategy='last',
    is_llm=True,
    apply_billm=True,
    billm_model_class='LlamaForCausalLM',
    torch_dtype=torch.float16
).cuda()

# Encode with custom prompt
doc_vecs = angle.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
], prompt='The representative word for sentence {text} is:"')

# Calculate similarity
for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))

4๏ธโƒฃ Espresso/Matryoshka Models

Open In Colab

Truncate layers and embedding dimensions for flexible model compression.

from angle_emb import AnglE
from angle_emb.utils import cosine_similarity

# Load model
angle = AnglE.from_pretrained('mixedbread-ai/mxbai-embed-2d-large-v1', pooling_strategy='cls').cuda()

# Truncate to specific layer
angle = angle.truncate_layer(layer_index=22)

# Encode with truncated embedding size
doc_vecs = angle.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
], embedding_size=768)

# Calculate similarity
for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))

5๏ธโƒฃ Third-party Models

Load any transformer-based models (e.g., sentence-transformers, BAAI/bge, etc.) using AnglE.

from angle_emb import AnglE

# Load third-party model
model = AnglE.from_pretrained('mixedbread-ai/mxbai-embed-large-v1', pooling_strategy='cls').cuda()

# Encode text
vec = model.encode('hello world', to_numpy=True)
print(vec)

โšก Batch Inference

Speed up inference with the batched library (recommended for large-scale processing).

uv pip install batched
import batched
from angle_emb import AnglE

# Load model
model = AnglE.from_pretrained("WhereIsAI/UAE-Large-V1", pooling_strategy='cls').cuda()

# Enable dynamic batching
model.encode = batched.dynamically(model.encode, batch_size=64)

# Encode large batch
vecs = model.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
] * 50)

๐Ÿ•ธ๏ธ Custom Training

๐Ÿ’ก For complete details, see the official training documentation.


๐Ÿ—‚๏ธ Step 1: Prepare Your Dataset

AnglE supports three dataset formats. Choose based on your task:

Format Columns Description Use Case
Format A text1, text2, label Paired texts with similarity scores (0-1) Similarity scoring
Format B query, positive Query-document pairs Retrieval without hard negatives
Format C query, positive, negative Query with positive and negative samples Contrastive learning

Notes:

  • All formats use HuggingFace datasets.Dataset
  • text1, text2, query, positive, and negative can be str or List[str] (random sampling for lists)

๐Ÿš‚ Step 2: Training Methods

Option A: CLI Training (Recommended)

Single GPU:

CUDA_VISIBLE_DEVICES=0 angle-trainer --help

Multi-GPU with FSDP:

CUDA_VISIBLE_DEVICES=0,1,2,3 WANDB_MODE=disabled accelerate launch \
  --multi_gpu \
  --num_processes 4 \
  --main_process_port 2345 \
  --config_file examples/FSDP/fsdp_config.yaml \
  -m angle_emb.angle_trainer \
  --gradient_checkpointing 1 \
  --use_reentrant 0 \
  ...

Multi-GPU (Standard):

CUDA_VISIBLE_DEVICES=0,1,2,3 WANDB_MODE=disabled accelerate launch \
  --multi_gpu \
  --num_processes 4 \
  --main_process_port 2345 \
  -m angle_emb.angle_trainer \
  --model_name_or_path YOUR_MODEL \
  --train_name_or_path YOUR_DATASET \
  ...

๐Ÿ“ More examples: examples/Training


Option B: Python API Training

Open In Colab

from datasets import load_dataset
from angle_emb import AnglE

# Step 1: Load pretrained model
angle = AnglE.from_pretrained(
    'SeanLee97/angle-bert-base-uncased-nli-en-v1',
    max_length=128,
    pooling_strategy='cls'
).cuda()

# Step 2: Prepare dataset (Format A example)
ds = load_dataset('mteb/stsbenchmark-sts')
ds = ds.map(lambda obj: {
    "text1": str(obj["sentence1"]),
    "text2": str(obj['sentence2']),
    "label": obj['score']
})
ds = ds.select_columns(["text1", "text2", "label"])

# Step 3: Train the model
angle.fit(
    train_ds=ds['train'].shuffle(),
    valid_ds=ds['validation'],
    output_dir='ckpts/sts-b',
    batch_size=32,
    epochs=5,
    learning_rate=2e-5,
    save_steps=100,
    eval_steps=1000,
    warmup_steps=0,
    gradient_accumulation_steps=1,
    loss_kwargs={
        'cosine_w': 1.0,
        'ibn_w': 1.0,
        'angle_w': 0.02,
        'cosine_tau': 20,
        'ibn_tau': 20,
        'angle_tau': 20
    },
    fp16=True,
    logging_steps=100
)

# Step 4: Evaluate
corrcoef = angle.evaluate(ds['test'])
print('Spearman\'s corrcoef:', corrcoef)

โš™๏ธ Advanced Configuration

Training Special Models

Model Type CLI Flags Description
LLM --is_llm 1 + LoRA params Must manually enable LLM mode
BiLLM --apply_billm 1 --billm_model_class LlamaForCausalLM Bidirectional LLMs (guide)
Espresso (ESE) --apply_ese 1 --ese_kl_temperature 1.0 --ese_compression_size 256 Matryoshka-style embeddings

Applying Prompts

Format Flag Applies To
Format A --text_prompt "text: {text}" Both text1 and text2
Format B/C --query_prompt "query: {text}" query field
Format B/C --doc_prompt "document: {text}" positive and negative fields

Column Mapping (Legacy Compatibility)

Adapt old datasets without modification:

# CLI
--column_rename_mapping "text:query"

# Python
column_rename_mapping={"text": "query"}

Model Conversion

Convert trained models to sentence-transformers format:

python scripts/convert_to_sentence_transformers.py --help

๐Ÿ’ก Fine-tuning Tips

๐Ÿ“– Full documentation

Format Recommendation
Format A Increase cosine_w or decrease ibn_w
Format B Only tune ibn_w and ibn_tau
Format C Set cosine_w=0, angle_w=0.02, and configure cln_w + ibn_w

Prevent Catastrophic Forgetting:

  • Set teacher_name_or_path for knowledge distillation
  • Use same model path for self-distillation
  • โš ๏ธ Ensure teacher and student use the same tokenizer

๐Ÿ”„ Integration with sentence-transformers

Task Status Notes
Training โš ๏ธ Partial SentenceTransformers has AnglE loss, but use official angle_emb for best results
Inference โœ… Full Convert trained models: examples/convert_to_sentence_transformers.py

๐Ÿซก Citation

If you use our code and pre-trained models, please support us by citing our work as follows:

@article{li2023angle,
  title={AnglE-optimized Text Embeddings},
  author={Li, Xianming and Li, Jing},
  journal={arXiv preprint arXiv:2309.12871},
  year={2023}
}

๐Ÿ“œ ChangeLogs

๐Ÿ“… Description
2025 Jan v0.6.0 - Major refactoring ๐ŸŽ‰:
โ€ข Removed AngleDataTokenizer - no need to pre-tokenize datasets!
โ€ข Removed DatasetFormats class - use string literals ('A', 'B', 'C')
โ€ข Removed auto-detection of LLM models - set is_llm manually
โ€ข Renamed --prompt_template to --text_prompt (Format A only)
โ€ข Added --query_prompt and --doc_prompt for Format B/C
โ€ข Added --column_rename_mapping to adapt old datasets without modification
โ€ข Updated data formats: Format B/C now use query, positive, negative fields
โ€ข Support list-based sampling in Format B/C
โ€ข Updated examples to use accelerate launch
โ€ข See MIGRATION_GUIDE.md for upgrade instructions
2024 May 21 support Espresso Sentence Embeddings
2024 Feb 7 support training with only positive pairs (Format C: query, positive)
2023 Dec 4 Release a universal English sentence embedding model: WhereIsAI/UAE-Large-V1
2023 Nov 2 Release an English pretrained model: SeanLee97/angle-llama-13b-nli
2023 Oct 28 Release two chinese pretrained models: SeanLee97/angle-roberta-wwm-base-zhnli-v1 and SeanLee97/angle-llama-7b-zhnli-v1; Add chinese README.md

๐Ÿ“ง Contact

If you have any questions or suggestions, please feel free to contact us via email: [email protected]

ยฉ License

This project is licensed under the MIT License. For the pretrained models, please refer to the corresponding license of the models.