funaudiollm

Professional software vendor delivering innovative solutions on the Softono platform. Specialized in both open-source and proprietary software development.

Visit Website

Total Products

Software by funaudiollm

Open Source

SenseVoice

([简体中文](./README_zh.md)|English|[日本語](./README_ja.md)) # Introduction SenseVoice is a speech foundation model with multiple speech understanding capabilities, including automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED). <div align="center"> <img src="image/sensevoice2.png"> </div> [//]: # (<div align="center"><img src="image/sensevoice.png" width="700"/> </div>) <div align="center"> <h4> <a href="https://funaudiollm.github.io/"> Homepage </a> ｜<a href="#What's News"> What's News </a> ｜<a href="#Benchmarks"> Benchmarks </a> ｜<a href="#Install"> Install </a> ｜<a href="#Usage"> Usage </a> ｜<a href="#Community"> Community </a> </h4> Model Zoo: [modelscope](https://www.modelscope.cn/models/iic/SenseVoiceSmall), [huggingface](https://huggingface.co/FunAudioLLM/SenseVoiceSmall) Online Demo: [modelscope demo](https://www.modelscope.cn/studios/iic/SenseVoice), [huggingface space](https://huggingface.co/spaces/FunAudioLLM/SenseVoice) </div> <a name="Highligts"></a> # Highlights 🎯 **SenseVoice** focuses on high-accuracy multilingual speech recognition, speech emotion recognition, and audio event detection. - **Multilingual Speech Recognition:** Trained with over 400,000 hours of data, supporting more than 50 languages, the recognition performance surpasses that of the Whisper model. - **Rich transcribe:** - Possess excellent emotion recognition capabilities, achieving and surpassing the effectiveness of the current best emotion recognition models on test data. - Offer sound event detection capabilities, supporting the detection of various common human-computer interaction events such as bgm, applause, laughter, crying, coughing, and sneezing. - **Efficient Inference:** The SenseVoice-Small model utilizes a non-autoregressive end-to-end framework, leading to exceptionally low inference latency. It requires only 70ms to process 10 seconds of audio, which is 15 times faster than Whisper-Large. - **Convenient Finetuning:** Provide convenient finetuning scripts and strategies, allowing users to easily address long-tail sample issues according to their business scenarios. - **Service Deployment:** Offer service deployment pipeline, supporting multi-concurrent requests, with client-side languages including Python, C++, HTML, Java, and C#, among others. <a name="What's News"></a> # What's New 🔥 - 2026/05: SenseVoice now supports speaker diarization. Use with `vad_model` + `spk_model` + `punc_model` to get per-sentence speaker labels. Requires installing FunASR from source: `pip install git+https://github.com/modelscope/FunASR.git` - 2024/11: Add support for timestamp based on the CTC alignment. - 2024/7: Added Export Features for [ONNX](./demo_onnx.py) and [libtorch](./demo_libtorch.py), as well as Python Version Runtimes: [funasr-onnx-0.4.0](https://pypi.org/project/funasr-onnx/), [funasr-torch-0.1.1](https://pypi.org/project/funasr-torch/) - 2024/7: The [SenseVoice-Small](https://www.modelscope.cn/models/iic/SenseVoiceSmall) voice understanding model is open-sourced, which offers high-precision multilingual speech recognition, emotion recognition, and audio event detection capabilities for Mandarin, Cantonese, English, Japanese, and Korean and leads to exceptionally low inference latency. - 2024/7: The CosyVoice for natural speech generation with multi-language, timbre, and emotion control. CosyVoice excels in multi-lingual voice generation, zero-shot voice generation, cross-lingual voice cloning, and instruction-following capabilities. [CosyVoice repo](https://github.com/FunAudioLLM/CosyVoice) and [CosyVoice space](https://www.modelscope.cn/studios/iic/CosyVoice-300M). - 2024/7: [FunASR](https://github.com/modelscope/FunASR) is a fundamental speech recognition toolkit that offers a variety of features, including speech recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, Speaker Diarization and multi-talker ASR. <a name="Benchmarks"></a> # Benchmarks 📝 ## Multilingual Speech Recognition We compared the performance of multilingual speech recognition between SenseVoice and Whisper on open-source benchmark datasets, including AISHELL-1, AISHELL-2, Wenetspeech, LibriSpeech, and Common Voice. In terms of Chinese and Cantonese recognition, the SenseVoice-Small model has advantages. <div align="center"> <img src="image/asr_results1.png" width="400" /><img src="image/asr_results2.png" width="400" /> </div> ## Speech Emotion Recognition Due to the current lack of widely-used benchmarks and methods for speech emotion recognition, we conducted evaluations across various metrics on multiple test sets and performed a comprehensive comparison with numerous results from recent benchmarks. The selected test sets encompass data in both Chinese and English, and include multiple styles such as performances, films, and natural conversations. Without finetuning on the target data, SenseVoice was able to achieve and exceed the performance of the current best speech emotion recognition models. <div align="center"> <img src="image/ser_table.png" width="1000" /> </div> Furthermore, we compared multiple open-source speech emotion recognition models on the test sets, and the results indicate that the SenseVoice-Large model achieved the best performance on nearly all datasets, while the SenseVoice-Small model also surpassed other open-source models on the majority of the datasets. <div align="center"> <img src="image/ser_figure.png" width="500" /> </div> ## Audio Event Detection Although trained exclusively on speech data, SenseVoice can still function as a standalone event detection model. We compared its performance on the environmental sound classification ESC-50 dataset against the widely used industry models BEATS and PANN. The SenseVoice model achieved commendable results on these tasks. However, due to limitations in training data and methodology, its event classification performance has some gaps compared to specialized AED models. <div align="center"> <img src="image/aed_figure.png" width="500" /> </div> ## Computational Efficiency The SenseVoice-Small model deploys a non-autoregressive end-to-end architecture, resulting in extremely low inference latency. With a similar number of parameters to the Whisper-Small model, it infers more than 5 times faster than Whisper-Small and 15 times faster than Whisper-Large. <div align="center"> <img src="image/inference.png" width="1000" /> </div> # Requirements ```shell pip install -r requirements.txt ``` <a name="Usage"></a> # Usage ## Inference Supports input of audio in any format and of any duration. ```python from funasr import AutoModel from funasr.utils.postprocess_utils import rich_transcription_postprocess model_dir = "iic/SenseVoiceSmall" model = AutoModel( model=model_dir, trust_remote_code=True, remote_code="./model.py", vad_model="fsmn-vad", vad_kwargs={"max_single_segment_time": 30000}, device="cuda:0", ) # en res = model.generate( input=f"{model.model_path}/example/en.mp3", cache={}, language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech" use_itn=True, batch_size_s=60, merge_vad=True, # merge_length_s=15, ) text = rich_transcription_postprocess(res[0]["text"]) print(text) ``` <details><summary>Parameter Description (Click to Expand)</summary> - `model_dir`: The name of the model, or the path to the model on the local disk. - `trust_remote_code`: - When `True`, it means that the model's code implementation is loaded from `remote_code`, which specifies the exact location of the `model` code (for example, `model.py` in the current directory). It supports absolute paths, relative paths, and network URLs. - When `False`, it indicates that the model's code implementation is the integrated version within [FunASR](https://github.com/modelscope/FunASR). At this time, modifications made to `model.py` in the current directory will not be effective, as the version loaded is the internal one from FunASR. For the model code, [click here to view](https://github.com/modelscope/FunASR/tree/main/funasr/models/sense_voice). - `vad_model`: This indicates the activation of VAD (Voice Activity Detection). The purpose of VAD is to split long audio into shorter clips. In this case, the inference time includes both VAD and SenseVoice total consumption, and represents the end-to-end latency. If you wish to test the SenseVoice model's inference time separately, the VAD model can be disabled. - `vad_kwargs`: Specifies the configurations for the VAD model. `max_single_segment_time`: denotes the maximum duration for audio segmentation by the `vad_model`, with the unit being milliseconds (ms). - `use_itn`: Whether the output result includes punctuation and inverse text normalization. - `batch_size_s`: Indicates the use of dynamic batching, where the total duration of audio in the batch is measured in seconds (s). - `merge_vad`: Whether to merge short audio fragments segmented by the VAD model, with the merged length being `merge_length_s`, in seconds (s). - `ban_emo_unk`: Whether to ban the output of the `emo_unk` token. </details> ### Speaker Diarization SenseVoice supports speaker diarization when used with VAD + CAM++ speaker model + punctuation model: ```python from funasr import AutoModel from funasr.utils.postprocess_utils import rich_transcription_postprocess model = AutoModel( model="iic/SenseVoiceSmall", trust_remote_code=True, remote_code="./model.py", vad_model="fsmn-vad", vad_kwargs={"max_single_segment_time": 30000}, spk_model="cam++", punc_model="ct-punc", device="cuda:0", ) res = model.generate( input="example.wav", cache={}, language="auto", use_itn=True, batch_size_s=60, merge_vad=True, merge_length_s=15, ) # Per-sentence results with speaker labels for sent in res[0]["sentence_info"]: text = rich_transcription_postprocess(sent["text"]) print(f"Speaker {sent['spk']}: [{sent['start']}ms - {sent['end']}ms] {text}") ``` > Note: Requires installing FunASR from source: `pip install git+https://github.com/modelscope/FunASR.git` If all inputs are short audios (<30s), and batch inference is needed to speed up inference efficiency, the VAD model can be removed, and `batch_size` can be set accordingly. ```python model = AutoModel(model=model_dir, trust_remote_code=True, device="cuda:0") res = model.generate( input=f"{model.model_path}/example/en.mp3", cache={}, language="zh", # "zh", "en", "yue", "ja", "ko", "nospeech" use_itn=False, batch_size=64, ) ``` For more usage, please refer to [docs](https://github.com/modelscope/FunASR/blob/main/docs/tutorial/README.md) ### Inference directly Supports input of audio in any format, with an input duration limit of 30 seconds or less. ```python from model import SenseVoiceSmall from funasr.utils.postprocess_utils import rich_transcription_postprocess model_dir = "iic/SenseVoiceSmall" m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir, device="cuda:0") m.eval() res = m.inference( data_in=f"{kwargs['model_path']}/example/en.mp3", language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech" use_itn=False, ban_emo_unk=False, **kwargs, ) text = rich_transcription_postprocess(res[0][0]["text"]) print(text) ``` ### Export and Test <details><summary>ONNX and Libtorch Export</summary> #### ONNX ```python # pip3 install -U funasr funasr-onnx from pathlib import Path from funasr_onnx import SenseVoiceSmall from funasr_onnx.utils.postprocess_utils import rich_transcription_postprocess model_dir = "iic/SenseVoiceSmall" model = SenseVoiceSmall(model_dir, batch_size=10, quantize=True) # inference wav_or_scp = ["{}/.cache/modelscope/hub/{}/example/en.mp3".format(Path.home(), model_dir)] res = model(wav_or_scp, language="auto", use_itn=True) print([rich_transcription_postprocess(i) for i in res]) ``` Note: ONNX model is exported to the original model directory. #### Libtorch ```python from pathlib import Path from funasr_torch import SenseVoiceSmall from funasr_torch.utils.postprocess_utils import rich_transcription_postprocess model_dir = "iic/SenseVoiceSmall" model = SenseVoiceSmall(model_dir, batch_size=10, device="cuda:0") wav_or_scp = ["{}/.cache/modelscope/hub/{}/example/en.mp3".format(Path.home(), model_dir)] res = model(wav_or_scp, language="auto", use_itn=True) print([rich_transcription_postprocess(i) for i in res]) ``` Note: Libtorch model is exported to the original model directory. </details> ## Service ### Deployment with FastAPI ```shell export SENSEVOICE_DEVICE=cuda:0 fastapi run --port 50000 ``` ## Finetune ### Requirements ```shell git clone https://github.com/alibaba/FunASR.git && cd FunASR pip3 install -e ./ ``` ## 🐳 Docker Support SenseVoice can be built and run using Docker to simplify setup, ensure reproducibility, and support both CPU and GPU inference. ### Build with Docker ```bash docker build -t sensevoice . ``` ### Run (GPU – default) ```bash docker run --gpus all -p 50000:50000 sensevoice ``` ### Run (CPU-only) ```bash docker run -e SENSEVOICE_DEVICE=cpu -p 50000:50000 sensevoice ``` ### Docker Compose Docker Compose provides an easier way to run SenseVoice with persistent model caching, networking etc. ### Start Stack ```bash docker compose up --build ``` ### Data prepare Data examples ```text {"key": "YOU0000008470_S0000238_punc_itn", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|withitn|>", "target": "Including legal due diligence, subscription agreement, negotiation.", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/YOU0000008470_S0000238.wav", "target_len": 7, "source_len": 140} {"key": "AUD0000001556_S0007580", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "there is a tendency to identify the self or take interest in what one has got used to", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/AUD0000001556_S0007580.wav", "target_len": 18, "source_len": 360} ``` Full ref to `data/train_example.jsonl` <details><summary>Data Prepare Details</summary> Description： - `key`: audio file unique ID - `source`：path to the audio file - `source_len`：number of fbank frames of the audio file - `target`：transcription - `target_len`：length of target - `text_language`：language id of the audio file - `emo_target`：emotion label of the audio file - `event_target`：event label of the audio file - `with_or_wo_itn`：whether includes punctuation and inverse text normalization `train_text.txt` ```bash BAC009S0764W0121 甚至出现交易几乎停滞的情况 BAC009S0916W0489 湖北一公司以员工名义贷款数十员工负债千万 asr_example_cn_en 所有只要处理 data 不管你是做 machine learning 做 deep learning 做 data analytics 做 data science 也好 scientist 也好通通都要都做的基本功啊那 again 先先对有一些>也许对 ID0012W0014 he tried to think how it could be ``` `train_wav.scp` ```bash BAC009S0764W0121 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav BAC009S0916W0489 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0916W0489.wav asr_example_cn_en https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_cn_en.wav ID0012W0014 https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_en.wav ``` `train_text_language.txt` The language ids include `<|zh|>`、`<|en|>`、`<|yue|>`、`<|ja|>` and `<|ko|>`. ```bash BAC009S0764W0121 <|zh|> BAC009S0916W0489 <|zh|> asr_example_cn_en <|zh|> ID0012W0014 <|en|> ``` `train_emo.txt` The emotion labels include`<|HAPPY|>`、`<|SAD|>`、`<|ANGRY|>`、`<|NEUTRAL|>`、`<|FEARFUL|>`、`<|DISGUSTED|>` and `<|SURPRISED|>`. ```bash BAC009S0764W0121 <|NEUTRAL|> BAC009S0916W0489 <|NEUTRAL|> asr_example_cn_en <|NEUTRAL|> ID0012W0014 <|NEUTRAL|> ``` `train_event.txt` The event labels include`<|BGM|>`、`<|Speech|>`、`<|Applause|>`、`<|Laughter|>`、`<|Cry|>`、`<|Sneeze|>`、`<|Breath|>` and `<|Cough|>`. ```bash BAC009S0764W0121 <|Speech|> BAC009S0916W0489 <|Speech|> asr_example_cn_en <|Speech|> ID0012W0014 <|Speech|> ``` `Command` ```shell # generate train.jsonl and val.jsonl from wav.scp, text.txt, text_language.txt, emo_target.txt, event_target.txt sensevoice2jsonl \ ++scp_file_list='["../../../data/list/train_wav.scp", "../../../data/list/train_text.txt", "../../../data/list/train_text_language.txt", "../../../data/list/train_emo.txt", "../../../data/list/train_event.txt"]' \ ++data_type_list='["source", "target", "text_language", "emo_target", "event_target"]' \ ++jsonl_file_out="../../../data/list/train.jsonl" ``` If there is no `train_text_language.txt`, `train_emo_target.txt` and `train_event_target.txt`, the language, emotion and event label will be predicted automatically by using the `SenseVoice` model. ```shell # generate train.jsonl and val.jsonl from wav.scp and text.txt sensevoice2jsonl \ ++scp_file_list='["../../../data/list/train_wav.scp", "../../../data/list/train_text.txt"]' \ ++data_type_list='["source", "target"]' \ ++jsonl_file_out="../../../data/list/train.jsonl" \ ++model_dir='iic/SenseVoiceSmall' ``` </details> ### Finetune Ensure to modify the train_tool in finetune.sh to the absolute path of `funasr/bin/train_ds.py` from the FunASR installation directory you have set up earlier. ```shell bash finetune.sh ``` ## WebUI ```shell python webui.py ``` <div align="center"><img src="image/webui.png" width="700"/> </div> ## Remarkable Third-Party Work - Triton (GPU) Deployment Best Practices: Using Triton + TensorRT, tested with FP32, achieving an acceleration ratio of 526 on V100 GPU. FP16 support is in progress. [Repository](https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/README.md) - Sherpa-onnx Deployment Best Practices: Supports using SenseVoice in 10 programming languages: C++, C, Python, C#, Go, Swift, Kotlin, Java, JavaScript, and Dart. Also supports deploying SenseVoice on platforms like iOS, Android, and Raspberry Pi. [Repository](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) - [SenseVoice.cpp](https://github.com/lovemefan/SenseVoice.cpp). Inference of SenseVoice in pure C/C++ based on GGML, supporting 3-bit, 4-bit, 5-bit, 8-bit quantization, etc. with no third-party dependencies. - [streaming-sensevoice](https://github.com/pengzhendong/streaming-sensevoice) processes inference in chunks. To achieve pseudo-streaming, it employs a truncated attention mechanism, sacrificing some accuracy. Additionally, this technology supports CTC prefix beam search and hot-word boosting features. - [OmniSenseVoice](https://github.com/lifeiteng/OmniSenseVoice) is optimized for lightning-fast inference and batching process. - [SenseVoice Hotword](https://www.modelscope.cn/models/dengcunqin/SenseVoiceSmall_hotword)，Neural Network Hotword Enhancement，[Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network](https://mp.weixin.qq.com/s/1QkIvh8j7rrUjRyWOgAvdA)。 ## Ecosystem SenseVoice is part of the **FunAudioLLM** family: | Project | Description | Stars | |---------|-------------|-------| | [FunASR](https://github.com/modelscope/FunASR) | Industrial speech recognition toolkit — VAD, ASR, punctuation, diarization | [![](https://img.shields.io/github/stars/modelscope/FunASR?style=social)](https://github.com/modelscope/FunASR) | | [Fun-ASR-Nano](https://github.com/FunAudioLLM/Fun-ASR) | End-to-end LLM-based ASR — 31 languages, streaming, hotwords | [![](https://img.shields.io/github/stars/FunAudioLLM/Fun-ASR?style=social)](https://github.com/FunAudioLLM/Fun-ASR) | | [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) | Natural speech generation — multi-language, zero-shot cloning | [![](https://img.shields.io/github/stars/FunAudioLLM/CosyVoice?style=social)](https://github.com/FunAudioLLM/CosyVoice) | | [FunClip](https://github.com/modelscope/FunClip) | AI-powered video clipping with speech recognition | [![](https://img.shields.io/github/stars/modelscope/FunClip?style=social)](https://github.com/modelscope/FunClip) | <a name="Community"></a> # Community If you encounter problems in use, you can directly raise Issues on the github page. You can also scan the following DingTalk group QR code to join the community group for communication and discussion. | FunASR | |:--------------------------------------------------------:| | <img src="image/dingding_funasr.png" width="250"/></div> | <a href="https://star-history.com/#FunAudioLLM/SenseVoice&modelscope/FunASR&FunAudioLLM/Fun-ASR&Date"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=FunAudioLLM/SenseVoice,modelscope/FunASR,FunAudioLLM/Fun-ASR&type=Date&theme=dark" /> <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=FunAudioLLM/SenseVoice,modelscope/FunASR,FunAudioLLM/Fun-ASR&type=Date" /> <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=FunAudioLLM/SenseVoice,modelscope/FunASR,FunAudioLLM/Fun-ASR&type=Date" /> </picture> </a>

AI & Machine Learning Audio Editing & DAW

8.5K Github Stars

Open Source

CosyVoice

![SVG Banners](https://svg-banners.vercel.app/api?type=origin&text1=CosyVoice🤠&text2=Text-to-Speech%20💖%20Large%20Language%20Model&width=800&height=210) ## 👉🏻 CosyVoice 👈🏻 **Fun-CosyVoice 3.0**: [Demos](https://funaudiollm.github.io/cosyvoice3/); [Paper](https://arxiv.org/pdf/2505.17589); [Modelscope](https://www.modelscope.cn/models/FunAudioLLM/Fun-CosyVoice3-0.5B-2512); [Huggingface](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512); [CV3-Eval](https://github.com/FunAudioLLM/CV3-Eval) **CosyVoice 2.0**: [Demos](https://funaudiollm.github.io/cosyvoice2/); [Paper](https://arxiv.org/pdf/2412.10117); [Modelscope](https://www.modelscope.cn/models/iic/CosyVoice2-0.5B); [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B) **CosyVoice 1.0**: [Demos](https://fun-audio-llm.github.io); [Paper](https://funaudiollm.github.io/pdf/CosyVoice_v1.pdf); [Modelscope](https://www.modelscope.cn/models/iic/CosyVoice-300M); [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice-300M) ## Highlight🔥 **Fun-CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild. ### Key Features - **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents (Guangdong, Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.) and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning. - **Content Consistency & Naturalness**: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness. - **Pronunciation Inpainting**: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use. - **Text Normalization**: Supports reading of numbers, special symbols and various text formats without a traditional frontend module. - **Bi-Streaming**: Support both text-in streaming and audio-out streaming, and achieves latency as low as 150ms while maintaining high-quality audio output. - **Instruct Support**: Supports various instructions such as languages, dialects, emotions, speed, volume, etc. ## Roadmap - [x] 2025/12 - [x] release Fun-CosyVoice3-0.5B-2512 base model, rl model and its training/inference script - [x] release Fun-CosyVoice3-0.5B modelscope gradio space - [x] 2025/08 - [x] Thanks to the contribution from NVIDIA Yuekai Zhang, add triton trtllm runtime support and cosyvoice2 grpo training support - [x] 2025/07 - [x] release Fun-CosyVoice 3.0 eval set - [x] 2025/05 - [x] add CosyVoice2-0.5B vllm support - [x] 2024/12 - [x] 25hz CosyVoice2-0.5B released - [x] 2024/09 - [x] 25hz CosyVoice-300M base model - [x] 25hz CosyVoice-300M voice conversion function - [x] 2024/08 - [x] Repetition Aware Sampling(RAS) inference for llm stability - [x] Streaming inference mode support, including kv cache and sdpa for rtf optimization - [x] 2024/07 - [x] Flow matching training support - [x] WeTextProcessing support when ttsfrd is not available - [x] Fastapi server and client ## Evaluation | Model | Open-Source | Model Size | test-zh<br>CER (%) ↓ | test-zh<br>SS (%) ↑ | test-en<br>WER (%) ↓ | test-en<br>SS (%) ↑ | test-hard<br>CER (%) ↓ | test-hard<br>SS (%) ↑ | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - | | Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 | | MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - | | F5-TTS | ✅ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 | | Spark TTS | ✅ | 0.5B | 1.2 | 66.0 | 1.98 | 57.3 | - | - | | CosyVoice2 | ✅ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 | | FireRedTTS2 | ✅ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | - | - | | Index-TTS2 | ✅ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | 7.12 | 75.5 | | VibeVoice-1.5B | ✅ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | - | - | | VibeVoice-Realtime | ✅ | 0.5B | - | - | 2.05 | 63.3 | - | - | | HiggsAudio-v2 | ✅ | 3B | 1.50 | 74.0 | 2.44 | 67.7 | - | - | | VoxCPM | ✅ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 | | GLM-TTS | ✅ | 1.5B | 1.03 | 76.1 | - | - | - | - | | GLM-TTS RL | ✅ | 1.5B | 0.89 | 76.4 | - | - | - | - | | Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 | | Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 | ## Install ### Clone and install - Clone the repo ``` sh git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git # If you failed to clone the submodule due to network failures, please run the following command until success cd CosyVoice git submodule update --init --recursive ``` - Install Conda: please see https://docs.conda.io/en/latest/miniconda.html - Create Conda env: ``` sh conda create -n cosyvoice -y python=3.10 conda activate cosyvoice pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com # If you encounter sox compatibility issues # ubuntu sudo apt-get install sox libsox-dev # centos sudo yum install sox sox-devel ``` ### Model download We strongly recommend that you download our pretrained `Fun-CosyVoice3-0.5B` `CosyVoice2-0.5B` `CosyVoice-300M` `CosyVoice-300M-SFT` `CosyVoice-300M-Instruct` model and `CosyVoice-ttsfrd` resource. ``` python # modelscope SDK model download from modelscope import snapshot_download snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B') snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B') snapshot_download('iic/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M') snapshot_download('iic/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT') snapshot_download('iic/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct') snapshot_download('iic/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd') # for oversea users, huggingface SDK model download from huggingface_hub import snapshot_download snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B') snapshot_download('FunAudioLLM/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B') snapshot_download('FunAudioLLM/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M') snapshot_download('FunAudioLLM/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT') snapshot_download('FunAudioLLM/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct') snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd') ``` Optionally, you can unzip `ttsfrd` resource and install `ttsfrd` package for better text normalization performance. Notice that this step is not necessary. If you do not install `ttsfrd` package, we will use wetext by default. ``` sh cd pretrained_models/CosyVoice-ttsfrd/ unzip resource.zip -d . pip install ttsfrd_dependency-0.1-py3-none-any.whl pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl ``` ### Basic Usage We strongly recommend using `Fun-CosyVoice3-0.5B` for better performance. Follow the code in `example.py` for detailed usage of each model. ```sh python example.py ``` #### vLLM Usage CosyVoice2/3 now supports **vLLM 0.11.x+ (V1 engine)** and **vLLM 0.9.0 (legacy)**. Older vllm version(<0.9.0) do not support CosyVoice inference, and versions in between (e.g., 0.10.x) are not tested. Notice that `vllm` has a lot of specific requirements. You can create a new env to in case your hardward do not support vllm and old env is corrupted. ``` sh conda create -n cosyvoice_vllm --clone cosyvoice conda activate cosyvoice_vllm # for vllm==0.9.0 pip install vllm==v0.9.0 transformers==4.51.3 numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com # for vllm>=0.11.0 pip install vllm==v0.11.0 transformers==4.57.1 numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com python vllm_example.py ``` #### Start web demo You can use our web demo page to get familiar with CosyVoice quickly. Please see the demo website for details. ``` python # change iic/CosyVoice-300M-SFT for sft inference, or iic/CosyVoice-300M-Instruct for instruct inference python3 webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M ``` #### Advanced Usage For advanced users, we have provided training and inference scripts in `examples/libritts`. #### Build for deployment Optionally, if you want service deployment, You can run the following steps. ``` sh cd runtime/python docker build -t cosyvoice:v1.0 . # change iic/CosyVoice-300M to iic/CosyVoice-300M-Instruct if you want to use instruct inference # for grpc usage docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/grpc && python3 server.py --port 50000 --max_conc 4 --model_dir iic/CosyVoice-300M && sleep infinity" cd grpc && python3 client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct> # for fastapi usage docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/fastapi && python3 server.py --port 50000 --model_dir iic/CosyVoice-300M && sleep infinity" cd fastapi && python3 client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct> ``` #### Using Nvidia TensorRT-LLM for deployment Using TensorRT-LLM to accelerate cosyvoice2 llm could give 4x acceleration comparing with huggingface transformers implementation. To quick start: ``` sh cd runtime/triton_trtllm docker compose up -d ``` For more details, you could check [here](https://github.com/FunAudioLLM/CosyVoice/tree/main/runtime/triton_trtllm) ## Discussion & Communication You can directly discuss on [Github Issues](https://github.com/FunAudioLLM/CosyVoice/issues). You can also scan the QR code to join our official Dingding chat group. <img src="./asset/dingding.png" width="250px"> ## Acknowledge 1. We borrowed a lot of code from [FunASR](https://github.com/modelscope/FunASR). 2. We borrowed a lot of code from [FunCodec](https://github.com/modelscope/FunCodec). 3. We borrowed a lot of code from [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS). 4. We borrowed a lot of code from [AcademiCodec](https://github.com/yangdongchao/AcademiCodec). 5. We borrowed a lot of code from [WeNet](https://github.com/wenet-e2e/wenet). ## Citations ``` bibtex @article{du2024cosyvoice, title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens}, author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others}, journal={arXiv preprint arXiv:2407.05407}, year={2024} } @article{du2024cosyvoice, title={Cosyvoice 2: Scalable streaming speech synthesis with large language models}, author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others}, journal={arXiv preprint arXiv:2412.10117}, year={2024} } @article{du2025cosyvoice, title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training}, author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others}, journal={arXiv preprint arXiv:2505.17589}, year={2025} } @inproceedings{lyu2025build, title={Build LLM-Based Zero-Shot Streaming TTS System with Cosyvoice}, author={Lyu, Xiang and Wang, Yuxuan and Zhao, Tianyu and Wang, Hao and Liu, Huadai and Du, Zhihao}, booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={1--2}, year={2025}, organization={IEEE} } ``` ## Ecosystem CosyVoice is part of the **FunAudioLLM** family — a complete speech AI toolkit: | Project | Description | Stars | |---------|-------------|-------| | [FunASR](https://github.com/modelscope/FunASR) | Industrial speech recognition — 50+ languages, speaker diarization, streaming | [![](https://img.shields.io/github/stars/modelscope/FunASR?style=social)](https://github.com/modelscope/FunASR) | | [Fun-ASR-Nano](https://github.com/FunAudioLLM/Fun-ASR) | End-to-end LLM-based ASR — 31 languages, hotwords, vLLM streaming | [![](https://img.shields.io/github/stars/FunAudioLLM/Fun-ASR?style=social)](https://github.com/FunAudioLLM/Fun-ASR) | | [SenseVoice](https://github.com/FunAudioLLM/SenseVoice) | Ultra-fast ASR + emotion + audio event detection | [![](https://img.shields.io/github/stars/FunAudioLLM/SenseVoice?style=social)](https://github.com/FunAudioLLM/SenseVoice) | | [FunClip](https://github.com/modelscope/FunClip) | AI video clipping powered by speech recognition | [![](https://img.shields.io/github/stars/modelscope/FunClip?style=social)](https://github.com/modelscope/FunClip) | ## Disclaimer The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.

LLM Tools & Chat UIs Audio Editing & DAW

21.5K Github Stars