Home
Softono
speaches

speaches

Open source MIT Python
3.4K
Stars
400
Forks
136
Issues
30
Watchers
1 week
Last Commit

About speaches

Speaches is an OpenAI API-compatible server for streaming transcription, translation, and speech generation. Speech-to-Text is powered by faster-whisper, while Text-to-Speech uses piper and Kokoro models. The project aims to be the Ollama equivalent for TTS and STT models. Key features include full compatibility with the OpenAI API, allowing existing tools and SDKs to work seamlessly. It supports audio generation through a chat completions endpoint for tasks like spoken audio summaries and sentiment analysis. Streaming transcription delivers results via server-sent events as audio is processed, eliminating the need to wait for completion. Models are dynamically loaded on demand and unloaded after periods of inactivity. Speaches provides Text-to-Speech via Kokoro, ranked first in the TTS Arena, and piper. It supports both GPU and CPU hardware acceleration and can be deployed using Docker Compose or standard Docker images. A Realtime API is available for interactive use cases. The server is highly configurable

Platforms

Web Self-hosted Docker

Languages

Python

Speaches

speaches is an OpenAI API-compatible server supporting streaming transcription, translation, and speech generation. Speach-to-Text is powered by faster-whisper and for Text-to-Speech piper and Kokoro are used. This project aims to be Ollama, but for TTS/STT models.

See the documentation for installation instructions and usage: speaches.ai

Features:

  • OpenAI API compatible. All tools and SDKs that work with OpenAI's API should work with speaches.
  • Audio generation (chat completions endpoint) | OpenAI Documentation
    • Generate a spoken audio summary of a body of text (text in, audio out)
    • Perform sentiment analysis on a recording (audio in, text out)
    • Async speech to speech interactions with a model (audio in, audio out)
  • Streaming support (transcription is sent via SSE as the audio is transcribed. You don't need to wait for the audio to fully be transcribed before receiving it).
  • Dynamic model loading / offloading. Just specify which model you want to use in the request and it will be loaded automatically. It will then be unloaded after a period of inactivity.
  • Text-to-Speech via kokoro(Ranked #1 in the TTS Arena) and piper models.
  • GPU and CPU support.
  • Deployable via Docker Compose / Docker
  • Realtime API
  • Highly configurable

Please create an issue if you find a bug, have a question, or a feature suggestion.

Demos

Realtime API

https://github.com/user-attachments/assets/457a736d-4c29-4b43-984b-05cc4d9995bc

(Excuse the breathing lol. Didn't have enough time to record a better demo)

Streaming Transcription

TODO

Speech Generation

https://github.com/user-attachments/assets/0021acd9-f480-4bc3-904d-831f54c4d45b