Supertonic 3 · open weights · out now

Supertonic 3: Lightning-Fast, On-Device, Multilingual TTS.

A 99M-parameter open-weight text-to-speech model running locally on CPU via ONNX Runtime. No GPU. No cloud. No API.

  • 31 Languages
  • 99M Params
  • CPU Only
  • ONNX Runtime
  • OpenRAIL-M
Try it on Hugging Face

Open weights. Runs in your browser or fully offline on your machine.

Try voice cloning

Bring your own voice and hear it speak 31 languages.

Currently on Supertonic 2v3 rolling out soon.

31 languages

Speaks 31 languages

One 99M-parameter model. No per-language fine-tuning. No GPU.

Highlighted languages have audio samples below.

Listening samples

Hear it next to the giants.

Same input text, same reference voice prompt, three systems. Supertonic 3 is ours — 99M params on CPU. OmniVoice and Chatterbox Multilingual are 5–8× larger and run on a GPU.

Supertonic 3 (ours, 99M · CPU) Chatterbox Multilingual (500M · GPU) OmniVoice (800M · GPU) Reference prompt voice
Domain
Language
Gender
Voice Builder

Want to hear your own voice?

Supertonic supports zero-shot voice cloning in Voice Builder — record or upload a short reference and synthesize across 31 languages.

Rollout Voice Builder currently runs on Supertonic 2. The Supertonic 3 upgrade is rolling out soon — you'll get the speed and 31-language coverage automatically.
Speed benchmark

GPU-class speed without a GPU.

RTF (real-time factor) measures how long synthesis takes per second of audio — lower is faster. ×RT is the inverse. Supertonic 3 reaches parity with an 800M-parameter GPU baseline while running on a 16-thread CPU.

N = 30 · same machine, same text, same reference voices
Model Hardware Params N Synth Audio RTF ↓ ×RT ↑
Supertonic 3 CPU (16 threads) 99M 30 57.99 s 289.92 s 0.200 5.00×
OmniVoice RTX 3090 800M 30 53.90 s 275.17 s 0.196 5.11×
Chatterbox Multilingual RTX 3090 500M 30 199.70 s 252.68 s 0.790 1.27×
8× smaller than OmniVoice (99M vs 800M params)
5× smaller than Chatterbox Multilingual (99M vs 500M params)
RTF parity with the 800M GPU baseline — but on CPU

Synthesis throughput (×RT, higher is better)

Seconds of speech produced per second of wall-clock time, across the same 30 inputs.

Supertonic 3 CPU · 99M 5.00× OmniVoice GPU · 800M 5.11× Chatterbox GPU · 500M 1.27×
Methodology
  • N = 30 samples (same set published in ./samples/ on this page).
  • Mean audio duration ≈ 9.66 s per sample.
  • Single machine. Identical text, identical reference voice prompts across all three systems.
  • Supertonic 3 timed on CPU with 16 threads via ONNX Runtime. Baselines timed on a single RTX 3090.
  • CPU model: (to be filled in).
  • RTF = synthesis time ÷ audio duration. ×RT = 1 ÷ RTF.
Install / Quickstart

Drop it into your stack.

Officially supported runtimes. Each tab links to working examples in the upstream repo.

# pip install supertonic
from supertonic import TTS

tts = TTS(auto_download=True)

# 1) Default: synthesize English with voice "M1"
style = tts.get_voice_style(voice_name="M1")
wav, duration = tts.synthesize(
    "A gentle breeze moved through the open window.",
    voice_style=style,
    lang="en",
)
tts.save_audio(wav, "output.wav")

# 2) Swap the voice → "M2"
style = tts.get_voice_style(voice_name="M2")

# 3) Swap the language → Japanese
wav, _ = tts.synthesize("こんにちは、世界。", voice_style=style, lang="ja")

Full reference and example scripts: supertonic-py docs.

License

Open weights. Permissive code. Read the fine print.

Model weights OpenRAIL-M

The trained Supertonic 3 model is released under the OpenRAIL-M license. Weights are open and usable commercially, with use-based restrictions (no harm, no impersonation without consent) and an attribution requirement.

Read the model card →

Note: OpenRAIL-M is not equivalent to MIT — it imposes downstream use restrictions. Read the full license text before deploying.

Sample code MIT

The Python package, runtime bindings, and example code in the upstream repo are MIT-licensed. Use, modify, and redistribute freely with attribution.

Read the LICENSE →

Standard MIT terms: no warranty, attribution required, no restrictions on commercial use.