Codesota · Models · Kokoro v1.0Hexgrad0 results · 0 benchmarks
Model card

Kokoro v1.0.

HexgradText-to-speech82M paramsLightweight autoregressiveOpen source

82M params, Apache 2.0. Runs on CPU.

§ 01 · Card

Model card,
inline.

Rendered server-side from the upstream README on Hugging Face — same content as the source repo, with editorial typography. The full card, sample weights, and revision history live on HF.


Source
hexgrad/Kokoro-82M
License
apache-2.0
Pipeline
text-to-speech

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.

<audio controls><source src="https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/samples/HEARME.wav" type="audio/wav"></audio>

🐈 GitHub: https://github.com/hexgrad/kokoro

🚀 Demo: https://hf.co/spaces/hexgrad/Kokoro-TTS

[!NOTE] As of April 2025, the market rate of Kokoro served over API is under $1 per million characters of text input, or under $0.06 per hour of audio output. (On average, 1000 characters of input is about 1 minute of output.) Sources: ArtificialAnalysis/Replicate at 65 cents per M chars and DeepInfra at 80 cents per M chars. This is an Apache-licensed model, and Kokoro has been deployed in numerous projects and commercial APIs. We welcome the deployment of the model in real use cases.
[!CAUTION] Fake websites like kokorottsaicom (snapshot: https://archive.ph/nRRnk) and kokorottsnet (snapshot: https://archive.ph/60opa) are likely scams masquerading under the banner of a popular model. Any website containing "kokoro" in its root domain (e.g. kokorottsaicom, kokorottsnet) is NOT owned by and NOT affiliated with this model page or its author, and attempts to imply otherwise are red flags.

Releases

| Model | Published | Training Data | Langs & Voices | SHA256 | | ----- | --------- | ------------- | -------------- | ------ | | v1.0 | 2025 Jan 27 | Few hundred hrs | **8 & 54** | 496dba11 | | v0.19 | 2024 Dec 25 | <100 hrs | 1 & 10 | 3b0c392f |

| Training Costs | v0.19 | v1.0 | Total | | -------------- | ----- | ---- | ----- | | in A100 80GB GPU hours | 500 | 500 | 1000 | | average hourly rate | $0.80/h | $1.20/h | $1/h | | in USD | $400 | $600 | $1000 |

Usage

You can run this basic cell on Google Colab. Listen to samples. For more languages and details, see Advanced Usage.

py
!pip install -q kokoro>=0.9.2 soundfile !apt-get -qq -y install espeak-ng > /dev/null 2>&1 from kokoro import KPipeline from IPython.display import display, Audio import soundfile as sf import torch pipeline = KPipeline(lang_code='a') text = ''' [Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects. ''' generator = pipeline(text, voice='af_heart') for i, (gs, ps, audio) in enumerate(generator): print(i, gs, ps) display(Audio(data=audio, rate=24000, autoplay=i==0)) sf.write(f'{i}.wav', audio, 24000)

Under the hood, kokoro uses `misaki`, a G2P library at https://github.com/hexgrad/misaki

Model Facts

Architecture:

  • StyleTTS 2: https://arxiv.org/abs/2306.07691
  • ISTFTNet: https://arxiv.org/abs/2203.02395
  • Decoder only: no diffusion, no encoder release

Architected by: Li et al @ https://github.com/yl4579/StyleTTS2

Trained by: @rzvzn on Discord

Languages: Multiple

Model SHA256 Hash: 496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4

Training Details

Data: Kokoro was trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels. Examples of permissive/non-copyrighted audio include:

  • Public domain audio
  • Audio licensed under Apache, MIT, etc
  • Synthetic audio<sup>[1]</sup> generated by closed<sup>[2]</sup> TTS models from large providers<br/>

[1] https://copyright.gov/ai/aipolicyguidance.pdf<br/> [2] No synthetic audio from open TTS models or "custom voice clones"

Total Dataset Size: A few hundred hours of audio

Total Training Cost: About $1000 for 1000 hours of A100 80GB vRAM

Creative Commons Attribution

The following CC BY audio was part of the dataset used to train Kokoro v1.0.

| Audio Data | Duration Used | License | Added to Training Set After | | ---------- | ------------- | ------- | --------------------------- | | Koniwa tnc | <1h | CC BY 3.0 | v0.19 / 22 Nov 2024 | | SIWIS | <11h | CC BY 4.0 | v0.19 / 22 Nov 2024 |

Acknowledgements

  • 🛠️ @yl4579 for architecting StyleTTS 2.
  • 🏆 @Pendrokar for adding Kokoro as a contender in the TTS Spaces Arena.
  • 📊 Thank you to everyone who contributed synthetic training data.
  • ❤️ Special thanks to all compute sponsors.
  • 👾 Discord server: https://discord.gg/QuGxSWBfQy
  • 🪽 Kokoro is a Japanese word that translates to "heart" or "spirit". It is also the name of an AI in the Terminator franchise.

<img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />

Card content reproduced from huggingface.co/hexgrad/Kokoro-82M under the upstream license. Rendering trims fenced HTML, raw widgets and tables for safety; tap the link for the untouched original.
§ 02 · Benchmarks

No recorded benchmark results yet.

This model is in the registry but doesn’t have any benchmark_results rows yet. If you have a score, submit it →

Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.