Train AI Text to Speech Models in Google Colab
Guide

Train Custom TTS Voices for Free in Google Colab — Full Tutorial

Mar 2025 · 9 min read · TTS Training · Unsloth · Spark TTS · Colab

What You Can Train — The Unsloth TTS Notebooks

The Unsloth team has built a library of fine-tuning notebooks that make training large models dramatically more efficient. Their TTS notebooks bring that same accessibility to speech models — you can train a custom voice in about 15 minutes on Colab's free T4 GPU.

Available models include:

Spark TTS

Compact 0.5B parameter model — fast to train, easy to run locally on 4GB VRAM. Great starting point for beginners.

Orpheus TTS

Highly expressive with emotion tags and zero-shot voice cloning. Great quality, slightly more demanding to run.

Sesame CSM TTS

Conversational speech model with natural turn-taking and dialogue flow — ideal for voice assistant applications.

OutTTS

Advanced control over audio output — suited for users who want fine-grained tuning of delivery and style.

💡 Which model should you start with? Spark TTS is the recommended starting point. It's a 0.5B parameter model, trains in ~15 minutes on the free T4, and runs comfortably on 4GB VRAM locally. All Unsloth notebooks follow the same structure, so skills transfer directly to the others.

What You'll Need Before Starting

Step 1 — Set Up Your Colab Environment

  1. Open the Unsloth Spark TTS Notebook — head to the Unsloth documentation page (link below) and open the Spark TTS Colab notebook. Sign into your Google account if prompted.
  2. Connect to the T4 GPU Runtime — click the connect button in the top-right. Go to Change runtime type and select T4 GPU. This is free — no Colab Pro needed.
  3. Create a Hugging Face Access Token — on Hugging Face, go to Settings → Access Tokens and create a new token with write permissions. Copy it — you'll need it in the next step.
  4. Add Your HF Token to Colab Secrets — in the Colab left sidebar, click the key icon (Secrets). Add a new secret named HF_TOKEN (all caps, underscore). Paste your token and toggle it to be accessible by the notebook.

Step 2 — Prepare Your Training Dataset

TTS fine-tuning requires a dataset of audio clips paired with their text transcriptions. The format for single-speaker models is typically text and audio columns; multi-speaker models add a source column.

You have two options:

The TTS Dataset Creator Tool

To streamline custom dataset creation, The Local Lab built a Python/Gradio tool called the TTS Dataset Creator. Here's what it does automatically:

💡 Dataset creator availability The TTS Dataset Creator with a one-click Windows installer (CPU and NVIDIA GPU versions) is available for Patreon members at The Local Lab Patreon.

Uploading Your Dataset to Hugging Face

  1. Create a New Dataset Repository on HF — on Hugging Face, create a new dataset repo. Navigate to Files and Versions and upload your .parquet file.
  2. Copy Your Repo Name — copy the repository name in the format your-username/your-dataset-name. You'll paste this into the Colab notebook's data preparation cell.

Step 3 — Run the Training

  1. Run the Setup Cells — run the first cells in order — they install dependencies and download the base Spark TTS model. Don't skip any cells or interrupt this process.
  2. Load Your Dataset — in the data preparation cell, find the load_dataset() call and paste your Hugging Face repository name inside the parentheses. Run the cell, then run the tokenization cell that follows.
  3. Handle the BF16 Bug (if needed) — if you hit an error related to BF16 detection, install a slightly older Unsloth version:
pip install unsloth==2025.5.6

After installing, restart the Colab runtime and re-run the last three import cells before continuing.

  1. Run the Trainer Cell — run the trainer cell to begin fine-tuning. Monitor training loss and VRAM usage as it runs. On the free T4 GPU, Spark TTS typically finishes in ~15 minutes.
  2. Test and Download the Model — once training completes, use the inference cell to test your fine-tuned model. Then download the output files from the outputs/ folder to your local machine.

Running Your Trained Model Locally

Spark TTS runs on as little as 4GB VRAM, making it one of the most accessible local TTS options available. To use your fine-tuned version locally:

  1. Install Spark TTS locally (one-click installer on Patreon, or follow the GitHub repo setup)
  2. Navigate to the Spark TTS models directory
  3. Replace the original model files with the files you downloaded from Colab
  4. Launch Spark TTS — it will now speak in your fine-tuned voice
🎯 Training tips for better results The default training parameters work well for a first run. For better results, experiment with more training steps, a lower learning rate, and a larger, more varied dataset. More audio diversity in your training data generally produces more natural, consistent output.

📦 Want to skip the setup?

The Local Lab offers pre-configured AI installer packages so you can get running in minutes, not hours.

Browse the Store →