Building a Local Voice Cloning Studio That Actually Runs on My Machine
I wanted a local voice cloning workflow that felt like a real product, not a notebook demo.
The goal was simple: record or upload a short reference clip, create a reusable voice profile, type arbitrary text, synthesize speech locally, and keep every file on my own machine. No accounts, no cloud voice API, no remote inference.
The result is local-voice-studio: a React and FastAPI application with local profile management, reference audio handling, generation history, diagnostics, and pluggable local TTS engines.
The Product Shape
The app is built around a few core workflows.
You can create multiple voice profiles, each with notes, language preference, tags, saved reference clips, a primary clip, and synthesis defaults. Reference clips can come from browser microphone recording or from uploaded audio files. The backend normalizes every clip to a model-friendly WAV file, stores metadata in SQLite, and keeps the original upload on disk.
The generation page is a focused workspace: choose a profile, enter text, optionally provide delivery instructions, tweak advanced settings, generate audio, and then preview or download the WAV result. Every generation is stored in history with status, parameters, input text, output path, and errors if something fails.
The settings page shows the local runtime state: active engine, model name, GPU detection, FFmpeg availability, directories in use, and whether local transcription is available.
Everything lives under the local data/ directory.
Architecture
The backend is FastAPI with a deliberately boring service/repository split.
Routes handle HTTP only. Services coordinate profile management, audio processing, generation jobs, runtime diagnostics, and TTS orchestration. Repositories isolate SQLite persistence. Audio utilities handle normalization and metadata probing through FFmpeg and FFprobe.
The TTS runtime sits behind a TtsEngine interface. The generation service does not care whether the active engine is XTTS or Qwen3. It prepares a profile, builds a synthesis payload, writes the output to data/generated, and records status updates in SQLite.
Synthesis runs as an in-process background job. That is enough for a local desktop-style app and avoids adding Redis, Celery, or other distributed infrastructure too early.
The frontend is React, TypeScript, Vite, React Router, and TanStack Query. The UI is desktop-first, dark, and intentionally simple: left navigation, profile cards, reference clip panels, a generation workspace, history, and diagnostics.
The XTTS Lesson
The first implementation used Coqui XTTS v2.
XTTS was a useful baseline because it can do zero-shot voice cloning from reference audio and it is relatively well-known in local TTS experiments. The app normalized clips, cached conditioning artifacts when possible, and used the selected primary clip as the conditioning source.
But in practice, the output did not sound close enough. The voice often drifted into a generic speaker. In one test, the result sounded like a random English man rather than the target voice.
That was an important product lesson: a working local model integration is not the same thing as a good cloning experience.
The app was technically generating speech, but the speaker identity was not good enough.
Why Transcript-Aware Cloning Helped
The breakthrough was moving from audio-only cloning to transcript-aware cloning.
Qwen3-TTS expects the reference audio and the exact text spoken in that reference audio. That small workflow change matters. Instead of asking the model to infer everything from the waveform alone, the app gives it aligned acoustic and linguistic context.
So the profile workflow changed:
- Record or upload a clean reference clip.
- Mark the best clip as primary.
- Save the exact transcript of that clip.
- Generate using Qwen3-TTS.
The app now stores reference_text on each clip. The primary clip transcript becomes part of the conditioning fingerprint, so changing the transcript invalidates old cached prompts. There is also an optional local transcription path using a Whisper-style local ASR dependency, but manual transcript entry is always available and often better if the user knows exactly what they said.
This made the system feel much closer to the Voicebox-style workflow that worked well in practice.
UX Details That Matter
Local AI workflows often have long, silent waits. That is bad UX.
The app now shows loading feedback where it matters:
- The
Transcribe locallybutton shows a spinner while transcription is running. - The transcription button is disabled while the action is in progress, preventing accidental duplicate requests.
- The
Generate audiobutton shows a spinner while the job is starting or running. - The generate page warns if the selected primary clip is missing its transcript.
These are small details, but they make the app feel more trustworthy. When local models take time to load, users need visible progress.
Running It Locally
The recommended path is Qwen3-TTS.
From the repository root:
powershell -ExecutionPolicy Bypass -File .\scripts\setup.ps1
& .\.venv\Scripts\python.exe -m pip install -e ".\backend[qwen,transcription]"
Start the backend:
$env:LVS_TTS_ENGINE = "qwen3"
powershell -ExecutionPolicy Bypass -File .\scripts\dev-backend.ps1
Start the frontend:
powershell -ExecutionPolicy Bypass -File .\scripts\dev-frontend.ps1
Then open:
http://localhost:5173
If another local app is already using port 8000, run the backend on 8010 and point the frontend proxy at it:
$env:LVS_TTS_ENGINE = "qwen3"
$env:LVS_BACKEND_PORT = "8010"
powershell -ExecutionPolicy Bypass -File .\scripts\dev-backend.ps1
$env:VITE_API_TARGET = "http://127.0.0.1:8010"
powershell -ExecutionPolicy Bypass -File .\scripts\dev-frontend.ps1
What I Would Improve Next
The next improvements are mostly product polish and runtime hardening.
The app should make the active engine more visible in the main UI. Profiles should get a clear "Qwen-ready" badge when their primary clip has a transcript. The transcription workflow could show progress beyond a spinner, especially on first model load. The settings page could provide stronger guidance when the environment is missing Qwen, Whisper, CUDA, FFmpeg, or model weights.
Long-form generation also needs more work. The app currently supports normal generation jobs, but paragraph-aware chunking, concatenation, and better history grouping would make it more useful for narration.
Finally, Qwen and XTTS have different Python/runtime expectations. A future version could support separate engine environments or a small process boundary so users do not have to fit every model into one Python virtual environment.
Takeaway
The biggest lesson was that local-first voice cloning is not just about wiring up a model.
The workflow matters.
For this project, XTTS proved the architecture and local storage model. Qwen3-TTS made the voice identity closer by adding the missing reference transcript. The resulting app is still early, but it now has the shape of a real local voice studio: profiles, clips, transcripts, generation history, runtime diagnostics, and no cloud dependency.
Comments
Post a Comment