guides12 min read · Updated 2026-05-19

How to Train Your Own RVC Voice Model

Create a custom AI voice clone by training your own RVC model from scratch. This guide covers everything from dataset preparation to real-time conversion.

What is RVC Training?

RVC (Retrieval-based Voice Conversion) training is the process of teaching a neural network to reproduce a specific voice. You provide audio samples of the target voice, and the model learns the vocal characteristics — timbre, pitch patterns, formant structure — so it can convert any input speech into that voice in real-time. Unlike text-to-speech, RVC preserves your natural speech patterns, emotion, and timing while changing only the voice identity.

What You Need Before Training

Training an RVC model requires three things: clean audio data, a training environment, and patience. The quality of your training data is the single biggest factor in model quality.

  • 10-30 minutes of clean, isolated vocal audio (no background music, no noise, no reverb)
  • Audio should be a single speaker with consistent recording conditions
  • WAV format, 44.1kHz or 48kHz sample rate, mono channel
  • A GPU with at least 6GB VRAM (NVIDIA recommended) or a cloud training service
  • RVC training software (Applio, Mangio-RVC, or the original RVC WebUI)

Step 1 — Prepare Your Dataset

Start by collecting clean vocal audio. The best sources are isolated vocal tracks (use our Vocal Remover tool to extract vocals from songs), podcast recordings, audiobook narration, or direct microphone recordings in a treated room. Remove any segments with background noise, music, or other speakers. Split long recordings into 5-15 second clips. The goal is variety — include different pitches, emotions, and speaking styles to give the model a complete picture of the voice.

Step 2 — Configure Training Parameters

The key training parameters are: epochs (200-500 for most voices), batch size (depends on your VRAM), sample rate (40kHz or 48kHz), and the feature extractor (RMVPE is recommended for pitch detection, ContentVec for speaker embedding). Start with default settings and adjust based on results. More epochs means longer training but potentially better quality — though over-training can cause artifacts.

Step 3 — Train and Evaluate

Launch training and monitor the loss curve. Training typically takes 1-4 hours depending on dataset size and GPU. Save checkpoints every 50 epochs so you can compare quality at different stages. Test each checkpoint by running inference on a sample audio file. Listen for naturalness, clarity, and whether the voice sounds like the target. The best checkpoint is not always the last one — over-trained models can sound robotic or introduce artifacts.

Step 4 — Import into Echo

Once you have a trained .pth model file, importing it into Echo takes seconds. Open Echo, go to the Voice Models section, and drag your .pth file into the import area. Echo automatically detects the model parameters and makes it available for real-time conversion. You can start using your custom voice immediately in Discord, games, or any voice chat application.

Tips for Better Model Quality

The difference between a good model and a great model comes down to dataset quality. Here are proven tips from the RVC community:

  • Use our Noise Remover tool to clean your training audio before training
  • Include whispered and shouted segments for better dynamic range
  • Remove breaths and silence between phrases for cleaner training
  • Train at 48kHz for the highest quality output
  • Test with both male and female input voices to check conversion quality

FAQ

How much audio do I need to train a good RVC model?
10-30 minutes of clean vocal audio is ideal. Less than 5 minutes usually produces poor results. More than 30 minutes rarely improves quality significantly and mainly increases training time.
Can I train an RVC model without a GPU?
Training is extremely slow on CPU — a 200-epoch training run that takes 1 hour on GPU could take 24+ hours on CPU. We recommend using a cloud GPU service like Google Colab, Paperspace, or RunPod if you don't have a local GPU.
Is it legal to train an RVC model on someone else's voice?
Training a model for personal, non-commercial use is generally considered fair use in most jurisdictions. However, using a cloned voice to impersonate someone, commit fraud, or create misleading content is illegal. Always use voice models responsibly and ethically.
What is the difference between RVC v1 and v2 models?
RVC v2 models produce higher quality output with fewer artifacts, especially for cross-gender conversion. They also support 48kHz audio natively. Echo supports both v1 and v2 models.
Can I share my trained RVC model with others?
Yes — .pth model files can be shared freely. The RVC community has thousands of shared models available. However, be mindful of the source audio: if you trained on copyrighted material, sharing the model may have legal implications.

Ready to try it?

Download Echo and experience AI-powered voice conversion for yourself.