Takes a Gurudev Sri Sri Ravi Shankar meditation video in English and produces a fully dubbed German audio track with:
# Server: Ubuntu with 16GB+ RAM, no GPU needed
# Python 3.12+
# Audio processing
pip install numpy scipy soundfile
pip install faster-whisper # Transcription
pip install audio-separator[cpu] # UVR5 vocal separation (better than Demucs)
# APIs (keys needed)
pip install anthropic # Claude Sonnet 4 for translation
# ElevenLabs API key # Voice cloning + TTS
# DashScope API key # Gurudev voice clone (registered voice IDs)
# Audio tools
apt install ffmpeg
Extract WAV files from source video.
ffmpeg -y -i source.mp4 -ar 44100 -ac 2 audio_44k_stereo.wav # Processing quality
ffmpeg -y -i source.mp4 -ar 16000 -ac 1 audio_16k_mono.wav # Whisper input
Output: audio_44k_stereo.wav, audio_16k_mono.wav
Separate vocals from instrumental (flute/background music) using UVR5 MDX-Net.
This gives us a clean instrumental track with zero vocals.
from audio_separator.separator import Separator
sep = Separator(output_dir="uvr_out", output_format="wav")
sep.load_model("UVR-MDX-NET-Inst_HQ_3.onnx")
sep.separate("audio_44k_stereo.wav")
Output: (Instrumental).wav (flute/music only), (Vocals).wav (all voice: English + Om + Sanskrit)
Why UVR5, not Demucs:
Transcribe English speech using faster-whisper with meditation-optimized settings.
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cpu", compute_type="int8")
segments, info = model.transcribe(
"audio_16k_mono.wav",
language="en",
initial_prompt="Meditation guidance by Gurudev Sri Sri Ravi Shankar. Terms: Om, pranayama, samadhi...",
no_speech_threshold=0.45, # Lower for soft meditation speech
compression_ratio_threshold=2.8, # Higher for repetitive meditation phrases
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=800),
word_timestamps=True,
)
Human review required: Verify transcription at flagged timestamps. Common corrections:
QC Gate 1 checks:
Auto-fix on fail: Remove hallucinated segments, re-run with stricter VAD.
Output: segments_corrected.json (id, start, end, duration, text per segment)
Full-scan the entire audio for Om and chanting. Do NOT hardcode locations — scan automatically.
# Scan for sustained high-energy non-speech events
# Pass 1: Om detection (>-20dB, >3s, not speech)
# Pass 2: Chanting detection (>-30dB, merge within 30s for phrase gaps)
Key insight: Whisper reports Om as 0.5s at one timestamp. Reality: Om resonates for 10+ seconds. Energy analysis finds the true boundaries.
Typical meditation has 3 Oms (at beginning of meditation, before deepening, and before a section transition). Sanskrit chanting is one long block (60-80s).
Output: sacred_segments_verified.json
Check if any speech segment overlaps a sacred region. If so, automatically split the sacred region around the speech.
Example: Om detected at 54-65s, but Seg 3 "Another deep breath in" is at 50.5-52.7s.
→ Split into: sacred 42.1-50.0s (before speech) + sacred 53.2-66.3s (after speech)
QC Gate 1.6 checks:
Auto-fix on fail: Split sacred region, pad 0.5s gap, retry.
Analyze HOW Gurudev says each segment in the original audio. This drives per-segment ElevenLabs voice settings.
Audio features extracted per segment:
Maps to emotion labels:
| Label | When | ElevenLabs stability | ElevenLabs style |
|---|---|---|---|
| `soft_warm` | Opening, settling | 0.45 (steady) | 0.20 (subtle) |
| `deep_resonant` | Om transitions | 0.50 (very steady) | 0.15 (minimal) |
| `gentle_guiding` | Standard instructions | 0.40 | 0.25 |
| `uplifting` | Smile, affirmation | 0.30 (varied) | 0.40 (expressive) |
| `releasing` | Letting go, breathe out | 0.25 (most varied) | 0.45 (most expressive) |
| `peaceful_close` | Coming back | 0.50 (steady) | 0.15 (minimal) |
QC Gate 1.7 checks:
Auto-fix on fail: Widen extraction window, interpolate sharp transitions.
Output: segment_prosody.json (per-segment features + emotion label + ElevenLabs settings)
Send ALL segments to Claude Sonnet 4 in ONE call with full meditation context.
Why batch, not per-segment: "Another deep breath in" appears 4 times. Per-segment translation gives identical German 4 times. Batch translation with context lets Claude vary them naturally across the meditation arc.
Claude prompt includes:
3 iterations:
QC Gate 2 checks:
Auto-fix on fail: Re-run Claude with specific failure injected in prompt.
Output: translations_de_context_v3.json (id, english, german, variation_note per segment)
Generate all segments with the registered Gurudev voice clone. This produces accurate voice identity but flat/emotionless delivery.
# DashScope API
MODEL = "qwen3-tts-vc-2026-01-22"
VOICE_ID = "qwen-tts-vc-gurudev-voice-20260321233012828-6e25" # Bhakti Sutras clone
# POST to: https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
Output: segments_dashscope/gurudev_vc/seg_001-015.wav
Clone the DashScope output into ElevenLabs, then re-generate each segment with per-segment emotion settings from Step 1.7.
# 1. Concatenate all DashScope WAVs → reference audio (30-60s)
# 2. Create ElevenLabs voice clone from reference
# 3. Generate each segment with tailored settings:
for seg in segments:
settings = prosody_map[seg["id"]] # From segment_prosody.json
# e.g., {"stability": 0.25, "similarity_boost": 0.75, "style": 0.45}
elevenlabs_tts(german_text, output_path, voice_id, settings)
Slow down 8% after generation (atempo=0.92) for meditation pacing.
QC Gate 3 checks:
QC Gate 3.5 checks:
Auto-fix on fail: Regenerate missing/bad segments.
Output: segments_elevenlabs/seg_001-015.mp3
Build final audio from 3 layers. No English anywhere.
Layer 0: UVR5 Instrumental stem (flute + background music, zero vocals)
Layer 1: Sacred audio (Om + Sanskrit from UVR5 vocal stem at verified timestamps)
Layer 2: German TTS (level-matched, slowed 8%, crossfaded) at speech timestamps
Mix: Duck instrumental 6dB during German speech for clarity.
QC Gate 4.5 — Segment Audit:
Auto-fix: Re-check sacred conflicts, re-split if needed, retry assembly.
QC Gate 4.6 — Transition Smoothness:
QC Gate 4.7 — English Bleed:
Auto-fix: Widen crossfades (200ms → 400ms → 500ms), retry.
Output: meditation_german_final.wav
Normalize to broadcast standard and final quality check.
# Two-pass LUFS normalization
ffmpeg -i input.wav -af "loudnorm=I=-16:TP=-1:LRA=11:print_format=json" -f null - # Measure
ffmpeg -i input.wav -af "loudnorm=I=-16:TP=-1:LRA=11:measured_I=....:linear=true" -ar 44100 output.wav # Apply
ffmpeg -i output.wav -codec:a libmp3lame -b:a 192k output.mp3 # MP3
QC Gate 5 checks:
Auto-fix:
Output: meditation_german_final.wav + meditation_german_final.mp3
| # | Gate | After Step | Pass Criteria | Auto-Fix |
|---|---|---|---|---|
| 1 | Transcription | 1 | No hallucinations, >5 segments | Remove bad segments, re-run |
| 2 | Conflict | 1.6 | Zero speech/sacred overlaps | Split sacred regions |
| 3 | Prosody | 1.7 | Features valid, labels consistent | Interpolate, widen windows |
| 4 | Translation | 2 | du-form, varied, natural German | Re-translate with fixes |
| 5 | TTS Files | 3 | All exist, valid duration | Regenerate |
| 6 | TTS Levels | 3.5 | Within 15dB of original | Adjust ratio |
| 7 | Segment Audit | 4.5 | All 15 segments audible | Re-check conflicts |
| 8 | Transitions | 4.6 | No harsh jumps (>55dB) | Widen crossfades |
| 9 | English Bleed | 4.7 | Zero vocal energy in silent regions | Better separation model |
| 10 | Final | 5 | Duration, peak, sacred OK | Re-normalize, re-assemble |
Max 3 retries per gate. If all 3 fail, pipeline stops and reports the issue.
# Set API keys
export ANTHROPIC_API_KEY="sk-..."
# Run
cd /root/hetzner-aol/sattva
python3 pipeline.py
# Output
# → test_run_001/05_finalization/meditation_german_final.mp3
# → test_run_001/06_qc_reports/pipeline_report.json
| Step | Service | Cost |
|---|---|---|
| Transcription | faster-whisper (self-hosted) | $0 |
| Source separation | UVR5 (self-hosted) | $0 |
| Translation | Claude Sonnet 4 (~3 API calls) | ~$0.10 |
| DashScope TTS | 15 segments | ~$0.02 |
| ElevenLabs TTS | 15 segments (~60s output) | ~$0.30-0.50 |
| **Total** | **~$0.50** |
Built with the Sattva Pipeline by Art of Living.
GitHub Repository