🎧 Listen to the result: German meditation (20 min)

Sattva Pipeline — Step-by-Step Guide

What This Does

Takes a Gurudev Sri Sri Ravi Shankar meditation video in English and produces a fully dubbed German audio track with:

Gurudev's cloned voice speaking German
All Om chanting preserved from original
All Sanskrit/sacred chanting preserved from original
Background music/flute preserved throughout
Emotion-matched voice delivery per segment
Zero English in the final output

Prerequisites

# Server: Ubuntu with 16GB+ RAM, no GPU needed
# Python 3.12+

# Audio processing
pip install numpy scipy soundfile
pip install faster-whisper          # Transcription
pip install audio-separator[cpu]    # UVR5 vocal separation (better than Demucs)

# APIs (keys needed)
pip install anthropic               # Claude Sonnet 4 for translation
# ElevenLabs API key               # Voice cloning + TTS
# DashScope API key                 # Gurudev voice clone (registered voice IDs)

# Audio tools
apt install ffmpeg

Pipeline Steps

Step 0: Audio Extraction

Extract WAV files from source video.

ffmpeg -y -i source.mp4 -ar 44100 -ac 2 audio_44k_stereo.wav   # Processing quality
ffmpeg -y -i source.mp4 -ar 16000 -ac 1 audio_16k_mono.wav     # Whisper input

Output: audio_44k_stereo.wav, audio_16k_mono.wav

Step 0.5: Source Separation (UVR5)

Separate vocals from instrumental (flute/background music) using UVR5 MDX-Net.

This gives us a clean instrumental track with zero vocals.

from audio_separator.separator import Separator

sep = Separator(output_dir="uvr_out", output_format="wav")
sep.load_model("UVR-MDX-NET-Inst_HQ_3.onnx")
sep.separate("audio_44k_stereo.wav")

Output: (Instrumental).wav (flute/music only), (Vocals).wav (all voice: English + Om + Sanskrit)

Why UVR5, not Demucs:

Demucs background stem has vocal leakage (Om leaked at -4dB into bg stem)
Demucs background stem has artifacts in quiet regions
UVR5 MDX-Net produces much cleaner instrumental separation
Demucs also had OOM issues (6.5GB) and torchcodec save failures on this server

Step 1: Transcription + QC Gate 1

Transcribe English speech using faster-whisper with meditation-optimized settings.

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cpu", compute_type="int8")
segments, info = model.transcribe(
    "audio_16k_mono.wav",
    language="en",
    initial_prompt="Meditation guidance by Gurudev Sri Sri Ravi Shankar. Terms: Om, pranayama, samadhi...",
    no_speech_threshold=0.45,       # Lower for soft meditation speech
    compression_ratio_threshold=2.8, # Higher for repetitive meditation phrases
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=800),
    word_timestamps=True,
)

Human review required: Verify transcription at flagged timestamps. Common corrections:

"oh," → "Om" (sacred sound)
"Living away" → "Leaving away" (Gurudev's phrasing)
Remove hallucinated text in long silence regions

QC Gate 1 checks:

☐ No hallucinated repeated phrases
☐ No segments in silence regions (<-45dB)
☐ All segments >0.3s and <60s
☐ Minimum 5 speech segments detected

Auto-fix on fail: Remove hallucinated segments, re-run with stricter VAD.

Output: segments_corrected.json (id, start, end, duration, text per segment)

Step 1.5: Sacred Audio Identification

Full-scan the entire audio for Om and chanting. Do NOT hardcode locations — scan automatically.

# Scan for sustained high-energy non-speech events
# Pass 1: Om detection (>-20dB, >3s, not speech)
# Pass 2: Chanting detection (>-30dB, merge within 30s for phrase gaps)

Key insight: Whisper reports Om as 0.5s at one timestamp. Reality: Om resonates for 10+ seconds. Energy analysis finds the true boundaries.

Typical meditation has 3 Oms (at beginning of meditation, before deepening, and before a section transition). Sanskrit chanting is one long block (60-80s).

Output: sacred_segments_verified.json

Step 1.6: Conflict Check + QC Gate 1.6

Check if any speech segment overlaps a sacred region. If so, automatically split the sacred region around the speech.

Example: Om detected at 54-65s, but Seg 3 "Another deep breath in" is at 50.5-52.7s.

→ Split into: sacred 42.1-50.0s (before speech) + sacred 53.2-66.3s (after speech)

QC Gate 1.6 checks:

☐ Zero overlaps between speech segments and sacred regions

Auto-fix on fail: Split sacred region, pad 0.5s gap, retry.

Step 1.7: Emotion/Prosody Analysis + QC Gate 1.7

Analyze HOW Gurudev says each segment in the original audio. This drives per-segment ElevenLabs voice settings.

Audio features extracted per segment:

**RMS energy** → normalized 0-1 ratio (how loud)
**F0 pitch** → median + range via autocorrelation (how deep)
**Speaking rate** → words/second (how fast)
**Dynamic range** → max-min RMS in 200ms windows (how expressive)
**Silence ratio** → fraction below -40dB (how contemplative)

Maps to emotion labels:

Label	When	ElevenLabs stability	ElevenLabs style
`soft_warm`	Opening, settling	0.45 (steady)	0.20 (subtle)
`deep_resonant`	Om transitions	0.50 (very steady)	0.15 (minimal)
`gentle_guiding`	Standard instructions	0.40	0.25
`uplifting`	Smile, affirmation	0.30 (varied)	0.40 (expressive)
`releasing`	Letting go, breathe out	0.25 (most varied)	0.45 (most expressive)
`peaceful_close`	Coming back	0.50 (steady)	0.15 (minimal)

QC Gate 1.7 checks:

☐ All features computed (no NaN)
☐ Labels match audio features (rule consistency)
☐ ElevenLabs params within safe range (stability 0.15-0.60, style 0.10-0.50)
☐ No sharp jumps (>0.25) between adjacent segments
☐ Same-phase segments have consistent labels

Auto-fix on fail: Widen extraction window, interpolate sharp transitions.

Output: segment_prosody.json (per-segment features + emotion label + ElevenLabs settings)

Step 2: Context-Aware Translation + QC Gate 2

Send ALL segments to Claude Sonnet 4 in ONE call with full meditation context.

Why batch, not per-segment: "Another deep breath in" appears 4 times. Per-segment translation gives identical German 4 times. Batch translation with context lets Claude vary them naturally across the meditation arc.

Claude prompt includes:

9-phase meditation arc (Settling In → Deepening → Releasing → Deep Meditation → Affirmation → Deepest → Chanting → Closing → Coming Back)
Per-segment emotion labels from Step 1.7
Rules: du-form only, sacred terms never translate, natural German, vary repeated phrases

3 iterations:

v1: Context-aware literal
v2: Natural/flowing with variation emphasis
v3: Claude picks best per-segment from v1 + v2

QC Gate 2 checks:

☐ No formal pronouns (Sie/Ihnen) — du-form only
☐ Sacred terms preserved unchanged
☐ Length ratios 0.3x-2.0x per segment
☐ Identical English → at least 50% different German (variation check)
☐ No calques ("Lass uns" OK when English has "Let's", but not elsewhere)
☐ All 15 segments present in response
☐ Valid JSON returned

Auto-fix on fail: Re-run Claude with specific failure injected in prompt.

Output: translations_de_context_v3.json (id, english, german, variation_note per segment)

Step 3a: DashScope TTS — Gurudev Voice Clone

Generate all segments with the registered Gurudev voice clone. This produces accurate voice identity but flat/emotionless delivery.

# DashScope API
MODEL = "qwen3-tts-vc-2026-01-22"
VOICE_ID = "qwen-tts-vc-gurudev-voice-20260321233012828-6e25"  # Bhakti Sutras clone
# POST to: https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

Output: segments_dashscope/gurudev_vc/seg_001-015.wav

Step 3b: ElevenLabs TTS — Emotion Layer + QC Gates 3, 3.5

Clone the DashScope output into ElevenLabs, then re-generate each segment with per-segment emotion settings from Step 1.7.

# 1. Concatenate all DashScope WAVs → reference audio (30-60s)
# 2. Create ElevenLabs voice clone from reference
# 3. Generate each segment with tailored settings:

for seg in segments:
    settings = prosody_map[seg["id"]]  # From segment_prosody.json
    # e.g., {"stability": 0.25, "similarity_boost": 0.75, "style": 0.45}
    elevenlabs_tts(german_text, output_path, voice_id, settings)

Slow down 8% after generation (atempo=0.92) for meditation pacing.

QC Gate 3 checks:

☐ All MP3 files exist
☐ Each file >1KB
☐ Each segment >0.3s duration
☐ No silence-only files (RMS > -40dB)

QC Gate 3.5 checks:

☐ TTS level within 15dB of original speech level at same timestamp
☐ Level adjustments computed and applied

Auto-fix on fail: Regenerate missing/bad segments.

Output: segments_elevenlabs/seg_001-015.mp3

Step 4: Assembly + QC Gates 4.5, 4.6, 4.7

Build final audio from 3 layers. No English anywhere.

Layer 0: UVR5 Instrumental stem (flute + background music, zero vocals)
Layer 1: Sacred audio (Om + Sanskrit from UVR5 vocal stem at verified timestamps)
Layer 2: German TTS (level-matched, slowed 8%, crossfaded) at speech timestamps

Mix: Duck instrumental 6dB during German speech for clarity.

QC Gate 4.5 — Segment Audit:

☐ All segments present and audible (>-35dB) at their timestamps

Auto-fix: Re-check sacred conflicts, re-split if needed, retry assembly.

QC Gate 4.6 — Transition Smoothness:

☐ No entry/exit jumps >55dB (truly broken transitions)
Note: 30-50dB jumps are normal — speech starting from quiet flute background

QC Gate 4.7 — English Bleed:

☐ No vocal energy detected in non-TTS, non-sacred regions
Compares final output to instrumental stem — any difference = potential vocal leakage

Auto-fix: Widen crossfades (200ms → 400ms → 500ms), retry.

Output: meditation_german_final.wav

Step 5: Mastering + QC Gate 5

Normalize to broadcast standard and final quality check.

# Two-pass LUFS normalization
ffmpeg -i input.wav -af "loudnorm=I=-16:TP=-1:LRA=11:print_format=json" -f null -   # Measure
ffmpeg -i input.wav -af "loudnorm=I=-16:TP=-1:LRA=11:measured_I=....:linear=true" -ar 44100 output.wav  # Apply
ffmpeg -i output.wav -codec:a libmp3lame -b:a 192k output.mp3  # MP3

QC Gate 5 checks:

☐ Duration matches original (within 2s)
☐ Peak < -1 dBTP (no clipping)
☐ Sacred regions preserved (correlation >0.85 vs vocal stem)

Auto-fix:

Clipping → re-normalize peak to 0.95
Sacred corrupted → widen sacred padding by 0.5s, re-assemble

Output: meditation_german_final.wav + meditation_german_final.mp3

QC Summary — All Must Pass

#	Gate	After Step	Pass Criteria	Auto-Fix
1	Transcription	1	No hallucinations, >5 segments	Remove bad segments, re-run
2	Conflict	1.6	Zero speech/sacred overlaps	Split sacred regions
3	Prosody	1.7	Features valid, labels consistent	Interpolate, widen windows
4	Translation	2	du-form, varied, natural German	Re-translate with fixes
5	TTS Files	3	All exist, valid duration	Regenerate
6	TTS Levels	3.5	Within 15dB of original	Adjust ratio
7	Segment Audit	4.5	All 15 segments audible	Re-check conflicts
8	Transitions	4.6	No harsh jumps (>55dB)	Widen crossfades
9	English Bleed	4.7	Zero vocal energy in silent regions	Better separation model
10	Final	5	Duration, peak, sacred OK	Re-normalize, re-assemble

Max 3 retries per gate. If all 3 fail, pipeline stops and reports the issue.

Running the Pipeline

# Set API keys
export ANTHROPIC_API_KEY="sk-..."

# Run
cd /root/hetzner-aol/sattva
python3 pipeline.py

# Output
# → test_run_001/05_finalization/meditation_german_final.mp3
# → test_run_001/06_qc_reports/pipeline_report.json

Key Decisions (Why We Do It This Way)

**UVR5 over Demucs**: Demucs leaked vocals into background stem, had OOM issues, torchcodec failures. UVR5 MDX-Net produces clean separation.

**Surgical replace on instrumental, NOT layered rebuild**: V1-V4 tried rebuilding from Demucs layers — artifacts everywhere. V5+ uses UVR5 instrumental as base, adds only what we want.

**Full-scan sacred detection, NOT hardcoded timestamps**: Whisper said Om was 0.5s at one location. Energy analysis found 3 Oms each lasting 10+ seconds. Hardcoding misses events.

**Context-aware batch translation, NOT per-segment**: Identical English phrases get varied German to match the meditation arc. Claude sees all 15 segments + emotion labels + 9-phase context.

**Per-segment ElevenLabs settings, NOT global**: Settling segments get steady/subtle voice. Releasing segments get varied/expressive voice. Matches Gurudev's actual delivery.

**Two-stage TTS (DashScope → ElevenLabs)**: DashScope gives accurate Gurudev voice identity from registered embeddings. ElevenLabs adds emotion/prosody. A/B tested 2026-04-08, confirmed as best approach.

**8% slower TTS**: ElevenLabs natural pace was slightly fast for meditation. 0.92x atempo matches Gurudev's contemplative pace.

**10 QC gates with auto-retry**: Every step verified before proceeding. Failures auto-fix and retry up to 3x. No manual intervention needed for known failure modes.

Cost Per Meditation (20 min episode, 1 language)

Step	Service	Cost
Transcription	faster-whisper (self-hosted)	$0
Source separation	UVR5 (self-hosted)	$0
Translation	Claude Sonnet 4 (~3 API calls)	~$0.10
DashScope TTS	15 segments	~$0.02
ElevenLabs TTS	15 segments (~60s output)	~$0.30-0.50
Total		~$0.50

Built with the Sattva Pipeline by Art of Living.
GitHub Repository