🎧 Listen to the result: German meditation (20 min)

Sattva Pipeline — Step-by-Step Guide

What This Does

Takes a Gurudev Sri Sri Ravi Shankar meditation video in English and produces a fully dubbed German audio track with:

Prerequisites

# Server: Ubuntu with 16GB+ RAM, no GPU needed
# Python 3.12+

# Audio processing
pip install numpy scipy soundfile
pip install faster-whisper          # Transcription
pip install audio-separator[cpu]    # UVR5 vocal separation (better than Demucs)

# APIs (keys needed)
pip install anthropic               # Claude Sonnet 4 for translation
# ElevenLabs API key               # Voice cloning + TTS
# DashScope API key                 # Gurudev voice clone (registered voice IDs)

# Audio tools
apt install ffmpeg

Pipeline Steps

Step 0: Audio Extraction

Extract WAV files from source video.

ffmpeg -y -i source.mp4 -ar 44100 -ac 2 audio_44k_stereo.wav   # Processing quality
ffmpeg -y -i source.mp4 -ar 16000 -ac 1 audio_16k_mono.wav     # Whisper input

Output: audio_44k_stereo.wav, audio_16k_mono.wav


Step 0.5: Source Separation (UVR5)

Separate vocals from instrumental (flute/background music) using UVR5 MDX-Net.

This gives us a clean instrumental track with zero vocals.

from audio_separator.separator import Separator

sep = Separator(output_dir="uvr_out", output_format="wav")
sep.load_model("UVR-MDX-NET-Inst_HQ_3.onnx")
sep.separate("audio_44k_stereo.wav")

Output: (Instrumental).wav (flute/music only), (Vocals).wav (all voice: English + Om + Sanskrit)

Why UVR5, not Demucs:


Step 1: Transcription + QC Gate 1

Transcribe English speech using faster-whisper with meditation-optimized settings.

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cpu", compute_type="int8")
segments, info = model.transcribe(
    "audio_16k_mono.wav",
    language="en",
    initial_prompt="Meditation guidance by Gurudev Sri Sri Ravi Shankar. Terms: Om, pranayama, samadhi...",
    no_speech_threshold=0.45,       # Lower for soft meditation speech
    compression_ratio_threshold=2.8, # Higher for repetitive meditation phrases
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=800),
    word_timestamps=True,
)

Human review required: Verify transcription at flagged timestamps. Common corrections:

QC Gate 1 checks:

Auto-fix on fail: Remove hallucinated segments, re-run with stricter VAD.

Output: segments_corrected.json (id, start, end, duration, text per segment)


Step 1.5: Sacred Audio Identification

Full-scan the entire audio for Om and chanting. Do NOT hardcode locations — scan automatically.

# Scan for sustained high-energy non-speech events
# Pass 1: Om detection (>-20dB, >3s, not speech)
# Pass 2: Chanting detection (>-30dB, merge within 30s for phrase gaps)

Key insight: Whisper reports Om as 0.5s at one timestamp. Reality: Om resonates for 10+ seconds. Energy analysis finds the true boundaries.

Typical meditation has 3 Oms (at beginning of meditation, before deepening, and before a section transition). Sanskrit chanting is one long block (60-80s).

Output: sacred_segments_verified.json


Step 1.6: Conflict Check + QC Gate 1.6

Check if any speech segment overlaps a sacred region. If so, automatically split the sacred region around the speech.

Example: Om detected at 54-65s, but Seg 3 "Another deep breath in" is at 50.5-52.7s.

→ Split into: sacred 42.1-50.0s (before speech) + sacred 53.2-66.3s (after speech)

QC Gate 1.6 checks:

Auto-fix on fail: Split sacred region, pad 0.5s gap, retry.


Step 1.7: Emotion/Prosody Analysis + QC Gate 1.7

Analyze HOW Gurudev says each segment in the original audio. This drives per-segment ElevenLabs voice settings.

Audio features extracted per segment:

  1. **RMS energy** → normalized 0-1 ratio (how loud)
  2. **F0 pitch** → median + range via autocorrelation (how deep)
  3. **Speaking rate** → words/second (how fast)
  4. **Dynamic range** → max-min RMS in 200ms windows (how expressive)
  5. **Silence ratio** → fraction below -40dB (how contemplative)

Maps to emotion labels:

LabelWhenElevenLabs stabilityElevenLabs style
`soft_warm`Opening, settling0.45 (steady)0.20 (subtle)
`deep_resonant`Om transitions0.50 (very steady)0.15 (minimal)
`gentle_guiding`Standard instructions0.400.25
`uplifting`Smile, affirmation0.30 (varied)0.40 (expressive)
`releasing`Letting go, breathe out0.25 (most varied)0.45 (most expressive)
`peaceful_close`Coming back0.50 (steady)0.15 (minimal)

QC Gate 1.7 checks:

Auto-fix on fail: Widen extraction window, interpolate sharp transitions.

Output: segment_prosody.json (per-segment features + emotion label + ElevenLabs settings)


Step 2: Context-Aware Translation + QC Gate 2

Send ALL segments to Claude Sonnet 4 in ONE call with full meditation context.

Why batch, not per-segment: "Another deep breath in" appears 4 times. Per-segment translation gives identical German 4 times. Batch translation with context lets Claude vary them naturally across the meditation arc.

Claude prompt includes:

3 iterations:

  1. v1: Context-aware literal
  2. v2: Natural/flowing with variation emphasis
  3. v3: Claude picks best per-segment from v1 + v2

QC Gate 2 checks:

Auto-fix on fail: Re-run Claude with specific failure injected in prompt.

Output: translations_de_context_v3.json (id, english, german, variation_note per segment)


Step 3a: DashScope TTS — Gurudev Voice Clone

Generate all segments with the registered Gurudev voice clone. This produces accurate voice identity but flat/emotionless delivery.

# DashScope API
MODEL = "qwen3-tts-vc-2026-01-22"
VOICE_ID = "qwen-tts-vc-gurudev-voice-20260321233012828-6e25"  # Bhakti Sutras clone
# POST to: https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

Output: segments_dashscope/gurudev_vc/seg_001-015.wav


Step 3b: ElevenLabs TTS — Emotion Layer + QC Gates 3, 3.5

Clone the DashScope output into ElevenLabs, then re-generate each segment with per-segment emotion settings from Step 1.7.

# 1. Concatenate all DashScope WAVs → reference audio (30-60s)
# 2. Create ElevenLabs voice clone from reference
# 3. Generate each segment with tailored settings:

for seg in segments:
    settings = prosody_map[seg["id"]]  # From segment_prosody.json
    # e.g., {"stability": 0.25, "similarity_boost": 0.75, "style": 0.45}
    elevenlabs_tts(german_text, output_path, voice_id, settings)

Slow down 8% after generation (atempo=0.92) for meditation pacing.

QC Gate 3 checks:

QC Gate 3.5 checks:

Auto-fix on fail: Regenerate missing/bad segments.

Output: segments_elevenlabs/seg_001-015.mp3


Step 4: Assembly + QC Gates 4.5, 4.6, 4.7

Build final audio from 3 layers. No English anywhere.

Layer 0: UVR5 Instrumental stem (flute + background music, zero vocals)
Layer 1: Sacred audio (Om + Sanskrit from UVR5 vocal stem at verified timestamps)
Layer 2: German TTS (level-matched, slowed 8%, crossfaded) at speech timestamps

Mix: Duck instrumental 6dB during German speech for clarity.

QC Gate 4.5 — Segment Audit:

Auto-fix: Re-check sacred conflicts, re-split if needed, retry assembly.

QC Gate 4.6 — Transition Smoothness:

QC Gate 4.7 — English Bleed:

Auto-fix: Widen crossfades (200ms → 400ms → 500ms), retry.

Output: meditation_german_final.wav


Step 5: Mastering + QC Gate 5

Normalize to broadcast standard and final quality check.

# Two-pass LUFS normalization
ffmpeg -i input.wav -af "loudnorm=I=-16:TP=-1:LRA=11:print_format=json" -f null -   # Measure
ffmpeg -i input.wav -af "loudnorm=I=-16:TP=-1:LRA=11:measured_I=....:linear=true" -ar 44100 output.wav  # Apply
ffmpeg -i output.wav -codec:a libmp3lame -b:a 192k output.mp3  # MP3

QC Gate 5 checks:

Auto-fix:

Output: meditation_german_final.wav + meditation_german_final.mp3


QC Summary — All Must Pass

#GateAfter StepPass CriteriaAuto-Fix
1Transcription1No hallucinations, >5 segmentsRemove bad segments, re-run
2Conflict1.6Zero speech/sacred overlapsSplit sacred regions
3Prosody1.7Features valid, labels consistentInterpolate, widen windows
4Translation2du-form, varied, natural GermanRe-translate with fixes
5TTS Files3All exist, valid durationRegenerate
6TTS Levels3.5Within 15dB of originalAdjust ratio
7Segment Audit4.5All 15 segments audibleRe-check conflicts
8Transitions4.6No harsh jumps (>55dB)Widen crossfades
9English Bleed4.7Zero vocal energy in silent regionsBetter separation model
10Final5Duration, peak, sacred OKRe-normalize, re-assemble

Max 3 retries per gate. If all 3 fail, pipeline stops and reports the issue.


Running the Pipeline

# Set API keys
export ANTHROPIC_API_KEY="sk-..."

# Run
cd /root/hetzner-aol/sattva
python3 pipeline.py

# Output
# → test_run_001/05_finalization/meditation_german_final.mp3
# → test_run_001/06_qc_reports/pipeline_report.json

Key Decisions (Why We Do It This Way)

  1. **UVR5 over Demucs**: Demucs leaked vocals into background stem, had OOM issues, torchcodec failures. UVR5 MDX-Net produces clean separation.
  1. **Surgical replace on instrumental, NOT layered rebuild**: V1-V4 tried rebuilding from Demucs layers — artifacts everywhere. V5+ uses UVR5 instrumental as base, adds only what we want.
  1. **Full-scan sacred detection, NOT hardcoded timestamps**: Whisper said Om was 0.5s at one location. Energy analysis found 3 Oms each lasting 10+ seconds. Hardcoding misses events.
  1. **Context-aware batch translation, NOT per-segment**: Identical English phrases get varied German to match the meditation arc. Claude sees all 15 segments + emotion labels + 9-phase context.
  1. **Per-segment ElevenLabs settings, NOT global**: Settling segments get steady/subtle voice. Releasing segments get varied/expressive voice. Matches Gurudev's actual delivery.
  1. **Two-stage TTS (DashScope → ElevenLabs)**: DashScope gives accurate Gurudev voice identity from registered embeddings. ElevenLabs adds emotion/prosody. A/B tested 2026-04-08, confirmed as best approach.
  1. **8% slower TTS**: ElevenLabs natural pace was slightly fast for meditation. 0.92x atempo matches Gurudev's contemplative pace.
  1. **10 QC gates with auto-retry**: Every step verified before proceeding. Failures auto-fix and retry up to 3x. No manual intervention needed for known failure modes.

Cost Per Meditation (20 min episode, 1 language)

StepServiceCost
Transcriptionfaster-whisper (self-hosted)$0
Source separationUVR5 (self-hosted)$0
TranslationClaude Sonnet 4 (~3 API calls)~$0.10
DashScope TTS15 segments~$0.02
ElevenLabs TTS15 segments (~60s output)~$0.30-0.50
**Total****~$0.50**

Built with the Sattva Pipeline by Art of Living.
GitHub Repository