From creative-media-generation
Internal provider executor for the creative-media-generation plugin — translates and dubs an existing video into another language with voice cloning and lip-sync (HeyGen Video Translation: same face, cloned voice, re-synced lips). Invoked by the orchestrator (creative-media-generation), not selected directly from user chat. The orchestrator handles request understanding, provider discovery, spend-safety, levers, and artifacts, then delegates the raw translate/dub here. If the user names HeyGen directly, route to the orchestrator first. Operates only on an existing source video (audio-only mode supported); returns the translated video/audio URL per language.
How this skill is triggered — by the user, by Claude, or both
Slash command
/creative-media-generation:heygen-translate [video_url_or_path] [--to language][video_url_or_path] [--to language]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
> **Vendored provider executor.** This skill is bundled inside the `creative-media-generation` plugin. Route through the orchestrator (`/creative-media-generation:creative-media-generation`) before any credit-consuming run or durable `media/` write — the orchestrator owns provider discovery, the spend-safety confirmation, and the artifact contract. Do not spend or write durable artifacts directly.
Vendored provider executor. This skill is bundled inside the
creative-media-generationplugin. Route through the orchestrator (/creative-media-generation:creative-media-generation) before any credit-consuming run or durablemedia/write — the orchestrator owns provider discovery, the spend-safety confirmation, and the artifact contract. Do not spend or write durable artifacts directly.
CMG executor mode (binding when invoked inside the creative-media-generation plugin)
When this skill runs as a vendored executor inside the
creative-media-generationplugin, the orchestrator governs routing, billing, and the artifact contract. In CMG mode this skill MUST:
- Use ONLY the orchestrator-supplied
provider/transport/billing_pool. Do NOT make its ownHEYGEN_API_KEYbilling decision and do NOT treat API-key presence as a signal to prefer the CLI/API route — the orchestrator's HeyGen API-key footgun stop decides the wallet.HEYGEN_API_KEYis optional here (MCP-first via the orchestrator). Do NOT ask the user to paste a HeyGen API key into chat — provider key setup happens outside the transcript; the orchestrator handles auth.- Write raw assets (translated videos / audio) ONLY to the orchestrator-supplied
output_dir(media/<instance>/assets/). Do NOT write local logs and do NOT read workspace-root files.- Not use OpenClaw paths. Do NOT call OpenClaw
video_generateormessage(action:send)— those are upstream/standalone-only and are NOT a CMG path.- Return
job_id/session_id(translation id) /asset_paths(plus theprovider/transport/billing_pool/levers_appliedused — e.g. target language,voice_id, glossary) to the orchestrator. The orchestrator owns the durablemedia/writes; this skill never writes them.Sections below marked (standalone/upstream only — NOT in CMG mode) describe the bare OpenClaw/CLI behavior and are superseded by this block whenever the skill runs inside the plugin.
Translate and dub an existing video into 175+ languages. The system clones the presenter's voice into the target language, re-syncs their lips to the new audio, and returns a fully dubbed video. You provide a source video and a target language — the engine handles transcription, translation, voice cloning, lip-sync, and (optionally) burned-in captions.
This is not new-video generation. The presenter, performance, framing, and brand assets in the original video are preserved. Translation rides on top of what's already there.
Pick one transport at session start. Never mix, never switch mid-session, never narrate the choice.
Detect in this order:
CMG mode: Skip OpenClaw-plugin mode and the API-key override. Use the orchestrator-supplied
transport(mcporcli) andbilling_pool; the orchestrator already ran the HeyGen API-key footgun stop and chose the wallet. Items 1–2 below are (standalone/upstream only — NOT in CMG mode).
video_generate tool exposes a HeyGen translation model, prefer that. Currently the plugin generates videos but does not expose translation directly — fall through to the next tier until HeyGen ships translation through video_generate.HEYGEN_API_KEY is set in the environment AND heygen --version exits 0, use CLI. (Standalone only: API-key presence is an explicit signal that the user wants direct API access. In CMG mode this is neutralized — the orchestrator's footgun stop decides the wallet; key presence is never an auto-green-light.)HEYGEN_API_KEY AND HeyGen MCP tools are visible (mcp__heygen__*). OAuth auth, runs against the user's plan credits. (This is the CMG-mode default unless the orchestrator hands you transport: cli.)heygen --version exits 0. Auth via heygen auth login.curl -fsSL https://static.heygen.ai/cli/install.sh | bash then heygen auth login."After mode detection, verify auth actually works before entering Phase 1. This avoids wasting the user's time gathering inputs only to hit an auth error on submit.
heygen auth status (silent). If it exits 0, proceed. If it exits non-zero (no key, expired, invalid):
echo "<key>" | heygen auth login (writes to ~/.heygen/credentials, survives across sessions).heygen auth status. If still failing, surface the error and stop.This is a one-time setup. Once heygen auth login persists the key, future sessions pick it up automatically. Don't ask again if heygen auth status passes.
Hard rules:
curl api.heygen.com/... Every operation in this skill has a CLI command and (where supported) an MCP tool. Use those.mcp__heygen__* tools. If translation isn't exposed via MCP yet, fall through to CLI for translation operations specifically. Do not synthesize raw HTTP calls.heygen ... commands. Run heygen video-translate --help and heygen video-translate <subcommand> --help to discover arguments. Use --request-schema to see the full JSON shape of any create command.create_video_translation (single language). Multi-language and proofreads are
not yet exposed via MCP — fall through to CLI for those. Run mcp__heygen__*
tool listing at session start to confirm what's available; tool surface evolves.
heygen video-translate
├── languages list # supported target languages
├── create # submit a translation job (single or batch)
├── get <id> # check status / fetch result
├── list # list past translations
├── update # update job metadata
├── delete # delete a job
└── proofreads
├── create # extract editable subtitles before final render
├── get <id> # check proofread session status
├── srt get <id> # download the extracted SRT
├── srt update <id> # upload edited SRT
└── generate <id> # render the final video from approved SRT
heygen asset create --file <path> # for local source video uploads (max 32 MB)
Every command supports --help. Use --request-schema on any create to see the full JSON body. CLI output: JSON on stdout, {error:{code,message,hint}} envelope on stderr, exit codes 0 ok · 1 API · 2 usage · 3 auth · 4 timeout. Add --wait on create to block until the job completes (default timeout 20m).
📖 Detailed CLI/MCP error → action mapping → references/troubleshooting.md
The skill runs four phases. Phase 1 (Discovery) is the only place you ask questions. Phase 2 (Pre-flight) is silent. Phase 3 (Submit + Poll) is silent. Phase 4 (Deliver) is one short message.
Phase 1 — Discovery — gather minimum needed inputs from the user
Phase 2 — Pre-flight — validate language, classify content, set flags
Phase 3 — Submit + Poll — kick off, background poll, surface only on done/fail
Phase 4 — Deliver — post the result with one-line summary
Ask only what you don't already have. Communicate in the user's language. Never run a form. One or two questions per turn, max.
Required inputs (block until you have these):
Important inputs (ask if not provided, with smart defaults):
enable_dynamic_duration: true). Only set to false when the user needs frame-exact timing (e.g., syncing to a timeline, ad slot, or external audio track).Optional (ask only if relevant):
start_time and end_time in seconds.📖 Locale-pair gotchas (formality registers, RTL languages, tonal compression, lip-sync ceiling) → references/language-locale-guide.md
Silent. No user-facing chatter. Three checks, in order.
Check 2a: Language validation.
MCP: list_video_translation_languages() (if exposed). Otherwise CLI.
CLI: heygen video-translate languages list | jq -r '.data.languages[]'
The list contains exact strings ("Spanish (Spain)", "Chinese (Mandarin, Simplified)", "Arabic (Saudi Arabia)"). Match the user's input case-insensitively against these exact strings. If they say "Spanish", default to "Spanish (Spain)" and confirm in Phase 4. If they say "Chinese", default to "Chinese (Mandarin, Simplified)". If they specify a region ("Mexican Spanish"), map it ("Spanish (Mexico)"). If no match: ask the user to pick from the closest options.
Check 2b: Source video routing.
| Source the user gave you | Route |
|---|---|
Public HTTPS URL (no auth, returns video MIME on HEAD) | Pass directly as {type: "url", url: "..."} |
| Auth-walled URL, 403, 404, or HTML response | Tell the user, ask for a public URL or local file |
| Local file path | Upload via heygen asset create --file <path> (CLI) or upload_asset (MCP). Max 32 MB. Use the returned asset_id as {type: "asset_id", asset_id: "..."} |
| Existing HeyGen asset_id | Pass directly as {type: "asset_id", asset_id: "..."} |
📖 Asset routing edge cases (very large files, presigned URLs, auth-walled sources) → references/asset-routing.md
Check 2c: Content profile.
Pick one profile based on the source. Don't list all five to the user — propose silently and only ask if the source is genuinely ambiguous (e.g., a music-heavy talking-head where you can't tell if speech enhancement will help).
| Profile | Use when | Flags |
|---|---|---|
| Talking head / presenter (default) | One person speaks to camera; clean audio | mode: precision, enable_speech_enhancement: true, enable_caption: true, enable_dynamic_duration: true, keep_the_same_format: true |
| Podcast / audio-only | The visual is static, doesn't matter, or doesn't exist | mode: precision, translate_audio_only: true, enable_speech_enhancement: true, enable_caption: true |
| Music / high-soundtrack | Background music interferes with speech | mode: precision, disable_music_track: true, enable_speech_enhancement: true, enable_dynamic_duration: true, keep_the_same_format: true |
| Multi-speaker | Two or more distinct speakers | Talking-head defaults + speaker_num: <count>. Speaker count is REQUIRED here — don't guess. |
| Corporate / branded | Brand voice, glossary discipline, high-stakes | Talking-head defaults + (if user has one) brand_voice_id. Strongly consider proofreads for this profile. |
Always:
mode: "precision" unless the user explicitly asks for "fast" / "quick" / "speed".enable_dynamic_duration: set based on the user's answer to the duration flexibility question in Phase 1. Default true (recommended) — lets translated speech breathe instead of being crammed into the source's exact timing. Set false only when the user explicitly needs fixed-length output. Tonal compression makes flexibility especially important for en→zh, en→ja, en→ko (Asian languages run shorter); de→en, ja→en (run longer); ar/he/ur (RTL + register shifts).keep_the_same_format: true for visual translations — preserves the source's resolution and bitrate so the dubbed video matches the original's encoding.enable_watermark: false (the default).Silent. Background work. Surface only on (a) per-language completion, (b) per-language hard failure, (c) >5 min progress check.
Branching:
Submit one job per target language using batch syntax (--output-languages accepts multiple).
MCP (single language only at time of writing):
create_video_translation(
video={type, url|asset_id},
output_languages=["Spanish (Spain)"],
mode="precision",
enable_speech_enhancement=true,
enable_caption=true,
enable_dynamic_duration=true,
keep_the_same_format=true,
speaker_num=<n>, # only when known multi-speaker
)
CLI:
heygen video-translate create \
-d '{"video":{"type":"url","url":"https://..."},"output_languages":["Spanish (Spain)","Japanese (Japan)"]}' \
--mode precision \
--enable-speech-enhancement \
--enable-caption \
--enable-dynamic-duration \
--keep-the-same-format \
--speaker-num 1 \
--title "<short title>"
Response returns one video_translation_id per language. Capture all of them.
Polling (silent, backgrounded):
Use --wait on create to block until completion when running ONE language. For batch, drop --wait and poll each ID:
# CLI mode polling (background)
heygen video-translate get <video-translation-id>
# Returns { data: { status: "pending"|"running"|"succeeded"|"failed", video_url, ... } }
Polling cadence: 30s for the first 3 minutes, then 60s. Most translations complete in 5–15 min; some (long videos, batched languages) take 30+ min. Hard timeout: 60 min per translation — beyond that, treat as stuck and surface the issue.
MCP equivalents: get_video_translation(id) (if exposed). Otherwise fall through to CLI for polling.
📖 Background polling pattern (don't poll in foreground / harness-specific notes) → references/troubleshooting.md#polling
For high-stakes content, run a proofread session first so the user can review/edit the translated subtitles before the engine commits to a final render.
# 1. Create proofread session — returns proofread_ids (one per language)
heygen video-translate proofreads create \
-d '{"video":{"type":"url","url":"https://..."}}' \
--output-languages "Spanish (Spain)" \
--mode precision \
--enable-speech-enhancement \
--keep-the-same-format \
--speaker-num 1 \
--title "<short fileNAME-safe title>"
# → status: processing (3–5 min for short videos)
# 2. Poll until completed (or failed + failure_message)
heygen video-translate proofreads get <proofread-id>
# → status: completed
# 3. Fetch presigned URLs for editable + original SRTs
heygen video-translate proofreads srt get <proofread-id> > /tmp/srt-resp.json
SRT_URL=$(jq -r '.data.srt_url' /tmp/srt-resp.json) # target-lang, edit this
ORIG_URL=$(jq -r '.data.original_srt_url' /tmp/srt-resp.json) # source-lang transcript
curl -s "$SRT_URL" -o /tmp/proofread.srt
# 4. Edit /tmp/proofread.srt by hand or sed (glossary, register, names)
# See references/proofreads-workflow.md for the full edit playbook.
# 5. Host the edited SRT at a public URL, then upload by reference.
# ⚠️ asset_id route is currently BLOCKED for SRTs —
# `heygen asset create` only accepts png/jpeg/mp4/webm/mp3/wav/pdf.
# Use the URL route. (gist raw, S3 public-read, presigned ≥2h, etc.)
EDITED_URL="https://example.com/proofread-edited.srt"
heygen video-translate proofreads srt update <proofread-id> \
-d "{\"srt\":{\"type\":\"url\",\"url\":\"$EDITED_URL\"}}"
# 6. Kick off final render — returns a video_translation_id
heygen video-translate proofreads generate <proofread-id> --captions
# → {"data":{"video_translation_id":"<vid-id>","status":"processing"}}
# 7. Poll the translation to completion (NOT proofreads get — graduates here)
heygen video-translate get <vid-id>
# → status: running → succeeded; data.video_url has the final mp4
📖 When to insist on proofread, common SRT edits, glossary discipline → references/proofreads-workflow.md
CMG mode: Download each completed translation into the orchestrator-supplied
output_dir(media/<instance>/assets/) and returnjob_id/ translation id /asset_paths(with the target language +voice_id/ glossary used) to the orchestrator. Do NOT callmessage(action:send); the orchestrator surfaces results and owns the durablemedia/writes.
One message per completed language. Format (standalone delivery):
✅ Spanish (Spain) — <video_url> 1m 47s, precision mode, captions on.
If a language failed: one short line with the cause (from troubleshooting reference). Don't flood the user with retry options unless they ask. If the user batched many languages, deliver each as it completes — don't wait for all to finish before posting any.
Source-quality disclaimer. Translation can't improve on the source. If the source has muffled audio, fast cuts, heavy occlusion of the face, or low resolution, lip-sync and voice quality will degrade. When you detect these conditions in Phase 2 (or the user mentions them), warn upfront. Don't surface this after a bad result.
The defaults above cover the common case. The decisions below are what separate this skill from a generic API wrapper. Use them as judgement calls during the workflow, not as a checklist to recite.
For talking-head: 1 speaker. For interviews / podcasts / panels: count exactly, don't guess. The engine separates voices by speaker_num; wrong count means voices bleed across speakers in the dubbed output. If the user is unsure, ask them to scrub the video and count.
A 30-second triage in Phase 2 saves 10–30 minutes of bad translation. Watch/listen to the first ~10 seconds of the source and check:
enable_speech_enhancement: true. If music dominates, disable_music_track: true. If both, warn the user that quality may be lower regardless of flags.enable_caption: true AND warn about the existing burn-in.📖 Full locale-pair table with register notes and known quirks → references/language-locale-guide.md
Lip-sync is best on:
Lip-sync degrades on:
If the user's source has these conditions, warn them in Phase 1/2: "Heads up — the source has [X], so lip-sync won't be as tight as it would on a static talking-head. Want me to proceed anyway, or switch to audio-only translation?"
enable_caption: true produces captions burned into the video by default. Pros: no separate file, plays anywhere. Cons: not editable later, can collide with source graphics, fixed font/style. For high-stakes content where the user might want to restyle captions (brand kit, language-specific font), prefer the proofreads workflow — it gives an SRT they can use as a sidecar caption file.
translate_audio_only: true skips lip-sync entirely. Use it for:
Output is an audio file (typically MP3). Tell the user how to use it: "This gives you a translated audio track. Composite it back over the original video in your editor, or use it standalone." Do NOT pitch audio-only as a "quality workaround" for bad lip-sync — it's a different deliverable.
Translations bill by source video duration. A 5-minute video translated into 5 languages = 25 billable minutes. Surface time and cost expectations in Phase 1 when the user requests batches: "That's 5 languages × 5 minutes = ~25 min of translation time. Each one will take 10–20 min to render. Sound good?"
Don't quote dollar figures (pricing changes, varies by plan). Quote source minutes × language count, plus an honest render-time range.
Common error responses → human-readable causes → next action:
| Symptom | Likely cause | Fix |
|---|---|---|
400 "video URL not accessible" | URL requires auth, returned HTML, or wrong MIME | Ask for public URL or local file → upload route |
400 "language not supported" | String didn't match canonical languages list | Re-run languages list, present closest matches |
failed status with "audio extraction" | Source has no audible speech, very corrupted audio, or wrong codec | Verify the source has speech; consider re-encoding |
failed with "speaker detection" | speaker_num mismatched actual speakers, or audio is too noisy | Re-submit with correct speaker count or enable_speech_enhancement: true |
Stuck >30 min in running | Backend queue / occasional stalls | Check status, give it 60 min total, then surface to user |
| Lip-sync looks bad on output | Source face conditions (see lip-sync ceiling) | Re-frame expectation; offer audio-only as alternative |
| Captions in wrong direction | RTL language with burned-in caption colliding with source layout | Switch to proofread + sidecar SRT |
📖 Full error → action table including auth and asset upload errors → references/troubleshooting.md
Signals the user is new to HeyGen translation specifically:
heygen video-translate list)For first-timers, suggest a 30–60 second test clip before committing to a full video. This catches source-quality issues, voice-clone fidelity, and lip-sync ceiling without burning a long-video translation.
When the source is borderline:
"Heads up — the source video has [muffled audio / dim lighting / fast cuts / heavy music / etc.]. The translation engine can't improve on the source, so the dub might inherit some of that. Want to proceed, fix the source first, or test with a short clip?"
Don't surface this after a bad result. Surface it in Phase 2.
Currently best-supported via the proofreads workflow but not yet first-class flags:
brand_voice_id flag is for voice consistency across translations, not for glossaries.srt field (Enterprise plan) on create, or via proofreads srt update for any plan. Use srt_role: "input" to apply YOUR subtitles to the source language; output to apply them as the target-language captions.start_time / end_time in seconds. Useful for "translate just minute 2:00–4:00".--output-languages (CLI) or output_languages array (MCP). One job ID returned per language.--callback-url and --callback-id skip polling entirely. Use when you have a webhook endpoint and want event-driven completion.npx claudepluginhub cmgramse/skill-development --plugin creative-media-generationGenerates brand assets: logos (55+ styles, Gemini AI), CIP mockups, HTML slides (Chart.js), banners (22 styles), SVG icons (15 styles), and social media photos. Routes to sub-skills for design tokens and UI styling.