From voice-assistant
Manage Claude Code, Codex, or Opencode voice assistant — TTS with personality. Use when the user says /voice, asks about speech settings, wants to change voice persona, enable/disable TTS, or configure voice providers. Also use when user says "stop talking", "disable speech", "enable voice", or similar. Also use when user mentions voice names, accents, pacing, style, or says things like "more expressive", "use a male voice", "speak in Dutch", "slow down", "what voice is this?".
How this skill is triggered — by the user, by Claude, or both
Slash command
/voice-assistant:voice-assistantThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Text-to-speech with configurable personality for Claude Code, Codex, and
MANIFEST.yamlREADME.mddefaults/gemini-voices.mddefaults/local-tts-models.jsondefaults/personas/casual.jsondefaults/personas/professional.jsondefaults/personas/vulcan-science-officer.jsondefaults/personas/zoe-mackenzie.jsondefaults/provider-requirements.jsondefaults/settings.jsondefaults/tts-lenses.jsondefaults/voxtral-voices.mdhooks/notification.pyhooks/pre_compact.pyhooks/session_start.pyhooks/stop.pyhooks/subagent_stop.pyhooks/user_prompt_submit.pyopencode/HANDOFF.mdopencode/README.mdText-to-speech with configurable personality for Claude Code, Codex, and Opencode project configuration.
The voice system supports five provider families: macos, voxtral,
gemini, local_tts, and omlx. In provider names, voxtral means cloud
Voxtral through the Mistral API. Local Voxtral models run through
provider=local_tts. The local_tts provider is a generic OpenAI-compatible
HTTP adapter and has been real-E2E verified with MLX-Audio for Kokoro 4-bit,
PocketTTS 4-bit, and Voxtral TTS 4-bit. Do not claim every MLX-Audio model is
production-ready until that specific model has a real audio test on the user's
machine.
omlx is the same OpenAI-compatible /v1/audio/speech client pointed at a
separate, externally-managed oMLX server (default :8001, auth-gated)
instead of the bundled mlx-audio server (local_tts, :8000). It exists as a
distinct provider so both can be configured at once and switched, and so it
never auto-starts a server. It is unverified for TTS until oMLX is confirmed
to serve /v1/audio/speech with a TTS model loaded — see the omlx schema
below.
Read references/local-tts-onboarding.md before advising users about local
models, PocketTTS, device support, custom voices, or installation mode.
For the EliteExperts project-local SAMI setup, read
requirements.md before installing, repairing, or handing
off the skill. Run
python3 <skill>/scripts/check_requirements.py --target both from the project
root to verify Apple Silicon, uv, Claude CLI, Hugging Face CLI, real
project-local skill copies, local sami_de persona files, Voxtral model cache,
and the shared MLX-Audio server. The checker must not install anything
silently; when a requirement is missing, report the missing item and ask the
user before installing dependencies, downloading the model, or starting the
server.
Project-local lifecycle hooks run a deliberately small SessionStart
preflight: the project must explicitly define enabled, provider,
active_profile, and primary_language, and the active persona must exist in
the project-local host directory when project_local_persona_required is true.
Vendored handoff projects should set that flag. Symlinked personal workspaces
may continue to use global or built-in personas intentionally. Treat broader readiness as a markdown rubric plus
check_requirements.py, not as a large programmatic test framework.
| Provider | Best for | Main tradeoff |
|---|---|---|
macos | Zero-setup speech and emergency fallback | Robotic, limited personality |
gemini | Most expressive cloud voice, Director's Notes, accents | Requires Gemini credentials/config |
voxtral | Cloud Voxtral through the Mistral API with simple emotion-mapped voices | Requires Mistral credentials/config |
local_tts | Local/private speech through MLX-Audio or another OpenAI-compatible server | Needs a running local server, model downloads, RAM, and real E2E testing |
omlx | Local/private speech through a separate, externally-managed oMLX server (:8001, auth-gated) — co-exists with local_tts | Needs oMLX running with a TTS model loaded + a bearer token; no auto-start |
Machine-readable provider credential requirements live in
defaults/provider-requirements.json.
That file contains only non-secret metadata. It must distinguish cloud Voxtral
(provider=voxtral, Mistral API key) from local Voxtral
(provider=local_tts, no Mistral API key unless the local server itself uses a
bearer-token setting).
For Apple Silicon users who want local/private English summaries, start with
Kokoro. The built-in vulcan-science-officer persona uses Kokoro 4-bit with
the English bm_george voice. For German or Dutch projects, prefer local
Voxtral 4-bit. Vollständige
Modell/Sprach-Hardrules + Auswahl-Logik: authoritative in
defaults/local-tts-models.json
project_language_rules. Keep fallbacks disabled until failures are
understood; this system should fail loudly during rollout.
| Command | Description |
|---|---|
/voice | Show current status (see Status Display below) |
/voice setup | Guided first-time setup (interactive Q&A) |
/voice create | Create a new persona interactively |
/voice list | List all available personas with traits preview |
/voice switch <name> | Switch active persona for this project |
/voice provider macos|voxtral|gemini|local_tts|omlx | Switch voice provider |
/voice language <code> | Persist primary project conversation language for voice/model checks |
/voice local status | Show local TTS endpoint, model, voice, fallback, health, and last synthesis diagnostics |
/voice local check | Check configured local TTS server health without generating audio |
/voice local doctor | Run full local TTS readiness checks and print actionable setup notices |
/voice local test | Generate and play a short local TTS test phrase |
/voice local models | List OpenAI-compatible model IDs when the server supports /v1/models |
/voice local-models --language <code> | Show known local TTS model/language compatibility before choosing a model |
/voice doctor | Cross-cutting voice self-diagnostic — config, credential presence, both servers + loaded models, last synthesis, recent announcer outcomes, and observations that flag the recurring failure modes (see Voice self-diagnostic below) |
/voice credentials status | Show redacted credential readiness by provider and target |
/voice credentials set gemini|voxtral | Prompt locally for a cloud-provider key and write it to a user-global credential file |
/voice credentials verify gemini|voxtral|local_tts | Exit nonzero when required credentials for that provider are missing |
/voice server status | Show shared local TTS server readiness, managed PID, model, voice, and log path |
/voice server ensure | Reuse a ready shared server or start one detached shared MLX-Audio server |
/voice server logs | Show recent shared MLX-Audio server logs |
/voice on | Enable TTS (persists across sessions) |
/voice off | Disable TTS (persists across sessions) |
/voice test | Speak a test sentence with current persona + provider |
/voice voice | Browse all 30 Gemini voices with traits |
/voice voice preview <name> | Play a sample in that voice |
/voice voice <name> | Set the active Gemini voice |
/voice style | Show current Director's Notes (style, pacing, accent) |
/voice style <description> | Update style/pacing/accent from description |
/voice cache warm | Pre-generate cached audio for acknowledgments |
I want to engage voice)This is a markdown-driven setup conversation, not a silent wizard. Ask one decision at a time, wait for the user's answer, then act. Do not install hooks, choose a provider, start a local server, or enable voice until that exact decision has been made.
Comprehensive walkthrough: see
references/onboarding/00-master-flow.mdfor the full ordered checklist linking persona setup, language alignment, and platform-specific persona blocks. Use that as the master conductor; this SKILL.md section covers the technical gates only (install scope, agent target, provider, language, audible test, enable).
Start by using the helper script for state:
uv run <skill>/scripts/voice.py status --target auto
uv run <skill>/scripts/voice.py doctor --target auto
Run those commands from the consuming project root, or set
VOICE_ASSISTANT_PROJECT_DIR=/absolute/path/to/project when invoking helpers
from the skill-development checkout. The helpers use that explicit env var,
then CLAUDE_PROJECT_DIR/CODEX_PROJECT_DIR, then the current working
directory, then the skill path.
The helper script is not the onboarding experience. The skill is. The script exists so the AI can inspect state, persist explicit choices, and fail loudly without inventing shell snippets. Hook installation and audio tests remain in their specialist scripts.
Required order:
AGENTS.md and/or CLAUDE.md when those files exist.user_name,
align language. See
references/onboarding/02-persona-naming-and-tone.md.voice.<provider> per
references/onboarding/03-platform-gemini.md,
04-platform-voxtral.md,
or 05-platform-local-tts.md.Ask exactly one:
Should I install voice for this project only, or as your user-wide default? Recommended for a fresh repo: project-only, so hooks and skill code are explicit here.
Choices:
project-only: use .agents/skills/voice-assistant and/or
.claude/skills/voice-assistant; hooks point at this project copy.user-wide: use the global skill/preferences; suitable only when the user
wants the same voice behavior across projects.For project-local hooks, the project-local skill copy is the runtime source of truth. Do not install both unless the user explicitly asks.
Ask exactly one:
Which agent should I wire voice into: Codex, Claude Code, Opencode config, or all supported targets?
Choices:
codex: install Codex hooks only.claude: install Claude hooks only.opencode: distribute skill/persona/config AND install the Opencode
lifecycle adapter (an ESM plugin at .opencode/plugins/voice-assistant.plugin.ts).
Audible lifecycle hooks are implemented for Opencode, pending live
verification — the adapter, event mapping, detached Stop dispatch, and the
hook stdin/project-root handoff are tested in isolation, but an end-to-end
audible run inside a live Opencode TUI / opencode run session has not yet
been confirmed on a user machine. See
references/opencode-harness-contract.md.both: keep .agents/skills/voice-assistant and
.claude/skills/voice-assistant mirrored except documented host-specific
differences.all: install Codex and Claude hooks, plus the Opencode lifecycle adapter
and skill/persona/config distribution.For both, persist provider and enabled state by running each host's own
scripts/voice.py helper. Do not use the Codex helper to write Claude config
or the Claude helper to write Codex config; config reads and writes must use the
same runtime source of truth.
Ask exactly one:
Which voice provider do you want first?
macosis fastest,geminiis most polished,voxtralis cloud Voxtral through the Mistral API, andlocal_ttsis private/local but requires a running server.
Provider rules:
macos: zero credentials; acceptable fallback; least expressive.gemini: cloud; needs Gemini credentials saved where hooks can read them;
best Director's Notes/style support.voxtral: cloud Voxtral through the Mistral API; needs Mistral credentials;
use documented cloud Voxtral voices.local_tts: local/private, including local Voxtral models; requires explicit
model + voice, loopback server, diagnostics, and successful audible test
before enabling.Ask exactly one before selecting a local TTS model:
What is the primary language for instructions and AI replies in this project?
Persist the answer, then inspect known local model support:
uv run <skill>/scripts/voice.py language <code> --scope project
uv run <skill>/scripts/voice.py local-models --language <code>
This is a project/persona setting today; the skill does not dynamically switch
local models per response. Modell/Sprach-Hardrules (z.B. Kokoro/PocketTTS-Sperre
für DE, Voxtral als DE/NL-Empfehlung): authoritative in
defaults/local-tts-models.json
project_language_rules. Den Helper-Aufruf oben nutzen, dessen Output beachten,
nicht parallel pflegen.
If the user chooses local_tts, branch to the platform-specific docs:
references/onboarding/05-platform-local-tts.md.references/local-tts-onboarding.md.defaults/local-tts-models.json project_language_rules + models[]. Andere Docs verweisen.Apple Silicon zuerst; auf Intel/Windows/Linux funktioniert local_tts mit beliebigem
OpenAI-kompatiblem Server, aber MLX-Audio nicht versprechen. Hooks importieren
nie MLX direkt; SessionStart reuse-t den shared Server oder fails loud.
Voice is not considered engaged until:
claude CLI is installed and authenticated because greeting and
summary rewrites use claude -p --model sonnet before TTS.local_tts, voice.py server status shows a ready shared server or
voice.py server ensure starts/reuses one; failures must produce an
explicit user-facing diagnostic."User" user_name. Built-in
professional / casual ship with "User" as a placeholder; the
notification hook explicitly suppresses the name-prefix in that case,
so onboarding is not complete until user_name has been set per
references/onboarding/02-persona-naming-and-tone.md.voice.<provider> block matching
the chosen provider. For Gemini that means voice_name,
audio_profile, scene, directors_notes (style/pacing/accent), and
sample_context; for Voxtral cloud primary_voice + emotions; for
local_tts model + voice + emotion_map.primary_language != "en": persona's acknowledgments,
greeting_style, language_style, and (if used) notification_message
/ precompact_message are in the target language.Command mapping:
uv run <skill>/scripts/voice.py status --target codex
uv run <skill>/scripts/voice.py doctor --target codex
uv run <skill>/scripts/install_hooks.py --target codex
uv run <skill>/scripts/voice.py server status --target codex
uv run <skill>/scripts/voice.py server ensure --target codex --timeout 30
uv run <skill>/scripts/local_tts.py --target codex doctor
uv run <skill>/scripts/local_tts.py --target codex test --text "Local TTS is working."
uv run <skill>/scripts/voice.py provider local_tts --target codex --scope project --confirm-audible-test
uv run <skill>/scripts/voice.py on --target codex --scope project
uv run <skill>/scripts/test_tts.py --provider local_tts --text "Voice is working."
Custom voice/reference-audio cloning is not implemented. If requested, explain
that current behavior is consent-gated but intentionally unsupported:
custom_voice_consent_required for missing consent and
unsupported_custom_voice even when valid consent/reference data is supplied.
Do not clone, ignore the custom voice block, or silently substitute a built-in
voice.
When running /voice, show this information:
Always show:
When provider is gemini, also show:
defaults/gemini-voices.md)voice.gemini.directors_notes)voice.gemini block, note "Using defaults (Aoede, Breezy)" and offer to configureExample output:
Voice Assistant Status
Enabled: yes
Persona: Sami (warm and personal)
Provider: gemini
Gemini TTS:
Voice: Aoede (Female, Breezy)
Style: Warm, witty, affectionate. Light British charm...
Pacing: Varied — brisk for wins, slower for reassurance
Accent: Modern British, London-adjacent
When provider is local_tts, also show:
127.0.0.1, ::1, or localhost)voice.local_tts blockwav, mp3, flac, ogg, or m4a)connect_timeout_seconds, read_timeout_seconds, retry_count)not_checked, disabled, healthy, reachable_warning, or failed), endpoint, HTTP status, latency, and error type when present. Show disabled when health.enabled: false is configured.local_tts is configured and a loopback server is unreachable, say so plainly, show the shared server manager command, and ask whether the user wants you to start/reuse it or will start it themselves.Example output:
Local TTS:
Endpoint: http://127.0.0.1:8000/v1/audio/speech (loopback: yes)
Model: mlx-community/Kokoro-82M-4bit
Voice: af_sarah
Response format: wav
Timeout/retry: connect 2s, read 20s, retries 0
Fallback: disabled (provider: macos; on: connection_error, timeout)
Custom voice: disabled
Last health: not checked
Last synthesis: not checked
Last error: none
/voice voice)When running /voice voice (no argument), show all 30 Gemini voices grouped by gender.
Read defaults/gemini-voices.md for the complete voice table with official traits.
Authority: The trait labels in the block below mirror
defaults/gemini-voices.md. If a trait drifts, fix it indefaults/gemini-voices.mdfirst — that file is the single source of truth. Do not edit the labels here in isolation.
Display format:
Female voices:
Achernar (Soft) · Aoede* (Breezy) · Autonoe (Bright)
Callirrhoe (Easy-going) · Despina (Smooth) · Erinome (Clear)
Gacrux (Mature) · Kore (Firm) · Laomedeia (Upbeat)
Leda (Youthful) · Pulcherrima (Forward) · Schedar (Even)
Sulafat (Warm) · Vindemiatrix (Gentle) · Zephyr (Bright)
Male voices:
Achird (Friendly) · Algenib (Gravelly) · Algieba (Smooth)
Alnilam (Firm) · Charon (Informative) · Enceladus (Breathy)
Fenrir (Excitable) · Iapetus (Clear) · Orus (Firm)
Puck (Upbeat) · Rasalgethi (Informative) · Sadachbia (Lively)
Sadaltager (Knowledgeable) · Umbriel (Easy-going) · Zubenelgenubi (Casual)
* = currently active
Say "preview <name>" to hear a sample, or "<name>" to switch.
To preview a voice: Run uv run <skill>/scripts/preview_voice.py --voice-name <name>.
To set a voice: Run uv run <skill>/scripts/update_gemini_config.py --persona <id> --voice-name <name>, then speak a test phrase.
/voice style)/voice style (no argument) — Show the current Director's Notes:
voice.gemini.directors_notes object/voice style <description> — Update style conversationally:
uv run <skill>/scripts/update_gemini_config.py with the new valuesdirectors_notes.style, add more vivid descriptors, save via update_gemini_config.py, testgemini-voices.md, offer preview/voice voice browserdirectors_notes.pacing to slower variant, savedirectors_notes.accent with the language directive (e.g., "Dutch, natural Netherlands accent"). Check gemini-voices.md for BCP-47 code support.~/.codex/voice/profiles/<id>.json with fallback to ~/.claude/voice/profiles/<id>.jsonvoice.gemini.directors_notes or voice.gemini.voice_name fieldsuv run <skill>/scripts/update_gemini_config.pyWhen enabled, Codex speaks its responses using a configurable personality. The Stop hook extracts Codex's last response, sends it to the configured summarizer backend (Claude Haiku/Sonnet by default, or an opt-in local model — see below) for persona-flavored rewriting with emotion/style/pacing/ accent selection, and speaks via macOS say, Mistral Voxtral, Google Gemini TTS, or local HTTP TTS.
If the rewrite fails, the Stop hook degrades gracefully (it does not announce a raw failure): it speaks a short un-stylized excerpt of the actual message and writes a visible stderr diagnostic with the failure reason.
providers.summarizer)The Stop summary and SessionStart greeting rewrites run through a config-driven
provider (utils/summarizer.py). Two backends:
claude_cli (default, shipped) — the legacy claude -p path
(--model haiku for the Stop summary, --model sonnet for the greeting).
Unchanged behavior.openai_compatible (opt-in, per project) — POSTs to a local
OpenAI-compatible /v1/chat/completions (e.g. an oMLX server running a
fast local model such as gemma-4-12B-it-OptiQ-4bit). Reach for this when
running several projects in parallel makes the Anthropic CLI calls contend
and time out — a local model has no shared rate limit and returns in seconds.By default the local path reuses the same persona-tuned rewrite prompt the
Claude path builds (prompt_style: "shared") — there is no second prompt to
maintain — and parses the response leniently: prose that skips the JSON
envelope is still used as the spoken line, so the local hop succeeds rather than
falling through. prompt_style: "minimal" is an optional stripped-down prompt
kept purely as a per-platform A/B tuning lever (see
scripts/test_summarizer.py). On local failure the chain runs a
short-budgeted claude -p fallback
(claude_fallback_timeout_seconds, default 12s — deliberately not the full
110s, which would recreate the timeout), then the hook's terminal excerpt
fallback. Set allow_claude_fallback: false for a fully-offline chain.
Opt a project in — in <project>/.claude/tts_config.json (deep-merges over
the shipped defaults, so only the changed keys are needed):
{
"providers": {
"summarizer": {
"type": "openai_compatible",
"openai_compatible": {
"base_url": "http://127.0.0.1:8001",
"model": "gemma-4-12B-it-OptiQ-4bit",
"credential_key": "omlx_api_key"
}
}
}
}
The bearer token (oMLX is auth-gated) is read from
~/.skills/voice-assistant/credentials.json under credential_key via the R50
loader — never place the token in project config. Store it with
scripts/setup_credentials.py. base_url is loopback-only by default (set
require_loopback: false to allow a remote host); redirects and URL-embedded
credentials are refused, and header values redact in diagnostics. Leave type
at claude_cli to keep the Anthropic path — openai_compatible never becomes a
silent global default.
Reasoning models — disable "thinking". A reasoning/thinking model (e.g.
gemma-4-12B-it on oMLX) spends the token budget on chain-of-thought and
returns message.reasoning_content with an empty message.content (the rewrite
then fails closed and falls back). Disable thinking server-side (oMLX: turn off
the model's thinking; OpenAI-compatible flag chat_template_kwargs: {"enable_thinking": false} or reasoning_effort: "none" also works) — measured
~2–5s clean output vs ~30s+ with thinking on. Note model IDs are server-specific:
oMLX uses the bare name (gemma-4-12B-it-OptiQ-4bit), other MLX servers use the
mlx-community/… form.
Local TTS synthesis of a summary-length utterance is slow — Qwen3/Ryan via oMLX takes ~30–40s for ~200 chars, then playback adds ~15–20s — and the rewrite calls a model that serializes across concurrent projects on a shared local model. Running all of that inline would (a) trip the provider's HTTP read timeout, aborting long synths so the summary never plays, and (b) blow the harness hook budget.
So the Stop hook is a thin dispatcher: after its cheap guards + cooldown it
hands the raw input to a detached worker (scripts/speak_worker.py, spawned by
utils/announce.py with start_new_session=True) and returns immediately. The
worker owns the whole slow chain — transcript extract → produce_spoken rewrite →
TTS synth → playback — and survives the hook's exit, so audio duration is fully
decoupled from the hook budget and concurrent Stop hooks never queue model calls
inside it. The SessionStart greeting (short, once per session) detaches only its
playback via announce.speak_detached. Every attempt appends a durable line to
~/.<host>/voice/logs/announcer.log (result=ok|fail|no_text|…); the per-provider
status/omlx_tts.json keeps only the most recent synthesis.
Two settings make this robust: providers.omlx.read_timeout_seconds (default
120s, comfortably above worst-case synth) bounds only the detached worker, and
announce.cap_spoken_text caps a runaway summary.
A broken pipeline must be heard, not silently logged. When synth/playback fails
the worker speaks a short audit notice via speak.speak_audit_failure — and it uses
the active persona's voice.macos voice + rate (macOS say, which needs no
key/quota and works even when the failing provider is the real TTS). So a failure in
a Vulcan project is heard in Daniel's voice and a failure in a Californian persona in
hers — when several projects run at once you can tell which one broke from the voice
alone. Give every persona a voice.macos block (voice + rate) so its failures are
identifiable; without one the notice uses the system default voice. (The rare
"couldn't even launch the announcer" case is voiced from the hook the same way.)
/voice doctor)scripts/voice_doctor.py gathers factual evidence into JSON — config summary,
credential presence (booleans, never values), each configured server's reachability
announcer.log
outcomes, hook-install state, and a list of plain observations that encode the
failure modes we keep re-deriving (TTS read timeout too low for slow local synth,
configured server unreachable, recent result=fail, stale hook install). Run it and
interpret the evidence into a health read + concrete fixes. Per R32 it emits facts
and observations only — there is no numeric health score and no PASS/FAIL gate;
the judgment is yours, over the evidence. This is the runtime half of the
self-improvement loop: each hard-won lesson becomes a live, visible check instead of a
surprise the next time.Two checks make that loop concrete (added 2026-06-10 after a missing-dependency
incident — see references/retro-2026-06-10-doctor-gaps.md):
synth_path_dep_coverage — verifies the active provider's SDK is declared
in the PEP 723 deps of every script that actually synthesizes (the detached
scripts/speak_worker.py plus the hook entries), not merely that a credential exists.
Catches the class where a refactor moves synthesis into a new entry script but leaves
its inline deps short (e.g. worker declared only requests while Gemini needs
google-genai). The synth_path_dep_coverage block carries the per-script matrix; a
shortfall also emits a provider_dep_missing_on_synth_path observation.failure_signature — classifies announcer-log fail-lines and status.*.last_error
text against the living FAILURE_SIGNATURES registry in voice_doctor.py into a typed
cause + concrete fix, so the doctor interprets a known error string instead of
dumping it raw. Add a row to that registry (and a runbook entry) whenever a new
failure mode is root-caused — that is how the loop stays closed.The checks field in the output is the tracked manifest of everything the doctor covers.
Gemini TTS (gemini-3.1-flash-tts-preview) is an LLM that generates audio
directly. Prompts follow Google's official TTS prompting guide structure:
# AUDIO PROFILE: Character Name
### DIRECTOR'S NOTES
Style: Warm, witty, affectionate...
Pacing: Varied — brisk for wins, slower for reassurance
Accent: Modern British
### TRANSCRIPT
The actual text to speak.
The rewriter (Sonnet) produces {text, emotion, style, pacing, accent}.
The _build_gemini_prompt() function in speak.py renders the structured
prompt. Each TTS call is independent (32k token context window, per-call).
Gemini also supports inline audio tags ([whispers], [excited], [warmly]) as delivery modifiers; when a persona declares voice.gemini.preferred_audio_tags, the rewriter emits a transcript_with_tags field with sparse inline tags. Details: 03-platform-gemini.md §Audio Tags.
Custom personas without a voice.gemini block fall back to GEMINI_DEFAULTS
(Aoede voice, no style).
Wenn aktive Persona keinen voice.gemini-Block hat, proaktiv das Onboarding aus
references/onboarding/03-platform-gemini.md
(Steps G1–G6) anbieten statt stillschweigend GEMINI_DEFAULTS zu nutzen.
preferred_audio_tags, also emits transcript_with_tagsWhen provider=local_tts, SessionStart first runs a bounded shared-server
readiness check. If a compatible loopback server is already running, it reuses
it. If none is running, it starts one detached shared MLX-Audio server. If the
server cannot become ready, SessionStart writes the diagnostic and skips the
greeting/rewrite work instead of silently falling back.
Codex currently installs only SessionStart, UserPromptSubmit, and
Stop. Do not install PreCompact for Codex: current Codex hook review UX
can surface it as needing review without offering a reliable approval path.
Claude also supports Notification, PreCompact, and SubagentStop
hooks. The shared defaults include a harmless notification flag for
Claude/project compatibility, but the Codex installer does not wire a
Notification hook.
Opencode installs the lifecycle adapter plugin, which maps Opencode's native event stream onto the same Python hooks:
| Opencode event | Voice hook | Notes |
|---|---|---|
session.created | SessionStart | Fires once per new session |
user message text part (message.part.updated) | UserPromptSubmit | First text part of a new user message |
session.status with status.type === "idle" | Stop | Reliable in both TUI and opencode run; session.idle is deprecated. Passes the current turn's assistant text as last_assistant_message. Dispatch depends on the client (via OPENCODE_CLIENT): interactive desktop/TUI → plain awaited spawn (loop stays alive); opencode run one-shot / unknown client → detached via opencode/detach_hook.py so the 35-75s claude -p rewrite survives teardown. Detached is the safe default. Skips cleanly if the turn has no assistant text |
session.compacted | PreCompact | Opencode surfaces compaction natively |
Opencode has no Notification/SubagentStop equivalents wired today (no native
event maps cleanly); those remain Claude-only. Full contract:
references/opencode-harness-contract.md.
Install Codex hooks with:
uv run .agents/skills/voice-assistant/scripts/install_hooks.py --target codex
Install the Opencode lifecycle adapter with:
uv run .opencode/skills/voice-assistant/scripts/install_hooks.py --target opencode
This writes the ESM plugin into <project>/.opencode/plugins/ (OpenCode cannot
reference a local plugin by path, so placement there is mandatory) and adopts
the project's existing .claude/.codex voice config so Opencode runs the
identical voice; if no config exists anywhere it points you at onboarding.
The plugin is installed as a real-file copy (never a symlink — Bun resolves a
symlink's target, which would break the plugin's import.meta.dir skill-path
anchor), so plugin edits need a re-install; hook/config edits are live via the
project skill symlink.
Notification and PreCompact speak fixed-template messages — no LLM rewrite.
Their text is resolved by utils/locales.py:pick_message() using a three-tier
priority chain:
persona["<event>_message"] (e.g. notification_message,
precompact_message). Single string or list of variants (random.choice picks
one per call so repeated messages don't feel monotonous)..agents/locales/<primary_language>.json (e.g.
.agents/locales/de.json). Same string-or-list shape..agents/locales/en.json. Always present.Locales ship inside the skill under .agents/locales/ and are accessible to any
AI that loads this skill via the standard skill mount points. To add a new
language, drop a <lang>.json file in .agents/locales/ with the same keys
(notification, precompact) — no hook edits needed.
For a German-speaking project, set primary_language: "de" in the project's
tts_config.json; the German locale defaults will be used automatically for
any persona that doesn't override.
SessionStart and Stop use full LLM rewrites and follow the persona's voice
naturally — they don't need locale files. SubagentStop plays only a chime.
UserPromptSubmit reads acknowledgments directly from the persona's
acknowledgments array (translate per-persona).
/voice setup with local_tts provider)Cross-Ref only. SKILL.md beschreibt den Provider-Wahl-Gate; die Setup-Schritte stehen anderswo:
references/onboarding/05-platform-local-tts.md.references/local-tts-onboarding.md.defaults/local-tts-models.json project_language_rules./voice setup with Gemini provider)Persona-Side-Onboarding-Conversation für Gemini siehe
references/onboarding/03-platform-gemini.md
(Steps G1–G6 + Pre-Flight-Checks). Voice-Katalog + offizielle Traits:
defaults/gemini-voices.md.
| Tier | Path | Purpose |
|---|---|---|
| Project | .codex/tts_config.json | Per-project Codex settings |
| Project | .claude/tts_config.json | Per-project Claude fallback |
| Project | .opencode/tts_config.json | Per-project Opencode voice settings |
| Shared global | ~/.skills/voice-assistant/preferences.json | Host-agnostic user defaults |
| Global | ~/.codex/voice/preferences.json | User-wide Codex defaults |
| Global | ~/.claude/voice/preferences.json | User-wide Claude fallback |
| Global | ~/.config/opencode/voice/preferences.json | User-wide Opencode defaults |
| Shared global | ~/.skills/voice-assistant/profiles/*.json | Custom personas available to all hosts |
| Global | ~/.codex/voice/profiles/*.json | Custom Codex personas |
| Global | ~/.claude/voice/profiles/*.json | Custom Claude fallback personas |
| Shared global | ~/.skills/voice-assistant/credentials.json | Recommended user-level API keys (Voxtral, Gemini, local bearer tokens) |
| Global | ~/.codex/voice/credentials.json | Optional Codex-specific API key overrides |
| Global | ~/.claude/voice/credentials.json | Optional Claude-specific API key overrides |
| Global | ~/.config/opencode/voice/credentials.json | Optional Opencode-specific API key overrides |
| Env var | GEMINI_API_KEY | Gemini API key fallback when no file key exists |
| Env var | MISTRAL_API_KEY | Mistral/Voxtral API key fallback when no file key exists |
| Defaults | <skill>/defaults/settings.json | Fallback settings |
| Defaults | <skill>/defaults/personas/*.json | Built-in personas |
| Locales | <skill>/.agents/locales/en.json | English message templates for Notification + PreCompact (always present, ultimate fallback) |
| Locales | <skill>/.agents/locales/<lang>.json | Localized message templates loaded when config.primary_language matches (e.g. de.json, nl.json) |
Do not store API keys in the skill folder, project config, persona JSON, hook settings, or repository files. The workable second-best credential model is:
~/.skills/voice-assistant/credentials.json.~/.voice-assistant or ~/.iurfriend-skills as silent fallbacks. If a legacy credentials file exists, ask the user before copying or moving secrets into the shared skill root and keep the resulting file mode at 600.Credential resolution merges the shared file first, then the selected host's credential file, then process environment values when file keys are absent. Host-specific keys override shared keys for that host only.
For managed installs and tests, VOICE_ASSISTANT_SHARED_DIR may point at an
explicit shared directory. VOICE_ASSISTANT_SKILLS_ROOT may point at a
different parent root; the default shared directory remains
~/.skills/voice-assistant.
For onboarding, never ask the user to paste an API key into chat. Use the local helper instead:
uv run <skill>/scripts/voice.py credentials status --target auto
uv run <skill>/scripts/voice.py credentials set gemini --target shared
uv run <skill>/scripts/voice.py credentials set voxtral --target shared
uv run <skill>/scripts/voice.py credentials verify gemini --target auto
credentials set voxtral stores mistral_api_key for cloud Voxtral through
the Mistral API. It is not used for local Voxtral under provider=local_tts.
The helper prompts in the local terminal with hidden input and writes only to a
user-global credential file with restricted file permissions when supported by
the filesystem.
The skill can be installed globally for the user or copied into a project. When
project hooks point at .agents/skills/voice-assistant, that project-local copy
is the code that actually runs. Global preferences and profiles still merge in,
but runtime behavior comes from the hooked project skill. Keep the Codex copy
(.agents/skills/voice-assistant) and Claude copy (.claude/skills/voice-assistant)
synchronized unless an agent-specific hook difference is intentional.
Opencode does not use the Claude/Codex hook JSON formats. It loads ESM plugins
from .opencode/plugins/ via its Bun runtime. The voice-assistant Opencode
lifecycle adapter is .opencode/plugins/voice-assistant.plugin.ts, installed by
install_hooks.py --target opencode. It subscribes to Opencode's event stream
and dispatches the SAME project-local Python hooks the Claude/Codex installers
use (via scripts/hook_wrapper.sh + stdin JSON) — it adds no voice logic of its
own, so the Python hooks stay the single runtime source of truth. For Opencode,
keep .opencode/skills/voice-assistant, .opencode/tts_config.json, and
.opencode/agents/*.md synchronized with the shared project persona, and run
the installer to (re)generate the adapter. Full harness contract + event mapping:
references/opencode-harness-contract.md.
voice.gemini-Block-Schema authoritative in
references/onboarding/03-platform-gemini.md §The voice.gemini block.
Optionales preferred_audio_tags-Feld (Persona-Tag-Bias für den Rewriter)
ebd. §Audio Tags. Gelebte Beispiele: defaults/personas/*.json.
Local TTS is a generic OpenAI-compatible HTTP adapter. Hooks never import MLX,
load models directly, or choose a hidden model or voice. The user must configure
both providers.local_tts transport settings and each persona's
voice.local_tts model/voice values before switching the provider. When
provider=local_tts, SessionStart may run the shared server manager to reuse or
start a detached bundled MLX-Audio server after making failures visible.
Project or global settings:
{
"provider": "local_tts",
"primary_language": "en",
"providers": {
"local_tts": {
"base_url": "http://127.0.0.1:8000",
"endpoint": "/v1/audio/speech",
"models_endpoint": "/v1/models",
"api_key_env": null,
"credential_key": null,
"headers": {},
"timeout_seconds": 20,
"connect_timeout_seconds": 2,
"read_timeout_seconds": 20,
"retry_count": 0,
"response_format": "wav",
"request_format": "openai_audio_speech",
"response_encoding": "binary",
"server": {
"mode": "shared",
"auto_start_on_session_start": true,
"startup_timeout_seconds": 30,
"session_start_timeout_seconds": 8
},
"health": {
"enabled": true,
"method": "GET",
"endpoint": "/health",
"timeout_seconds": 2,
"healthy_statuses": [200, 204],
"treat_404_as_reachable_warning": true
},
"fallback": {
"enabled": false,
"provider": "macos",
"on": ["connection_error", "timeout"]
},
"diagnostics": {
"status_filename": "local_tts.json",
"redact_headers": ["authorization", "x-api-key"]
}
}
}
}
Persona-Side-voice.local_tts-Schema (model/voice/emotion_map/custom_voice)
authoritative in references/onboarding/05-platform-local-tts.md §The voice.local_tts block.
PocketTTS-Model-IDs + Voice-Liste in
references/local-tts-onboarding.md §PocketTTS.
Switch explicitly with /voice provider local_tts or by setting
"provider": "local_tts" in .codex/tts_config.json or the global Codex voice
preferences. The resolver intentionally has no hidden Kokoro or af_heart
defaults: missing model returns missing_model, and missing voice returns
missing_voice.
Custom voice/reference-audio support is intentionally blocked in Phase 2. A
persona with voice.local_tts.custom_voice.enabled: true must return
unsupported_custom_voice after consent/reference-audio validation; it must not
silently clone a voice or fall back to an implicit voice.
The omlx provider reuses the entire local_tts synth/play/cache/fallback
engine; its config and persona blocks mirror local_tts, just under the omlx
keys and pointed at the oMLX server (default :8001).
{
"provider": "omlx",
"primary_language": "en",
"providers": {
"omlx": {
"base_url": "http://127.0.0.1:8001",
"endpoint": "/v1/audio/speech",
"credential_key": "omlx_api_key",
"diagnostics": { "status_filename": "omlx_tts.json" }
}
}
}
The active persona must carry a voice.omlx block whose model + voice the
loaded oMLX TTS model actually serves:
{ "voice": { "omlx": { "model": "mlx-community/Voxtral-4B-TTS-2603-mlx-4bit", "voice": "<a voice the model serves>", "response_format": "wav" } } }
Differences from local_tts: no server block and no auto-start (oMLX is
run by its own app/CLI, e.g. omlx serve … --port 8001), plus a separate
omlx_tts.json status file. The bearer token (oMLX is auth-gated) resolves from
~/.skills/voice-assistant/credentials.json via credential_key — store it
with scripts/setup_credentials.py, never in project config.
Verification gate — do this before relying on omlx for speech. The
provider ships regardless, but oMLX may only have an LLM (e.g. gemma) loaded:
# 1) confirm a TTS model is loadable (auth-gated)
curl -H "Authorization: Bearer $TOKEN" http://127.0.0.1:8001/v1/models
# 2) confirm /v1/audio/speech actually returns audio bytes
curl -s -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json' \
-d '{"model":"<tts-model-id>","input":"oMLX speech test.","voice":"<voice>"}' \
http://127.0.0.1:8001/v1/audio/speech --output /tmp/omlx_test.wav && file /tmp/omlx_test.wav
If only the LLM is loaded, load a TTS model first (via omlx serve / the oMLX
app) before switching provider to omlx. The Kokoro local_tts path (:8000)
is unaffected by configuring omlx.
defaults/tts-lenses.json)Different TTS models behave differently along three axes — voice strategy
(named voices vs zero-shot clone), emotion (voice-map vs instruct-scene vs
none), and transport (oMLX :8001 vs mlx-audio :8000). A lens is a
per-model preset that encodes exactly that, so a persona can pick a model by
name instead of restating its quirks. A persona opts in with
voice.<provider>.lens: "<name>"; speak() resolves it and maps the turn's
emotion onto the model's controls (no-op when no lens is set). The default voice
across lenses is a happy, cheerful female.
{ "voice": { "omlx": { "lens": "voxtral-omlx" } } }
| Lens | Transport | Emotion mechanism | Status |
|---|---|---|---|
voxtral-omlx | oMLX :8001 | voice-map (emotion → a named voice; cheerful_female, de_*, nl_*, …) | ✅ verified |
pocket-omlx | oMLX :8001 | none (single voice) | ✅ verified |
kokoro-local | mlx-audio :8000 | none (voice selection) | ✅ verified |
higgs-omlx | oMLX :8001 | instruct-scene (emotion → natural-language instruct) | ⏳ gated on oMLX (Higgs adapter bug) |
voice_map lenses set the voice from the emotion (and expose emotion_map so
the synth layer re-resolves per turn). instruct_scene lenses (Higgs) put a
scene string in extra_body.instruct and thread the reference-clone
ref_audio/ref_text (base64 audio) + temperature — that path is blocked by
an oMLX-side bug (Kokoro via oMLX likewise fails on oMLX's torch; both keep
working on their healthy paths). Lens resolution + emotion mapping live in
utils/lens.py; the registry is defaults/tts-lenses.json.
Shared-Server-Runtime, MLX-Audio-Pin-Commit, Voxtral-Pre-Download,
fail-loud-rules + missing-server-notice authoritative in
references/local-tts-onboarding.md §Shared Server Runtime
uv run <skill>/scripts/voice.py server ensure --timeout 30
<skill>/requirements.md — Projektlokale EliteExperts-Requirements fuer SAMI mit Apple-Silicon-, MLX-Audio-, Voxtral-, Persona- und Setup-Gates<skill>/references/troubleshooting-and-audit.md — Start here when voice misbehaves. Ordered debugging runbook (reproduce the Stop hook with stderr, failure-signature table, rewriter/Gemini/credential checks) plus a copy-paste audit rubric to certify a healthy install. Captures the known failure modes (rewrite timeout, MCP-overflow on Haiku, preview-model throttle) and the fail-loud design invariants.<skill>/references/timeouts.md — Single reference for every timeout. Knob table (timeouts.* in settings.json / per-project tts_config.json — greeting_rewrite_seconds, summary_rewrite_seconds, rewrite_hook_headroom_seconds), the values derived from them (hook ceilings = budget + headroom, via utils/timeouts.py), the fixed safety ceilings that must NOT be tuned, and the known no-timeout gap on the Gemini synth call. Read before changing any timeout.<skill>/references/opencode-harness-contract.md — Opencode lifecycle adapter: the harness plugin contract, event → voice-hook mapping, stdin/env/project-root preservation, install/uninstall, and the supported/unsupported event matrix<skill>/references/local-tts-onboarding.md — Local provider readiness, model guide, setup steps, installation model, custom voice position, and fail-loud rules<skill>/references/voxtral-best-practices.md — Voxtral TTS local prompting best practices (Voice-as-an-Instruction paradigm, 20-preset voice catalog verified against Hugging Face model card, CC BY-NC 4.0 production caveat, provider-asymmetry vs. Gemini audio tags)<skill>/references/onboarding/00-master-flow.md — Master conversation flow for onboarding a fresh user end-to-end (technical + persona + platform); start here for new-user setup<skill>/references/onboarding/01-persona-anatomy.md — Field-by-field reference for every persona JSON key (built-in vs user-tier, validation rules, anti-patterns)<skill>/references/onboarding/02-persona-naming-and-tone.md — Conversation script for naming, user_name, relationship, traits, language alignment<skill>/references/onboarding/02b-persona-from-character-concept.md — Generator-Pfad: vollständige Persona aus einem Charakter-Konzept ableiten (Trigger, Mapping-Heuristik pro Feld, Worked Examples, Validation-Loop)<skill>/references/onboarding/03-platform-gemini.md — Gemini-specific persona block (voice_name, audio_profile, scene, directors_notes, sample_context) with onboarding prompts<skill>/references/onboarding/04-platform-voxtral.md — Voxtral cloud persona block (primary_voice + emotion-mapping) with onboarding prompts<skill>/references/onboarding/05-platform-local-tts.md — Local TTS persona block (model + voice + emotion_map + language coordination) with onboarding prompts<skill>/defaults/local-tts-models.json — AI-facing local model/language compatibility table with E2E verification status<skill>/defaults/gemini-voices.md — All 30 voices with official traits, 87 supported languages with BCP-47 codes, pacing examples, prompting guide<skill>/defaults/voxtral-voices.md — Voxtral voice reference<skill>/.agents/locales/*.json — Hook message templates per language (Notification + PreCompact); resolved by utils/locales.py:pick_message() with priority persona override > locale > English fallback. Add a new language by dropping <lang>.json with the same keys; no hook edits required.<skill>/scripts/check_requirements.py --target both — Projektlokaler Requirements-Check fuer EliteExperts/SAMI; installiert nichts automatisch und gibt konkrete Nutzerfragen bei fehlenden Voraussetzungen aus<skill>/scripts/preview_voice.py — Preview a Gemini voice: uv run preview_voice.py --voice-name <name> [--text "sample"]<skill>/scripts/update_gemini_config.py — Update persona Gemini config: uv run update_gemini_config.py --persona <id> [--voice-name X] [--style "..."] [--pacing "..."] [--accent "..."]<skill>/scripts/warm_cache.py — Pre-generate acknowledgment audio cache: uv run warm_cache.py [--persona <id>]<skill>/scripts/voice.py status|doctor|on|off|provider — Thin AI-facing status and switching helper<skill>/scripts/voice.py local-models --language <code> — Show local TTS model/language compatibility before choosing local_tts model/voice<skill>/scripts/test_tts.py — Test TTS with current config<skill>/scripts/voice.py server status|ensure|start|stop|logs — Manage the shared local TTS server via the AI-facing helper<skill>/scripts/voice_server.py status|ensure|start|stop|logs — Direct shared local TTS server manager; PID/logs live in ~/.skills/voice-assistant/local-tts-server/<skill>/scripts/mlx_audio_server.py --host 127.0.0.1 --port 8000 — Run a local MLX-Audio OpenAI-compatible server with voice-assistant-safe defaults<skill>/scripts/local_tts.py status — Read durable local TTS diagnostics<skill>/scripts/local_tts.py check — Check local server health<skill>/scripts/local_tts.py doctor — Check readiness and print missing-server onboarding notices<skill>/scripts/local_tts.py test --text "Hello" — Generate and play a test phrase<skill>/scripts/local_tts.py models — List models when supported<skill>/scripts/test_tts.py --provider local_tts — Exercise local TTS through normal provider dispatchnpx claudepluginhub cmgramse/skill-development --plugin voice-assistantBuilds a throwaway prototype to answer a design question about UI appearance or state/logic behavior. Guides you through two branches: interactive terminal app for logic validation, or multiple UI variations for visual exploration.