Skip to content

Voice Configuration

Device Agent uses the voice channel for speech recognition (ASR) and text-to-speech (TTS). Voice requests enter the selected device session; device online state, command execution, and reporting still use MQTT or the device SDK.

Console Configuration

Open http://127.0.0.1:3000, go to Settings → Voice, and configure:

FieldNotes
Voice enabledControls whether the /ws/voice channel is available.
Voice WebSocket URLRead-only in the UI. The default is ws://127.0.0.1:3001/ws/voice. Browser and device clients must be able to reach it.
Speech providerSelect volcengine, aliyun, aws, or elevenlabs.
RegionSupports cn, us, eu, and global; this affects eligible providers.
Speech recognitionASR model, language, or resource ID. Fields vary by provider.
Text-to-speechTTS model, voice, and sample rate. Fields vary by provider.
Provider credentialsAPI key, access key, AWS credentials, or equivalent provider secrets.

Provider, model, voice, and credential changes can be saved from the page. Enablement, bind host, port, or TLS changes require a service restart.

Speech Providers

ProviderRegionsMain settings
Volcengine (volcengine)cn, globalVOLCENGINE_SPEECH_APP_ID, VOLCENGINE_SPEECH_ACCESS_KEY, plus ASR/TTS resource IDs, language, voice, and sample rate.
Aliyun DashScope (aliyun)cn, globalALIYUN_DASHSCOPE_API_KEY or QWEN_API_KEY, plus ASR model, TTS model, voice, and sample rate.
AWS (aws)us, eu, globalAWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, plus Transcribe language and Polly voice.
ElevenLabs (elevenlabs)us, eu, globalELEVENLABS_API_KEY, API endpoint, ASR model, TTS model, voice, and sample rate.

VOICE_REGION=cn uses Volcengine or Aliyun; us uses AWS or ElevenLabs; eu requires an AWS eu-* region or the ElevenLabs EU residency endpoint; global does not restrict providers.

.env Configuration

Use .env for first startup, container deployment, or environments without UI access. You can switch the provider later from Settings → Voice.

Aliyun DashScope:

bash
VOICE_ENABLED=true
VOICE_REGION=cn
ALIYUN_DASHSCOPE_API_KEY=sk-...
ALIYUN_ASR_MODEL=paraformer-realtime-v2
ALIYUN_TTS_MODEL=cosyvoice-v3-flash
ALIYUN_TTS_VOICE=longanyang

Volcengine:

bash
VOICE_ENABLED=true
VOICE_REGION=cn
VOLCENGINE_SPEECH_APP_ID=...
VOLCENGINE_SPEECH_ACCESS_KEY=...
VOLCENGINE_TTS_VOICE=zh_female_shuangkuaisisi_moon_bigtts

AWS:

bash
VOICE_ENABLED=true
VOICE_REGION=us
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_REGION=us-east-1
AWS_TRANSCRIBE_LANGUAGE_CODE=en-US
AWS_POLLY_VOICE=Joanna

ElevenLabs:

bash
VOICE_ENABLED=true
VOICE_REGION=us
ELEVENLABS_API_KEY=...
ELEVENLABS_TTS_MODEL_ID=eleven_multilingual_v2
ELEVENLABS_TTS_VOICE=UgBBYS2sOqTuMpoF3BR0

Bind Address and TLS

The voice service listens on 127.0.0.1:3001 by default. For LAN or server access, set:

bash
VOICE_HOST=0.0.0.0
VOICE_PORT=3001

For production use or HTTPS console access, enable TLS:

bash
VOICE_TLS_ENABLED=true
VOICE_TLS_CERT_FILE=/path/to/cert.pem
VOICE_TLS_KEY_FILE=/path/to/key.pem

After configuration, the console shows the voice WebSocket URL that clients should use.

For usage flow, see Voice Interaction. For protocol messages, see API Reference.