Skip to content

Volcano Engine Speech Services

Volcano Engine Real-Time Conversational AI provides core capabilities such as RTC audio/video transmission, ASR speech recognition, and TTS speech synthesis. Developers can integrate their own AI backends through the CustomLLM mode to build voice-driven intelligent interactions.

What Is Volcano Engine Speech Services

Volcano Engine Real-Time Conversational AI is an end-to-end voice interaction solution that enables intelligent agents to “hear, speak, see, and reason.” It is suitable for scenarios such as AI assistants, AI customer service, AI companionship, AI spoken-language learning, and intelligent hardware.

Core Components

RTC (Real-Time Audio and Video)

Responsible for audio and video transmission between clients and the cloud.

  • Based on the WebRTC protocol, supporting mainstream browsers
  • Multi-platform SDKs: Web (@volcengine/rtc), iOS, Android, Windows, Linux, macOS
  • Built-in AI noise suppression (AI-ANS) to filter environmental noise
  • Binary message channel for transmitting structured data such as subtitles and status
  • Strong resilience to weak network conditions, ensuring reliable transmission in complex environments

ASR (Automatic Speech Recognition)

Converts user speech into text in real time.

  • Streaming recognition with real-time transcription
  • Supports multiple languages, including Chinese, English, Japanese, and Spanish
  • Supports hotword configuration to improve recognition accuracy for domain-specific terms
  • Frame-level Voice Activity Detection (VAD) for accurate speech start and end detection

TTS (Text-to-Speech)

Converts AI-generated text responses into natural-sounding speech.

  • Streaming synthesis with low latency
  • Multiple voice options (male, female, and different styles)
  • Supports adjustment of speech rate, pitch, and volume
  • Supports emotional synthesis (e.g., happy, calm)

LLM (Large Language Models)

Handles user intent understanding and response generation, with two integration modes:

Volcano Ark (ArkV3)

Uses large language models hosted by Volcano Engine, ready to use out of the box.

  • Supports multiple models such as Doubao, Claude, and GLM
  • No additional service deployment required
  • Automatic cloud scaling

CustomLLM (Custom Backend)

Volcano Engine invokes the developer’s custom service to obtain LLM responses.

  • Can integrate with any LLM (OpenAI, Qwen, local models, etc.)
  • Full control over conversation logic
  • Supports agent architectures and tool invocation
  • Can integrate private knowledge bases

The EMQX MCP AI voice assistant uses the CustomLLM mode to enable MCP tool invocation.

Extended Capabilities

Volcano Engine Speech Services also provide the following extended features:

CapabilityDescription
Intelligent InterruptionFull-duplex communication; users can interrupt the AI at any time for more natural interaction
Visual UnderstandingSupports image and video input, enabling AI to “see” and understand visual content
Function CallingAllows the LLM to identify user intent and invoke external functions
MCP Protocol SupportStandardized access to external tool ecosystems
Real-Time SubtitlesReturns ASR results and LLM responses in real time
Context ManagementSupports short-term and long-term memory (via vector databases)

For detailed feature descriptions, see the Volcano Engine Real-Time Conversational AI Documentation.

Pricing

Volcano Engine Speech Services are billed based on usage. Each billing item includes a free trial quota. For details, see Conversational AI Real-Time Pricing.