A peer-to-peer protocol for voice assistants (basically JSONL + PCM audio)
{ "type": "...", "data": { ... }, "data_length": ..., "payload_length": ... }\n
<data_length bytes (optional)>
<payload_length bytes (optional)>
Used in Rhasspy and the Home Assistant for communication with voice services.
- Satellite for Home Assistant
- Piper text to speech
- Faster Whisper speech to text
- openWakeWord wake word detection
- porcupine1 wake word detection
- snowboy wake word detection
- mic-external
- snd-external
- A JSON object header as a single line with
\n
(UTF-8, required)type
- event type (string, required)data
- event data (object, optional)data_length
- bytes of additional data (int, optional)payload_length
- bytes of binary payload (int, optional)
- Additional data (UTF-8, optional)
- JSON object with additional event-specific data
- Merged on top of header
data
- Exactly
data_length
bytes long - Immediately follows header
\n
- Payload
- Typically PCM audio but can be any binary data
- Exactly
payload_length
bytes long - Immediately follows additional data or header
\n
if no additional data
Available events with type
and fields.
Send raw audio and indicate begin/end of audio streams.
audio-chunk
- chunk of raw PCM audiorate
- sample rate in hertz (int, required)width
- sample width in bytes (int, required)channels
- number of channels (int, required)timestamp
- timestamp of audio chunk in milliseconds (int, optional)- Payload is raw PCM audio samples
audio-start
- start of an audio streamrate
- sample rate in hertz (int, required)width
- sample width in bytes (int, required)channels
- number of channels (int, required)timestamp
- timestamp in milliseconds (int, optional)
audio-stop
- end of an audio streamtimestamp
- timestamp in milliseconds (int, optional)
Describe available services.
describe
- request for available voice servicesinfo
- response describing available voice servicesasr
- list speech recognition services (optional)models
- list of available models (required)name
- unique name (required)languages
- supported languages by model (list of string, required)attribution
(required)name
- name of creator (required)url
- URL of creator (required)
installed
- true if currently installed (bool, required)description
- human-readable description (string, optional)
tts
- list text to speech services (optional)models
- list of available modelsname
- unique name (required)languages
- supported languages by model (list of string, required)speakers
- list of speakers (optional)name
- unique name of speaker (required)
attribution
(required)name
- name of creator (required)url
- URL of creator (required)
installed
- true if currently installed (bool, required)description
- human-readable description (string, optional)
wake
- list wake word detection services( optional )models
- list of available models (required)name
- unique name (required)languages
- supported languages by model (list of string, required)attribution
(required)name
- name of creator (required)url
- URL of creator (required)
installed
- true if currently installed (bool, required)description
- human-readable description (string, optional)
handle
- list intent handling services (optional)models
- list of available models (required)name
- unique name (required)languages
- supported languages by model (list of string, required)attribution
(required)name
- name of creator (required)url
- URL of creator (required)
installed
- true if currently installed (bool, required)description
- human-readable description (string, optional)
intent
- list intent recognition services (optional)models
- list of available models (required)name
- unique name (required)languages
- supported languages by model (list of string, required)attribution
(required)name
- name of creator (required)url
- URL of creator (required)
installed
- true if currently installed (bool, required)description
- human-readable description (string, optional)
Transcribe audio into text.
transcribe
- request to transcribe an audio streamname
- name of model to use (string, optional)language
- language of spoken audio (string, optional)
transcript
- response with transcriptiontext
- text transcription of spoken audio (string, required)
Synthesize audio from text.
synthesize
- request to generate audio from texttext
- text to speak (string, required)voice
- use a specific voice (optional)name
- name of voice (string, optional)language
- language of voice (string, optional)speaker
- speaker of voice (string, optional)
Detect wake words in an audio stream.
detect
- request detection of specific wake word(s)names
- wake word names to detect (list of string, optional)
detection
- response when detection occursname
- name of wake word that was detected (int, optional)timestamp
- timestamp of audio chunk in milliseconds when detection occurred (int optional)
not-detected
- response when audio stream ends without a detection
Detects speech and silence in an audio stream.
voice-started
- user has started speakingtimestamp
- timestamp of audio chunk when speaking started in milliseconds (int, optional)
voice-stopped
- user has stopped speakingtimestamp
- timestamp of audio chunk when speaking stopped in milliseconds (int, optional)
Recognizes intents from text.
recognize
- request to recognize an intent from texttext
- text to recognize (string, required)
intent
- response with recognized intentname
- name of intent (string, required)entities
- list of entities (optional)name
- name of entity (string, required)value
- value of entity (any, optional)
text
- response for user (string, optional)
not-recognized
- response indicating no intent was recognizedtext
- response for user (string, optional)
Handle structured intents or text directly.
handled
- response when intent was successfully handledtext
- response for user (string, optional)
not-handled
- response when intent was not handledtext
- response for user (string, optional)
Play audio stream.
played
- response when audio finishes playing
- → is an event from client to server
- ← is an event from server to client
- →
describe
(required) - ←
info
(required)
- →
transcribe
event withname
of model to use orlanguage
(optional) - →
audio-start
(required) - →
audio-chunk
(required)- Send audio chunks until silence is detected
- →
audio-stop
(required) - ←
transcript
- Contains text transcription of spoken audio
- →
synthesize
event withtext
(required) - ←
audio-start
- ←
audio-chunk
- One or more audio chunks
- ←
audio-stop
- →
detect
event withnames
of wake words to detect (optional) - →
audio-start
(required) - →
audio-chunk
(required)- Keep sending audio chunks until a
detection
is received
- Keep sending audio chunks until a
- ←
detection
- Sent for each wake word detection
- →
audio-stop
(optional)- Manually end audio stream
- ←
not-detected
- Sent after
audio-stop
if no detections occurred
- Sent after
- →
audio-chunk
(required)- Send audio chunks until silence is detected
- ←
voice-started
- When speech starts
- ←
voice-stopped
- When speech stops
- →
recognize
(required) - ←
intent
if successful - ←
not-recognized
if not successful
For structured intents:
- →
intent
(required) - ←
handled
if successful - ←
not-handled
if not successful
For text only:
- →
transcript
withtext
to handle (required) - ←
handled
if successful - ←
not-handled
if not successful
- →
audio-start
(required) - →
audio-chunk
(required)- One or more audio chunks
- →
audio-stop
(required) - ←
played