Git Product home page Git Product logo

wyoming's Introduction

Wyoming Protocol

A peer-to-peer protocol for voice assistants (basically JSONL + PCM audio)

{ "type": "...", "data": { ... }, "data_length": ..., "payload_length": ... }\n
<data_length bytes (optional)>
<payload_length bytes (optional)>

Used in Rhasspy and the Home Assistant for communication with voice services.

Wyoming Projects

Format

  1. A JSON object header as a single line with \n (UTF-8, required)
    • type - event type (string, required)
    • data - event data (object, optional)
    • data_length - bytes of additional data (int, optional)
    • payload_length - bytes of binary payload (int, optional)
  2. Additional data (UTF-8, optional)
    • JSON object with additional event-specific data
    • Merged on top of header data
    • Exactly data_length bytes long
    • Immediately follows header \n
  3. Payload
    • Typically PCM audio but can be any binary data
    • Exactly payload_length bytes long
    • Immediately follows additional data or header \n if no additional data

Events Types

Available events with type and fields.

Audio

Send raw audio and indicate begin/end of audio streams.

  • audio-chunk - chunk of raw PCM audio
    • rate - sample rate in hertz (int, required)
    • width - sample width in bytes (int, required)
    • channels - number of channels (int, required)
    • timestamp - timestamp of audio chunk in milliseconds (int, optional)
    • Payload is raw PCM audio samples
  • audio-start - start of an audio stream
    • rate - sample rate in hertz (int, required)
    • width - sample width in bytes (int, required)
    • channels - number of channels (int, required)
    • timestamp - timestamp in milliseconds (int, optional)
  • audio-stop - end of an audio stream
    • timestamp - timestamp in milliseconds (int, optional)

Info

Describe available services.

  • describe - request for available voice services
  • info - response describing available voice services
    • asr - list speech recognition services (optional)
      • models - list of available models (required)
        • name - unique name (required)
        • languages - supported languages by model (list of string, required)
        • attribution (required)
          • name - name of creator (required)
          • url - URL of creator (required)
        • installed - true if currently installed (bool, required)
        • description - human-readable description (string, optional)
    • tts - list text to speech services (optional)
      • models - list of available models
        • name - unique name (required)
        • languages - supported languages by model (list of string, required)
        • speakers - list of speakers (optional)
          • name - unique name of speaker (required)
        • attribution (required)
          • name - name of creator (required)
          • url - URL of creator (required)
        • installed - true if currently installed (bool, required)
        • description - human-readable description (string, optional)
    • wake - list wake word detection services( optional )
      • models - list of available models (required)
        • name - unique name (required)
        • languages - supported languages by model (list of string, required)
        • attribution (required)
          • name - name of creator (required)
          • url - URL of creator (required)
        • installed - true if currently installed (bool, required)
        • description - human-readable description (string, optional)
    • handle - list intent handling services (optional)
      • models - list of available models (required)
        • name - unique name (required)
        • languages - supported languages by model (list of string, required)
        • attribution (required)
          • name - name of creator (required)
          • url - URL of creator (required)
        • installed - true if currently installed (bool, required)
        • description - human-readable description (string, optional)
    • intent - list intent recognition services (optional)
      • models - list of available models (required)
        • name - unique name (required)
        • languages - supported languages by model (list of string, required)
        • attribution (required)
          • name - name of creator (required)
          • url - URL of creator (required)
        • installed - true if currently installed (bool, required)
        • description - human-readable description (string, optional)

Speech Recognition

Transcribe audio into text.

  • transcribe - request to transcribe an audio stream
    • name - name of model to use (string, optional)
    • language - language of spoken audio (string, optional)
  • transcript - response with transcription
    • text - text transcription of spoken audio (string, required)

Text to Speech

Synthesize audio from text.

  • synthesize - request to generate audio from text
    • text - text to speak (string, required)
    • voice - use a specific voice (optional)
      • name - name of voice (string, optional)
      • language - language of voice (string, optional)
      • speaker - speaker of voice (string, optional)

Wake Word

Detect wake words in an audio stream.

  • detect - request detection of specific wake word(s)
    • names - wake word names to detect (list of string, optional)
  • detection - response when detection occurs
    • name - name of wake word that was detected (int, optional)
    • timestamp - timestamp of audio chunk in milliseconds when detection occurred (int optional)
  • not-detected - response when audio stream ends without a detection

Voice Activity Detection

Detects speech and silence in an audio stream.

  • voice-started - user has started speaking
    • timestamp - timestamp of audio chunk when speaking started in milliseconds (int, optional)
  • voice-stopped - user has stopped speaking
    • timestamp - timestamp of audio chunk when speaking stopped in milliseconds (int, optional)

Intent Recognition

Recognizes intents from text.

  • recognize - request to recognize an intent from text
    • text - text to recognize (string, required)
  • intent - response with recognized intent
    • name - name of intent (string, required)
    • entities - list of entities (optional)
      • name - name of entity (string, required)
      • value - value of entity (any, optional)
    • text - response for user (string, optional)
  • not-recognized - response indicating no intent was recognized
    • text - response for user (string, optional)

Intent Handling

Handle structured intents or text directly.

  • handled - response when intent was successfully handled
    • text - response for user (string, optional)
  • not-handled - response when intent was not handled
    • text - response for user (string, optional)

Audio Output

Play audio stream.

  • played - response when audio finishes playing

Event Flow

  • → is an event from client to server
  • ← is an event from server to client

Service Description

  1. describe (required)
  2. info (required)

Speech to Text

  1. transcribe event with name of model to use or language (optional)
  2. audio-start (required)
  3. audio-chunk (required)
    • Send audio chunks until silence is detected
  4. audio-stop (required)
  5. transcript
    • Contains text transcription of spoken audio

Text to Speech

  1. synthesize event with text (required)
  2. audio-start
  3. audio-chunk
    • One or more audio chunks
  4. audio-stop

Wake Word Detection

  1. detect event with names of wake words to detect (optional)
  2. audio-start (required)
  3. audio-chunk (required)
    • Keep sending audio chunks until a detection is received
  4. detection
    • Sent for each wake word detection
  5. audio-stop (optional)
    • Manually end audio stream
  6. not-detected
    • Sent after audio-stop if no detections occurred

Voice Activity Detection

  1. audio-chunk (required)
    • Send audio chunks until silence is detected
  2. voice-started
    • When speech starts
  3. voice-stopped
    • When speech stops

Intent Recognition

  1. recognize (required)
  2. intent if successful
  3. not-recognized if not successful

Intent Handling

For structured intents:

  1. intent (required)
  2. handled if successful
  3. not-handled if not successful

For text only:

  1. transcript with text to handle (required)
  2. handled if successful
  3. not-handled if not successful

Audio Output

  1. audio-start (required)
  2. audio-chunk (required)
    • One or more audio chunks
  3. audio-stop (required)
  4. played

wyoming's People

Contributors

synesthesiam avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.