nearform / fast-speech-to-text Goto Github PK

Experiments with implementing fast speech to text

JavaScript 24.90% TypeScript 67.85% HTML 0.50% CSS 6.75%

fast-speech-to-text's Introduction

Fast speech to text / Real time multi-lingual voice chat

Repository containing experiment/proof-of-concept for a "real time", multilingual voice chat.

The basic idea is to utilize Web Speech API and Google Cloud Translate API to enable a voice chat application that can translate speech to multiple languages.

Note

The project currently properly runs only in the latest Chrome browser (v117+).

Running things locally

Pre-requisites

If you want to run the server on your own, you will need:

set up a Google Cloud project & enable Cloud Translation API
set up a Firebase project and enable a Realtime Database

The project is using Application Default Credentials for authentication with Google Cloud Translation API & Firebase.

Before running the server, you will need to authenticate using Coogle Cloud CLI. In order to have server successfully connect to the Firebase Realtime Database, you'll need to impersonate the Service Account that used in GCP.

To set up user credentials using Google Cloud CLI, follow these instructions.
To impersonate a service account using Google Cloud CLI, follow these instructions. Make sure that your account has the Service Account Token Creator permission in GCP.

Once authenticated successfully, you can run the server with npm run -w server.

Note

This process has to be done ONLY once as the credentials will be generated for you and kept on a "well known" location. For more information see How Application Default Credentials Work

Environment Variables

You will need the following environment variable - use .env files in the server/ directory to set it:

# server/.env

FIREBASE_RTDB_URL=

Launching the application

# install all dependencies for the 'web' and 'server' packages
npm install

# run frontend and backend in dev
npm run dev
# or `npm run -w web` & `npm run -w server` in separate terminals if you so wish

Open localhost:5173 in your browser.

Demo application (NearForm access only)

There is a live demo application, however it is ONLY accessible to people with a NearForm Google account. You can access it here.

fast-speech-to-text's People

Contributors

Watchers

fast-speech-to-text's Issues

Live translations?

How about we add the ability to translate the user speech to another language, live?

The setup would be the same as we have, in the sense that we capture the audio, but rather than just processing it and echoing it back to the user, we pipe that through one of the classic AI models, which are also capable of translating text to other languages. The challenge here might be in doing the live translation. Similar to this, but a simultaneous translation.

Clearly this would also depend on #1 so that we can speak it back to the user

Convert to React/Next.js

The initial work on this project's UI has been carried out in Svelte but, in order to remain consistent with the other NF apps, we should convert it to React before beginning any other work

Press button twice sends the same message as previously

When double clicking the record message button right after recording a message, the recorded message gets duplicated on the screen.

Make the application deployable

We need to make the application deployable to GCP.

Hint: use the same approach to bundle everything as the llm-playground project

Can't create new chatrooms

Can't create new chatrooms due to a constraint that is not needed anymore.

UI: Restyle the home page

The front page needs to be updated according to the provided design.

The Figma designs can be found here

Make voice recognition more reliable

When the participant in the chat room hits the "record" button the app should record and recognise the voice until the user presses the "stop recording" button.

The current behaviour is a bit erratic and shows a few issues:

while talking with small pauses between words - restarts the voice recognition and everything spoken before is lost
the "stop recording" button is enabled while voice recognition is still processing the spoken words, this should be disabled

Set up text-to-speech

Let's focus only on the Google streaming solution for this.

Let's set up the app so that it speaks the text back to the user as soon as it comes in.

We can use the Web Speech API for this https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API

Update the UI look & feel

The UI l&f needs to be updated to be consistent/more inline with NearForm design language.

Update Readme

Update the Readme file to explain how to run the project locally.

Add firebase authentication

As the title implies, we need to setup Firebase authentication for the application and set up rules on the RT DB as well.

UI Overhaul & Demo Prep

Add firebase auth to the server

The server is using Firebase RT DB to store the chatrooms and the events. After setting up rules that only authenticated users can access the DB, server is failing now as it's missing the authentication for the service account.

UI: Restyle the chatroom page

The chatroom page needs to be updated according to the provided design.

The Figma designs can be found here

Live chat with translation

          After playing around with #3 , I realize that in practice this would be quite awkward to do, because speaking while you're getting your message spoken back to you in a different language as you speak, doesn't make for a terribly pleasant conversation. In any case let's work on this under the assumption that the recipient of the translated audio would be a different person. We could consider, as an extension of this work (but not as part of the PR for this work) to set up a simple p2p audio conversation so you could:

speak in English to, say, an Italian
they would hear Italian voice and talk back in Italian
you would hear what they said in English

Obviously, replace Italian with any other language

Originally posted by @simoneb in #2 (comment)

Fix styling and FE bugs

When typing a character in the language dropdown, the selection should jump to the first option that starts with this character
fix font family
Add flags next to user names in the chatroom per design.

UI: set the layout to support message container self-contained scrolling

Set CSS rules for the layout so it supports self-contained scrolling for every message container that contains enough messages to exceed it's initial height.

This behavior is expected on both home page and the chatroom page.

Implement live translation

Implement live translation functionality. "Add a voice message" button should be replaced with speech detection toggle (which should be enabled by default).