Git Product home page Git Product logo

kiirkirjutaja's Introduction

About

Kiirkirjutaja is a realtime speech-to-text tool designed for real-time subtitling of TV broadcasts and streaming media.

It's architecture is almost monolithic and therefore quite ugly, as the different components are hardwired to each other.

Some of the components are specific to Estonian, but it should be relatively easy to adapt this tool to other languages.

It consists of the following components:

News

  • 2021-11-16: fixed a memory leak, increased robustness against cases when the network API endpoint that receives subtitles is down

  • 2021-12-08: implemented presenter for FAB Subtitler BCAST

  • 2021-12-06: language ID check is done after 3 seconds of speech; if a non-target language is detected, another check is done after 5 seconds (new). This is to filter out false negatives when an utterance starts with some hesitation sounds or otherwise non-standard words that sometimes confuse the language ID module.

  • 2021-10-08: integrated language identification module that filters out speech that is not in the target language. The filtering works so: each new turn is buffered until 3 seconds of audio is received, at which point language ID is performed. If the language is the target language, the buffer is forwarded to the ASR module and all the rest of the chunks are sent to the ASR as usual. The prior probability of the target language can be tuned (usually we expect most of the data to be in the target language).

Demo

Government press conference with live subtitles, generated using the YouTube presenter (see below): https://www.youtube.com/watch?v=ZN03kzEo-V8

Hardware requirements

  • Around 16 GB memory should be enough (probably you can do with less)
  • Fairly modern fast CPU (development machine has Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz)
  • 4 free CPU cores

Using

Kiirkirjutaja has a lot of dependencies and therefore the recommended way to run it is though the Docker container. If you want to run it outside the container, please check the docker/Dockerfile file for how to install all the dependencies.

Running and using the Docker container (with generic Estonian models) is outlined below:

Pull the Docker image:

docker pull koodivaramu.eesti.ee:5050/taltechnlp/kiirkirjutaja:latest

Start Docker container (use the --shm-size 2GB argument because the program uses shared memory between for IPC):

docker run --shm-size 2GB --name kiirkirjutaja --rm -d -t koodivaramu.eesti.ee:5050/taltechnlp/kiirkirjutaja:latest

Decode a Vikerraadio real-time stream (it takes 10-20 seconds to load the models, and you'll get some warnings that can be usually ignored):

docker exec -it kiirkirjutaja python main.py https://icecast.err.ee/vikerraadio.mp3

By default, Kiirkirjutaja writes recognized words to stdout, word by word. Note that there is a delay in recognition results, around 2-3 seconds, because of two factors:

  • We use a speaker change detection model that needs a lookahead buffer of 1 second, which delays the recognition by 1 second;
  • We use various postprocessing steps (compound word reconstruction, punctuation, words to numbers) which change the decoded words, but the changes are dependent on future words, therefore the already recognized words could change, depending on the words in the close future (sorry about the bad explanation). Anyway, that's why Kiirkirjutaja outputs a word only after 3 more words have been recognized, because only then it's relativey sure that it won't change (unless it's a segment end, in which case we can output everything, since each segment will be post-processed independently).

You can also write the word-by-word output to a file (can be a named pipe, if you want to process the generated subtitles using some external program). E.g.:

docker exec -it kiirkirjutaja python main.py --word-output-file out.txt https://icecast.err.ee/vikerraadio.mp3

But of course this file will be on the Docker container filesystem.

If you want to pipe the word-by-word output to another program, you can just capture the stdout from the decoding process, as all the warning and informatuon messages are written to stderr. For example:

  docker exec -it kiirkirjutaja python main.py https://icecast.err.ee/vikerraadio.mp3 | some_external_program

Kiirkirjutaja uses ffmpeg to decode the given file/stream and convert it to mono, 16-bit, 16 kHz audio stream.

If the input stream is called "-", it is assumed to be an already decoded mono/16-bit/16kHz raw audio stream. This way you can stream raw audio to it. For example, to stream audio directly from the microphone, use something like this:

rec -t raw -r 16k -e signed -b 16 -c 1 - | docker exec -i kiirkirjutaja python main.py -

If the server with Kiirkirjutaja resides on a remote machine, you can stream audio via network using netcat. On server:

nc -l 8022 | docker exec -i kiirkirjutaja python main.py -

On desktop:

rec -t raw -r 16k -e signed -b 16 -c 1 - | nc server_name 8022

If the server port (8022) is behind a firewall, it has to done via an SSH tunnel, but this left as an exercise to the reader :)

There are a few other output mechanisms (or 'presenters', as we call them) implemented:

  • YouTube Live captions
  • FAB Live speech interface

Generating captions for YouTube Live

In order to generate captions to a YouTube Live stream, "Go Live" on YouTube, select "Streaming Software" (i.e., not "Webcam"), turn on "Closed captions", select the "POST captions to URL" and copy/paste the "Captions ingestion URL".

Now, start streaming video to the YouTube input streaming URL (e.g., from some external RTMP source), using a script that does something like this:

VBR="2500k"
FPS="30"   
QUAL="medium"
YOUTUBE_URL="rtmp://a.rtmp.youtube.com/live2"

SOURCE=rtmp://your-source-video
KEY=<YOUR-YOUTUBE-LIVE-ID>

ffmpeg \
    -i $SOURCE -deinterlace \
    -vcodec libx264 -pix_fmt yuv420p -preset $QUAL -r $FPS -g $(($FPS * 2)) -b:v $VBR \
    -acodec libmp3lame -ar 44100 -threads 6 -qscale 3 -b:a 712000 -bufsize 512k \
    -f flv "$YOUTUBE_URL/$KEY"

At the same time, start Kiirkirjutaja, using the following arguments (i.e., it should get the same external video stream as input):

docker exec -it kiirkirjutaja python main.py --youtube-caption-url <YOUR-CAPTIONS-INGESTION-URL> rtmp://your-source-video

It is recommended to use a 30 second delay setting for YouTube live streams, otherwise the subtitles display somewhat sporadically.

Generating captions for FAB Live

FAB Live is professional media subtitling software.

In order to use Kiirkirjutaja with FAB Live, first go to "Options -> Special -> Speech interface" in FAB, and set "Send mode" to "Word by word".

Now set the "Mode" in FAB to "Speech" and start Kiirkirjutaja with the following options (change localhost to the machine where FAB is running, and make sure the port is accessible to the machine running Kiirkirjutaja):

docker exec -it kiirkirjutaja python main.py --fab-speechinterface-url http://localhost:8001/speechinterface rtmp://your-source-video

kiirkirjutaja's People

Contributors

alumae avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.