One interesting usage of tensorflow from the community is to process live audio stream

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

FYI <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Support live audio for tensorflow-io,about tensorflow/io

Comments (19)

ivelin commented on August 10, 2024 2

@yongtang I've made a bit of progress educating myself on this topic. I see potential to split Issue #50 in two related but separate issues:

Capturing audio stream on a local operating system port (hardware or TCP/IP) and feeding it to TF for training and predictions. The code provided by @Alireza89 is a great start. He's already written a TF class that can continuously read audio input from a local memory file into a TF session. It would be straightforward to use a library such as pcapsipdum to log into a local pcap file the audio packets coming to a local network interface over SIP/RTP and from there to audio_data_tensor.
Cloud scale processing of many thousands of simultaneous SIP/RTP calls. I am still researching this problem. So far I am hitting a rock with request/response based solutions like REST and gRPC. Some of the key problems I see:

If a TF model is involved in some way in a real phone call with expectations that the application can process natural language and engage with a caller (or callers in a conference call), the latency for response cannot be more than 100ms. Longer than that makes the interaction feel artificial. The caller doesn't quite know if the synthetic model received and understood the input. Some sort of immediate minimal feedback has to be provided as soon as the caller stops talking. Normally in telephony networks we look for latency under 30ms. When its much more than that callers get confused whether to keep talking or listen. You start hearing callers talking over each other although they don't mean to.
Using request/response API to transform streaming UDP packets into TF input and then potentially again on the way back introduces several challenges I'd like us to solve:
-- Latency due to buffering of packets to fill a data array for a gRPC/HTTP request.
-- Latency due to repackaging of data from its raw wire format to a different data array format for the gRPC/HTTP request and then again for the TF input format.
-- Loss of UDP RTP packets during transformation to TCP interface. Since we are working with real time speech, if the gRPC interface has a network delay, we can't just continue buffering the incoming RTP as if it was a song or a movie and play it later when the gRPC interface is available again. It will throw off the rhythm of the conversation as described above. If we throw away the packets to allow processing of fresh packets, then the voice will sound interrupted.

Please feel free to correct me if I am missing something. Otherwise I will keep looking for options and share feedback here when I find a suitable answer.

Beyond the issues of interfacing efficiently SIP/RTP and TF, there is also the problem with performance and scalability due to TF processing time and potentially delays due to integration with other data sources. When we get to this bridge, we may think about crossing it by splitting TF models into several categories, ones that are less computationally intensive, live closer to the edge of the network and allow instant speech feedback to callers. For example detecting silence and reacting with something simple like "Ok, let me see". Then running the more computationally intensive models for a full response. Although this type of problem probably doesn't belong in the SIG IO group. I suppose only to the extend of finding the upper limit of latency for (IO + TF model processing) < 30ms.

from io.

yongtang commented on August 10, 2024

FYI @Alireza89

from io.

Alireza89 commented on August 10, 2024

Thanks @yongtang for raising this issue.

Recently I've written two c/c++ programs on top of speech_commands/test_streaming_accuracy.cc which use JACK2 audio server daemon in Linux to capture the sound and perform real-time audio recognition with little latency (~30ms).

Here is a demonstration of the working code in Ubuntu 18.04:
https://youtu.be/2CoRbuRRKbw

And these are links to the source code:

test_live_stream.cc :
https://github.com/Alireza89/tensorflow/blob/master/tensorflow/examples/speech_commands/test_live_stream.cc
jack_capture_tensorflow.c :
https://github.com/Alireza89/tensorflow/blob/master/tensorflow/examples/speech_commands/jack_capture_tensorflow.c
Updated BUILD file for Bazel :
https://github.com/Alireza89/tensorflow/blob/master/tensorflow/examples/speech_commands/BUILD

Hope the TensorFlow community find this useful.

from io.

ivelin commented on August 10, 2024

I am looking at wiring tensorflow models into live SIP/RTP streams for VoIP and WebRTC call scenarios such as live virtual agents. Still researching design and implementation options. Please let me know if anyone is working on this and interested to coordinate effort.

@Alireza89 thank you for sharing your work. I think its relevant to my subject and will study it.

from io.

yongtang commented on August 10, 2024

@ivelin As far as I know no one is actively working on this issue at the moment. The support for live audio may not be very difficult, if there is a reliable and actively maintained C/C++ audio library available (not realistic for tensorflow-io to start from scratch). I could help the implementation if a C/C++ audio library available.

The requirement is that:

Library is in C/C++
Actively maintained with good quality
Compatible license
- Prefer licenses without restrictions (Apache/MIT/BSD)
- tensorflow-io is Apache License so we could not take GPL license
- If the library is LGPL license, then it needs to be part of a major Linux distribution. For example, tensorflow-io includes ffmpeg with dynamic linking. However, we only link against Ubuntu 14.04/16.04's system installed ffmpeg with very fixed versions. We could not afford to support a wide range of dynamic linking with random version of ffmpeg.

from io.

ivelin commented on August 10, 2024

@yongtang Thank you for the guidance.

There is a plethora of open source SIP and RTP libraries out there in C, Java and other languages. My team leads one of the more popular open source Java telephony middleware stacks - Restcomm.

The search criteria here is a library that satisfies your guidelines as well as:

Allows convenient attachment to VoIP calls for bi-directional TF processing.
Allows production deployment at scale to millions of concurrent calls.
Allows two or more participants in a conversation. Conference calls are an interesting area of exploration for ML.
Allows long running bi-directional conversations connected to LSTM RNN or other types of contextual recurrent NN models. Mainstream production deployments seem to be voice command and responses via Echo type of devices, which simulate more traditional IVR menus, but would be great to make TF easier for testing more natural contextual dialogs.
Allows experimentation with models that go beyond Dialogflow type of tools, which would not require explicit feature engineering for context labels and explicit call path design between each request/response pair.

On this note, does anyone know if the Google Assistant team has released (or is planning to release) under ASL their voice streaming connectors to TF? That could be a good place to start.

I looked at the AWS Lex REST API as a potential approach but I don't think its a great fit here, because its designed for a discrete series of request / response pairs over HTTP, which is fine for voice commands, but is not exactly the way natural phone conversations take place.

I will do some more research and get back with suggestions.

from io.

yongtang commented on August 10, 2024

@ivelin Thanks for the details. Thinking again I realized even C/C++ is not necessarily needed. We actually could use libraries written in other languages with various ways:

For RTP/SIP written in golang (or even maybe swift), we could build golang and expose cdecl api so that tensorflow-io could call cdecl API directly. This should be fairly straightforward.
For RTP/SIP written in Java, the easiest way is to expose gRPC endpoint so that tensorflow-io could call gRPC endpoint and access stream. The gRPC is used by Google Cloud as the default API format. And gRPC automatically generate the server/client side stub code for you. It is very easy as well. Could be done in a matter of a couple of days in most cases.

So overall, we will need a library:

Actively maintained with good quality
Compatible license
Could be C/C++/Golang/Java

from io.

ivelin commented on August 10, 2024

@yongtang noted. Allow me a few days to dig deeper in the options and get back to you.

from io.

yongtang commented on August 10, 2024

Thanks @ivelin, I created a new issue #241 to cover the local os live audio capture part.

For cloud scale processing of many thousands simultaneous SIP/RTP calls, there are several potential issues though I think it is still might be feasible.

For grpc serialization cost, I haven't done performance for quite some time, but 30ms seems to be within the range.
For model processing, one thing to note is that, if a user talks, model does not need to wait for the very last moment (last ms) to process. At least it might be possible for the model to build up the needed knowledge in order to have several possible paths to explore as user's talk continue. There might be still some challenges as the model or workflow might not be the same as traditional ways of processing.
Simultaneous SIP/RTP calls might be more about scale-out than scale-up. Scale-out typically are easier as in many cases the bottleneck is just about load balancing (could be a combination of IP-level LB and DNS).

Overall, I think those are good topics and a good chunk of them (not all of them) fall into SIG IO's focus. Things could be built-up gradually so that it could help the community.

from io.

ivelin commented on August 10, 2024

Thanks @yongtang for splitting the issue. To your points:

I will do some benchmarking to see what are the boundaries of grpc related transformation costs. As I'm still researching, I came across the 2019 Google I/O keynote announcing that the 100GB voice model was crunched to 500MB to fit on a phone. The primary reason cited was reduction in latency. This is good news for folks that can solve their problems with local OS IO that Issue #241 targets . It remains an open topic how to bring this low latency into VoIP nodes deeper in the SIP network.
Yes, I am aware that some models (e.g. RNN) allow continuous processing and output of predictions with partial input of samples, which is great. That's a task to tackle once we figure a way to feed high performance RTP streams in and out of TF models. :)
Yes, I agree that scale out is a separate issues that we can address separately. The main challenge right now is to find high performance processing solution of a single RTP stream in and out of TF with less than 30ms total delay.

I'll post again as soon as I find something tangible to share. Of course if someone else has the answer, please comment.

from io.

ivelin commented on August 10, 2024

@yongtang it seems that bi-directional gRPC streaming could be a way forward. Here is an example:
https://grpc.io/docs/tutorials/basic/java/#bidirectional-streaming-rpc

I found a related thread under TF Serving, where the notion of bi-directional gRPC streaming was acknowledged over a year ago but declared out of scope. Do you know why that might be the case and whether its in scope currently for Serving for IO?
tensorflow/serving#722

I also researched the Google Speech to Text gRPC APIs a bit more and found out that bi-directional streaming was introduced with client SDKs as Alpha. The latest github activity suggests that bi-directional streaming for speech to text is moving towards beta.

We've done some testing with my team against the previous version of the speech-to-text API which only supported requests via fixed byte array or URL of a file on google cloud bucket. Testing from multiple locations around the world showed that the latency is in the order of 2-5 seconds for speech recognition of a short phrase plus receiving a callback with the text words. That's significantly above the 100ms maximum (30ms ideally).

Hopefully we can find a way to enable bi-directional gRPC into TF Serving. Looking forward to comments.

from io.

yongtang commented on August 10, 2024

@ivelin Sorry for the late reply as I was out of office last week. For bi-directional gRPC streaming, the protocol itself is straightforward and easy to implement at the first place. Since TensorFlow is protobuf based, write a gRPC server is really trivial. A starter server could normally done in < 100 lines of implementation, as most of the code are machine-generated automatically. Go with TF serving probably is not necessary as TF serving has some extra layers in between the server, format, and the TensorFlow's graph.

I think I could help the implementation in this case.

from io.

yongtang commented on August 10, 2024

To rephrase the problem, I think the goal is to have a TF implementation with gRPC bi-directional streaming. Both input and output share the same resource (e.g., gRPC endpoint). The input is received from client and, after inference, returns back client the output. The tensorflow-io normally implements either an input, (e.g., Dataset), or an output as a resource.

Could be interesting to see the implementation with a shared RESOURCE type for both input and output.

from io.

ivelin commented on August 10, 2024

@yongtang Hope you enjoyed time off.

Thank you for the thoughtful response.

It lead me to realize that we may want to split the problem again into two separate ones. While live RTP audio stream remains an important topic, I also see scenarios where NNs can be trained on live network audio data captured and stored in pcap format.

Pcap is a fairly accurate snapshot representation of network traffic on a time axis. One important advantage of the pcap format over WAV, MP3 or other file streaming formats is that pcap stores the timestamp when the UDP packet was actually captured on the network interface, which is important to understand packet loss and latency which can have significant impact on the resulting audio/voice quality. Normally there is some sort of padding algorithm used in RTP jitter buffers to correct for latency and loss. These paddings have fairly limited quality correction ability. However I am hopeful that with modern Generative NN models that have shown a lot of promise with image and video denoising we can soon see results applicable to audio.

If you are interested in adding pcap as an input (and possibly output) file format for TF IO, here are references in Java that have been used successfully in production for several years.
https://www.telestax.com/blog/pcap-files-for-media-server-testing/
https://github.com/RIPE-NCC/hadoop-pcap
https://github.com/RestComm/media-core/blob/master/pcap/src/main/java/org/restcomm/media/core/pcap/PcapPlayer.java

There are use cases where I can see pcap as both TF input and output format. For example GAN encoder-decoder sequence that tries to come up with NN parameters that smoothen RTP audio streams, detect reliably DTMF, AMD, etc.

Returning to the original topic of streaming RTP directly into a TF model via TF IO and potentially back out through the same or another RTP UDP channel, is there any code you can point me to that shows TF IO processing UDP streams? Seems like UDP was rejected in gRPC. If that is the case I am not sure how gRPC could help here.

Maybe we should look at a more direct implementation of a TF IO dataset that can read from an UDP port, convert RTP packets to a format TF can work with. Respectively in the output direction write a dataset out to RTP packets and through a UDP port.

My closest guess is that we need an implementation of the DatasetSource class introduced in TF 2.0. That's the base class for the new TFRecordDataset and the experimental CSV and SQL Datasets. I looked around but I could not find a base class for UDP or even TCP input sources.

from io.

yongtang commented on August 10, 2024

@ivelin There are some concepts here. The tensorflow essentially is a graph with nodes as ops. There is not a lot of limitations for the graph but you also have to build lots of parts yourself if too low-level. The Dataset (e.g., TFRecordDataset and others) op is a subgraph that has been abstracted and specialized for processing data pipeline inputs. Dataset ops are not capable of handling both inputs and outputs at the same time by design.

To be able to handle input and output at the same time, you have to go deep into the tf graph implementation (without the abstraction of Dataset). Essentially, you need a TF_RESOURCE tensor for the endpoint (be it UDP server, gRPC server, or simply a file parser) as it should be unique. Then add additional input and output ops referencing the resource for further processing. The input and output ops are custom ops in C++. If those custom ops (input/output) consists of too many building blocks from the scratch with protocols and components in other domains like networking, buffering packeting, handling lost packet, etc., it may not be very realistic for a project with a different focus.

For that reason, gRPC could be a good choice in that:

It is very simple to implement a gRPC server or client.
It effectively is a bridge between different scopes/domains.

For example, in case of audio I assume there are some mature server software available in Java. Then a way to wire up everything is to setup the audio server in Java with mature and proven software, receive the data and preprocess, a small gRPC client code in Java, and send the data to the final gRPC server with TensorFlow. On the TensorFlow side, the gRPC server could easily be implemented in either C++ or python, receive the data, do the inference and send the output back.

In other words, a Java audio software could be served as a relay in between the original RTP and the final destination of TensorFlow. The assumption is that wiring up with gRPC would be much easier than processing UDP in C++ as part of the tensorflow.

With gRPC as a relay in between, it might incur additional latency but I would assume this additional latency is managable in most situations.

from io.

ivelin commented on August 10, 2024

@yongtang Thanks for the helpful breakdown of TF data types and flow.

I still believe there is value in separating the issue in two:

offline processing of pcap data sets
real time bi-directional streaming.

For offline processing of network packets, I would still suggest opening a separate TF IO issue. There are several python pcap parsers that can read and write files in pcap format. dpkt stands out because it supports various network packet formats (SIP, RTP and many others) which could simplify and accelerate applied ML in network protocols. It looks like due to lack of pcap support in TF, applications currently transform pcap files to CSV, XML or JSON format first, which adds 10x if not 100x bloat. Here is an example, another one and a third one.

As it relates to bi-directional real time streaming, you are correct that there are several very stable and very well established open source telephony servers. Someone can write a bi-directional gRPC client for any of these servers that transforms UDP RTP packets to gRPC streaming data packets. If the overhead is minimal as you suspect, then this may be a fine solution. The fact that Google Speech Recognition uses this method is encouraging in terms of latency and scalability expectations.

If you can help with the development of a TF IO gRPC server end point with bi-directional streaming, I can work on my side to add a bi-dir streaming client into the Restcomm Media Server. Once connected, we should be able to experiment with handling live voice traffic on the SIP network via TF.

from io.

yongtang commented on August 10, 2024

@ivelin Created a separate issue #264 to track offline pcap file format support. Will look into the implementation of gRPC bi-directional streaming server soon.

from io.

yongtang commented on August 10, 2024

@ivelin With grpc bi-directional streaming, it is possible to have both input and output at the same channel. However, one potential issue is that write to a channel will be block and could not truly set a time limit other than cancel the operation.

So, true bi-directional communication through grpc with one channel, could only be done through a ping-pong style (one request, one reply), or a predefined behavior patter (e.g., write 5 data request and wait for read of 1 reply). This could be an issue as one side has to wait for the other side (blocking).

An alternative way, is to setup two channels, one channel is dedicated to send data to the server continuously, another channel is dedicated to receive result from the server continuously. This will avoid the blocking read or write.

from io.

ivelin commented on August 10, 2024

Good catch @yongtang
RTP streams are bi-directional and independent of each other. Each party in a call can speak at any time, although people normally wait for each other and ping-pong. When one of the sides is a bot and the other is a person, its more likely for the person to interrupt the bot with instructions or questions.

I think your suggestion to use two separate grpc streams could work and we can measure how it performs.

I continue to look for a way to represent real world network conditions with UDP packet delays and losses that feed into a TF model without potential data leakage from interim processing steps.

It looks like aiortc is beeing used by some ML projects for real time audio/video processing. A promising direction I will explore further.

from io.

Support live audio for tensorflow-io about io HOT 19 OPEN

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent