w3c / webrtc-encoded-transform Goto Github PK

View Code? Open in Web Editor NEW

121.0 121.0 26.0 576 KB

WebRTC Encoded Transform

Home Page: https://w3c.github.io/webrtc-encoded-transform/

License: Other

Makefile 0.36% Bikeshed 99.64%

webrtc-encoded-transform's People

Contributors

Stargazers

Watchers

webrtc-encoded-transform's Issues

Insertable MediaStreams in chrome issues

It may be unclear to me, however following the standard the following use cases are listed:

Funny Hats (processing inserted before encoding or after decoding)
Background removal
Voice processing
Dynamic control of codec parameters
App-defined bandwidth distribution between tracks
Custom codecs for special purposes (in combination with WebCodecs)

And the approach lists the standard as following the pattern of WebCodecs. However, the current chrome 83 release does not seem to support converting the frame back to image data event though web codecs appears to.

I've followed the example from the medooze blog article and posted the code in this repository https://github.com/tato123/face-detection-insertable-mediastreams. Note, within their example they are performing face detection using createImageBitmap from the video element and passing to an offscreen canvas. Currently there does not seem to be a way of directly accessing the image data from a frame.

I would like clarification on how the use cases for the standard could be implemented given the data provided within a stream

Add custom encryption with aiortc python webRTC

I am tying to use webrtc insertable stream. Right now my sender peer is aiortc https://github.com/aiortc/aiortc python webrtc lib and receiver is normal browser(latest chrome which support insertable stream).

Right now I am adding simple change on the trasnmistted encoded byte, that is subtract 1(at python-webrtc-aiortc). And at receiver side add 1 to reconstruct the frame.

Java script

     pc.ontrack = function (event) {
     var remoteVideo = document.getElementById("video_");
     remoteVideo.srcObject =event.streams[0];

	    let receiverTransform = new TransformStream({
	    start() {},
	    flush() {},
	    async transform(encodedFrame, controller) {
	      // Reconstruct the original frame.
	      let view = new DataView(encodedFrame.data);

	      let newData = new ArrayBuffer(encodedFrame.data.byteLength);
	      let newView = new DataView(newData);

	      
	      for (let i = 0; i < encodedFrame.data.byteLength ; i++){

                    var sub = 0;
		if(view.getUint8(i)>0&&view.getUint8(i)<255)
		sub = 1;
		newView.setUint8(i, view.getUint8(i)+sub);
                  }
                  encodedFrame.data = newData;
	      controller.enqueue(encodedFrame);
	      },
	    });


	  let receiverStreams = event.receiver.track.kind === 'video' ? event.receiver.createEncodedVideoStreams() : event.receiver.createEncodedAudioStreams();
	  receiverStreams.readableStream.pipeThrough(receiverTransform).pipeTo(receiverStreams.writableStream);

 }

Python code:

I am editing following file https://github.com/aiortc/aiortc/blob/c0504b6962484ac26ba8ad065191794ac6f607a4/src/aiortc/rtcrtpsender.py#L284 and corresponding code, where the decode frame get

                tmpdata = list(payload) //decode frame packet
                for x in range(len(payload)):
                   if(x>3) and (tmpdata[x]>0) and (tmpdata[x]<255):      # first four byte is some inf data, not received at the js section preprocess so avoid it.
                      tmpdata[x] = tmpdata[x] -1
                print(tmpdata)
                packet.ssrc = self._ssrc
                packet.payload = bytearray(tmpdata)

I am getting correct data at js side, but the video not playing.

Few doubts having about python side encryption.

Where should I apply the encryption, just after frame encoded on frame data, before adding header, sequence number and timestamp.
Why the video not playing while the frame send is received correctly when I apply the above changes.

Please give some advice, how I can add encryption at aiortc side decryption at browser side. I have implemented it like above and data received same as sent but video not playing. If I remove the subtract operation(simple-encryption) from both side then it works.

Describe accurate threading model

Currently, we are not precisely describing the threading model and instead rely on pipeTo et al.
We should probably define a encoded media thread which is the thread on which happen the generation and consumption of frames.
And the thread of the window + the thread of the worker.
We could post/enqueue tasks between the various threads which would further clarify things.

Add WPT tests covering SFrameTransform

Generalize ScriptTransform constructor to allow main-thread processing

The RTCRtpScriptTransform constructor takes a Worker argument, limiting the usage of this form of the transform to Workers.

The older createEncodedStreams() function was agnostic as to where the processing was going to take place; a number of existing demos and apps have been written that do processing on the main thread; some have even prototyped both worker-based and main-thread-based processing and deliberately chosen main-thread-based processing.

The normal use case should be worker - but other use cases should be possible.

Proposal: Change the argument type of the constructor from Worker to (Worker or MessagePort). Dispatch the event (which could then be a message) on either the worker's implicit port or the explicit MessagePort.

This allows all the use cases that the older API allowed, but ensures that the simplest code will be the one invoking a Worker.

Does transform receive Frames or Chunks?

Many of the examples here (as in the current explainer.md) have the signature:

async transform(encodedFrame, controller)

but assuming that this is really just Streams, isn't the original example from the slides (https://docs.google.com/presentation/d/1NIHzumglY9cYa4b7rcEbHGVsMam5BiY80VfFDB6cDjQ/edit#slide=id.g7eb1549726_1_10):

async transform(chunk, controller)

more accurate? That is, does InsertableStreams build up a full frame for arg0.data?

Nearly all online examples just overwrite / mess with all the bytes in a uniform way (e.g., bit complementing everything) that wouldn't whether arg0 was just a chunk or the full frame.

Maybe documenting what happens for a video stream in VP8 would be useful? #32 (comment) mentions that:

It actually doesn't provide access to the full info of the RTP payload; the RTP headers and the segmenting that goes into putting frames into RTP packets isn't reflected in the Insertable Streams API.

But that note isn't reflected in either the spec (index.bs, correct?) nor explainer.md for "what actually goes into the ArrayBuffer data".

@alvestrand could you clarify this?

Consider renaming to WebRTC Encoded Media Transform

Two repos use the term "insertable streams", this one and w3c/mediacapture-insertable-streams, causing a bit of confusion.

It might be better to name specs based on what they do, rather than on the shape of a particular solution, to distinguish them.

For this repo, maybe "WebRTC Encoded Media Transform"? Credit to @martinthomson w3c/webtransport#174 (comment)

Frames should be Serializable

In order to support running streams on a Worker, chunks must be marked serializable, since that is the mechanism streams use to send chunks to a Worker.
For simplicity and efficiency, we should neuter frames once they're serialized, so that it is easier to make the deserialized frame on the Worker side reuse the underlying WebRTC frames.

Should we rename SFrameTransform and RTCRtpScriptTransform

And also RTCRtpScriptTransformer

Need feature detection

Apps that require this feature in order to work need the ability to feature detect it; setting up a PeerConnection and making a connection in order to detect that nothing happens seems like it's too complicated.

Race conditions with async/await inside transform streams

The crypto.subtle encrypt/decrypt functions returns a promise, so the call to the transform function of the TransformStream is not guaranteed to finish before the next frame is processed.

That could cause frame order to be reversed and image decoded with artifacts (specially on large frames vs short ones, or if the encrypted frame contains signature infos in SFrame)

This is solvable in the js (although not easily), but not sure if we could do anything to make things easier for devs.

what metadata is useful?

This API should enable cryptographic schemes to be built ontop of it. What we have at a bit lower level is SRTP and srtp + gcm which use input from the RTP packet.

I am not sure the SSRC is always useful as some SFUs do ssrc rewriting so can not use it in their e2ee encryption scheme. Same goes for the pictureId and (rtp) timestamp. For simulcast rewriting pictureId towards the receiver is a must effectively.

Note that this just means that when using these as input for the IV or counter that IV/counter must still be sent along. For GCM I suspect that allowing the IV to be generated programatically and then sent alongside the packet would still be better than requiring the generation of a large number of (cryptographically?) random numbers.

Does Chromium require anything in SDP or RTP Header to make this work?

Sorry to talk about this on W3C repo, but I have no idea how to contact any Chrome developers.

I tried to code up an example here but hitting weird behavior. After processing my buffers I get the right values (the values I sent), but the browser fails to decode them. In the debug logs I just get Failed to decode frame with timestamp 2656706362, error code: -1

I am not that familiar with the Chromium code base, but do you know if this feature depends on anything else? I see lots of extmap entries and multiple RTP headers, hoping this is behind something I haven't found yet.

Will keep debugging, going to send both tracks in and diff each packet and see if I am making a mistake here. Can't find anything yet though.

thanks

Interaction with Congestion Control

The Virtual Reality Gaming use case may potentially involve adding metadata to the encoded frame. The metadata could be substantial (hundreds of bytes).

Similarly, there are accessibility scenarios (captioning) in which the captions might be sent along with the frames.

So the question arises as to the interaction with congestion control in these scenarios. When adding to (or even substracting from?) the size of the encoded frame, is there a way to properly interact with congestion control?

Off-the-main thread processing by default

Since this is dedicated to RTC, it is important that this processing does not get blocked by other processing. One solution is to define an API and a processing model that would set things up from main thread but runs in a background thread by default.

Similarly to WebAudio, this could be defined in terms of:

A pipeline processing some data (frames + metadata) as seems to do the current proposal, but on a background thread
A way to set up the pipeline by connecting nodes together, with an input and a destination node
A way to create native nodes having a specific functionality
An optional way to create JS processing nodes a la AudioWorket

Ability to insert native source nodes in the pipeline

end-to-end encryption is one node that would be best implemented natively.
There are several benefits:

Implementation of a single standardised format, widely studied, widely tested, well maintained
An API to provide the key material to the encryption node. This API can be extended to support different trust models
The ability to not expose encryption keys to the JS (directly or through attacks like Spectre)

Some constructs should only be exposed to DedicatedWorker

We moved to Worker as a short term fix to be able to land #74.
We should update the spec to make sure some of these constructs are exposed in DedicatedWorker instead of Worker

Consider using TransformStreams instead of exposing ReadableStream/WritableStream

It might be worth considering using TransformStreams instead of exposing ReadableStream/WritableStream directly.
One reason is consistency with other APIs like https://encoding.spec.whatwg.org/#interface-textencoderstream or https://encoding.spec.whatwg.org/#interface-textencoderstream.
This for instance makes it easier to define native transforms.
Not dealing with ReadableStream is also nice to remove some potential foot guns like cloning a ReadableStream.

Optimizing encoded frame buffer allocation and memory copies

RTCEncodedAudioFrame and RTCEncodedVideoFrame both own an ArrayBuffer.
This array buffer is exposed to JavaScript by ReadableStream and consumed by WritableStream.

One important design API goal is to limit memory copies, maybe allow in-place transform of the array buffer so that there is no memory copy and no memory allocation.

One possibility would be to allow the frame array buffer to be detached after the frame is enqueued in the WritableStream.

expose RTCEncoded(Audio|Video)Frame on workers?

from @foolip here:

https://github.com/alvestrand/webrtc-media-streams/blob/master/explainer.md#api is missing the
require Exposed extended attribute, and I can't tell from the rest of the IDL what's intended here.
@alvestrand should these interfaces be exposed to workers?

not going to help with my polyfill plans but since they're used in workers they should be exposed there as well.

Add WPT tests covering RTCRtpScriptTransform

Is there a flag that only switches on insertable streams api?

It is a bit scary to switch on experimental-web-platform flag, so I was wondering if there is a flag that I can turn on via console /web that only turns on insertable streams api.

Update examples to showcase RTCRtpScriptTransform

We should update examples to showcase RTCRtpScriptTransform in the explainer.

Use hooks from SFrame specification when available

This is the counterpart of sframe-wg/sframe#57

{ readableStream, writableStream } should be { readable, writable }

https://htmlpreview.github.io/?https://github.com/w3c/webrtc-insertable-streams/blob/master/index.html#dictdef-rtcinsertablestreams

The conventional names for a pair of readable and writable streams are { readable, writable }, with no suffix. Aligning these will improve interoperability with the rest of the streams ecosystem, both concretely (e.g., making the object usable with pipeThrough()) and just in terms of web developer familiarity.

Adopt feedback streams in RTCRtpScriptTransformer

In some applications, especially those that have the input or output to a transform go somewhere other than the normal path, it is vital that the upstream frame source be informed of other events than just frames being consumed. This may include bandwidth adaptation signals, frame size adaptation signals, or other signals.

In mediacapture-transform, this need is satisfied by control channels.

I propose that we add two more attributes to the ScriptTransformer interface: ReadableStream readableControl and WritableStream writableControl.

Security evaluation

This is a general issue about evaluating the security risks this new API can bring to existing infrastructure and adding a security section.

One potential threat is the following: by allowing JS to modify media content post-encoding, this API allows an attacker that is able to inject code in a page doing a WebRTC call to send RTP packets with poisoned content to either SFU or other participants in the call. Without this API, the attack is more difficult since the encoder will probably generate sanitised content. A non-browser attacker might be able to generate the same poisoned content but may not be able to connect easily to either SFU or other participants.

https://streams.spec.whatwg.org/#readablestream-create is no longer valid

We should probably use https://streams.spec.whatwg.org/#readablestream-set-up instead

Add an API to know if createEncoded{Audio,Video}Streams was called

Hey there!

While integraating this on Jitsi Meet we ran into the case of calling createEncodedVideoStreams more than once by mistake. This currently throws an exception, which is nice, but there is no way to know in advance if we already created the encoded audio / video stream.

We solved it by using a custom hidden (with a Symbol) attritute on the sender / receiver, but it would be nice to be able to have an "official" API for this.

statistics

@emcho asked me some good questions about performance. This is measurable with performance.now() and then summing that up.

Should we have something similar to totalEncodeTime to allow measuring how much time is spent in insertable streams?

Remove WebIDL from explainer.md

Having webidl both in explainer.md and index.md is redundant and hard to maintain.

SFrameTransfor.setEncryptionKey cannot use all possible keyID values

As per SFrame, keyID can be 64 bits.
But unsigned long long is only mapping precisely up to 53 bits, as per https://heycam.github.io/webidl/#abstract-opdef-converttoint.

One possibility is to use a BigInt in lieu, or in addition maybe.
In that case values above max could trigger a RangeError.

Applicability Statement

The Insertable Streams API provides access to the RTP payload, which has generated considerable interest. I have heard suggestions that it might be used to implement some of the following:

Support for audio redundancy (e.g. RED, FEC, etc.)
Accessibility (captioning, real-time text)
Generic bitstream access (similar to WebCodecs, but with WebRTC parity and support for WHAT WG streams)

It might be helpful to have an applicability statement somewhere in the document, to clarify what use cases might not be supportable.

"Get" is not a good name

The name "get" on a function implies that it doesn't change the state of the object, but getEncodedStreams() definitely changes the state. Can we call it "extract", "insert" or something else suggesting that it modifies things?

What about simulcast?

If the sender is a simulcast sender, what should be the behavior of the streams?
One per RTP stream, or one for the whole sender? If the latter, where do we know what encoding the frame belongs to?

Piping capured audio to insertable stream from shell script

For the case of Chromium refusal to support capture of monitor devices am using Native Messaging and Native File System to write and read a file which is then parsed and set as outputs at AudioWorkletProcessor.process(), in pertinent part

parec --raw -d alsa_output.pci-0000_00_1b.0.analog-stereo.monitor ../app/output

which is read in main thread at browser. However, one issue is that Native File System currently cannot got a single handle on a file that is simultaneously being written to for the purpose of reading and writing at the same time, DOMExceptions will be thrown, and requires reading the entire file at each iteration to slice() from previous offset

          async function* fileStream() {
            while (true) {
              let fileHandle, fileBit, buffer;
                // if exception not thrown slice file from readOffset, handle exceptions
                // https://bugs.chromium.org/p/chromium/issues/detail?id=1084880
                // TODO: stream file being written at local filesystem
                // without reading entire file at each iteration before slice
                fileHandle = await dir.getFileHandle('output', {
                  create: false,
                });
                fileBit = await fileHandle.getFile();
                if (fileBit) {
                  const slice = fileBit.slice(readOffset);
                  if (slice.size === 0 && done) {
                    break;
                  }
                  if (slice.size > 0) {
                    buffer = await slice.arrayBuffer();
                    readOffset = readOffset + slice.size;
                    const u8_sab_view = new Uint8Array(memory.buffer);
                    const u8_file_view = new Uint8Array(buffer);
                    u8_sab_view.set(u8_file_view, writeOffset);
                    // accumulate  512 * 346 * 2 of data
                    if (
                      writeOffset > 512 * 346 * 2 &&
                      ac.state === 'suspended'
                    ) {
                      await ac.resume();
                    }
                    writeOffset = readOffset;
                  }
                }
              } catch (err) {
                // handle DOMException
                // : A requested file or directory could not be found at the time an operation was processed.
                // : The requested file could not be read, typically due to permission problems that have occurred after a reference to a file was acquired.
                if (
                  err instanceof DOMException ||
                  err instanceof TypeError ||
                  err
                ) {
                  console.warn(err);
                }
              } finally {
                yield;
              }
            }
          }
          for await (const _ of fileStream()) {
            if (done) break;
          }

Does opus-tools have the capability to create an Opus bitstream that will support piping the output therefrom to the writable side of the insertable stream? That is, instead of writing the file and reading the file we can do something like

parec --raw -d alsa_output.pci-0000_00_1b.0.analog-stereo.monitor | opusenc - - | opusenc <options_to_make_stdout_insertable_stream_writable_input> -

where we can then write() the output from the native shell script directly to an insertable stream, avoiding the need to re-read the same file just to get the current offset, or use SharedArrayBuffer to store the contents of the file in memory; we do not need to write a file at all, rather actually stream output from native application to RTCPeerConnection?

How to easily support messaging between RTCScriptTransform and RTCScriptTransformer

Safari prototype supports a MessagePort natively so that RTCScriptTransform and RTCScriptTransformer can exchange messages conveniently. This mimics AudioWorkletProcessor.port.

Another approach would be to add a parameter to RTCScriptTransform to transfer some objects when serialising the options constructor parameter. Something like:

const channel = new MessageChannel()
const transform = new RTCRtpScriptTransform(worker, { name: 'myPortTransform', port: { channel.port2 }, [channel.port2])
transform.port = channel.port1

Handle SFrame error case when encrypting/decrypting

We need to handle some potential error cases:

no key set on encryption side
no key for the given ID on decryption side
Decryption error

How to handle transforms largely changing frame size

Transforms may be able to introduce large changes in frame size (decrease or increase).
It seems interesting to understand how to handle these cases.

I can see different variations of these cases:

Metadata size is not really negotiable by the JS transform, size change is more or less fixed.
The user agent can handle it.
User Agent can detect the size of the metadata and update the encoder bitrate according to the average of the transformed data.
Transform may decided to decrease metadata size
The transform may add more or less metadata based on available bandwidth.
The transform could be made aware of the target bit rate from network side, maybe the encoder target bit rate as well.
The transform would then compute how much space it can add to the frame.
Transform might want to trade media quality to include more metadata
In that case, the transform can decide whether to reduce the encoder bit rate or the size of the metadata, or both.
It seems useful to notify the JS whenever change of the encoder bit rate is planned and potentially allow the JS to override the default behavior.

Case 1 requires no new API.
Case 2 could be implemented as getter APIs.
Case 3 can be implemented in various ways: ReadableStream/WritableStream pair, transform, events, maybe even through frame dedicated fields). It seems all these variants should provide roughly the same functionality support.

I feel like a single object that the JS could use for all of this when processing a frame might be the more convenient from a web developer perspective.

Additional space in the buffer

It would be useful to allow the user to request additional bytes to be prepended and appended for each frame, so adding a header/footer/nonce/whatever kind of additional data does not require to copy into a new ArrayBuffer which can be expensive and may require garbage collection.

This is useful for e.g. encryption modes with additional MACs and nonces that need to be transmitted.

Could look like:

createEncodedVideoStreams(optional EncodedVideoStreamsParameters)

dictionary EncodedVideoStreamsParameters {
  unsigned byteHeadroom = 0;
  unsigned byteLegroom = 0;
};

Privacy evaluation

This is a general issue about evaluating the privacy risks this new API can bring to existing infrastructure and adding a privacy section to the proposal.

This API may provide access to encoder/decoder states otherwise not available to applications, for instance timing information. It would be good to investigate this potential issue and the potential mitigations.
For instance, a fully native pipeline probably does not bring much fingerprinting, or makes it more easy to add mitigations. Limiting what JS can do/observe is a potential mitigation.

rename to createEncodedStreams?

can we rename createEncodedVideoStreams and createEncodedAudioStreams to createEncodedStreams? The sender kind is always known and can not change so having to decide which method to call is a bit cumbersome

We should automate index.html generation

RTCInsertableStreams dictionary removed accidentally?

#62 removed the definition of RTCInsertableStreams, but it's still used as the return type of two methods in https://w3c.github.io/webrtc-insertable-streams/#specification.

@alvestrand was this an accident, or should the methods also be removed/tweaked?

Handle error case when setting crypto key

As pointed out by @jan-ivar, we probably want to reject the promise if setting the crypto key is failing.

Data channels

If we're able to use Streams on underlying video and audio data, it stands to reason we should also have that ability on data channels themselves. Given how difficult it is to effectively address backpressure using the traditional Javascript event model, the Streams API can give a huge performance win to developers.

metadata for start and flush

related to #9

Some things like the SSRC are constant over the lifetime of the stream (well, modulo ssrc changes...).

It would be useful to avoid first-time-initialization bookkeeping along the lines of "we haven't seen this ssrc" inside the main transform function. Same for flush, there one could still do periodic cleanup but that is even worse

w3c / webrtc-encoded-transform Goto Github PK

webrtc-encoded-transform's People

Contributors

Stargazers

Watchers

Forkers

webrtc-encoded-transform's Issues

Recommend Projects

Recommend Topics

Recommend Org