Git Product home page Git Product logo

webrtc-encoded-transform's People

Contributors

aboba avatar alvestrand avatar autokagami avatar chrisguttandin avatar dontcallmedom avatar ekr avatar fippo avatar foolip avatar guest271314 avatar guidou avatar henbos avatar jan-ivar avatar palak8669 avatar saschanaz avatar sean-der avatar shaseley avatar tidoust avatar tonyherre avatar youennf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

webrtc-encoded-transform's Issues

Insertable MediaStreams in chrome issues

It may be unclear to me, however following the standard the following use cases are listed:

  • Funny Hats (processing inserted before encoding or after decoding)
  • Background removal
  • Voice processing
  • Dynamic control of codec parameters
  • App-defined bandwidth distribution between tracks
  • Custom codecs for special purposes (in combination with WebCodecs)

And the approach lists the standard as following the pattern of WebCodecs. However, the current chrome 83 release does not seem to support converting the frame back to image data event though web codecs appears to.

I've followed the example from the medooze blog article and posted the code in this repository https://github.com/tato123/face-detection-insertable-mediastreams. Note, within their example they are performing face detection using createImageBitmap from the video element and passing to an offscreen canvas. Currently there does not seem to be a way of directly accessing the image data from a frame.

I would like clarification on how the use cases for the standard could be implemented given the data provided within a stream

Add custom encryption with aiortc python webRTC

I am tying to use webrtc insertable stream. Right now my sender peer is aiortc https://github.com/aiortc/aiortc python webrtc lib and receiver is normal browser(latest chrome which support insertable stream).

Right now I am adding simple change on the trasnmistted encoded byte, that is subtract 1(at python-webrtc-aiortc). And at receiver side add 1 to reconstruct the frame.

Java script

     pc.ontrack = function (event) {
     var remoteVideo = document.getElementById("video_");
     remoteVideo.srcObject =event.streams[0];

	    let receiverTransform = new TransformStream({
	    start() {},
	    flush() {},
	    async transform(encodedFrame, controller) {
	      // Reconstruct the original frame.
	      let view = new DataView(encodedFrame.data);

	      let newData = new ArrayBuffer(encodedFrame.data.byteLength);
	      let newView = new DataView(newData);

	      
	      for (let i = 0; i < encodedFrame.data.byteLength ; i++){

                    var sub = 0;
		if(view.getUint8(i)>0&&view.getUint8(i)<255)
		sub = 1;
		newView.setUint8(i, view.getUint8(i)+sub);
                  }
                  encodedFrame.data = newData;
	      controller.enqueue(encodedFrame);
	      },
	    });


	  let receiverStreams = event.receiver.track.kind === 'video' ? event.receiver.createEncodedVideoStreams() : event.receiver.createEncodedAudioStreams();
	  receiverStreams.readableStream.pipeThrough(receiverTransform).pipeTo(receiverStreams.writableStream);

 }

Python code:

I am editing following file https://github.com/aiortc/aiortc/blob/c0504b6962484ac26ba8ad065191794ac6f607a4/src/aiortc/rtcrtpsender.py#L284 and corresponding code, where the decode frame get

                tmpdata = list(payload) //decode frame packet
                for x in range(len(payload)):
                   if(x>3) and (tmpdata[x]>0) and (tmpdata[x]<255):      # first four byte is some inf data, not received at the js section preprocess so avoid it.
                      tmpdata[x] = tmpdata[x] -1
                print(tmpdata)
                packet.ssrc = self._ssrc
                packet.payload = bytearray(tmpdata)

I am getting correct data at js side, but the video not playing.

Few doubts having about python side encryption.

  1. Where should I apply the encryption, just after frame encoded on frame data, before adding header, sequence number and timestamp.

  2. Why the video not playing while the frame send is received correctly when I apply the above changes.

Please give some advice, how I can add encryption at aiortc side decryption at browser side. I have implemented it like above and data received same as sent but video not playing. If I remove the subtract operation(simple-encryption) from both side then it works.

Describe accurate threading model

Currently, we are not precisely describing the threading model and instead rely on pipeTo et al.
We should probably define a encoded media thread which is the thread on which happen the generation and consumption of frames.
And the thread of the window + the thread of the worker.
We could post/enqueue tasks between the various threads which would further clarify things.

Generalize ScriptTransform constructor to allow main-thread processing

The RTCRtpScriptTransform constructor takes a Worker argument, limiting the usage of this form of the transform to Workers.

The older createEncodedStreams() function was agnostic as to where the processing was going to take place; a number of existing demos and apps have been written that do processing on the main thread; some have even prototyped both worker-based and main-thread-based processing and deliberately chosen main-thread-based processing.

The normal use case should be worker - but other use cases should be possible.

Proposal: Change the argument type of the constructor from Worker to (Worker or MessagePort). Dispatch the event (which could then be a message) on either the worker's implicit port or the explicit MessagePort.

This allows all the use cases that the older API allowed, but ensures that the simplest code will be the one invoking a Worker.

Does transform receive Frames or Chunks?

Many of the examples here (as in the current explainer.md) have the signature:

async transform(encodedFrame, controller)

but assuming that this is really just Streams, isn't the original example from the slides (https://docs.google.com/presentation/d/1NIHzumglY9cYa4b7rcEbHGVsMam5BiY80VfFDB6cDjQ/edit#slide=id.g7eb1549726_1_10):

async transform(chunk, controller)

more accurate? That is, does InsertableStreams build up a full frame for arg0.data?

Nearly all online examples just overwrite / mess with all the bytes in a uniform way (e.g., bit complementing everything) that wouldn't whether arg0 was just a chunk or the full frame.

Maybe documenting what happens for a video stream in VP8 would be useful? #32 (comment) mentions that:

It actually doesn't provide access to the full info of the RTP payload; the RTP headers and the segmenting that goes into putting frames into RTP packets isn't reflected in the Insertable Streams API.

But that note isn't reflected in either the spec (index.bs, correct?) nor explainer.md for "what actually goes into the ArrayBuffer data".

@alvestrand could you clarify this?

Frames should be Serializable

In order to support running streams on a Worker, chunks must be marked serializable, since that is the mechanism streams use to send chunks to a Worker.
For simplicity and efficiency, we should neuter frames once they're serialized, so that it is easier to make the deserialized frame on the Worker side reuse the underlying WebRTC frames.

Need feature detection

Apps that require this feature in order to work need the ability to feature detect it; setting up a PeerConnection and making a connection in order to detect that nothing happens seems like it's too complicated.

Race conditions with async/await inside transform streams

The crypto.subtle encrypt/decrypt functions returns a promise, so the call to the transform function of the TransformStream is not guaranteed to finish before the next frame is processed.

That could cause frame order to be reversed and image decoded with artifacts (specially on large frames vs short ones, or if the encrypted frame contains signature infos in SFrame)

This is solvable in the js (although not easily), but not sure if we could do anything to make things easier for devs.

what metadata is useful?

This API should enable cryptographic schemes to be built ontop of it. What we have at a bit lower level is SRTP and srtp + gcm which use input from the RTP packet.

I am not sure the SSRC is always useful as some SFUs do ssrc rewriting so can not use it in their e2ee encryption scheme. Same goes for the pictureId and (rtp) timestamp. For simulcast rewriting pictureId towards the receiver is a must effectively.

Note that this just means that when using these as input for the IV or counter that IV/counter must still be sent along. For GCM I suspect that allowing the IV to be generated programatically and then sent alongside the packet would still be better than requiring the generation of a large number of (cryptographically?) random numbers.

Does Chromium require anything in SDP or RTP Header to make this work?

Sorry to talk about this on W3C repo, but I have no idea how to contact any Chrome developers.

I tried to code up an example here but hitting weird behavior. After processing my buffers I get the right values (the values I sent), but the browser fails to decode them. In the debug logs I just get Failed to decode frame with timestamp 2656706362, error code: -1

I am not that familiar with the Chromium code base, but do you know if this feature depends on anything else? I see lots of extmap entries and multiple RTP headers, hoping this is behind something I haven't found yet.

Will keep debugging, going to send both tracks in and diff each packet and see if I am making a mistake here. Can't find anything yet though.

thanks

Interaction with Congestion Control

The Virtual Reality Gaming use case may potentially involve adding metadata to the encoded frame. The metadata could be substantial (hundreds of bytes).

Similarly, there are accessibility scenarios (captioning) in which the captions might be sent along with the frames.

So the question arises as to the interaction with congestion control in these scenarios. When adding to (or even substracting from?) the size of the encoded frame, is there a way to properly interact with congestion control?

Off-the-main thread processing by default

Since this is dedicated to RTC, it is important that this processing does not get blocked by other processing. One solution is to define an API and a processing model that would set things up from main thread but runs in a background thread by default.

Similarly to WebAudio, this could be defined in terms of:

  • A pipeline processing some data (frames + metadata) as seems to do the current proposal, but on a background thread
  • A way to set up the pipeline by connecting nodes together, with an input and a destination node
  • A way to create native nodes having a specific functionality
  • An optional way to create JS processing nodes a la AudioWorket

Ability to insert native source nodes in the pipeline

end-to-end encryption is one node that would be best implemented natively.
There are several benefits:

  1. Implementation of a single standardised format, widely studied, widely tested, well maintained
  2. An API to provide the key material to the encryption node. This API can be extended to support different trust models
  3. The ability to not expose encryption keys to the JS (directly or through attacks like Spectre)

Consider using TransformStreams instead of exposing ReadableStream/WritableStream

It might be worth considering using TransformStreams instead of exposing ReadableStream/WritableStream directly.
One reason is consistency with other APIs like https://encoding.spec.whatwg.org/#interface-textencoderstream or https://encoding.spec.whatwg.org/#interface-textencoderstream.
This for instance makes it easier to define native transforms.
Not dealing with ReadableStream is also nice to remove some potential foot guns like cloning a ReadableStream.

Optimizing encoded frame buffer allocation and memory copies

RTCEncodedAudioFrame and RTCEncodedVideoFrame both own an ArrayBuffer.
This array buffer is exposed to JavaScript by ReadableStream and consumed by WritableStream.

One important design API goal is to limit memory copies, maybe allow in-place transform of the array buffer so that there is no memory copy and no memory allocation.

One possibility would be to allow the frame array buffer to be detached after the frame is enqueued in the WritableStream.

{ readableStream, writableStream } should be { readable, writable }

https://htmlpreview.github.io/?https://github.com/w3c/webrtc-insertable-streams/blob/master/index.html#dictdef-rtcinsertablestreams

The conventional names for a pair of readable and writable streams are { readable, writable }, with no suffix. Aligning these will improve interoperability with the rest of the streams ecosystem, both concretely (e.g., making the object usable with pipeThrough()) and just in terms of web developer familiarity.

Adopt feedback streams in RTCRtpScriptTransformer

In some applications, especially those that have the input or output to a transform go somewhere other than the normal path, it is vital that the upstream frame source be informed of other events than just frames being consumed. This may include bandwidth adaptation signals, frame size adaptation signals, or other signals.

In mediacapture-transform, this need is satisfied by control channels.

I propose that we add two more attributes to the ScriptTransformer interface: ReadableStream readableControl and WritableStream writableControl.

Security evaluation

This is a general issue about evaluating the security risks this new API can bring to existing infrastructure and adding a security section.

One potential threat is the following: by allowing JS to modify media content post-encoding, this API allows an attacker that is able to inject code in a page doing a WebRTC call to send RTP packets with poisoned content to either SFU or other participants in the call. Without this API, the attack is more difficult since the encoder will probably generate sanitised content. A non-browser attacker might be able to generate the same poisoned content but may not be able to connect easily to either SFU or other participants.

Add an API to know if createEncoded{Audio,Video}Streams was called

Hey there!

While integraating this on Jitsi Meet we ran into the case of calling createEncodedVideoStreams more than once by mistake. This currently throws an exception, which is nice, but there is no way to know in advance if we already created the encoded audio / video stream.

We solved it by using a custom hidden (with a Symbol) attritute on the sender / receiver, but it would be nice to be able to have an "official" API for this.

statistics

@emcho asked me some good questions about performance. This is measurable with performance.now() and then summing that up.

Should we have something similar to totalEncodeTime to allow measuring how much time is spent in insertable streams?

Applicability Statement

The Insertable Streams API provides access to the RTP payload, which has generated considerable interest. I have heard suggestions that it might be used to implement some of the following:

  • Support for audio redundancy (e.g. RED, FEC, etc.)
  • Accessibility (captioning, real-time text)
  • Generic bitstream access (similar to WebCodecs, but with WebRTC parity and support for WHAT WG streams)

It might be helpful to have an applicability statement somewhere in the document, to clarify what use cases might not be supportable.

"Get" is not a good name

The name "get" on a function implies that it doesn't change the state of the object, but getEncodedStreams() definitely changes the state. Can we call it "extract", "insert" or something else suggesting that it modifies things?

What about simulcast?

If the sender is a simulcast sender, what should be the behavior of the streams?
One per RTP stream, or one for the whole sender? If the latter, where do we know what encoding the frame belongs to?

Piping capured audio to insertable stream from shell script

For the case of Chromium refusal to support capture of monitor devices am using Native Messaging and Native File System to write and read a file which is then parsed and set as outputs at AudioWorkletProcessor.process(), in pertinent part

parec --raw -d alsa_output.pci-0000_00_1b.0.analog-stereo.monitor ../app/output

which is read in main thread at browser. However, one issue is that Native File System currently cannot got a single handle on a file that is simultaneously being written to for the purpose of reading and writing at the same time, DOMExceptions will be thrown, and requires reading the entire file at each iteration to slice() from previous offset

          async function* fileStream() {
            while (true) {
              let fileHandle, fileBit, buffer;
                // if exception not thrown slice file from readOffset, handle exceptions
                // https://bugs.chromium.org/p/chromium/issues/detail?id=1084880
                // TODO: stream file being written at local filesystem
                // without reading entire file at each iteration before slice
                fileHandle = await dir.getFileHandle('output', {
                  create: false,
                });
                fileBit = await fileHandle.getFile();
                if (fileBit) {
                  const slice = fileBit.slice(readOffset);
                  if (slice.size === 0 && done) {
                    break;
                  }
                  if (slice.size > 0) {
                    buffer = await slice.arrayBuffer();
                    readOffset = readOffset + slice.size;
                    const u8_sab_view = new Uint8Array(memory.buffer);
                    const u8_file_view = new Uint8Array(buffer);
                    u8_sab_view.set(u8_file_view, writeOffset);
                    // accumulate  512 * 346 * 2 of data
                    if (
                      writeOffset > 512 * 346 * 2 &&
                      ac.state === 'suspended'
                    ) {
                      await ac.resume();
                    }
                    writeOffset = readOffset;
                  }
                }
              } catch (err) {
                // handle DOMException
                // : A requested file or directory could not be found at the time an operation was processed.
                // : The requested file could not be read, typically due to permission problems that have occurred after a reference to a file was acquired.
                if (
                  err instanceof DOMException ||
                  err instanceof TypeError ||
                  err
                ) {
                  console.warn(err);
                }
              } finally {
                yield;
              }
            }
          }
          for await (const _ of fileStream()) {
            if (done) break;
          }

Does opus-tools have the capability to create an Opus bitstream that will support piping the output therefrom to the writable side of the insertable stream? That is, instead of writing the file and reading the file we can do something like

parec --raw -d alsa_output.pci-0000_00_1b.0.analog-stereo.monitor | opusenc - - | opusenc <options_to_make_stdout_insertable_stream_writable_input> -

where we can then write() the output from the native shell script directly to an insertable stream, avoiding the need to re-read the same file just to get the current offset, or use SharedArrayBuffer to store the contents of the file in memory; we do not need to write a file at all, rather actually stream output from native application to RTCPeerConnection?

How to easily support messaging between RTCScriptTransform and RTCScriptTransformer

Safari prototype supports a MessagePort natively so that RTCScriptTransform and RTCScriptTransformer can exchange messages conveniently. This mimics AudioWorkletProcessor.port.

Another approach would be to add a parameter to RTCScriptTransform to transfer some objects when serialising the options constructor parameter. Something like:

const channel = new MessageChannel()
const transform = new RTCRtpScriptTransform(worker, { name: 'myPortTransform', port: { channel.port2 }, [channel.port2])
transform.port = channel.port1

How to handle transforms largely changing frame size

Transforms may be able to introduce large changes in frame size (decrease or increase).
It seems interesting to understand how to handle these cases.

I can see different variations of these cases:

  1. Metadata size is not really negotiable by the JS transform, size change is more or less fixed.
    The user agent can handle it.
    User Agent can detect the size of the metadata and update the encoder bitrate according to the average of the transformed data.

  2. Transform may decided to decrease metadata size
    The transform may add more or less metadata based on available bandwidth.
    The transform could be made aware of the target bit rate from network side, maybe the encoder target bit rate as well.
    The transform would then compute how much space it can add to the frame.

  3. Transform might want to trade media quality to include more metadata
    In that case, the transform can decide whether to reduce the encoder bit rate or the size of the metadata, or both.
    It seems useful to notify the JS whenever change of the encoder bit rate is planned and potentially allow the JS to override the default behavior.

Case 1 requires no new API.
Case 2 could be implemented as getter APIs.
Case 3 can be implemented in various ways: ReadableStream/WritableStream pair, transform, events, maybe even through frame dedicated fields). It seems all these variants should provide roughly the same functionality support.

I feel like a single object that the JS could use for all of this when processing a frame might be the more convenient from a web developer perspective.

Additional space in the buffer

It would be useful to allow the user to request additional bytes to be prepended and appended for each frame, so adding a header/footer/nonce/whatever kind of additional data does not require to copy into a new ArrayBuffer which can be expensive and may require garbage collection.

This is useful for e.g. encryption modes with additional MACs and nonces that need to be transmitted.

Could look like:

createEncodedVideoStreams(optional EncodedVideoStreamsParameters)

dictionary EncodedVideoStreamsParameters {
  unsigned byteHeadroom = 0;
  unsigned byteLegroom = 0;
};

Privacy evaluation

This is a general issue about evaluating the privacy risks this new API can bring to existing infrastructure and adding a privacy section to the proposal.

This API may provide access to encoder/decoder states otherwise not available to applications, for instance timing information. It would be good to investigate this potential issue and the potential mitigations.
For instance, a fully native pipeline probably does not bring much fingerprinting, or makes it more easy to add mitigations. Limiting what JS can do/observe is a potential mitigation.

rename to createEncodedStreams?

can we rename createEncodedVideoStreams and createEncodedAudioStreams to createEncodedStreams? The sender kind is always known and can not change so having to decide which method to call is a bit cumbersome

Data channels

If we're able to use Streams on underlying video and audio data, it stands to reason we should also have that ability on data channels themselves. Given how difficult it is to effectively address backpressure using the traditional Javascript event model, the Streams API can give a huge performance win to developers.

metadata for start and flush

related to #9

Some things like the SSRC are constant over the lifetime of the stream (well, modulo ssrc changes...).

It would be useful to avoid first-time-initialization bookkeeping along the lines of "we haven't seen this ssrc" inside the main transform function. Same for flush, there one could still do periodic cleanup but that is even worse

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.