Git Product home page Git Product logo

Comments (10)

huitema avatar huitema commented on August 26, 2024 1

There is related experience with DNS log formats. In particular, look at the CBOR encoding of DNS logs proposed in RFC 8618, https://datatracker.ietf.org/doc/rfc8618/. They started from PCAP, but there was a practical issue with managing huge PCAP files. The first attempt was to just try compress the binary, but they ended up with a more structured approach. The logical syntax follows the "natural" repetitions in the data, managing to get for example DNS names encoded just once, and then represented by indices in the tables of names. Then they encode the "syntactically organized" data in CBOR (binary JSON), and they apply compression on top of that.

The main value of the logical syntax comes when processing logs. For example, I observed a factor 50 performance gain between doing DNS statistics directly on the PCAP and doing the same statistics on the logical CBOR data, due to both reduced IO with shorter data, and more compact code following logical references.

I suspect there is something similar hiding in the Quic traces.

from qlog.

huitema avatar huitema commented on August 26, 2024 1

I am quite familiar with the work on using CBOR to record DNS traces in RFC 8618. The captures were originally in PCAP, but PCAP gets very large files. They looked at a set of variations:

Format File size Comp. Comp. size RSS User time
PCAP 661.87 snzip 212.48 2696 1.26
lz4 181.58 6336 1.35
gzip 153.46 1428 18.20
zstd 87.07 3544 4.27
xz 49.09 97416 160.79
JSON simple 4113.92 snzip 603.78 2656 5.72
lz4 386.42 5636 5.25
gzip 271.11 1492 73.00
zstd 133.43 3284 8.68
xz 51.98 97412 600.74
Avro simple 640.45 snzip 148.98 2656 0.90
lz4 111.92 5828 0.99
gzip 103.07 1540 11.52
zstd 49.08 3524 2.50
xz 22.87 97308 90.34
CBOR simple 764.82 snzip 164.57 2664 1.11
lz4 120.98 5892 1.13
gzip 110.61 1428 12.88
zstd 54.14 3224 2.77
xz 23.43 97276 111.48
PBuf simple 749.51 snzip 167.16 2660 1.08
lz4 123.09 5824 1.14
gzip 112.05 1424 12.75
zstd 53.39 3388 2.76
xz 23.99 97348 106.47
JSON block 519.77 snzip 106.12 2812 0.93
lz4 104.34 6080 0.97
gzip 57.97 1604 12.70
zstd 61.51 3396 3.45
xz 27.67 97524 169.10
Avro block 60.45 snzip 48.38 2688 0.20
lz4 48.78 8540 0.22
gzip 39.62 1576 2.92
zstd 29.63 3612 1.25
xz 18.28 97564 25.81
CBOR block 75.25 snzip 53.27 2684 0.24
lz4 51.88 8008 0.28
gzip 41.17 1548 4.36
zstd 30.61 3476 1.48
xz 18.15 97556 38.78
PBuf block 67.98 snzip 51.10 2636 0.24
lz4 52.39 8304 0.24
gzip 40.19 1520 3.63
zstd 31.61 3576 1.40
xz 17.94 97440 33.99

You can see that there are some differences between various algorithms. JSON clearly gets bigger sizes there than the binary alternatives, even after compression. But the biggest differences come from switching from what they call "simple" to what they call "block".

The simple alternative is pretty similar to the current Qlog. Each DNS transaction is represented by a corresponding record in JSON, CBOR, Avro or protobuf. In contrast, the "block" format starts by building tables of objects seen in multiple records: table of DNS names, table to record values, etc. Then the individual PCAP records are represented by "block records" which instead of listing DNS names simply list the index of the name in the table of names. You can think of that as a "logical compression", and it does reduces the size of the recording by a factor 10x. After that, they can still apply compression.

The real beauty of the block format comes when processing the data in back end programs. Compare:

uncompress < pcap.xz | process-pcap

To:

uncompress < cbor.xz | process-cbor

In the cbor alternative, there are about 10 times fewer data piped into the analysis program than in the pcap alternative. That's a much lower IO load. On top of that, since the cbor data is structured in blocks, parsing and processing is much easier, resulting in a much lower CPU load. In a project that I was involved with, replacing process-pcap by process-cbor made us run 40 times faster!

Also note that there are no practical differences between the various binary alternatives. yes, +- 10% here or there, but compared to a factor 40 that's really in the noise.

from qlog.

marten-seemann avatar marten-seemann commented on August 26, 2024 1

If we use cbor, does that mean that we can get rid of the event_fields? Having implemented a both an encoder as well as a parsing tool, this complicated things quite a bit for me (over just encoding an event as a normal JSON object).

from qlog.

rmarx avatar rmarx commented on August 26, 2024

Talking about this with @nibanks, he would primarily like this for larger traces (he has logs of several 100s of megabytes) and for integration with other tools (like https://docs.microsoft.com/en-us/windows-hardware/test/wpt/windows-performance-analyzer).

He suggests https://diamon.org/ctf/ as one possible format (though, at first glance, this doesn't have a JavaScript parser somewhere).

from qlog.

rmarx avatar rmarx commented on August 26, 2024

@huitema that's some very interesting stuff that I wasn't aware of yet, thanks!

from qlog.

rmarx avatar rmarx commented on August 26, 2024

Talking about it some more with @nibanks, he states:

I'd prefer something that is light-weight and doesn't depend on yet another thing (protobuf). Or something that exists and is light weight to implement a parser for from scratch

@LPardue did some initial tests with CBOR and found the file size gains not to really outweigh compressed JSON.

I am currently experimenting with a few binary scheme options to get a first feel for potential file size and (de)serialization gains. That should give us some additional data to work from.

from qlog.

LPardue avatar LPardue commented on August 26, 2024

To be clear I am no CBOR expert. All I did for my serializing code was substitute out serde_json for serde_cbor and compare the resulting output. CBOR shaved off 10% of identity encoding, gzipped-json shaved off about 40%.

AFAIK It is possible to profile CBOR to be more efficient (e.g. https://tools.ietf.org/html/draft-raza-ace-cbor-certificates-04) but that is beyond my skillset.

from qlog.

rmarx avatar rmarx commented on August 26, 2024

Thanks a lot for that @huitema. Doing something similar to the "block format" would be trivial for qlog as well. However, it mismatches with how I thought general purpose compression works in my head... don't those algorithms also build that type of lookup-table on the fly? I will test with manual block formats as well and see what that gives.

Another interesting ref from @martinthomson https://tools.ietf.org/html/draft-mattsson-tls-cbor-cert-compress-00

from qlog.

rmarx avatar rmarx commented on August 26, 2024

So I've been doing some tests of my own to figure out the best approach to this for qlog.

I've created converter scripts (see https://github.com/quiclog/pcap2qlog/tree/binary/src/converters) that use a lookup table/dictionary instead of repeating values, one that cbor encodes the files and a (rudimentary) protobuf schema. The dictionary is currently fully dynamic and stored inside the resulting file, but this can obviously be improved by having a static shared dictionary with a dynamic part for just the field values (much like QPACK and Chrome's NetLog).

I've then also looked at various compression schemes (https://github.com/quiclog/pcap2qlog/blob/binary/src/scripts/comparisons/compare.sh) (xz, gzip, brotli, zstd, lz4), focusing mainly on the schemes most often seen on the web for on-the-fly compression (gzip 6 and brotli 4).

Full results can be found at https://gist.github.com/rmarx/49bb14f83157d9fe59fb40e7c05b1f3f, a bit nicer representation in the following image (sizes for traces in which a 500MB or 100MB file were downloaded from the lsquic public endpoint). The blue value is the reference point for the percentages, green is the "best in class" for that row:

2020-04-22 11_27_11-results xlsx - Excel

Main takeaways for me:

  1. protobuf is the smallest, but not by a huge margin compared to dictionary+cbor, especially not when compression is used.
  2. compression alone saves massively, even on the original JSON file or direct CBOR version of that
  3. protobuf without compression is still quite large (23% of original), so I'd reckon you'd always use compression for storage/transmission anyway?

Next to these tests, we also ran a survey among QUIC experts (implementers and researchers), on which we got replies from 28 participants (thanks everyone!). Part of the survey was to ask how important they felt features like "fast (de)serialization, small file size, flexibility (e.g., easily add new event types), grep-ability" were. The full results will be posted soon (are part of a publication we're preparing), but the gist of it is:

2020-04-22 11_37_40-QUIC and HTTP_3 Debugging Survey - March 2020 - Google Forms

My interpretation:

  1. Flexibility is the major selling point across the board. I personally believe moving to something like protobuf sort of robs us from that (much more difficult to add new event types as you have to update the schema)
  2. Most don't care too much about (de)serialization performance
  3. Small file size was regarded as important, but again not as much as flexibility.
  4. grep-ability was also considered an important feature by many
  5. easy integration is also a major point and this would be easier with something like protobuf

Finally, we also talked to Facebook (cc @mjoras), who have been deploying qlog at scale, logging over 30 billion qlog events per day. Compared to their earlier binary format, qlog is about 2-3x larger and takes 50% longer to serialize. Yet, this is quite manageable on the server side, where they log full-string JSON events to a centralized service. On the client, they do find the file-size to be prohibitive to upload granular full qlogs (containing all events they'd like). Yet, Matt was also adamant that they'd rather keep the flexibility of the JSON format than move to a more inflexible binary one. They were considering utilizing compression and writing a custom JSON (de)serializer, optimized for qlog, to help deal with some of the overhead.


So, presented with those results, my standpoint today is still to keep using JSON as the basis for qlog. I would propose to add the "dictionary" setup to the spec though, as an optional optimized mode and also recommend tools to support that (not sure about a default static dictionary at this point though). Furthermore, I'd recommend using cbor if file size is important.

Companies that do need more optimizations can write their own protobuf (or equivalent) schema (which I've shown is possible) and then write a post-processor to go to proper JSON qlog for shared tooling.

Still, feedback on all this is more than welcome of course! @marten-seemann, @martinthomson, @huitema, @LPardue, @nibanks, @mjoras

from qlog.

rmarx avatar rmarx commented on August 26, 2024

With the latest commit linked above (eb59e69), I feel this issue has been resolved.

qlog has not moved to a binary format by default, but is now much easier to serialize as one/to define a binary schema for. Some of the reasoning behind that has also been included in the qlog document.

from qlog.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.