quicwg / qlog Goto Github PK

View Code? Open in Web Editor NEW

79.0 79.0 12.0 10.72 MB

The IETF I-D documents for the qlog format

License: Other

Makefile 30.21% Ruby 9.11% Shell 60.68%

ietf qlog quic

qlog's People

Contributors

Stargazers

Watchers

Forkers

jlaine marten-seemann dtikhonov toru lnicco rmarx junhochoi yfming swift-quic mxinden minhkstn auliasadil

qlog's Issues

How descriptive should EVENT_TYPE names be?

For example:
PACKET_RECEIVED vs 1RTT_PACKET_RECEIVED
FRAME_NEW vs ACK_FRAME_NEW

For the leftmost entries, one would add a "type" field to the "DATA" value, e.g.,

{
     "type": "1RTT"
}

The shorter form makes that we have a much less large amount of different EVENT_TYPEs, but also makes it a bit harder to parse for human readers + harder to quickly filter for tools.
The longer form is much more explicit, but requires much more definition up-front and a proliferation of different EVENT_TYPEs.

We could also break consistency. i.e., the original qlog used PACKET_RECEIVED with an explicit type in the DATA, but used ACK_FRAME_NEW for individual frames.

Currently, we use the short-form, since this is most similar to quic-trace and keeps it consistent if we want to log frames both in their own events and again when sending a packet.

Extra edge-case: Errors
If you go for extreme short-form, you would just have a single ERROR EVENT_TYPE for each CATEGORY, and define the error type in the DATA.
However, for easier manual debugging, tracking the specific type of error directly in the EVENT_TYPE is arguably easier. Maybe an exception should be made for errors?

Add connection_closed or connection_dropped event

Currently, we rely on packet_* with a CONNECTION_CLOSE frame, but that's not always enough. E.g., the server can decide to drop a connection after a long timeout without sending a CONNECTION_CLOSE. Or, we might want additional information of when a connection is effectively dropped completely (according to @marten-seemann: is supposed to happen 3 PTOs after it is retired)

Maybe a connection_closed event with a trigger field suffices? Should this be importance Base or Extra?

Additional triggers and info for dropped packets

The current text lists about 8 reasons for dropping packets.
Microsoft's implementation lists 60+ individual reasons (via @nibanks)

Some of those:

LogDrop("Retry sent to server");
LogDrop("Already received server response");
LogDrop("No room for ODCID");
LogDrop("Invalid ODCID");
LogDrop("InitialToken alloc failed");
LogDrop("OrigCID alloc failed");
LogDrop("Max deferred datagram count reached");
LogDrop("Key no longer accepted");
LogDrop("SH packet during version negotiation");
LogDrop("Too short for HP");
LogDrop("Packet number too big");
LogDrop("Payload length less than encryption tag");
LogDrop("Generate new packet keys");
LogDrop("Decryption failure");
LogDrop("Invalid SH Reserved bits values");
LogDrop("Invalid LH Reserved bits values");
LogDrop("Duplicate packet number");
LogDrop("Key no longer accepted (batch)");
LogDrop("Failed to compute HP mask");
LogDrop("Different remote address");
LogDrop("Too small for Packet->Invariant");
LogDrop("LH no room for DestCID");
LogDrop("Zero length DestCID");
LogDrop("LH no room for SourceCID");
LogDrop("SH no room for DestCID");
LogDrop("DestCID don't match");
LogDrop("SourceCID don't match");
LogDrop("Greater than allowed max CID length");
LogDropWithValue("Invalid client/server packet type", Packet->LH->Type);
LogDrop("Invalid LH FixedBit bits values");
LogDrop("Long header has invalid token length");
LogDropWithValue("Long header has token length larger than buffer length", TokenLengthVarInt);
LogDrop("Long header has invalid payload length");
LogDropWithValue("Long header has length larger than buffer length", LengthVarInt);
LogDropWithValue("Long Header doesn't have enough room for packet number",
LogDrop("Invalid SH FixedBit bits values");
LogDrop("Non-initial packet not matched with a Connection");
LogDrop("Retry received after initial");

I feel that we don't need to list things in this level of detail in the qlog spec (the "trigger" field allows any text anyway). However, maybe some guidance text on this would be helpful and maybe a few more suggested triggers would be interesting.

Add guidance to server developers

@marten-seemann asked how to best approach logging from a server's perspective, given that things like version negotiation and stateless retry are not inherently tied to a single connection. We should add some informative guidance on how to best approach this to the spec, depending on how much state you're willing to keep around

Some options:

low state: keep a separate qlog file for the entire server. This logs vneg, retry, etc.. Then, when a connection is truly accepted, start a new .qlog for the individual connection, containing all events thereafter. The server.qlog can then also contain an event signalling the acceptance of a new connection for later cross-linking between the files.
low state: keep a single huge qlog file for everything, using the "group_id" field to allow later de-multiplexing into separate connections (I believe quant does this atm)
stateful: if you already track vneg/retry and link them up with the final connection, you can output them in the per-connection qlog file as well

Maybe also shortly talk about some of the trade-offs in each option. Also talk about how to approach server-level events like server_listening and packet_dropped in separate scenarios.

QLOGDIR environment variable

Maybe the main schema should specify something like SSLKEYLOGFILE that implementations SHOULD adhere to.

Though it needs to be a directory, not 1 file, since we probably don't want to dump all qlogs into 1 single file as we do with the SSL keys.

A good option might be QLOGDIR.

We should also specify how to write/name individual files in that directory (or at least list options)

TODO: find out where SSLKEYLOGFILE Is specified exactly (seems a bit difficult to google).

CC @bagder @xquery

Allow index-based referencing for all event field names

Currently, we have a concept that you can have a "groups_id" object in "common_fields".
If you then have a "group_id" in your "events_field", the value for that with each event is an index into the "groups_id" field, to prevent replication of complex fields.

We could make this more general, applicable to each field name.
E.g., if you know up-front which CATEGORY values you support, you could do something like:

{
    "common_fields": {
        "group_id": "127ecc830d98f9d54a42c4f0842aa87e181a",
        "ODCID": "127ecc830d98f9d54a42c4f0842aa87e181a",
        "CATEGORY": ["PACKET_RX", "DATA_FRAME_NEW"]
        "protocol_type":  "QUIC_HTTP3",
        "reference_time": "1553986553572"
    },
    "event_fields": [
        "delta_time",
        "CATEGORY",
        "EVENT_TYPE",
        "TRIGGER",
        "DATA"
    ],
    "events": [[
            2,
            "TRANSPORT",
            0,
            "LINE",
            [...]
        ],[
            7,
            "APPLICATION",
            1,
            "GET",
            [...]
        ],
        ...
    ]
}

We would then have a general rule:

If the field is present as a value of type array in "common_fields" AND the field-name is present in "event_fields", the value per-event MUST be treated as an index into the "common_fields" entry, rather than taken as a direct value.

("groups_id" would then also be renamed to "group_id" in "common_fields")

This would allow smaller file sizes (and less string writing overhead) for applications that have a static list of e.g., CATEGORY or EVENT_TYPE up front.
Downside 1 is that it complicates the tools reading the file (but only a bit imo).
Downside 2 is that is complicates humans reading the file (so it depends on the use case).
(either way, it's easy to go from 1 to the other with a simple script)

This concept could be extended to make it a fully self-describing format.
In other words, we could also describe the fields in the DATA for known events up-front and replace those entries with in-order arrays of the values instead of key-value object definitions.

Very high-level concept (probably needs proper description of the fields etc.):

"data_fields" : {
      "TRANSPORT+PACKET_SENT" : [
         "frame_type",
         "packet_number",
         "frames"
     ]
}
...
[ 57, "TRANSPORT", "PACKET_SENT", "TRIGGER", ["STREAM", 15, [...]]]

Taking this all to the extreme, you could have a fully self-describing format that lists all known events (and potentially some values, similar to QPack's static table) up-top and then each entry just uses indexes + potentially a few raw values. However, I'm personally not of the opinion this added complexity is worth it.

0-RTT is a bit ambiguous in -01

using transport.parameters_set there are two parameters to signal 0-RTT:

resumption_allowed
early_data_enabled

As pointed out by @jlaine, these are a bit ambiguous, as 0-RTT can either be used for the current connection or enabled for a future connection.

Solution 1:

Rename parameters to resumption_accepted and early_data_accepted
Add new events: session_ticket_sent/received (with early_data_enabled?:boolean member)

Solution 2:

Rename parameters to resumption_accepted and early_data_accepted
Add new parameters: resumption_offered and early_data_offered

Require specific encoding for string fields

For example, NewTokenFrame doesn't really need a length field if the encoding of the token is specified (e.g., if it is hex-encoded, byte-length is 2x token.length).

We already indicate hex-encoding at other places in the text (e.g., for version), maybe it's a good idea to enforce this across the board (then also specificy whether it should have 0x prefix or not etc. + examples).

Thanks to @marten-seemann for reporting

Decide upon a language to define the schema definition

We need to indicate various things, such as field types (int, string, ...), whether fields are required or optional, give example values for fields, etc.

We could use normal JSON-schema for this, but this is quite verbose...
We could also use TypeScript, though this is non-standard...
Other RFCs are known to use their own specific languages for this (i.e., see TLS 1.3's type definitions), but maybe there is something workable already out there.

Currently, we use TypeScript format, since this is used in the Quicker prototype qlog implementation directly + is easy enough to parse for newcomers.

Aggregated metrics

Example: in-network logger aggregates over a period of time and sends back summarized data in 1 go (e.g., average RTT in past 10s, measured over 10k packets).

This can be supported in a variety of ways.

The easiest would probably be a new EVENT type that just contains the aggregated metrics.
However, this depends on which types of aggregated data you want to pass.

e.g., saying: median RTT was 50ms over the past 5s and we saw 1k packets in that time is fine

e.g., saying: median RTT was 50ms across these 5 connections identified by these 4-tuples, is not really adhering to the semantics of the original setup.

We need more input from the people doing aggregated use cases to see how this data would be used and which types of metadata is needed to make an informed decision.

Mark events by their importance

Not all events are equally useful in a debugging/tooling scenario.
Mark events according to order of usefulness/expectedness.

For example:

Core
Base
Extra

Make triggers behave like mixins

Since no-one is implementing triggers as top-level fields and it was always unclear how to best approach them, we should punt them to optional properties of the "data" field instead. This allows their flexibility without their overhead.

consider moving to a binary format

I just started working on implementing qlog in quic-go. Maybe it's because I'm still fairly unfamiliar with qlog, but I feel like encoding things in JSON leads to some awkward hacks. Examples of these are:

A lot of numbers are encoded as strings, e.g. stream offset or packet numbers. I assume this is because JSON doesn't properly handle uint64s (or does it?).
IP addresses are encoded as strings. If that means they're supposed to be encoded in the human-readable encoding (with . and :), that's ambiguous for IPv6 addresses. Really, IP addresses should be a byte array.
(Raw) packet data is supposed to be hex encoded, which greatly increases the log size.
Some fields are defined as enums, whereas other fields that just have a few options are encoded as strings. Examples are the stream_side ("sending" or "receiving") and stream_type ("unidirectional" or "bidirectional"), which are both string fields.

I'm not sure if I like trick to save bytes on the events by first defining the event_fields and then using a list instead of an object to encode the events. To me, this feels more like a hack to work around the shortcomings of JSON, namely the repetition of the field labels when using objects.
As far as I can see, a binary encoding scheme would be able to provide a type-safe representation here without repeating the field labels (and blowing up the file size), as long as it's possible to define some common_fields for a connection.

A protobuf-based logging format (This is just a suggestion. Protobufs are the thing I'm most familiar with, maybe there are better choices out there.) would resolve the encoding ambiguities I listed above, because we'd be able to make use of a strong typing system, which would allow us to completely eliminate the use of strings (except for places where things actually are strings, e.g. CONNECTION_CLOSE reason phrases). Furthermore, it would greatly simplify implementing qlog: Just fill in the corresponding fields in the Protobuf messages, call Marshal(), and you're done. No need to manually define dozens of logging structs and make sure they're correctly serialized into qlog's flavor of JSON.

Support partial logs

@nibanks mentioned that the winquic implementation does logging in a circular buffer. If this runs out of space, the earliest logs (e.g., the ip addresses, initial connection ids, etc. might have been dropped).

We can't really force tools to support this, but potentially we can add text in the draft so people know they should take this into account.

Readability vs file size

We would like to keep the file (semi-) human readable.
This means more advanced techniques like listing all fields in an ENUM up-front and then referencing to them by index is not optimal.

However, when repeating strings, we might want to limit the length of individual fields.
e.g., TRANS instead of TRANSPORT, APP instead of APPLICATION, RX instead of RECEIVE, etc.

Current numbers (with readable strings):

qlog version without whitespace is about 4x the size of the binary .qtr file for the quic-trace example file (3.5MB to 823KB)
qlog version with whitespace is 10.6MB. However, this should matter little, since if manually reviewing large logs, you'll probably use a text editor that can auto-format the json from the version without whitespace

4x size difference is a lot, but better than the direct .json protobuf transform, which clocks in at 14MB.

Typo in path response frame definition

### PathResponseFrame

~~~
class PathResponseFrame{
  frame_type:string = "patch_response";

  data?:string;
}
~~~

patch_response should be path_response.

Define the semantics of RX and TX for NETWORK vantage point

For CLIENT and SERVER, the difference between RECEIVE (RX) and TRANSMIT (TX) is obvious. Not so for an in-network observer (or, e.g., a proxy server), where these terms make less sense...

Some options:

Type is NETWORK_CLIENT and NETWORK_SERVER (instead of NETWORK)
add separate "flow" field indicating if we use CLIENT or SERVER semantics (currently in the draft)
add separate metadata field indicating which 5-tuple is the conceptual "client" and which is the "server" and use RX/TX based on that
Don't fix this and let the tooling layer figure it out (if packet nr 6 is a client TX and a RX in the NETWORK trace, the network is from the viewpoint of the SERVER)

Broader discussion: does it make sense to log packets as PACKET_TX and _RX here? how about instead a PACKET event (similar to how wireshark does it). However: this doesn't make sense for (stateful) proxies that do act as both client+server when ~transforming the traffic (e.g., Facebook's load balancer).

Lacking a way to indicate ALPN list for client

Currently, we can log a list of supported ALPN values for the server in "server_listening" and log the finally selected ALPN in "parameters_set". However, we lack a way to log the list of ALPN values the client supports (and offers the server).

Options:

add alpn_values?: Array<string> // ALPN values offered by the client / received by the server. Use parameters_set to log the actually selected alp to "connection_started"
make "alpn" in "parameters_set" an array instead of a string

I personally prefer the 1st option, since the second doesn't match the semantics of "set" (since it would be emitted twice) and logging the negotiation options should be optional in a "Base" (while the final value is in a "Core" event).

CC @jlaine

Revisit the category for generic events

We have a number of "generic" events in the draft (section 7, General error, warning and debugging definitions). These currently have their own categories: error, warning, info, debug, verbose, simulation. This is a bit awkward, since most of these only have a single event type.

It might be a good idea to group these under a single category, e.g., "generic" or "general" or "textual" or... (or maybe 2: generic and simulation). Another option would be to define a new event field called log_level next to category (but that would increase overhead, as now we're essentially squatting on the category to provide that).

The purpose of these events (sans simulation) is to allow one to replace the default textual logging with qlog completely (e.g., everything you now send to stdout/stderr with printf() would go into these kinds of events).

Thanks to @LPardue for reporting.

packet_size doesn't belong in the PacketHeader, but PacketType does

In my interpretation, the PacketHeader is a QUIC packet header. Therefore, it should contain the QUIC packet type.
However, the packet_size (as opposed to the payload_length, which is the length of the QUIC payload e.g. in a coalesced packet), is a property of the UDP packet, and therefore should a property of the packet_sent / packet_received event.

Mention JSON earlier

The fact that qlog is in JSON format is not mentioned until section 3.3.4 of the main logging schema draft. This should be stated earlier: before any JSON examples are given.

Allow more wire-image indicators

Currently, we abstract out a lot of the on-the-wire specifics.

A good example is in the STREAM frame: there, the length and offset fields are optional, depending on how the frame should be handled. Simply having the same fields optional in qlog doesn't convey quite the same semantics: did the implementation simply not log them or were they not present from the start?

More importantly though: if no length is set, the frame extends to the end of the packet, so it does have a length, which you'd probably want to log (that's the way it's currently designed), but so you loose the info that the length field wasn't encoded on the wire.

This is just one example of similar problems across the board. I rather like the simplicity of the current setup, but we should have ways to signal the explicit wire image as well.

cc @huitema

Numbers in JSON

Typically, integers in JavaScript and most JSON implementations are limited to 2^53-1.
This gives problems, as the VLIE values in the QUIC/H3 specs are 2^62-1.

Two options:

Allow bigger values in the JSON. Tools MUST use a JSON parser that can deal with this and a JavaScript engine that supports BigInt (currently limited to Chromium: https://caniuse.com/#search=bigint)
Encode all VLIE fields as strings. Tools have to deal with this themselves (most will probably just take the shortcut of assuming actual values will be < 2^53 and just use the JavaScript "Number" type). This is best for a wide tooling implementation area in browsers.

Currently, the draft uses option 2.

Replace specific events with a single encompassing event

We want to reduce the total amount of events as much as possible.

Especially specific events for things happening in reaction to the receipt of a specific frame in a packet (e.g., ACK, MAX_DATA, etc.) can be removed, since they can usually be inferred from that frame. Initially we had separate events for these (e.g., "packet_acknowledged" or "flow_control_updated" but those were rarely used in addition to packet_received events + many implementations do not defer frame handling from reception.

One notable exception is @nibank's msquic, which does not log all frames in a received packet, but rather does log only specific events (e.g., packet_acknowledged). One of the reasons is because he feels logging each packet in full does not scale. Another reason for this pattern might be that an implementation does not wish to log all types of frames OR conversely, does not want to log packet-level information at all but only very select frames.

To support this use case and still keep a low amount of event types, I will add a "frame_parsed" encompassing event. This will log the frame with its associated data, but without the encompassing packet-level data. This prevents re-defining semantics for many events. The downside is that you sometimes might want to log e.g., "packet_acknowledged" a long time after frame receipt. In that case, you would pretend you're parsing the frame only then. I feel this is a good trade-off to make here though.

Payload for packet_dropped

First, it looks like there is no payload for this event. Is that expected? It's a hard problem. Practically, there is only one case in which you drop a packet, post decryption success, and that is because it's a duplicate packet number. Other than that, all other drop events would occur before a packet is decrypted. If it can't be decrypted you don't know the packet number, which would likely be the most interesting payload of this event. So, therefor it likely isn't too useful in having the packet number as payload.

So, in absence of including the packet number as payload, it might just be worth having a "reason" which is a string. That's what winquic has already at least.

well-known URI might include an extension

Currently, we just use the ODCID directly as an identifier, without a ".qlog" extension.
It might be interesting to include the extension, but I don't really have a good view on the pros and cons.

Utf-8

Draft 01 doesn't specify character encoding, but the ts files work with utf-8, don't they?
I propose to standardize that choice as well.

More fine-grained connection states

Per @huitema:

I would split the active state between "start" and "confirmed", and the handshake start on the server side between "received a request" and "address validated".

The server will go through the "anti dos mitigation" phase until the client's address is validated. That's important, because the server behavior in that state is restricted. Once it has sent all the handshake packets, the server goes to a "false start" phase in which it can send 1-RTT packets but (should) not receive. And then once it receives the client finished and sends the "handshake done", it moves to a confirmed state, at which point it can deal with migration and key update.

Client is kind of the same. It goes from initiating to handshake, then to an "almost ready" phase after sending the "client finished" and getting the 1rtt keys. But it will only become really active once it receives the "handshake done" or an equivalent.

Allow raw logging of packets and/or frames

As mentioned by @kazuho on the mailing list, it would be useful to (re-)introduce the ability to log raw packet and/or frame contents, probably has a hex-encoded string.

Probably easiest to add an additional field like this:

{
    "stream_type": "ACK",
    "raw_hex": "0892e340dbaa354f800239dddc7be78406fe3726bea050bb8c56ab36",
    ...
}

Updates for draft-25

For now, there's just one I'm aware of:

parameters_set.idle_timeout was renamed to .max_idle_timeout

Use cases for the TRIGGER field

To have the TRIGGER as a top-level field, there need to be good use-cases and people willing to use this in their tools.

Since the TRIGGER will only be useful for specific events (e.g., PACKET_RETRANSMIT can be due to several loss-detection related situations) and it's value might in some cases also just be deduced from the context of surrounding log messages, it is debatable it will have much use in practice.

An alternate approach could be to log it as part of the DATA field of specific events, instead of as a top-level field.

Better qpack support

Some comments from @LPardue on qpack support:

but how do you correlate a qpack header block event (or whatever you want to call it) to the header frame that it was carried in? maybe simply that header block event contains a stream_id for correlation

other option is add raw_header_block to the HeaderFrame event data

Though that would still require additional events for encoder/decoder instructions, no?

a qpack event that can consist of encoder instructions, decoder instructions and/or header block

So this seems the better option then, combined with the stream_id and expectation that the qpack events are logged in the same order as HeaderFrame's... (though not sure how important the ordering is personally).

Open questions:

What about seeing what's in the dynamic table (or initial static table? or is that always the same?)? specific dynamic_table_updated event or... ?

Be more consistent in numbers vs strings for (potentially) large numbers (varints)

Now, we only make varints strings if it's likely they will go over 2^53 (JSON's / JavaScript's number limit).

For example, this means error codes are just numbers, even though they are varints, as it's unlikely to see a very large error code. However, for fields that are greased (e.g., see quicwg/base-drafts#3360) they could be larger and a number no longer suffices.

See also dcil and scil in PacketHeader, which could be numbers but are now strings.

More in general: maybe it's best to simply allow both for all number fields and have the tools figure it out?

general observations

The following is a 'bag' of general comments ... perhaps not so actionable but I did not find a mail list to send these kind of general observations.

scope

Some readers may try to contrast/compare this effort with higher level http logging formats (eg. NCSA, common log format and friends) ... might be worth reducing their confusion by adding some context eg. qlog sits near pcap.

You may consider explicitly constraining scope to h2/h3 and degrade gracefully for use with other protocols as a side effect of good design instead of assert broad applicability.

There is a statement of compliance attempted in 'tooling section' but this document defines a high level schema eg. I would have expected the only statement of compliance achievable (in this document) is the correct validation of schema against instance data.

Unlike pcap (which is defined in terms of an api) qlog (so far) is defined in the form of a schema - hence I was looking for a clear definition of optional vs not required ... my expectation was for the core of qlog to be as small as possible.

I would separate out protocol definition (endpoint) ... I am sure there exists a good IETF example of this ... but I am too lazy to find and will point you to a W3C set of specs as example https://www.w3.org/TR/2013/REC-sparql11-overview-20130321/

json

You might consider adding some section on json convention/style (assert snake_case, etc) eg. remarking on challenges of using json for representing log data (limitations with datetime, decimal, no comments - all come to mind).

You are defining a high level schema (which for me implies no dependency on a concrete format like json) but as you have used json throughout to illustrate relationships/structure - I was looking (maybe missed it) for a non normative json schema definition.

keys

error is buried in the text, should be normatively defined.

I dislike the term 'vantage_point' ... I understand the requirements but maybe considering other terms like 'src' and 'target' are more appropriate.

values

have you considered defining a time unit in some human readable datetime (including tz notation ex.iso-8601 et al https://www.w3.org/International/core/2005/09/timezone.html)

transforms
Have you considered demonstrating how transforms might work
I like the way csv spec goes about this https://www.w3.org/2013/csvw/wiki/Main_Page

general

It is unclear how easy for a database to index qlog formatted json.

I think the section on 'Practical Use' might consider how compressed json compares to a binary format.

Invalid Assumptions in packet_sent triggers

The packet_sent's triggers field makes the assumptions that a packet is sent because a previous packet is being retransmitted. For instance, in winquic a connection has a queue/set of what data needs to be sent. When data is (suspected) lost, the data in the packet is added back to the queue. Similarly for PTO, we mark an outstanding packet as lost, and if we don't have any, queue a PING frame.

The send logic just grabs data from the queue/set and builds up packets to be sent. There is no direction relationship between different packets.

So, IMO, triggers are the reason data is queued to be sent, not actually sent. What is actually sent will depend on the entire state of the send queue at the time the send logic actually executes.

For example, assume you have two outstanding packets that end up getting marked as lost:

  PktNum=1 { STREAM ID=1, Offset=0, Length=100 }
  PktNum=2 { STREAM ID=1, Offset=100, Length=100 }

Both are marked as lost. Around the same time, the app queues another 100 bytes on stream 1 to be sent. Then another packet ends up getting sent:

  PktNum=55 { STREAM ID=1, Offset=0, Length=300 }

Revise design of dual-endpoint events

Currently, we have some events that are used both for indicating changes in the local as well as the remote endpoint. An example is parameters_set, which logs both connection params we set locally, as the ones we get from the other side. parameters_set uses an "owner" field to disambiguate between these two cases.

However, other events with similar purpose, like connection_id_updated, use another approach (src_ vs dst_ prefixes). We should decide on a singular consistent approach (currently leaning towards "owner" field myself, as it is the most flexible)

Simplify / fix group_id usage

Right now, the group_id concept is -very- flexible. It can be in common_fields (e.g., 1 trace per connection) but it can also be in event_fields (combining several connections into 1 trace). That already puts quite a burden on tools to support this different approaches.

Then, another problem comes up if you would have the same group_id across multiple traces (e.g., trace 1 has some events for group_id x, but trace 2 has some group_id x events as well.)

Note: This concept was mainly added to support the spindump use case (https://github.com/EricssonResearch/spindump), CC @jariarkko. There, a network intermediary logs (aggregated) data on many connections and even protocols. It would be tedious for that setup to split everything out into separate traces.

Possible solutions I currently see:

Only allow 1 of the options (e.g., group_id only in common_fields or only in event_fields). I'm not a big fan of this (common only is inflexible, event_fields only has much overhead)(also: common can be seen as a special case of event_fields, so that could be the implementation target)
Disallow the same group_id across different traces: I think this makes a lot of sense, my preference
Discard the whole group_id concept alltogether (in practice, this would lead to many different approaches in different tools. basically the same as event_fields only, only no standard way of calling the thing)

Additional suggestion: rename group_id to "luid" (locally unique identifier)

Thanks to @nibanks for reporting this

Provide clearer usage advice

At the moment, it's not entirely clear how qlog is "supposed to be used".

For example, it's not clear to implementers why some fields (e.g., quic version) are duplicated across connection_started and also the PacketHeader.

We should provide some examples of what things look like if you implement "all of qlog" and what should happen if you only implement the Core events (i.e., some fields in the Core events can be skipped if you're using the Base or Extra events instead). This is also important info for tool implementers.

This also depends on the use case: tracing while debugging vs production-level logging.

Reduce importance of connection_id_updated

Currently, it is a Core event.
However, as pointed out by @marten-seemann, not all implementations track when the remote endpoint changes their CID (e.g., looks when they first receive a packet with a new CID). In those cases, they might decide to take the log filesize hit by just logging the CID for each incoming PacketHeader.

As such, the connection_id_updated should probably be a Base event, with guidance on when to use which option.

Specify time units used for ack_delay

And probably for other non-timestamp time values (like RTT)

Options:

Force people to use the resolution set in the "configuration"
Choose a fixed resolution (always milli or always micro)
Allow people to indicate resolution inside .data of each event
Combination: default is milli, add config parameter to specify, allow overrides in the .data, etc.

Thanks @jlaine for reporting.

Make it possible to tie push_id to stream

Thanks to @jlaine for reporting

Consolidate repeated padding frames?

Padding frames contain no information, but when eyeballing a qlog for the start of a connection, they swamp more interesting things.

It's a bit of a cheat, given that qlog otherwise is 1:1 between packet contents and logging, but I was wondering if a more compact representation of repeated padding frames might be nice.

This is less about qlog file size -- I'm assuming repeated padding entries in json would compress amazingly -- more about human readability.

I see pros and cons, but wanted to raise it as an issue. Cheers.

mandatory new field in key_updated compromises security

On a production system you probably don't want to log TLS secrets, even if you qlog (some of the) connections. The new field in the key_updated event therefore should not be mandatory.

I'm not sure I understand the old field either. If you're logging 1-RTT key updates and their sequence numbers, the key would already be written to the qlog, so there's no need to export it again. Or am I missing something?

Maybe it would be a good idea to keep key material to the SSLKEYLOGFILE and not even offer an option to write them to qlog?

Make event names more consistent

Change event names to the trend of "metrics_updated" instead of "metric_update".
This is what we use for "packet_sent" and "packet_received" etc. and it's nicer to have this everywhere.

Streaming format

Plain JSON is not entirely streamable... it requires its fields to be closed properly (by ] and }) at the end.

e.g.,

"events": [
           { "key": "val" }

Will fail, but

"events": [
           { "key": "val" }
]

will succeed.

However, one could employ a streaming JSON parser (e.g., http://oboejs.com/) and ignore unclosed fields at the end that way.
For the way the format is currently defined, an implementation would then write the "header" of the qlog file, and then it could stream individual events that are just appended to the file.
If the file isn't properly closed, that's not a problem: the streaming parser user just ignores those errors.
However, this breaks compatibility with the built-in parsers in many stdlibs and the browsers themselves.
It would also be possible to write a simple postprocessing tool that simply appends the necessary "closing" characters to a qlog file if it isn't complete, though that adds another step in a pipeline...

There are also various JSON-subformats that address this problem (see https://en.wikipedia.org/wiki/JSON_streaming), but they also do not seem to be supported in the standard JSON parsers for many platforms...

Given all this, my personal preference is to stay with just the normal JSON format and tools are recommended to use a streaming parser.

Example of how the browser's built-in JSON parser does not support special JSON:

Example of how a streaming parser (oboe) does handle this properly:

Allow flexible fields definitions

The fields that are logged for each individual event depend on the context of usage of the format.

E.g., if you split your logs per connection-id yourself, you do not need to log the connection-id for each event.
However, if you do not log only QUIC data, but also ICMP/TCP info (e.g., the in-network measurement use-case), you need an additional field "protocol-type".

Proposal:
Allow common fields to be set in each "trace" header.

{
     "common_fields": [ "connection_id": "0xdeadbeef", "protocol-type": "QUIC" ],  
     "field_headers: ["timestamp", "category", "type", "trigger", "data"],  
     "events": [ ... ]  
}

If you do not log a field, you just leave it out of both common_fields and field_headers.

Well-known URI: uppercase or lowercase

Whether ODCID in the well-known URI is uppercase or lowercase should be specified.

missing HANDSHAKE_DONE frame

Reference the JSON specification

Currently JSON is mentioned, but not officially referenced.
Make it clear that formats unspecified in the qlog draft should be taken from JSON spec (e.g., that booleans should be spelled true and false)

thanks @hawkinsw for reporting

Think about usage in proxies

If qlog is used on a proxy, the question becomes what vantage_point it should use for its traces, since it's now simultaneously accepting and opening connections, and thus behaves as both a client and a server.

One approach would be to have (two) separate traces with separate vantage_points, but if there is a 1-to-1 mapping between client -> proxy -> origin connections, it -might- make sense to log everything into one trace (though I would need to reflect more on this).

In this latter case, it might make sense to add a "role" or "vantage_point" indication to events like connection_started.

Thanks to @hawkinsw for reporting.
Maybe @LPardue has some comments, given his experience with the proxy-ing use case?