A kafka tombstone is used with topics having hash keys to indicate data for the key ha

Support archiving of Kafka tombstones about secor HOT 11 CLOSED

pinterest commented on August 17, 2024

Support archiving of Kafka tombstones

from secor.

Comments (11)

llamahunter commented on August 17, 2024

So, how do you want to support this? Without changing the output format dramatically, it would require a new 'magic' message payload that readers will need to know about that contains just the key of the deleted entry? (this is still an output format change, but one that only consumers of topics that have tombstones would need to know about. They wouldn't be able to use a stock serde, tho).

Still think that the output format should include both the key and the offset, separate from the message payload, tho. It possibly should also include the other info that comes along with a message, like the attribute flags (e.g. compression codec). But, seems unlikely the CRC is necessary for archiving. If the message body is bad, maybe just discard it?

from secor.

llamahunter commented on August 17, 2024

https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-Messagesets

from secor.

pgarbacki commented on August 17, 2024

I didn't think about it much, but at this point don't have a good suggestion for a generic solution including all metadata in addition to the message payload.

Regarding the specific case of tombstones, the simplest solution is to replace the null tombstone message body with a custom magic + key. It is indeed the case that readers will need to be aware of this formatting convention.

from secor.

llamahunter commented on August 17, 2024

From some reading on the Hadoop SequenceFile, it appears they suggest that all metadata should be encoded in the key. Thus, should perhaps instead the strategy be to put the key+offset into the SequenceFile key, and then leave the message body as the value? That way, there's no need for special handling of 'magic' payloads by downstream readers. However, downstream readers that care about the offset (or key) would need to know how to parse it out of the SequenceFile key field.

http://thinkbig.teradata.com/hadoop-sequence-files-and-a-use-case/

from secor.

pgarbacki commented on August 17, 2024

Yes, storing key along the offset in the seq file key is more elegant. My only concern is about backwards compatibility. Currently seq file keys store offsets and people use those to verify the output consistency. Having that said, what you suggest and paying the cost of breaking backwards compatibility is probably the right thing to do.

from secor.

llamahunter commented on August 17, 2024

I think it could be a configuration option. If people want message keys archived (e.g. for tombstone tracking or some other purpose), they need to parse the SequenceFile key field to separate it from the offset. If they don't care about keys, the SequenceFile key is just the offset, as it is now.

I'd argue for encoding the SequenceFile key as some reasonably extensible but extremely easy to parse format. Note that keys can be binary. Suggestions?

from secor.

pgarbacki commented on August 17, 2024

We could use something like MessagePack to serialize a map of key-values for extensibility. We could use enum (int) keys for compactness and byte values for flexibility.

from secor.

llamahunter commented on August 17, 2024

So, I'm going to start working on making key archiving a reality in the next few days. Any new thoughts on this? I'm probably going to go with the above approach of modifying the SequenceFile key field to contain both the partition hash key and sequence number. I'm sort of inclined to use JSON for the key/value map of the SequenceFile key rather than MessagePack, unless you think size is a real issue. I guess repeating the keys in json over and over for 'hash key' and 'sequence number' is kind of a bummer. If we go with int keys via MessagePack, there needs to be a page documenting them somewhere.

from secor.

HenryCaiHaiying commented on August 17, 2024

Json is too loose in terms of schema conformance. I would prefer a
strongly typed object, you basically want a struct with two fields:
KafkaMessageKey and the offset.

On Thu, Nov 12, 2015 at 7:10 PM, Richard Lee [email protected]
wrote:

So, I'm going to start working on making key archiving a reality in the
next few days. Any new thoughts on this? I'm probably going to go with the
above approach of modifying the SequenceFile key field to contain both the
partition hash key and sequence number. I'm sort of inclined to use JSON
for the key/value map of the SequenceFile key rather than MessagePack,
unless you think size is a real issue. I guess repeating the keys in json
over and over for 'hash key' and 'sequence number' is kind of a bummer. If
we go with int keys via MessagePack, there needs to be a page documenting
them somewhere.

—
Reply to this email directly or view it on GitHub
#121 (comment).

from secor.

llamahunter commented on August 17, 2024

So, MessagePack is not really strongly typed either. It's just a binary encoding of keys and values. Also, not clear to me that strong typing is a good idea for the metadata fields. I think that defining known keys, and allowing the set of keys to expand over time, is a good strategy. Given that MessagePack has a lot of language bindings, I'm ok using it instead of JSON. And, I do sort of like the idea of not repeating the keys over and over again a strings. So... proposal.

Keys will be encoded as Message Pack Integers. The keys will be defined in the Secor source code and documented in DESIGN.md. Initial definitions are:

Key	Meaning	MessagePack Value Type
1	partition offset	64 bit Integer
2	message key	Raw Binary byte array OR Nil if no key

Note that this 'new' output format of storing this MessagePack in the SequenceFile key field would be an non default option of secor. Default behavior of storing just the partition offset would be maintained for backwards compatibility use cases.

from secor.

HenryCaiHaiying commented on August 17, 2024

This sounds good to me.

On Thu, Nov 19, 2015 at 10:10 AM, Richard Lee [email protected]
wrote:

So, MessagePack is not really strongly typed either. It's just a binary
encoding of keys and values. Also, not clear to me that strong typing is a
good idea for the metadata fields. I think that defining known keys, and
allowing the set of keys to expand over time, is a good strategy. Given
that MessagePack has a lot of language bindings, I'm ok using it instead of
JSON. And, I do sort of like the idea of not repeating the keys over and
over again a strings. So... proposal.

Keys will be encoded as Message Pack Integers. The keys will be defined in
the Secor source code and documented in DESIGN.md. Initial definitions are:
Key Meaning MessagePack Value Type 1 partition offset 64 bit Integer 2 message
key Raw Binary byte array OR Nil if no key

Note that this 'new' output format of storing this MessagePack in the
SequenceFile key field would be an non default option of secor. Default
behavior of storing just the partition offset would be maintained for
backwards compatibility use cases.

—
Reply to this email directly or view it on GitHub
#121 (comment).

from secor.

Support archiving of Kafka tombstones about secor HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent