Git Product home page Git Product logo

Comments (11)

llamahunter avatar llamahunter commented on August 17, 2024

So, how do you want to support this? Without changing the output format dramatically, it would require a new 'magic' message payload that readers will need to know about that contains just the key of the deleted entry? (this is still an output format change, but one that only consumers of topics that have tombstones would need to know about. They wouldn't be able to use a stock serde, tho).

Still think that the output format should include both the key and the offset, separate from the message payload, tho. It possibly should also include the other info that comes along with a message, like the attribute flags (e.g. compression codec). But, seems unlikely the CRC is necessary for archiving. If the message body is bad, maybe just discard it?

from secor.

llamahunter avatar llamahunter commented on August 17, 2024

https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-Messagesets

from secor.

pgarbacki avatar pgarbacki commented on August 17, 2024

I didn't think about it much, but at this point don't have a good suggestion for a generic solution including all metadata in addition to the message payload.

Regarding the specific case of tombstones, the simplest solution is to replace the null tombstone message body with a custom magic + key. It is indeed the case that readers will need to be aware of this formatting convention.

from secor.

llamahunter avatar llamahunter commented on August 17, 2024

From some reading on the Hadoop SequenceFile, it appears they suggest that all metadata should be encoded in the key. Thus, should perhaps instead the strategy be to put the key+offset into the SequenceFile key, and then leave the message body as the value? That way, there's no need for special handling of 'magic' payloads by downstream readers. However, downstream readers that care about the offset (or key) would need to know how to parse it out of the SequenceFile key field.

http://thinkbig.teradata.com/hadoop-sequence-files-and-a-use-case/

from secor.

pgarbacki avatar pgarbacki commented on August 17, 2024

Yes, storing key along the offset in the seq file key is more elegant. My only concern is about backwards compatibility. Currently seq file keys store offsets and people use those to verify the output consistency. Having that said, what you suggest and paying the cost of breaking backwards compatibility is probably the right thing to do.

from secor.

llamahunter avatar llamahunter commented on August 17, 2024

I think it could be a configuration option. If people want message keys archived (e.g. for tombstone tracking or some other purpose), they need to parse the SequenceFile key field to separate it from the offset. If they don't care about keys, the SequenceFile key is just the offset, as it is now.

I'd argue for encoding the SequenceFile key as some reasonably extensible but extremely easy to parse format. Note that keys can be binary. Suggestions?

from secor.

pgarbacki avatar pgarbacki commented on August 17, 2024

We could use something like MessagePack to serialize a map of key-values for extensibility. We could use enum (int) keys for compactness and byte values for flexibility.

from secor.

llamahunter avatar llamahunter commented on August 17, 2024

So, I'm going to start working on making key archiving a reality in the next few days. Any new thoughts on this? I'm probably going to go with the above approach of modifying the SequenceFile key field to contain both the partition hash key and sequence number. I'm sort of inclined to use JSON for the key/value map of the SequenceFile key rather than MessagePack, unless you think size is a real issue. I guess repeating the keys in json over and over for 'hash key' and 'sequence number' is kind of a bummer. If we go with int keys via MessagePack, there needs to be a page documenting them somewhere.

from secor.

HenryCaiHaiying avatar HenryCaiHaiying commented on August 17, 2024

Json is too loose in terms of schema conformance. I would prefer a
strongly typed object, you basically want a struct with two fields:
KafkaMessageKey and the offset.

On Thu, Nov 12, 2015 at 7:10 PM, Richard Lee [email protected]
wrote:

So, I'm going to start working on making key archiving a reality in the
next few days. Any new thoughts on this? I'm probably going to go with the
above approach of modifying the SequenceFile key field to contain both the
partition hash key and sequence number. I'm sort of inclined to use JSON
for the key/value map of the SequenceFile key rather than MessagePack,
unless you think size is a real issue. I guess repeating the keys in json
over and over for 'hash key' and 'sequence number' is kind of a bummer. If
we go with int keys via MessagePack, there needs to be a page documenting
them somewhere.


Reply to this email directly or view it on GitHub
#121 (comment).

from secor.

llamahunter avatar llamahunter commented on August 17, 2024

So, MessagePack is not really strongly typed either. It's just a binary encoding of keys and values. Also, not clear to me that strong typing is a good idea for the metadata fields. I think that defining known keys, and allowing the set of keys to expand over time, is a good strategy. Given that MessagePack has a lot of language bindings, I'm ok using it instead of JSON. And, I do sort of like the idea of not repeating the keys over and over again a strings. So... proposal.

Keys will be encoded as Message Pack Integers. The keys will be defined in the Secor source code and documented in DESIGN.md. Initial definitions are:

Key Meaning MessagePack Value Type
1 partition offset 64 bit Integer
2 message key Raw Binary byte array OR Nil if no key

Note that this 'new' output format of storing this MessagePack in the SequenceFile key field would be an non default option of secor. Default behavior of storing just the partition offset would be maintained for backwards compatibility use cases.

from secor.

HenryCaiHaiying avatar HenryCaiHaiying commented on August 17, 2024

This sounds good to me.

On Thu, Nov 19, 2015 at 10:10 AM, Richard Lee [email protected]
wrote:

So, MessagePack is not really strongly typed either. It's just a binary
encoding of keys and values. Also, not clear to me that strong typing is a
good idea for the metadata fields. I think that defining known keys, and
allowing the set of keys to expand over time, is a good strategy. Given
that MessagePack has a lot of language bindings, I'm ok using it instead of
JSON. And, I do sort of like the idea of not repeating the keys over and
over again a strings. So... proposal.

Keys will be encoded as Message Pack Integers. The keys will be defined in
the Secor source code and documented in DESIGN.md. Initial definitions are:
Key Meaning MessagePack Value Type 1 partition offset 64 bit Integer 2 message
key Raw Binary byte array OR Nil if no key

Note that this 'new' output format of storing this MessagePack in the
SequenceFile key field would be an non default option of secor. Default
behavior of storing just the partition offset would be maintained for
backwards compatibility use cases.


Reply to this email directly or view it on GitHub
#121 (comment).

from secor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.