Git Product home page Git Product logo

Comments (6)

cowtowncoder avatar cowtowncoder commented on September 27, 2024

Hmmh. Not sure I understood the exact description but I think that:

  1. Decoder must use same index as encoder (obviously, otherwise result is wrong)
  2. There should be no stipulation that everything that could be shared must be shared -- it is legal (if not optimal) to keep on adding Strings instead of using back-reference (otherwise lookup structures must be able to keep all N referencable Strings)
  3. Yes, if not mentioned, preventing use of specific indexes should be documented.

This should probably be tackled at:

https://github.com/FasterXML/smile-format-specification/

from jackson-dataformats-binary.

vooft avatar vooft commented on September 27, 2024

Sorry for not being clear.

The problem here in particular is due to skipping the index and not remembering which value was skipped.

For example:

  1. Value "test" happened to be 254th, the encoder wrote it as tiny ascii, but it was just not added to the back-referencing table.
  2. On the decoder side it was recorded in the back-referencing table as normal.
  3. There is no issue yet at the moment, because the next index for both encoder and decoder is still 255.
  4. If we never see "test" again, everything is fine, because the only occurrence was a single tiny ascii token and we never try to resolve the shared value number 254.
  5. However if we do see a "test" again, encoder and decoder may do different things:
    1. Encoder will write it again as a tiny ascii, but it will also assign a new index (in my case it was 307).
    2. Decoder sees tiny ascii "test" again and assumes this is just a non-shared repeated token and will not create a new index (as in point 2 in your answer).
  6. From now on, any new indexes will differ by 1 in encoder vs decoder.

IMO the best option would be to document the current Jackson behavior in the spec (i.e. skip first occurrence, remember the second one), since everyone is using Jackson as a reference anyways.

Another option would be implementing a logic to remember which strings were skipped on the encoder side and never try to share them again. In my example, since "test" was skipped the first time, from now on it will only be written as a tiny ascii and never as a shared value. From compatibility perspective, this is probably a better option.

Shall I create another issue in https://github.com/FasterXML/smile-format-specification/ and close this one?

from jackson-dataformats-binary.

cowtowncoder avatar cowtowncoder commented on September 27, 2024

@vooft Ok. Right: so the logic should indeed be that EVERY String written out occupies "slot", but some of these (254) are NOT allowed to be back-referenced by encoder. For decoder it should not really matter, it simply should not see any such references. It may keep it in decoder table, or (if it wants to super-optimize :) ) not store. But it has to consider that index taken.

And in fact this behavior is how it has to be: otherwise value 254 would not be avoided.

However: I don't think "test" value per-se could not be referenced, only slot #254 must be avoided. If and when it is repeated, following occurrence could be referenced.
I also do not think specification should strictly prohibit duplicates in encoding table -- ideally no duplicates should be kept, but some encoders could allow that by necessity (hash tables with collisions and no overflow chains -- i.e. imperfect set of references kept).

So yes, I think I agree: current behavior should be clarified.

Does above make sense?

from jackson-dataformats-binary.

vooft avatar vooft commented on September 27, 2024

Yes, absolutely, thank you for the clarification!

Just a thought: wouldn't it be simpler just to skip those invalid indexes and pick a next valid one? Like, in this case 254 and 255 must be skipped, but "test" could be remembered as 256 on the first occurrence.

from jackson-dataformats-binary.

cowtowncoder avatar cowtowncoder commented on September 27, 2024

If I was to specify behavior from the scratch, yes, but for backwards compatibility reasons (existing Smile encoded data; earlier versions), that would be majorly backwards-incompatible change -- if I understand the suggestion correctly.

So I don't think that can be done, even tho it is a good idea on its own, if we didn't have to worry about existing usage and codec versions.

from jackson-dataformats-binary.

vooft avatar vooft commented on September 27, 2024

Closing this issue, as it will be handled in the spec FasterXML/smile-format-specification#21

from jackson-dataformats-binary.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.