Hello! I'm working on my own smile serializer/deserializer and, I th

Hmmh. Not sure I understood the exact deion but I think that: <ol dir="auto"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Closing this issue, as it will be handled in the spec <a class="issue-link js-issue-li

Issue in Smile shared string values due to avoiding 0xFE and 0xFF about jackson-dataformats-binary HOT 6 CLOSED

vooft commented on September 27, 2024

Issue in Smile shared string values due to avoiding 0xFE and 0xFF

from jackson-dataformats-binary.

Comments (6)

cowtowncoder commented on September 27, 2024

Hmmh. Not sure I understood the exact description but I think that:

Decoder must use same index as encoder (obviously, otherwise result is wrong)
There should be no stipulation that everything that could be shared must be shared -- it is legal (if not optimal) to keep on adding Strings instead of using back-reference (otherwise lookup structures must be able to keep all N referencable Strings)
Yes, if not mentioned, preventing use of specific indexes should be documented.

This should probably be tackled at:

https://github.com/FasterXML/smile-format-specification/

from jackson-dataformats-binary.

vooft commented on September 27, 2024

Sorry for not being clear.

The problem here in particular is due to skipping the index and not remembering which value was skipped.

For example:

Value "test" happened to be 254th, the encoder wrote it as tiny ascii, but it was just not added to the back-referencing table.
On the decoder side it was recorded in the back-referencing table as normal.
There is no issue yet at the moment, because the next index for both encoder and decoder is still 255.
If we never see "test" again, everything is fine, because the only occurrence was a single tiny ascii token and we never try to resolve the shared value number 254.
However if we do see a "test" again, encoder and decoder may do different things:
1. Encoder will write it again as a tiny ascii, but it will also assign a new index (in my case it was 307).
2. Decoder sees tiny ascii "test" again and assumes this is just a non-shared repeated token and will not create a new index (as in point 2 in your answer).
From now on, any new indexes will differ by 1 in encoder vs decoder.

IMO the best option would be to document the current Jackson behavior in the spec (i.e. skip first occurrence, remember the second one), since everyone is using Jackson as a reference anyways.

Another option would be implementing a logic to remember which strings were skipped on the encoder side and never try to share them again. In my example, since "test" was skipped the first time, from now on it will only be written as a tiny ascii and never as a shared value. From compatibility perspective, this is probably a better option.

Shall I create another issue in https://github.com/FasterXML/smile-format-specification/ and close this one?

from jackson-dataformats-binary.

cowtowncoder commented on September 27, 2024

@vooft Ok. Right: so the logic should indeed be that EVERY String written out occupies "slot", but some of these (254) are NOT allowed to be back-referenced by encoder. For decoder it should not really matter, it simply should not see any such references. It may keep it in decoder table, or (if it wants to super-optimize :) ) not store. But it has to consider that index taken.

And in fact this behavior is how it has to be: otherwise value 254 would not be avoided.

However: I don't think "test" value per-se could not be referenced, only slot #254 must be avoided. If and when it is repeated, following occurrence could be referenced.
I also do not think specification should strictly prohibit duplicates in encoding table -- ideally no duplicates should be kept, but some encoders could allow that by necessity (hash tables with collisions and no overflow chains -- i.e. imperfect set of references kept).

So yes, I think I agree: current behavior should be clarified.

Does above make sense?

from jackson-dataformats-binary.

vooft commented on September 27, 2024

Yes, absolutely, thank you for the clarification!

Just a thought: wouldn't it be simpler just to skip those invalid indexes and pick a next valid one? Like, in this case 254 and 255 must be skipped, but "test" could be remembered as 256 on the first occurrence.

from jackson-dataformats-binary.

cowtowncoder commented on September 27, 2024

If I was to specify behavior from the scratch, yes, but for backwards compatibility reasons (existing Smile encoded data; earlier versions), that would be majorly backwards-incompatible change -- if I understand the suggestion correctly.

So I don't think that can be done, even tho it is a good idea on its own, if we didn't have to worry about existing usage and codec versions.

from jackson-dataformats-binary.

vooft commented on September 27, 2024

Closing this issue, as it will be handled in the spec FasterXML/smile-format-specification#21

from jackson-dataformats-binary.

Issue in Smile shared string values due to avoiding 0xFE and 0xFF about jackson-dataformats-binary HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent