Comments (6)
Hmmh. Not sure I understood the exact description but I think that:
- Decoder must use same index as encoder (obviously, otherwise result is wrong)
- There should be no stipulation that everything that could be shared must be shared -- it is legal (if not optimal) to keep on adding Strings instead of using back-reference (otherwise lookup structures must be able to keep all N referencable Strings)
- Yes, if not mentioned, preventing use of specific indexes should be documented.
This should probably be tackled at:
https://github.com/FasterXML/smile-format-specification/
from jackson-dataformats-binary.
Sorry for not being clear.
The problem here in particular is due to skipping the index and not remembering which value was skipped.
For example:
- Value "test" happened to be 254th, the encoder wrote it as tiny ascii, but it was just not added to the back-referencing table.
- On the decoder side it was recorded in the back-referencing table as normal.
- There is no issue yet at the moment, because the next index for both encoder and decoder is still 255.
- If we never see "test" again, everything is fine, because the only occurrence was a single tiny ascii token and we never try to resolve the shared value number 254.
- However if we do see a "test" again, encoder and decoder may do different things:
- Encoder will write it again as a tiny ascii, but it will also assign a new index (in my case it was 307).
- Decoder sees tiny ascii "test" again and assumes this is just a non-shared repeated token and will not create a new index (as in point 2 in your answer).
- From now on, any new indexes will differ by 1 in encoder vs decoder.
IMO the best option would be to document the current Jackson behavior in the spec (i.e. skip first occurrence, remember the second one), since everyone is using Jackson as a reference anyways.
Another option would be implementing a logic to remember which strings were skipped on the encoder side and never try to share them again. In my example, since "test" was skipped the first time, from now on it will only be written as a tiny ascii and never as a shared value. From compatibility perspective, this is probably a better option.
Shall I create another issue in https://github.com/FasterXML/smile-format-specification/ and close this one?
from jackson-dataformats-binary.
@vooft Ok. Right: so the logic should indeed be that EVERY String written out occupies "slot", but some of these (254) are NOT allowed to be back-referenced by encoder. For decoder it should not really matter, it simply should not see any such references. It may keep it in decoder table, or (if it wants to super-optimize :) ) not store. But it has to consider that index taken.
And in fact this behavior is how it has to be: otherwise value 254 would not be avoided.
However: I don't think "test" value per-se could not be referenced, only slot #254 must be avoided. If and when it is repeated, following occurrence could be referenced.
I also do not think specification should strictly prohibit duplicates in encoding table -- ideally no duplicates should be kept, but some encoders could allow that by necessity (hash tables with collisions and no overflow chains -- i.e. imperfect set of references kept).
So yes, I think I agree: current behavior should be clarified.
Does above make sense?
from jackson-dataformats-binary.
Yes, absolutely, thank you for the clarification!
Just a thought: wouldn't it be simpler just to skip those invalid indexes and pick a next valid one? Like, in this case 254 and 255 must be skipped, but "test" could be remembered as 256 on the first occurrence.
from jackson-dataformats-binary.
If I was to specify behavior from the scratch, yes, but for backwards compatibility reasons (existing Smile encoded data; earlier versions), that would be majorly backwards-incompatible change -- if I understand the suggestion correctly.
So I don't think that can be done, even tho it is a good idea on its own, if we didn't have to worry about existing usage and codec versions.
from jackson-dataformats-binary.
Closing this issue, as it will be handled in the spec FasterXML/smile-format-specification#21
from jackson-dataformats-binary.
Related Issues (20)
- `ProtobufParser.currentName()` returns wrong value at root level
- Unexpected `ArrayIndexOutOfBoundsException` in `CBORParser` for corrupt String value
- IonReader.newBytes() throwing `NegativeArraySizeException` HOT 1
- `IonReader` throws `AssertionError` for Timestamp value HOT 1
- `IonReader.next()` throws `ArrayIndexOutOfBoundsException` for some corrupt content HOT 4
- Support for `StreamReadConstraints.maxNameLength` validation for CBOR, Smile backends
- Protobuf `LocalDateTime` serialization as Root value throws `UnsupportedOperationException` HOT 5
- Upgrade `ion-java` to latest (1.11.4)
- Upgrade `ion-java` to latest (1.11.7)
- ION deserialization type change from Double to Float in 2.17.0 HOT 7
- ION - MappingIterator skip the first line HOT 8
- Avro Schema generation: allow mapping Java Enum properties to Avro String values HOT 7
- Failed copy(): `IonValueMapper` does not override copy() HOT 5
- [AVRO] 2 fields in schema for single property when getter and setter contain property name in capitals. HOT 3
- Upgrade `ion-java` dependency to 1.11.8
- Cannot deserialize `UUID` values HOT 8
- IonValueMapper.builder() not implemented, does not register modules HOT 2
- Avro 1.12.0 breaks jackson-dataformat-avro (also: requires JDK 11) HOT 14
- Issue with canonical CBOR serialization and removing RFC7049#2.4 type hints HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jackson-dataformats-binary.