Git Product home page Git Product logo

fmtbuf's People

Contributors

tgockel avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

fmtbuf's Issues

Mechanism for Handling Zero Width Joiners

Certain graphemes like "๐Ÿ™‡โ€โ™€" (which you might see as two separate graphemes) are comprised of three code points:

  1. ๐Ÿ™‡ U+1F647 "Person Bowing Emoji"
  2. U+200D "Zero Width Joiner"
  3. โ™€ U+2640 "Female Sign"

So the single grapheme is the character sequence b"\xf0\x9f\x99\x87\xe2\x80\x8d\xe2\x99\x80".

Steps to Reproduce

If there is enough space in the buffer, everything is fine:

Cli { buffer_size: 10, reserve: 0, finish_with: None, truncate_with: None, debug: true, input: "๐Ÿ™‡\u{200d}โ™€" }
๐Ÿ™‡โ™€
+ version: 0.1.2
+ written_len: 10
+ truncated: false
+ output_bytes: [240, 159, 153, 135, 226, 128, 141, 226, 153, 128]
+ input: ๐Ÿ™‡โ™€
+ input_bytes: [240, 159, 153, 135, 226, 128, 141, 226, 153, 128]

But shrinking the buffer to 9 is a bit odd:

Cli { buffer_size: 9, reserve: 0, finish_with: None, truncate_with: None, debug: true, input: "๐Ÿ™‡\u{200d}โ™€" }
๐Ÿ™‡
+ version: 0.1.2
+ written_len: 7
+ truncated: true
+ output_bytes: [240, 159, 153, 135, 226, 128, 141]
+ input: ๐Ÿ™‡โ™€
+ input_bytes: [240, 159, 153, 135, 226, 128, 141, 226, 153, 128]

The output string is b"\xf0\x9f\x99\x87\xe2\x80\x8d", which includes the zero width joiner character, but it joins nothing.

Potential Fixes

It is not entirely clear what should happen here.

Option 1: Do Nothing

This might be perfectly okay. The output is still valid UTF-8 and every decoder should be able to read it. This library only guarantees UTF-8, not that the output is a sensible grapheme.

Option 2: Cut Off the Zero Width Joiner

Another option is to drop off the trailing ZWJ, leaving ๐Ÿ™‡ b"\xf0\x9f\x99\x87\xe2". This leaves the idea of "person bowing," but drops the gender modifier.

Cli { buffer_size: 9, reserve: 0, finish_with: None, truncate_with: Some("..."), debug: true, input: "๐Ÿ™‡\u{200d}โ™€" }
๐Ÿ™‡...
+ version: 0.1.2
+ written_len: 7
+ truncated: true
+ output_bytes: [240, 159, 153, 135, 46, 46, 46]
+ input: ๐Ÿ™‡โ™€
+ input_bytes: [240, 159, 153, 135, 226, 128, 141, 226, 153, 128]

This has the unfortunate side-effect of potentially changing the meaning of the written output. The family of 4 written by b"\xf0\x9f\x91\xa8\xe2\x80\x8d\xf0\x9f\x91\xa9\xe2\x80\x8d\xf0\x9f\x91\xa6\xe2\x80\x8d\xf0\x9f\x91\xa6" showing up as ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆ might change the interpretation. The cutoff of the second child is conveyed by the Err for truncation at least.

Option 3: Cut Off the Entire Trailing Emoji

If we assume the input to WriteBuf::write_str is meant to be a sequence of graphemes (which is probably a fair assumption), then the right choice is to remove the entire ๐Ÿ™‡โ€โ™€ as a unit.

Cli { buffer_size: 9, reserve: 0, finish_with: None, truncate_with: Some("..."), debug: true, input: "๐Ÿ™‡\u{200d}โ™€" }
...
+ version: 0.1.2
+ written_len: 3
+ truncated: true
+ output_bytes: [46, 46, 46]
+ input: ๐Ÿ™‡โ™€
+ input_bytes: [240, 159, 153, 135, 226, 128, 141, 226, 153, 128]

This is not an unreasonable ask if the only character of this nature is the zero width joiner. We can handle one of these funky characters, but we probably do not want to get into the business of handling arbitrary modifiers. My personal suspicion is that there are other such Unicode sequences that would fall into this category. For example: should the characters between the left-to-right mark U+200E and right-to-left mark U+200F be considered atomic? What about the opposite way?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.