Git Product home page Git Product logo

Comments (9)

pascaldekloe avatar pascaldekloe commented on June 10, 2024 1

I think text conversion depends on the implementation, i.e., the rules are not related to the data format. The compiler manual (see README) states the following.

	Text validation is not part of the marshalling and unmarshalling
	process. C and Go just pass any malformed UTF-8 characters. Java
	and JavaScript replace unmappable content with the '?' character
	(ASCII 63).

So java.lang.String is no longer backed by a char(acter) array. With the new implementation it is even harder to access the data in an efficient way. πŸ˜– Happy to hear about better alternatives for String#charAt(int). String#getBytes allocates memory. The unmarshaller uses String(byte[],int,int,java.nio.charset.Charset) now, and that works fine.

No external libraries for generated code is key!

Feel free to open an issue for a specific improvement idea.

from colfer.

pascaldekloe avatar pascaldekloe commented on June 10, 2024 1

Had a quick look at the new streams with String#chars. It is way slower 😱than String#charAt(int).

from colfer.

pascaldekloe avatar pascaldekloe commented on June 10, 2024 1
  1. The size fit is the maximum ratio from UTF-16 char(acters) to UTF-8 bytes. That is, encoding of a char costs 1, 2 or 3 bytes; never more.

  2. The golden cases have a string with all cases covered.

from colfer.

pascaldekloe avatar pascaldekloe commented on June 10, 2024 1

What do you mean with "for now"? 😬 This must hold forever, even with malformed UTF-16 sequences.

from colfer.

pascaldekloe avatar pascaldekloe commented on June 10, 2024 1
--- a/ecma/test.js
+++ b/ecma/test.js
@@ -50,6 +50,7 @@ function newGoldenCases() {
                '87ffffffffffffffff2e5da4e77f': {t: new Date(-223), t_ns: 888999},
                '0801417f': {s: 'A'},
                '080261007f': {s: 'a\x00'},
+               '0804f0908d887f': {s: '𐍈'},
                '0809c280e0a080f09080807f': {s: '\u0080\u0800\u{10000}'},
                '08800120202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020207f': {s: '                                                                                                                                '},
                '0901ff7f': {a: new Uint8Array([0xFF])},

… passes just fine.

from colfer.

guilt avatar guilt commented on June 10, 2024

Thank you for clarifying. So here are some specific suggestions:

  1. Java's size fit should probably use *4 or *5 as a general rule. Assuming πˆπŸ±β€πŸ‰πŸ€·β€β™‚οΈπŸ₯—πŸš‚ emojis are the norm than the exception.
  2. We should probably add ser/deser tests with UTF characters across all 4 character ranges in these tests.

from colfer.

guilt avatar guilt commented on June 10, 2024

Ah, yes, based on characters, not code points. That should be okay for now.

from colfer.

guilt avatar guilt commented on June 10, 2024

For now, it meant, until, Unicode ups the range dramatically.

from colfer.

pascaldekloe avatar pascaldekloe commented on June 10, 2024

Unicode doesn't up the range dramatically. It would would also be against their own stability policy.

from colfer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.