Reader flag to support unicode escapes in string literals about edn-java HOT 14 CLOSED

avodonosov commented on September 27, 2024

Reader flag to support unicode escapes in string literals

from edn-java.

Comments (14)

bpsm commented on September 27, 2024

A few things:

I totally missed the addition of \uXXXX for character literals way back in 2014. edn-java does not support that, but should do so. #60
As I'm sure you're aware, unicode escapes in strings are not, strictly speaking, necessary since "edn elements, streams and files should be encoded using UTF-8.".

What's the use case for unicode escapes in string literals? Are you getting edn that you have no control over that is broken in this way? Are you dumping raw bytes of some other format (JSON) directly into an edn stream, crossing your fingers and hoping everything will be fine? Do you need to persist or transmit edn in a character encoding other than UTF-8?

I'd be willing to add \uXXXX interpretation when reading string literals with edn-java, but it would help to better understand how it would be useful to you.
I wouldn't want printing to ever produce \uXXXX escapes, however, as this would produce output that's broken for other edn parsers.

from edn-java.

avodonosov commented on September 27, 2024

Hello, thank you for the response.

The motivation are invisible, zero-width characters like \u200c.

I keep in .edn files certain configuration and unit test data for that configuration. One of the examples I must support includes text containing \u200c. If I simply save it as UTF-8, other people reading this file will have difficulty understanding what is going on.

As for printing, I also was thinking it should not produce unicode escapes, even for characters, because I don't know how to decide what characters should use the escaped representation.

from edn-java.

bpsm commented on September 27, 2024

Yes, I can see the utility of that. I'll see what I can do about that over the weekend.

from edn-java.

bpsm commented on September 27, 2024

Please have a look at what I just pushed to the develop branch.

from edn-java.

avodonosov commented on September 27, 2024

@bpsm , thank you, that would suit my needs.

So you don't' see a need for a special flag to enable Unicode escapes in string literals, it's fine to be always enabled?

from edn-java.

bpsm commented on September 27, 2024

Yes, I think I'll just have it always enabled. It seems to me like a sensible addition and inoffensive enough to leave always on.

It occurred to me (after I'd implemented it) that there's actually another approach you could have taken to address your use case. (Maybe you'd thought of it, but I hadn't.)
EDN is, Extensible data notation, after all:

#my/unicode-escaped "\\u1234 mumble ~ mumble \\u4444 mumble"

And then install a corresponding handler for the tag #my/unicode-escaped which scans the string for things that look like \uXXXX and does the necessary conversion. Note, however that if you're married to \uXXXX as the concrete syntax this solution means using \\ in the string literal since them's the rules. Of course, you could invent your own convention that doesn't overload \ for representing arbitrary Unicode characters if you wanted to at this point.

#my/unicode-swizzled "~1234 mumble ~007E mumble ~4444 mumble"

from edn-java.

avodonosov commented on September 27, 2024

If you keep this feature always on, it's less work for me - just update the version number in dependency.

But to be honest I have to tell you one pro-flag consideration: people who want only valid EDN to enter their system. So if they can parse it with edn-java and store somewhere, then any other component down the processing chain is guaranteed to read the file.

That's why I reported the issue mentioning the flag.

The workaround with tags is interesting, I haven't thought of it. If you reject this issue and not implement I maybe use it. But would prefer to avoid - my configs are complex enough already for users to edit, EDN is not as widely know as JSON or XML, the config includes regular expressions represented as EDN string literals, so users already have to deal with double escaping of \ (regex syntax + edn string literals), adding tags and custom Unicode escapes into the mix would not be friendly.

Ideally, IMHO, the spec would include unicode escapes for string literals. Missing them, while they are already specified for characters looks like the result of the spec being a bit drafty, not polished.

from edn-java.

bpsm commented on September 27, 2024

I hadn't considered the use case of using edn-java to check incoming edn for well-formedness before persisting the original bytes. (Rather than, printing the just parsed edn back out to fresh bytes). I suppose that would argue in favor of making this extension off by default. (Maximum compatibility.)

I'll have a look at what adding this as an option to the parser configuration would look like.

from edn-java.

bpsm commented on September 27, 2024

Please have a look at this change which makes \uXXXX an option to be configured by installing a TagHandler into the Parser.Config and tell me if you think that would be acceptable to you.

from edn-java.

avodonosov commented on September 27, 2024

@bpsm , thank you, this solution would work for me.

FWIW, I'm not sure why you decided to express this option as a pseudo tag handler. Other pseudo tag handlers allow the user to customize the representation of EDN numbers in the object structure created by the parser. In the case of Unicode escapes not much room for custom behaviour, only one of the two predefined variants to chose: support or not support. Also, I noticed passing through the tag handler takes some redundant memory allocations: firs the Unicode escape text is put nnto the scanner buffer, then extracted as a substring of it and passed to the tag handler, the tag handler takes the substring of unicode escape to pass to parseInteger, then the resulting character is converted into a string and returned from the tag handler - 3 temporary string objects are created.
When imagining how to implement the option I personally thought about simply adding an IF within the previous case 'u' : branch.

Of course, unicode escapes would usually be rare so these allocations will not be significant usually. If you like the code as is, I'm fine.

Thank you again for your support.

from edn-java.

avodonosov commented on September 27, 2024

@bpsm, another aspect to consider is escapes support by the https://github.com/clojure/tools.reader from official clojure github project.

(Of course, double escapes are used here)

So it supports unicode and even octal escapes by default.

In this case, maybe edn-java does not need to make this feature optional, probably I was overthinking.

from edn-java.

bpsm commented on September 27, 2024

Yea, I think you're right. I'll have edn-java follow clojure.tools.read.edn's lead on this.

So, I'll go back to making support for "\uXXXX" always on.

(This reminds me of another thing I have yet to create an issue for, Clojure's reader supports unicode characters in symbol and keyword names, though the spec makes it sound like it doesn't and edn-java does not currently support this. Do you have any thoughts on this?)

The garden path that lead me to (ab)using a TagHandler for in-string unicode escaping was something like this:

I didn't want to add a method to the Parser.Config interface just in case someone had chosen to write their own implementation of that interface, though that's not intended usage. (I had forgotten, since edn-java was originally Java 6 compatible that default methods are a thing since Java 8.)
I already use edn-specific TagHandlers at various points for extension purposes. (When all you have is a hammer, everything looks like a nail.)
I considered having the tag handler accept an Integer instead of a String, but thought it might be convenient to have the contract go String -> String since that would allow trivial Identity tag handler to just leave the string unchanged. I now think that's more "cute" than useful.

Shrug.

from edn-java.

bpsm commented on September 27, 2024

This is implemented in 0.7.0

(But 0.7.0 is not currently on Maven central because of GPG issues I have been unable to resolve.)

from edn-java.

avodonosov commented on September 27, 2024

Thank you!

from edn-java.

Reader flag to support unicode escapes in string literals about edn-java HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent