Comments (14)
A few things:
- I totally missed the addition of \uXXXX for character literals way back in 2014. edn-java does not support that, but should do so. #60
- As I'm sure you're aware, unicode escapes in strings are not, strictly speaking, necessary since "edn elements, streams and files should be encoded using UTF-8.".
What's the use case for unicode escapes in string literals? Are you getting edn that you have no control over that is broken in this way? Are you dumping raw bytes of some other format (JSON) directly into an edn stream, crossing your fingers and hoping everything will be fine? Do you need to persist or transmit edn in a character encoding other than UTF-8?
- I'd be willing to add \uXXXX interpretation when reading string literals with edn-java, but it would help to better understand how it would be useful to you.
- I wouldn't want printing to ever produce \uXXXX escapes, however, as this would produce output that's broken for other edn parsers.
from edn-java.
Hello, thank you for the response.
The motivation are invisible, zero-width characters like \u200c.
I keep in .edn files certain configuration and unit test data for that configuration. One of the examples I must support includes text containing \u200c. If I simply save it as UTF-8, other people reading this file will have difficulty understanding what is going on.
As for printing, I also was thinking it should not produce unicode escapes, even for characters, because I don't know how to decide what characters should use the escaped representation.
from edn-java.
Yes, I can see the utility of that. I'll see what I can do about that over the weekend.
from edn-java.
Please have a look at what I just pushed to the develop branch.
from edn-java.
@bpsm , thank you, that would suit my needs.
So you don't' see a need for a special flag to enable Unicode escapes in string literals, it's fine to be always enabled?
from edn-java.
Yes, I think I'll just have it always enabled. It seems to me like a sensible addition and inoffensive enough to leave always on.
It occurred to me (after I'd implemented it) that there's actually another approach you could have taken to address your use case. (Maybe you'd thought of it, but I hadn't.)
EDN is, Extensible data notation, after all:
#my/unicode-escaped "\\u1234 mumble ~ mumble \\u4444 mumble"
And then install a corresponding handler for the tag #my/unicode-escaped which scans the string for things that look like \uXXXX
and does the necessary conversion. Note, however that if you're married to \uXXXX
as the concrete syntax this solution means using \\
in the string literal since them's the rules. Of course, you could invent your own convention that doesn't overload \
for representing arbitrary Unicode characters if you wanted to at this point.
#my/unicode-swizzled "~1234 mumble ~007E mumble ~4444 mumble"
from edn-java.
If you keep this feature always on, it's less work for me - just update the version number in dependency.
But to be honest I have to tell you one pro-flag consideration: people who want only valid EDN to enter their system. So if they can parse it with edn-java and store somewhere, then any other component down the processing chain is guaranteed to read the file.
That's why I reported the issue mentioning the flag.
The workaround with tags is interesting, I haven't thought of it. If you reject this issue and not implement I maybe use it. But would prefer to avoid - my configs are complex enough already for users to edit, EDN is not as widely know as JSON or XML, the config includes regular expressions represented as EDN string literals, so users already have to deal with double escaping of \ (regex syntax + edn string literals), adding tags and custom Unicode escapes into the mix would not be friendly.
Ideally, IMHO, the spec would include unicode escapes for string literals. Missing them, while they are already specified for characters looks like the result of the spec being a bit drafty, not polished.
from edn-java.
I hadn't considered the use case of using edn-java to check incoming edn for well-formedness before persisting the original bytes. (Rather than, printing the just parsed edn back out to fresh bytes). I suppose that would argue in favor of making this extension off by default. (Maximum compatibility.)
I'll have a look at what adding this as an option to the parser configuration would look like.
from edn-java.
Please have a look at this change which makes \uXXXX an option to be configured by installing a TagHandler into the Parser.Config and tell me if you think that would be acceptable to you.
from edn-java.
@bpsm , thank you, this solution would work for me.
FWIW, I'm not sure why you decided to express this option as a pseudo tag handler. Other pseudo tag handlers allow the user to customize the representation of EDN numbers in the object structure created by the parser. In the case of Unicode escapes not much room for custom behaviour, only one of the two predefined variants to chose: support or not support. Also, I noticed passing through the tag handler takes some redundant memory allocations: firs the Unicode escape text is put nnto the scanner buffer, then extracted as a substring of it and passed to the tag handler, the tag handler takes the substring of unicode escape to pass to parseInteger, then the resulting character is converted into a string and returned from the tag handler - 3 temporary string objects are created.
When imagining how to implement the option I personally thought about simply adding an IF within the previous case 'u' :
branch.
Of course, unicode escapes would usually be rare so these allocations will not be significant usually. If you like the code as is, I'm fine.
Thank you again for your support.
from edn-java.
@bpsm, another aspect to consider is escapes support by the https://github.com/clojure/tools.reader from official clojure github project.
(Of course, double escapes are used here)
So it supports unicode and even octal escapes by default.
In this case, maybe edn-java does not need to make this feature optional, probably I was overthinking.
from edn-java.
Yea, I think you're right. I'll have edn-java follow clojure.tools.read.edn's lead on this.
So, I'll go back to making support for "\uXXXX" always on.
(This reminds me of another thing I have yet to create an issue for, Clojure's reader supports unicode characters in symbol and keyword names, though the spec makes it sound like it doesn't and edn-java does not currently support this. Do you have any thoughts on this?)
The garden path that lead me to (ab)using a TagHandler for in-string unicode escaping was something like this:
- I didn't want to add a method to the Parser.Config interface just in case someone had chosen to write their own implementation of that interface, though that's not intended usage. (I had forgotten, since edn-java was originally Java 6 compatible that default methods are a thing since Java 8.)
- I already use edn-specific TagHandlers at various points for extension purposes. (When all you have is a hammer, everything looks like a nail.)
- I considered having the tag handler accept an Integer instead of a String, but thought it might be convenient to have the contract go String -> String since that would allow trivial Identity tag handler to just leave the string unchanged. I now think that's more "cute" than useful.
Shrug.
from edn-java.
This is implemented in 0.7.0
(But 0.7.0 is not currently on Maven central because of GPG issues I have been unable to resolve.)
from edn-java.
Thank you!
from edn-java.
Related Issues (20)
- Iterate over list recieved via get. HOT 1
- > and < symbols are not supported HOT 3
- Values returned by Parser should be Serializable by default HOT 1
- Comma character prints incorrectly HOT 3
- PrintingExamples are sensitive to hash ordering HOT 1
- Teach edn-java to read namespaced maps as per CLJ-1910 HOT 1
- Teach edn-java to optionally print namespaced maps as per CLJ-1910
- Duplicate map keys are parsed without error HOT 5
- Add Java 8 to test with Travis too. HOT 1
- QuickCheck to test? HOT 1
- mispelling, in the docs
- Version 1.0? HOT 1
- Support reading \uXXXX character literals HOT 1
- Review edn-format; compare to implementation HOT 1
- Support # embedded in symbol names HOT 1
- Clojure's edn-read allows unicode in names, should edn-java do the same?
- Fix broken GPG magic HOT 4
- Octal escapes in string and character literals? HOT 1
- 0.7.0 or 0.7.1 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from edn-java.