bpsm / edn-java Goto Github PK
View Code? Open in Web Editor NEWa reader for extensible data notation
License: Eclipse Public License 1.0
a reader for extensible data notation
License: Eclipse Public License 1.0
I'm thinking of something like:
Parser p = Parsers.newParser(…);
Parseable r = Parsers.newParseable( … );
Parser.Config cfg = (Parser.Config) p.nextValue(r);
Where the Parseable r
reads edn that's something like this:
#bpsm.edn-java/parser-config {
:listFactory #bpsm.edn-java/class "fully.qualifed.class.Name"
;; new fully.qualified.class.Name() returns a CollectionBuilder.Factory
…
:handlers {
:my/base64 #bpsm.edn-java/method {
:class #bpsm.edn-java/class "fully.qualified.class.Name"
:name "staticFactoryMethod"
}
;; fully.qualified.class.Name.staticFactoryMethod()
;; returns a TagHandler.
…
}
Open design questions:
parser-config
, class
, and method
tags. An alternative that leaves out the inner tags for brevity preferring instead to pack the knowledge into the handler for parser-config
is conceivable.#bpsm.edn-java
is just made up. Convention elsewhere in the code base is to use #info.bsmithmannschott
, but that's rather long -- though it's certainly unique.It would be very helpful to nest and compose parsers to deal with situations where you have a schema with different components that require different parsing logic.
For example consider the following hypothetical edn message:
{:vector-value [1.0 2.0 3.0 4.0]
:attribute-list [:foo :bar]}
You might want a different parser for the vector-value to produce a specialised double[] array or similar, but stick with the regular parser for the attribute-list of keywords.
Implementation thoughts:
See also
edn-format/edn#32 (comment)
// Symbols begin with a non-numeric character and can contain
// alphanumeric characters and `. * + ! - _ ? $ % & = < >`. If `-`,
// `+` or `.` are the first character, the second character (if any)
// must be non-numeric. Additionally, `: #` are allowed as constituent
// characters in symbols other than as the first character.
It appears that edn-java parses symbols with embedded "#" incorrectly. Given the text [a#b {}]
we expect a vector of these two items:
a#b
Instead we get these two items:
a
#b
Currently Printer requires a Writer. This is unnecessarily restrictive. We should be able to work with anything that's Appendable.
This follows up on issue-11.
This may be a design decision, but it is not clear so here goes.
The documentation states "Lists "(...)" and vectors "[...]" are both mapped to implementations of java.util.List. A vector maps to a List implementation that also implements the marker interface java.util.RandomAccess."
However, due to this commit which changes the backing of the List type to ArrayList instead of LinkedList, there is no way to distinguish between Lists or Vectors when "roundtripped" through the parser.
This means that "(1, 2, 3)" and "[1, 2, 3]" will parse to identical values. Printing back the parsed List "(1, 2, 3)" to EDN via the printer will yield the Vector "[1, 2, 3]".
This is not caught in the unit tests because the method assertEquals merely tests that the parsed objects are equal (they are, as Vectors), but does not compare a second round of printing with the original string.
#59 and #60 implement unicode escapes in character and string literals.
How about octal escapes?
I discovered that both clojure language reader and the edn reader from the official clojure github project - https://github.com/clojure/tools.reader - support this.
(Octal escapes in string literals come from Java, only that Java syntax for that is backlash followed by up to 3 digits, while in Clojure and in tools.reader exactly 3 digits are required.).
In string literals the syntax is baclash followed by 3 digits: \NNN. The first digit can be between 0 and 3, the last two digits are between 0 and 7.
For character literals the syntax is \oNNN. Again, the first digit is between 0 and 3, the last two are between 0 and 7.
$ clj -r
Clojure 1.10.1
user=> "aaa\062aaa"
"aaa2aaa"
user=> \o062
\2
user=> (require '[clojure.tools.reader.edn :as edn])
nil
user=> (edn/read-string "\"aaa\\062aaa\"")
"aaa2aaa"
user=> (edn/read-string "\\o062")
\2
The poposal in edn-format/edn#65 also includes octal escapes.
Not to say I personally need to use octal literals in my code. Just FYI. It may be good to have some consistency between EDN implementations.
Currently we are using X.Y version naming. May I suggest we move to X.Y.Z?
This would be more consistent with other Maven/Clojure projects and make it easier to distinguish between major/minor/bugfix releases which we are likely to need ultimately.
Hi,
Thanks for the hard work you put to make this awesome library.
I see the library is at version 0.6, and no commit since summer last year.
Is the library stable enough to be used as is? Are there known problems to be aware of?
I am still in the process of learning - but I am happy to help.
With kind regards,
Nicolas
Add Java 8 to test with Travis too.
Thank you.
Whatever magic set of configurations I made years ago to allow releases to Maven Central via Sonatype has ceased functioning for reasons I have not yet been able to diagnose. As a result 0.7.0 has been tagged, but not published on Maven Central.
[INFO] --- maven-gpg-plugin:1.6:sign-and-deploy-file (default-cli) @ edn-java ---
gpg: using "........" as default secret key for signing
gpg: signing failed: No pinentry
gpg: signing failed: No pinentry
I've googled around for what to do about "No pinetry" but not yet found anything that leads me to solution.
The printer interface currently declares checked exceptions (IOException).
I believe these should be removed for two main reasons:
import us.bpsm.edn.printer.Printers;
public class CommaBug {
public static void main(String[] args){
System.out.println(Printers.printString(','));
}
}
Running this program produces the following output
Exception in thread "main" us.bpsm.edn.EdnException: Whitespace character 0x2c is unsupported.
at us.bpsm.edn.printer.Printers$11.eval(Printers.java:384)
at us.bpsm.edn.printer.Printers$11.eval(Printers.java:357)
at us.bpsm.edn.printer.Printers$1.printValue(Printers.java:142)
at us.bpsm.edn.printer.Printers.printString(Printers.java:74)
at us.bpsm.edn.printer.Printers.printString(Printers.java:56)
The expected output is
\,
java.meth.BigDecimal
should be
java.math.BigDecimal
I wanted to try out the new 0.4.0 release, but wasn't able to find it on any of the usual public repos (Clojars, Maven Central). I could only find the snapshot versions.
Are the releases available anywhere that I should be aware of?
If not, I think they should be - it makes it much easier for people to pick up and run with the library
For some use cases it is desirable that the Values produced by parsing EDN text be able to participate in Java Serialization.
Currently this is impossible because Keyword, Symbol, Tag, TaggedValue and DelegatingList do not implement Serializable.
Currently Printer produces output that is all in one line:
[{:a "asdfasdfasdfasdfasdfasdf" :b 1234 :c "uoiuojoijoijmoinoihohkjhlkjhlkjhu", :d #{ … } … } … ]
This is great for communication, since it's compact and no human has to be able to read it. On the other hand, it stinks for debugging scenarios or where edn data is stored in version control where it may be subject to merges.
Printer should support the option to format output in multiple lines with some amount of indentation to indicate logical nesting. It need not be highly configurable. It need not match the output of Clojure's pprint. It must be faster than Clojure's pprint.
e.g. Symbol.newSymbol(name)
should construct a Symbol without a namespace. Keyword.newKeyword(ns, name)
should be equivalent to Keyword.newKeyword(Symbol.newSymbol)
. Similarly for Tag.
"if -
, +
or .
are the first character, the second character must be non-numeric."
See also:
https://github.com/edn-format/edn/blob/180328d96e0e48176618e2a92044bb23c5528593/README.md#symbols
Commit 7e5bd61 merges an expanded performance-test branch to master, but this still needs work.
ScannerImpl does not recognize symbols '>' and '<'.
;; clojure defacto-standard edn reader
user=> (clojure.edn/read-string "1/2")
1/2
// scala
val p = newParser(defaultConfiguration())
val r = p.nextValue(newParseable("{:x 1/2}")).asInstanceOf[java.util.Map[Keyword,Any]].toMap
us.bpsm.edn.EdnSyntaxException: Not a number: '1/'.
at us.bpsm.edn.parser.ScannerImpl.readNumber(ScannerImpl.java:434)
at us.bpsm.edn.parser.ScannerImpl.scanNextToken(ScannerImpl.java:153)
at us.bpsm.edn.parser.ScannerImpl.nextToken(ScannerImpl.java:61)
...
As suggested by abernard in a comment on issue 32:
On the issue of this handling Guava collections, I wonder if the code to determine List or Vector should be separated out into an interface. This would be something like:
interface SequenceTypeSelector {
boolean isVector(Object o);
}
The selector could be attached to the ProtocolBuilder for the Printer (with a default implementation provided of course). Extending SequenceTypeSelector would allow custom dispatch for types, allowing the simple if-else select on java.util.RandomAccess, or a Map lookup for more complex type hierarchies.
This was in response to a comment of mine that I was losing the list/vector distinction in edn-java-guava since guava's immutable list implementations implement RandomAccess.
Issue 60 involves a requirement documented in edn-format that I'd not implemented because I missed it somehow.
I'd like to release 1.0, but before doing that it makes sense to review edn-format one last time to make sure that there's not something else I've missed.
I am trying to parse a small bit of EDN in Processing and the only way I can import this is by using it as a 'library' via a compiled .jar.
"If the target platform supports some notion of interning, it is a further semantic of keywords that all instances of the same keyword yield the identical object."
See also
https://github.com/edn-format/edn/blob/180328d96e0e48176618e2a92044bb23c5528593/README.md#keywords
Despite the edn format doesn't specify unicode escapes in string literals (unlike for characters), in practice, it is very inconvenient sometimes. Optional support for unicode escapes in string literals, managed by reader config flag is very desirable.
See also: edn-format/edn#65
Looking into the code I think unicode escapes are not even supported in characters. This is a bug - the EDN specification requires unicode escapes in characters: https://github.com/edn-format/edn#characters.
{:a 1 :a 2}
parses without error in edn-java
whereas
user=> (clojure.edn/read-string "{:a 1 :a 2}")
IllegalArgumentException Duplicate key: :a clojure.lang.PersistentArrayMap.createWithCheck (PersistentArrayMap.java:71)
The edn spec states that keys should appear "at most once" so I think an error should be reported in this scenario.
This follows up on issue-11
This would open up CharBuffer and the like as possible sources. This will take some doing, as a raw Readable does not support character-at-a-time reading, which is what our Scanner needs internally.
See clojure/clojure@6d48ae3
See: http://dev.clojure.org/jira/browse/CLJ-1910
I don't think this new printing behavior should be the default as this could produce output that older versions of edn-java could not read by default.
I've found Travis CI to be a pretty good tool for continuous integration and testing.
Ben - want me to add Travis CI support for edn-Java?
It basically requires:
The LIsts, Sets and Maps returned by Parser should be immutable by default. The simplest way to achieve this within the JDK is to wrap them in Collections.unmodifiableXXX before returning them.
"Thus the resulting values should be considered immutable, and a reader implementation should yield values that ensure this, to the extent possible."
See also:
https://github.com/edn-format/edn/blob/180328d96e0e48176618e2a92044bb23c5528593/README.md#rationale
The pom.xml contains version 0.7.0, but the README mentions 0.7.1.
Is that a mistake in README? Or the pom.xml change is not pushed to github?
I intend to switch the branching model of edn-java git-flow.
This entails:
The git-flow model strikes me as a clean way to manage branches.
It has the advantage, on git hub, that the branch users see by default 'master' will show them the README of most recent stable release.
One potential drawback is that there's a monotonicity to putting all releases on master.
Consider this hypothetical: we release 2.0.0 but need to continue maintenance of 1.1.x, for some time because it takes users a while to upgrade to 2.0.0. Git-flow doesn't make explicit allowances for this, but it seems it could be addressed by branching "master-1.1.x" from the last 1.1.x tag and treating it like a second "master" branch.
I don't consider it likely that I'll need to maintain two production versions of edn-java in parallel at this stage in its life cycle (or really, ever), so I think this potential drawback is acceptable.
I suspect this is a defect. Awaiting feedback on edn-format/edn#51 to know how to proceed.
See https://github.com/edn-format/edn#symbols
Which states that symbols beginning with +
, must continue with a non-digit.
Currently we catch this for namespace-less symbols and for prefixes of symbols but don't catch it if the symbol has a legal prefix, but the name itself violates this rule.
This test should pass (by having scan throw an exception):
@Test(expected=EdnException.class)
public void symbolNameStartsWithPlusDigit() {
scan("foo/+4blah");
}
Currently this test fails.
Am storing parsed EDN in a map m
m.get(Keyword.newKeyword("modules"));
Under the keyword modules
I have a list of maps:
[
{:active=1, :addr=10657, :sensors=[520, 519, 0, 0]},
{:active=0, :addr=8217, :sensors=[212, 520, 0, 0]},
{:active=0, :addr=0, :sensors=[0, 0, 0, 0]}
]
That's parsed. (if I println(m.get(Keyword.newKeyword("modules")));
)
My question is how would I iterate through the [] list and access each of the maps directly. ( I haven't been able to do it as I am getting an error "... cannot convert from capture#2-of ? to ..." )
Thanks.
Printing a string containing single quotes will incorrectly escape these quotes with a backslash. In a groovy shell:
> System.out.withWriter("UTF-8") { us.bspm.edn.printer.Printers.newPrinter(it).printValue("a 'b' c") }
"a \'b\' c">
Version affected: 0.4.0
info.bsmithmannschott
OK as a groupId, but very long as a package prefix. I'd like to use the same thing for both, so something shorter is desirable.
I've registered the domain bpsm.us
, which is short enough.
GroupId will change from info.bsmithmannschott
to us.bpsm
.
For consistency, the common package prefix should should be changed from bpsm.edn
to us.bpsm.edn
, though this will cause some complication for merging back branches created before this switch.
edn-java
should be free of cyclic dependencies between packages. The solution, in this case, is to pull the contents of us.bpsm.edn.parser.inst
into us.bpsm.parser
.
The protocols implementation the printer builds on is a hack, particularly WRT how it detects and deals with ambiguity. Is there some better way to implement this? Which implementation do we choose if more than one could apply? Currently we require an explicit binding for the Object's concrete class.
Things to look at: for ideas:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.