simongray / datalinguist Goto Github PK
View Code? Open in Web Editor NEWStanford CoreNLP in idiomatic Clojure.
License: GNU General Public License v3.0
Stanford CoreNLP in idiomatic Clojure.
License: GNU General Public License v3.0
(from notes)
I started to work to make the CRFClassifier easy to use from Clojure.
My goal is to "integrate" the classifier in the tech.ml
library, which has a extension mechanism, so that third parties can contribute models to it:
https://github.com/techascent/tech.ml
And I will use a dataset dataframe to get data in and out.
https://github.com/techascent/tech.ml.dataset
The code is basically done, it's just a few lines of data conversions.
I could contribute it eventually here, instead of making a standalone library.
It has the following dependencies:
edu.stanford.nlp/stanford-corenlp {:mvn/version "4.2.0"}
techascent/tech.ml {:mvn/version "5.01"}
clojurewerkz/propertied {:mvn/version "1.3.0"}
Do you think it could fit here ?
Tregex should be wrapped in a similar to the existing Semgrex wrapping. See: edu.stanford.nlp.trees.tregex.TregexPattern
and the current sem-* implementation.
See also the parallel issue #11.
As discussed , I want to document the train
options properly as metadata.
This is rather easy to do, there is just one question,
namely if we should "clojurize" the options or not:
Should
" :maxLeft " become ":max-left" ?
There is a library, which does that in both directions automatically
https://github.com/clj-commons/camel-snake-kebab
so it would be easy.
It depends a bit "how close" your library want to stay in Java land vs. Clojure land.
I don't have a preference, I can do both.
Just decide, and let me know,
I need to wrap the TokensRegex methods approximately the same way as Semgrex. See: edu.stanford.nlp.ling.tokensregex.TokensRegexPattern
and the existing sem-* implementation in dk.simongray.datalinguist.dependency
.
This is probably a long shot, but since there's working Python interop in Clojure, it might be possible to also integrate Stanford's Stanza, which has a different - and much bigger - set of language models available. Unfortunately, these models are not compatible with CoreNLP. The currently available models notably include Danish!
Attempting to create a pipeline with the kbp annotator currently results in the following error:
Execution error (VerifyError) at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer/toProtoBuilder (ProtobufAnnotationSerializer.java:673).
Bad type on operand stack
Exception Details:
Location:
com/google/protobuf/GeneratedMessageV3$ExtendableMessage.getExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;I)Ljava/lang/Object; @3: invokevirtual
Reason:
Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
Current Frame:
bci: @3
flags: { }
locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
Bytecode:
0000000: 2a2b 1cb6 0024 b0
I initially thought the error had something to do with a difference between the protobuf version used to compile included protobuf data with and the version that CoreNLP officially depends on (3.9.2). However, if I drop both the corenlp
jar + the models
jar inside a directory along with an example.txt
file containing some text, annotation works fine from the command line:
# long example
java -cp "*" -Xmx16g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,coref,kbp -coref.md.type RULE -file example.txt
# shorter example (no coref)
java -cp "*" -Xmx16g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,kbp -file example.txt
CoreNLP annotations do have a de facto hierarchy, e.g. a dependency graph is always a child of a sentence, but in principle annotations can appear as a children of any other annotations. Consequently, there is really no delimited annotation tree to illustrate, nor is there an obvious built-in way to infer whether a certain annotation supports a specific annotation as its child.
Explore various ways this could be enhanced. Maybe there is some role for metadata?
Seems like it could be very useful to map explanations of and references to universal dependencies and then have a function exposing these in the REPL, e.g. https://universaldependencies.org/u/dep/conj.html
Seems like it is an infinite loop in the datafy-tsm implementation. Removing the datafy
call from (assoc m k (datafy v))
and leaving just v
seems to solve it for the regular datafy
. This is also how it should be, it shouldn't be recursive in the case of datafy
.
In the case of recur-datafy
I will need to look further into what's causing it. I guess some sort of memory of is needed to avoid this issue.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.