simongray / datalinguist Goto Github PK

Stanford CoreNLP in idiomatic Clojure.

License: GNU General Public License v3.0

Clojure 100.00%

nlp computational-linguistics natural-language-processing clojure corenlp stanford stanford-corenlp dependency-parsing dependency-parser part-of-speech-tagger

datalinguist's People

Contributors

Stargazers

Watchers

Forkers

lambdalexicon behrica computational-linguistics-research commotum classicvalues

datalinguist's Issues

Automatic coercion to string in properties map (booleans, regexes, etc)

(from notes)

adding support for edu.stanford.nlp.ie.crf.CRFClassifier ?

I started to work to make the CRFClassifier easy to use from Clojure.

My goal is to "integrate" the classifier in the tech.ml library, which has a extension mechanism, so that third parties can contribute models to it:
https://github.com/techascent/tech.ml

And I will use a dataset dataframe to get data in and out.
https://github.com/techascent/tech.ml.dataset

The code is basically done, it's just a few lines of data conversions.

I could contribute it eventually here, instead of making a standalone library.

It has the following dependencies:
edu.stanford.nlp/stanford-corenlp {:mvn/version "4.2.0"}
techascent/tech.ml {:mvn/version "5.01"}
clojurewerkz/propertied {:mvn/version "1.3.0"}

Do you think it could fit here ?

Tregex wrapping

Tregex should be wrapped in a similar to the existing Semgrex wrapping. See: edu.stanford.nlp.trees.tregex.TregexPattern and the current sem-* implementation.

document options for crf.clj/train

As discussed , I want to document the train options properly as metadata.

This is rather easy to do, there is just one question,
namely if we should "clojurize" the options or not:
Should
" :maxLeft " become ":max-left" ?

There is a library, which does that in both directions automatically
https://github.com/clj-commons/camel-snake-kebab
so it would be easy.

It depends a bit "how close" your library want to stay in Java land vs. Clojure land.

I don't have a preference, I can do both.

Just decide, and let me know,

I need to wrap the TokensRegex methods approximately the same way as Semgrex. See: edu.stanford.nlp.ling.tokensregex.TokensRegexPattern and the existing sem-* implementation in dk.simongray.datalinguist.dependency.

Long shot: Stanza wrapper

This is probably a long shot, but since there's working Python interop in Clojure, it might be possible to also integrate Stanford's Stanza, which has a different - and much bigger - set of language models available. Unfortunately, these models are not compatible with CoreNLP. The currently available models notably include Danish!

Pipeline with KBP annotator not working

Attempting to create a pipeline with the kbp annotator currently results in the following error:

Execution error (VerifyError) at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer/toProtoBuilder (ProtobufAnnotationSerializer.java:673).
Bad type on operand stack
Exception Details:
  Location:
    com/google/protobuf/GeneratedMessageV3$ExtendableMessage.getExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;I)Ljava/lang/Object; @3: invokevirtual
  Reason:
    Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
  Current Frame:
    bci: @3
    flags: { }
    locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
    stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
  Bytecode:
    0000000: 2a2b 1cb6 0024 b0

I initially thought the error had something to do with a difference between the protobuf version used to compile included protobuf data with and the version that CoreNLP officially depends on (3.9.2). However, if I drop both the corenlp jar + the models jar inside a directory along with an example.txt file containing some text, annotation works fine from the command line:

# long example
java -cp "*" -Xmx16g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,coref,kbp -coref.md.type RULE -file example.txt

# shorter example (no coref)
java -cp "*" -Xmx16g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,kbp -file example.txt

Annotation hierarchy observability

CoreNLP annotations do have a de facto hierarchy, e.g. a dependency graph is always a child of a sentence, but in principle annotations can appear as a children of any other annotations. Consequently, there is really no delimited annotation tree to illustrate, nor is there an obvious built-in way to infer whether a certain annotation supports a specific annotation as its child.

Explore various ways this could be enhanced. Maybe there is some role for metadata?

universal dependency metadata

Seems like it could be very useful to map explanations of and references to universal dependencies and then have a function exposing these in the REPL, e.g. https://universaldependencies.org/u/dep/conj.html

datafy and recur-datafy throw StackOverflowError

Seems like it is an infinite loop in the datafy-tsm implementation. Removing the datafy call from (assoc m k (datafy v)) and leaving just v seems to solve it for the regular datafy. This is also how it should be, it shouldn't be recursive in the case of datafy.

In the case of recur-datafy I will need to look further into what's causing it. I guess some sort of memory of is needed to avoid this issue.

simongray / datalinguist Goto Github PK

datalinguist's People

Contributors

Stargazers

Watchers

Forkers

datalinguist's Issues

Automatic coercion to string in properties map (booleans, regexes, etc)

adding support for edu.stanford.nlp.ie.crf.CRFClassifier ?

Tregex wrapping

document options for crf.clj/train

TokensRegex wrapping

Long shot: Stanza wrapper

Pipeline with KBP annotator not working

Annotation hierarchy observability

universal dependency metadata

datafy and recur-datafy throw StackOverflowError

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent