Git Product home page Git Product logo

datalinguist's People

Contributors

behrica avatar simongray avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datalinguist's Issues

adding support for edu.stanford.nlp.ie.crf.CRFClassifier ?

I started to work to make the CRFClassifier easy to use from Clojure.

My goal is to "integrate" the classifier in the tech.ml library, which has a extension mechanism, so that third parties can contribute models to it:
https://github.com/techascent/tech.ml

And I will use a dataset dataframe to get data in and out.
https://github.com/techascent/tech.ml.dataset

The code is basically done, it's just a few lines of data conversions.

I could contribute it eventually here, instead of making a standalone library.

It has the following dependencies:
edu.stanford.nlp/stanford-corenlp {:mvn/version "4.2.0"}
techascent/tech.ml {:mvn/version "5.01"}
clojurewerkz/propertied {:mvn/version "1.3.0"}

Do you think it could fit here ?

Tregex wrapping

Tregex should be wrapped in a similar to the existing Semgrex wrapping. See: edu.stanford.nlp.trees.tregex.TregexPattern and the current sem-* implementation.

See also the parallel issue #11.

document options for crf.clj/train

As discussed , I want to document the train options properly as metadata.

This is rather easy to do, there is just one question,
namely if we should "clojurize" the options or not:
Should
" :maxLeft " become ":max-left" ?

There is a library, which does that in both directions automatically
https://github.com/clj-commons/camel-snake-kebab
so it would be easy.

It depends a bit "how close" your library want to stay in Java land vs. Clojure land.

I don't have a preference, I can do both.

Just decide, and let me know,

TokensRegex wrapping

I need to wrap the TokensRegex methods approximately the same way as Semgrex. See: edu.stanford.nlp.ling.tokensregex.TokensRegexPattern and the existing sem-* implementation in dk.simongray.datalinguist.dependency.

Long shot: Stanza wrapper

This is probably a long shot, but since there's working Python interop in Clojure, it might be possible to also integrate Stanford's Stanza, which has a different - and much bigger - set of language models available. Unfortunately, these models are not compatible with CoreNLP. The currently available models notably include Danish!

Pipeline with KBP annotator not working

Attempting to create a pipeline with the kbp annotator currently results in the following error:

Execution error (VerifyError) at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer/toProtoBuilder (ProtobufAnnotationSerializer.java:673).
Bad type on operand stack
Exception Details:
  Location:
    com/google/protobuf/GeneratedMessageV3$ExtendableMessage.getExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;I)Ljava/lang/Object; @3: invokevirtual
  Reason:
    Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
  Current Frame:
    bci: @3
    flags: { }
    locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
    stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
  Bytecode:
    0000000: 2a2b 1cb6 0024 b0  

I initially thought the error had something to do with a difference between the protobuf version used to compile included protobuf data with and the version that CoreNLP officially depends on (3.9.2). However, if I drop both the corenlp jar + the models jar inside a directory along with an example.txt file containing some text, annotation works fine from the command line:

# long example
java -cp "*" -Xmx16g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,coref,kbp -coref.md.type RULE -file example.txt

# shorter example (no coref)
java -cp "*" -Xmx16g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,kbp -file example.txt

Annotation hierarchy observability

CoreNLP annotations do have a de facto hierarchy, e.g. a dependency graph is always a child of a sentence, but in principle annotations can appear as a children of any other annotations. Consequently, there is really no delimited annotation tree to illustrate, nor is there an obvious built-in way to infer whether a certain annotation supports a specific annotation as its child.

Explore various ways this could be enhanced. Maybe there is some role for metadata?

datafy and recur-datafy throw StackOverflowError

Seems like it is an infinite loop in the datafy-tsm implementation. Removing the datafy call from (assoc m k (datafy v)) and leaving just v seems to solve it for the regular datafy. This is also how it should be, it shouldn't be recursive in the case of datafy.

In the case of recur-datafy I will need to look further into what's causing it. I guess some sort of memory of is needed to avoid this issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.