Git Product home page Git Product logo

biryani's People

Contributors

phanijella avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

biryani's Issues

Make the code a python package

To better test and to also help other uses this code, this package should be a python package. There are many good instructions for doing this. A simple example is available on readthedocs.org. Additionally, you should also set up py.test with a simple test case that can implicitly tell a developer how this code works.

Merge adaptive, all-annotators link to master and clean up

The master branch should have the code that most users will run, either all annotators or kalman_filter_all_anno. Ideally, both should be in the same branch with an option to turn it on or off. At a minimum, master should be a default choice of one of those two.

Use standard biryani config file names

Instead of using specific config files packaged with the dockerfiles such as petrach.config and corenlp.config. Lets define one config file name for each package. I suggest biryani.json.

Within it we can create a json object. We should also try to use similar conventions across packages. We can a schema for each config key as we develop.

Add CoreNLP container

We will have a Docker container running CoreNLP natively (not the CoreNLP server) that will consume jobs from the RabbitMQ.

Components/Tasks:

  • Consume jobs from the RabbitMQ queue
  • Process job through the CoreNLP command line tool, configured with the right options (see below)
  • Format CoreNLP output into the right JSON schema
  • Store formatted output in cache
  • Update MongoDB with contents of cache

CoreNLP options

annotators = tokenize,ssplit,pos,lemma,parse
parse.model = edu/stanford/nlp/models/srparser/englishSR.ser.gz

CoreNLP can be downloaded here: http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip

The shift-reduce parser (faster) can be downloaded here: http://nlp.stanford.edu/software/stanford-srparser-2014-10-23-models.jar

How to use shift-reduce parser with CoreNLP: http://nlp.stanford.edu/software/srparser.shtml

Other questions

  • pull jobs in batches from the queue?
  • store processed jobs in SQLite for batch writing back to the Mongo?
  • write directly to Mongo or write to queue that writes to Mongo (I think we decided on writing directly to Mongo, but good to make sure).

And note: the job queue consumer in this container will be written in Java, while in the other processing containers it will be written in Python.

Unable to parse remaining documents which are not a multiple of batch size

suppose say our batch size is 500 and our total documents are 756.

every time we receive a message from rabbitmq queue, we are calling a function called dowork.
what dowork does is stores the message to the blockingqueue and we wrote a logic that if size of blocking queue is 500 or ideal time with out parsing is 1 min we start parsing the documents.

but the problem is, in this case as documents size is 756, after 1st iteration we parse 500 documents and remaining document size becomes 756-500 = 256.

in second iteration we we never reach the size of 500 as the documents remaining are only 256.!

Broken Jar links

Stanford Core NLP uses several jarfiles internally. Biryani downloads these jarfiles and adds them to a local path when building its image. We need to updated these jarfiles to be downloaded from the proper location.

One that needsto be moved immediatky is https://json-simple.googlecode.com/files/json-simple-1.1.1.jar ->
https://cliftonlabs.github.io/json-simple/target/json-simple-2.1.2.jar

Write code to go from SQLite back to Mongo

The CoreNLP and Petrarch2 containers write their output to a SQLite DB right now, for speed and convenience. We need code that will take this output and perform a bulk update back into Mongo for permanent storage.

Pipeline

  • Step 1: Decrypt the Truecrypt file
    Input: Truecrypt file
    Output: XML Files
  • Step 2: Parse XML files to json format and store in Rabbitmq queue.
    We can modify Andy's Code and integrate RabbitMq in it to store the documents into Rabbitmq queue.
    Input: XML Files
    Output: Rabbitmq Queue having all the documents in json format.
  • Step 3: Receive the documents from the Rabbitmq queue containing documents and send to biryani which gives corenlp parsed tree output.
    Input: Documents from Rabbitmq Queue
    Output: Corenlp parsed tree.
  • Step 4: Store the corenlp output to SQlite file and send to petrarch containers which gives us event data.
    One way of doing is create a another Rabbitmq queue for corenlp out and let petrarch containers receive them and process to produce event data.
    Input: Corenlp Output.
    Output: 1) Read the SQlite file containing corenlp output and send to petrarch to process them
    2) Store the output to SQLite file and batch update to Mongodb(Using mongo bulk update).
  • Step 5: Form petrarch Store the event data to the Database which is used by the website.
    Input: Petrarch Output (Event data)
    Output: Store Event data to database which is used by the website.

may need to balance the container usage by changing the batch size

@cegme @PhaniJella @ahalterman one thing to notice is the different containers not only have the same pattern of memory usage also there up and down time has the same pattern, which make me thinking we can make they consume different batch size like node 1 has batch size 400 , node 2 has batch size 500 and node 3 has 600 to make their schedule different and I think the cpu should be utilized better in this way

yanli114
yanli116

Make the output of petrarch to json format

the petrarch output is in dictionary format. We need to format it to json so that when getting data from SQlite DB we parse json string and create json objects to get the required data

Logstash with log4j configuration

@phani, so phani, I did some play of logstash with log4j tonight, and the below is what I find:
so try to use log4j2 , the version 1 is outdated, and the way logstash doing is: it is using a port to listen to, so that port need to be set up in the logstash.config and also need to be set up in the log4j with an "appender" named log4j2.xml, and elastic is just listen on the port you set up in logstash.config then you can go to localhost:elastic port to view the log info.
I pushed the set up of these two on my branch, and a lot of old set up on google i fyou search may not work , some of the way they set this up logstash not supported any more and marked as obsolete. Also if your java code compile with error, check this one to delete some related jars.

http://stackoverflow.com/questions/25891737/getting-exception-org-apache-logging-slf4j-slf4jloggercontext-cannot-be-cast-to

it should fix that, the problem I did not figure out is it seems also need appender to setup in the logstash not only in log4net, not find a good answer on how to tackle this down, need to do more resaerch on this:

the following is the error I got, and logstash does run, but it seems elastic did not get the result that java code sends.

selection_003

Output of the corenlp

What is the desired output we are looking form the corenlp? if we pass whole paragraph, right now to my understanding the output generated is sentence by sentence... like each sentence in the paragraph is being parsed not the whole paragraph.

How the structure of the output should be formated and where and how to be stored. If we can get a sample format of the output it will be helpful.

Parse Issue

ISSUE 1:

java.lang.IllegalArgumentException
at edu.stanford.nlp.semgraph.SemanticGraph.parentPairs(SemanticGraph.java:699)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.advance(GraphRelation.java:324)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.initialize(GraphRelation.java:1102)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.(GraphRelation.java:1083)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.(GraphRelation.java:309)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT.searchNodeIterator(GraphRelation.java:309)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChildIter(NodePattern.java:320)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.(NodePattern.java:315)
at edu.stanford.nlp.semgraph.semgrex.NodePattern.matcher(NodePattern.java:276)
at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.(CoordinationPattern.java:147)
at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern.matcher(CoordinationPattern.java:121)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChild(NodePattern.java:339)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.goToNextNodeMatch(NodePattern.java:438)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matches(NodePattern.java:555)
at edu.stanford.nlp.semgraph.semgrex.SemgrexMatcher.find(SemgrexMatcher.java:182)
at edu.stanford.nlp.dcoref.Mention.findDependentVerb(Mention.java:1068)
at edu.stanford.nlp.dcoref.Mention.setDiscourse(Mention.java:319)
at edu.stanford.nlp.dcoref.Mention.process(Mention.java:237)
at edu.stanford.nlp.dcoref.Mention.process(Mention.java:244)
at edu.stanford.nlp.dcoref.MentionExtractor.arrange(MentionExtractor.java:215)
at edu.stanford.nlp.dcoref.MentionExtractor.arrange(MentionExtractor.java:133)
at edu.stanford.nlp.dcoref.MentionExtractor.arrange(MentionExtractor.java:108)
at edu.stanford.nlp.pipeline.DeterministicCorefAnnotator.annotate(DeterministicCorefAnnotator.java:120)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:71)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:499)
at edu.stanford.nlp.pipeline.AnnotationPipeline$1.lambda$next$36(AnnotationPipeline.java:148)
at edu.stanford.nlp.util.logging.Redwood$Util$1$1.run(Redwood.java:1071)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

ISSUE 2:

java.lang.IllegalArgumentException
at edu.stanford.nlp.semgraph.SemanticGraph.parentPairs(SemanticGraph.java:699)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.advance(GraphRelation.java:324)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.initialize(GraphRelation.java:1102)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.(GraphRelation.java:1083)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.(GraphRelation.java:309)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT.searchNodeIterator(GraphRelation.java:309)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChildIter(NodePattern.java:320)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.(NodePattern.java:315)
at edu.stanford.nlp.semgraph.semgrex.NodePattern.matcher(NodePattern.java:276)
at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.(CoordinationPattern.java:147)
at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern.matcher(CoordinationPattern.java:121)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChild(NodePattern.java:339)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.goToNextNodeMatch(NodePattern.java:438)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matches(NodePattern.java:555)
at edu.stanford.nlp.semgraph.semgrex.SemgrexMatcher.find(SemgrexMatcher.java:182)
at edu.stanford.nlp.dcoref.Mention.findDependentVerb(Mention.java:1068)
at edu.stanford.nlp.dcoref.Mention.setDiscourse(Mention.java:319)
at edu.stanford.nlp.dcoref.Mention.process(Mention.java:237)
at edu.stanford.nlp.dcoref.Mention.process(Mention.java:244)
at edu.stanford.nlp.dcoref.MentionExtractor.arrange(MentionExtractor.java:215)
at edu.stanford.nlp.dcoref.MentionExtractor.arrange(MentionExtractor.java:133)
at edu.stanford.nlp.dcoref.MentionExtractor.arrange(MentionExtractor.java:108)
at edu.stanford.nlp.pipeline.DeterministicCorefAnnotator.annotate(DeterministicCorefAnnotator.java:120)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:71)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:499)
at edu.stanford.nlp.pipeline.AnnotationPipeline$1.lambda$next$36(AnnotationPipeline.java:148)
at edu.stanford.nlp.util.logging.Redwood$Util$1$1.run(Redwood.java:1071)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Switch to config file for options

Rather than changing hard coded options in the code, read them from a config file (e.g. like this). That'll also make it easier to combine biryani with birdcage. (The config file should be inter-operable between the two).

Write dynamic batch sizing algorithm

We want to change the batch size based on CPU and memory usage for the container. Hopefully this will make it more efficient, and it will definitely make things more interesting.

Keep folder names lowercase

I suggest that we should keep each of the folder/package names in lower case where it is possible. This will prevent us from running into any case management issues later. This is especially important when importing modules from python or java.

For example:
biryani/CoreNlp/ --> biryani/corenlp
biryani/Petrarch/ --> biryani/petrarch

Check out the PEP8 style guide for general guidelines.

Exception Involving Morphology class

Program Details
Annotators

  1. tokenize
  2. ssplit
    3)pos
    4)parse

Program Parameters

  1. Number of Threads: 128
  2. Batch size or number of documents : 1000

Exception
java.lang.Error: Error: pushback value was too large
at edu.stanford.nlp.process.Morpha.zzScanError(Morpha.java)
at edu.stanford.nlp.process.Morpha.yypushback(Morpha.java)
at edu.stanford.nlp.process.Morpha.next(Morpha.java)
at edu.stanford.nlp.process.Morphology.lemmatize(Morphology.java:156)
at edu.stanford.nlp.process.Morphology.lemma(Morphology.java:110)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructure.correctWHAttachment(UniversalEnglishGrammaticalStructure.java:689)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructure.postProcessDependencies(UniversalEnglishGrammaticalStructure.java:173)
at edu.stanford.nlp.trees.GrammaticalStructure.getDeps(GrammaticalStructure.java:560)
at edu.stanford.nlp.trees.GrammaticalStructure.(GrammaticalStructure.java:215)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructure.(UniversalEnglishGrammaticalStructure.java:92)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructure.(UniversalEnglishGrammaticalStructure.java:71)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructureFactory.newGrammaticalStructure(UniversalEnglishGrammaticalStructureFactory.java:29)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructureFactory.newGrammaticalStructure(UniversalEnglishGrammaticalStructureFactory.java:5)
at edu.stanford.nlp.pipeline.ParserAnnotatorUtils.fillInParseAnnotations(ParserAnnotatorUtils.java:59)
at edu.stanford.nlp.pipeline.ParserAnnotator.finishSentence(ParserAnnotator.java:290)
at edu.stanford.nlp.pipeline.ParserAnnotator.doOneSentence(ParserAnnotator.java:260)
at edu.stanford.nlp.pipeline.SentenceAnnotator.annotate(SentenceAnnotator.java:98)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:71)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:499)
at edu.stanford.nlp.pipeline.AnnotationPipeline$1.lambda$next$36(AnnotationPipeline.java:148)
at edu.stanford.nlp.util.logging.Redwood$Util$1$1.run(Redwood.java:1071)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Exception 2

java.lang.StringIndexOutOfBoundsException: String index out of range: -4
at java.lang.String.(String.java:196)
at edu.stanford.nlp.process.Morpha.yytext(Morpha.java)
at edu.stanford.nlp.process.Morpha.next(Morpha.java)
at edu.stanford.nlp.process.Morphology.lemmatize(Morphology.java:157)
at edu.stanford.nlp.process.Morphology.lemma(Morphology.java:110)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructure.correctWHAttachment(UniversalEnglishGrammaticalStructure.java:689)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructure.postProcessDependencies(UniversalEnglishGrammaticalStructure.java:173)
at edu.stanford.nlp.trees.GrammaticalStructure.getDeps(GrammaticalStructure.java:560)
at edu.stanford.nlp.trees.GrammaticalStructure.(GrammaticalStructure.java:215)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructure.(UniversalEnglishGrammaticalStructure.java:92)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructure.(UniversalEnglishGrammaticalStructure.java:71)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructureFactory.newGrammaticalStructure(UniversalEnglishGrammaticalStructureFactory.java:29)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructureFactory.newGrammaticalStructure(UniversalEnglishGrammaticalStructureFactory.java:5)
at edu.stanford.nlp.pipeline.ParserAnnotatorUtils.fillInParseAnnotations(ParserAnnotatorUtils.java:61)
at edu.stanford.nlp.pipeline.ParserAnnotator.finishSentence(ParserAnnotator.java:290)
at edu.stanford.nlp.pipeline.ParserAnnotator.doOneSentence(ParserAnnotator.java:260)
at edu.stanford.nlp.pipeline.SentenceAnnotator.annotate(SentenceAnnotator.java:98)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:71)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:499)
at edu.stanford.nlp.pipeline.AnnotationPipeline$1.lambda$next$36(AnnotationPipeline.java:148)
at edu.stanford.nlp.util.logging.Redwood$Util$1$1.run(Redwood.java:1071)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
java.lang.ArrayIndexOutOfBoundsException
at edu.stanford.nlp.process.Morpha.zzRefill(Morpha.java)
at edu.stanford.nlp.process.Morpha.next(Morpha.java)
at edu.stanford.nlp.process.Morphology.lemmatize(Morphology.java:156)
at edu.stanford.nlp.process.Morphology.lemma(Morphology.java:110)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructure.correctWHAttachment(UniversalEnglishGrammaticalStructure.java:689)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructure.postProcessDependencies(UniversalEnglishGrammaticalStructure.java:173)
at edu.stanford.nlp.trees.GrammaticalStructure.getDeps(GrammaticalStructure.java:560)
at edu.stanford.nlp.trees.GrammaticalStructure.(GrammaticalStructure.java:215)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructure.(UniversalEnglishGrammaticalStructure.java:92)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructure.(UniversalEnglishGrammaticalStructure.java:71)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructureFactory.newGrammaticalStructure(UniversalEnglishGrammaticalStructureFactory.java:29)
at edu.stanford.nlp.trees.UniversalEnglishGrammaticalStructureFactory.newGrammaticalStructure(UniversalEnglishGrammaticalStructureFactory.java:5)
at edu.stanford.nlp.pipeline.ParserAnnotatorUtils.fillInParseAnnotations(ParserAnnotatorUtils.java:59)
at edu.stanford.nlp.pipeline.ParserAnnotator.finishSentence(ParserAnnotator.java:290)
at edu.stanford.nlp.pipeline.ParserAnnotator.doOneSentence(ParserAnnotator.java:260)
at edu.stanford.nlp.pipeline.SentenceAnnotator.annotate(SentenceAnnotator.java:98)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:71)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:499)
at edu.stanford.nlp.pipeline.AnnotationPipeline$1.lambda$next$36(AnnotationPipeline.java:148)
at edu.stanford.nlp.util.logging.Redwood$Util$1$1.run(Redwood.java:1071)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.