ziqizhang / jate Goto Github PK

NEWS: JATE2.0 Beta.11 Released, see details below.

License: GNU Lesser General Public License v3.0

Java 100.00%

terminology-extraction terminology solr nlp text-mining ate atr

jate's Introduction

NEWS

07 Apr 2018: JATE 2.0 Beta.11 released. The main changes include: 1) migration to Solr 7.2.1. WARNING: the index files created by this version of Solr is not compatible with the previous versions; 2) fixing a couple of minor bugs documented in the Issues page; 3) added two more example configrations for the TTC corpora; 4) added two new algorithms, Basic and ComboBasic; 5) improved introduction page.

02 Apr 2018: JATE 2.0 Beta.9 released. The main change is migration to Solr 6.6.0 (thanks to MysterionRise) - JATE is now based on Solr 6.6.0. WARNING: the index files created by this version of Solr is not compatible with the previous versions. Please consider this before upgrading!

Introduction
Cite JATE
Reasons for using JATE
Support
Contributing
Other downloads
License
Contact
Release history

Introduction

JATE (Java Automatic Term Extraction) is an open source library for Automatic Term Extraction (or Recognition) from text corpora. It is implemented within the Apache Solr framework (currently Solr 7.2.1), currently supporting more than 10 ATE algorithms, and almost any kinds of term candidate patterns. The integration with Solr gives JATE potential to be easily customised and adapted to different document formats, domains, and languages.

JATE is not just a library for ATE. It also implements several text processing utilities that can be easily used for other general-purpose indexing, such as tokenisation, advanced phrase and n-gram extraction. See Reasons for using JATE

Cite JATE

Please support us by citing JATE as below:

If you use the version from this Git repository: Zhang, Z., Gao, J., Ciravegna, F. 2016. JATE 2.0: Java Automatic Term Extraction with Apache Solr. In The Proceedings of the 10th Language Resources and Evaluation Conference, May 2016, Portorož, Slovenia

If you use the old JATE 1.11 available here (no longer supported except an outdated JATE 1.0 wiki page): Zhang, Z., Iria, J., Brewster, C., and Ciravegna, F. 2008. A Comparative Evaluation of Term Recognition Algorithms. In Proceedings of The 6th Language Resources and Evaluation Conference, May 2008, Marrakech, Morocco.

Reasons for using JATE

A wide range of ATE tools and libraries have been developed over the years. In comparison, there are five reasons why JATE is unique:

Free to use, for commercial or non-commercial purposes.
Built on the Apache Solr framework to benefit from its powerful text analysis libraries, high compatibility and scalability, and rigorous community support. As examples, you can plug in the Tika library to process different document formats, use different text preprocessing (e.g., character filtering, HTML entity conversion), tokenisation and normalisation methods available through Lucene, or index your documents and boost your queries with extracted terms easily thanks to its integration with Solr.
Highly configurable linguistic processors for candidate term extraction, such as noun phrases, PoS patterns, and n-grams.
10 state of the art ATE scoring and ranking algorithms.
A set of highly configurable, complex text processing utilities that can be used as Solr plugins for general purpose text indexing and retrieval. For example, sentence splitter, statistical tokeniser, lemmatiser, PoS tagger, phrase chunker and n-gram extractors that are sentence context aware and stopwords removable, etc.

For terminology practitioners, this means you can quickly build highly customisable ATE tools that suit your data and domain, at no cost. For terminology researchers and developers, this means that you have many necessary building blocks for developing novel ATE methods, and a uniform environment where you can evaluate and compare different methods. For general information retrieval users, you have a range of advanced text processing utilities that you can easily plug into your existing Solr or Lucene based indexing and retrieval applications.

Support

JATE is currently maintained by a team of two members, who have other full-time roles but use as much their spare time as possible on this work. We try our best to respond to your queries but we apologise for any potential delays for this reason. However there are many ways you can contribute to JATE to potentially make it better. Currently you can obtain support from us in the following ways:

A wiki page to get you started.
A Google Group to ask questions about JATE.
An issues page to report bugs - only bug reporting please. For any questions please use the Google Group above.
Contact the team directly - please use this only if your query does not fall into any of the above categories.

Contributing to JATE

JATE is a research software that originates from an EPSRC funded project 'Abraxas'. As you may appreciate, since the project termination, there is no more funding to support the software and therefore all subsequent development and its current maintenance have been undertaken voluntarily by the team. JATE is far from perfect and yet we are trilled to see it becoming one of the most popular free text mining tools in the community, thanks to your support. 1We are also keen to make it better and therefore, we would be grateful for your contributions in many forms:

Tell us your use case with JATE

We would be grateful if you tell us a little more of your use cases with JATE: are you using JATE to conduct cutting-edge research in another (or the same) subject area? Or are you using JATE to enable your business applications? By gathering as many detailed use cases as possible, you are helping us make a compelling case to apply for fundings from various institutions to support the development and maintenance of JATE. Please get in touch with us by email and share your story with us - it costs you no money but just a little of your time!

Collaboration

We are keen to collaborate with any partners (academia or industry) to develop new project ideas. This can be, but not limited to, any of the following:

further development of JATE, by adding new algorithms, text processing capacities, user friendly interface, support for other programming languages etc.
integration with other, existing implementations of ATE methods, frameworks, or platforms.
using ATE for downstream applications, such as ontology engineering, information retrieval etc.

Please get in touch with us by email to discuss your ideas.

Code contribution

We welcome bug fixes, improvements, new features etc. Before embarking on making significant changes, please open an issue and ask first so that you do not risk duplicating efforts or spending time working on something that may be out of scope. To contribute code, please follow:

1. Fork the project, clone your fork, and add the upstream to your remote:

$ git clone [email protected]:<your-username>/jate.git
$ cd jate
$ git remote add upstream https://github.com/ziqizhang/jate

2. If you need to pull new changes committed upstream:

$ git checkout master
$ git fetch upstream
$ git merge upstream/master

3. Create a feature branch for your fix or new feature:

$ git checkout -b <feature-branch-name>

4. Please try to commit your changes in logical chunks and reference the issue number in your commit messages:

$ git commit -m "Issue #<issue-number> - <commit-message>"

5. Push your feature branch to your fork.

$ git push origin <feature-branch-name>

6. Open a Pull Request against the upstream master branch. Please give your pull request a clear title and description and note which issue(s) your pull request fixes.

Important: By submitting a patch, you agree to allow the project owners to license your work under the LGPLv3 license.

Data contribution

A crucial resource for developing ATE methods is data, and particularly 'annotated' data that consists of text corpora as well as a list of expected 'real' terms to be found within the corpora. We call this 'gold standard'. This is critical for evaluating and improving the performance of ATE in particular domains.

If you would like to share any data you have created please also get in touch by email. We will acknowledge your credits and share a download within the Other downloads section, subject to your consent.

Other downloads

Other versions of JATE (no longer supported)

This Git repository only hosts the most recent version of JATE. You can obtain some of the previous versions below:

JATE 1.11: download here
Other JATE 2.0 based versions in the Maven central repository

Data

We share datasets used for the development and evaluation of ATE below.

Ziqi Zhang's research data page contains 4 datasets used for ATE research.

License

JATE is licensed with LGPL 3.0, which permits free commercial and non-commercial use. See details here.

Contact

The team member's personal webpages contain their email contacts:

JATE release history

JATE2.0 Beta.11 version - 7 Apr 2018
JATE2.0 Beta version - 20 May 2016
JATE2.0 Alpha version - 04 April 2016

jate's People

Contributors

Stargazers

Watchers

jate's Issues

a better more efficient shinglefilter

current default shinglefilter ignores sentence boundaries. this is very inefficient as it generates lots of false positives, significantly increasing overheads.

we should support a better shinglefilter, and also a closed filter such as NP chunking

Purging index between JATE calls

I am trying to run JATE on different corpora, but found that it seems to incrementally add to the SOLR index when it indexes a new corpus, meaning I get terms from not just the corpus of interest, but the union of all corpora processed to that point. My solution to the problem has been to rm purge files from the relevant data/index file, but this is now causing an exception:

`2016-10-25 09:24:04 INFO  AppCValue:328 - Indexing corpus from [docs/english] and perform candidate extraction ...
2016-10-25 09:24:05 INFO  AppCValue:331 -  [151996] files are scanned and will be indexed and analysed.
Tue Oct 25 09:24:08 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:08 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:08 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:08 AEDT 2016 loading done
Tue Oct 25 09:24:08 AEDT 2016 loading done
Tue Oct 25 09:24:08 AEDT 2016 loading done
Tue Oct 25 09:24:09 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:09 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:09 AEDT 2016 loading done
Tue Oct 25 09:24:09 AEDT 2016 loading done
2016-10-25 09:24:09 ERROR SolrCore:525 - [jateCore] Solr index directory '/home/tim/forum-style/jate-2.0-beta.1/testdata/solr-testbed/jateCore/data/index/' is locked.  Throwing exception.
2016-10-25 09:24:09 ERROR CoreContainer:740 - Error creating core [jateCore]: Index locked for write for core 'jateCore'. Solr now longer supports forceful unlocking via 'unlockOnStartup'. Please verify locks manually!
org.apache.solr.common.SolrException: Index locked for write for core 'jateCore'. Solr now longer supports forceful unlocking via 'unlockOnStartup'. Please verify locks manually!
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:820)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:659)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:727)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:447)
2016-10-25 09:24:12 ERROR SolrCore:139 - org.apache.solr.common.SolrException: Exception writing document id 112188-q to the index; possible analysis error.
        at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167)
        at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
        at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706)
        at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:250)
        at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177)
        at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)
        at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:179)
        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:174)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:139)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:153)
        at uk.ac.shef.dcs.jate.util.JATEUtil.addNewDoc(JATEUtil.java:339)
        at uk.ac.shef.dcs.jate.app.App.indexJATEDocuments(App.java:374)
        at uk.ac.shef.dcs.jate.app.App.lambda$index$4(App.java:340)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
        at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
        at uk.ac.shef.dcs.jate.app.App.index(App.java:338)
        at uk.ac.shef.dcs.jate.app.AppCValue.main(AppCValue.java:45)
Caused by: java.lang.NullPointerException
        at opennlp.tools.util.Cache.put(Cache.java:134)
        at opennlp.tools.postag.DefaultPOSContextGenerator.getContext(DefaultPOSContextGenerator.java:195)
        at opennlp.tools.postag.DefaultPOSContextGenerator.getContext(DefaultPOSContextGenerator.java:87)
        at opennlp.tools.postag.DefaultPOSContextGenerator.getContext(DefaultPOSContextGenerator.java:32)
        at opennlp.tools.ml.BeamSearch.bestSequences(BeamSearch.java:102)
        at opennlp.tools.ml.BeamSearch.bestSequences(BeamSearch.java:168)
        at opennlp.tools.ml.BeamSearch.bestSequence(BeamSearch.java:173)
        at opennlp.tools.postag.POSTaggerME.tag(POSTaggerME.java:194)
        at opennlp.tools.postag.POSTaggerME.tag(POSTaggerME.java:190)
        at uk.ac.shef.dcs.jate.nlp.opennlp.POSTaggerOpenNLP.tag(POSTaggerOpenNLP.java:23)
        at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFilter.assignPOS(OpenNLPPOSTaggerFilter.java:103)
        at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFilter.createTags(OpenNLPPOSTaggerFilter.java:97)
        at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFilter.incrementToken(OpenNLPPOSTaggerFilter.java:51)
        at org.apache.lucene.analysis.jate.ComplexShingleFilter.getNextToken(ComplexShingleFilter.java:335)
        at org.apache.lucene.analysis.jate.ComplexShingleFilter.shiftInputWindow(ComplexShingleFilter.java:412)
        at org.apache.lucene.analysis.jate.ComplexShingleFilter.incrementToken(ComplexShingleFilter.java:175)
        at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:45)
        at org.apache.lucene.analysis.jate.EnglishLemmatisationFilter.incrementToken(EnglishLemmatisationFilter.java:30)
        at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:51)
        at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:613)
        at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
        at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
        at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
        at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1475)
        at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
        at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
        ... 23 more

2016-10-25 09:24:12 ERROR TransactionLog:567 - Error: Forcing close of tlog{file=/home/tim/forum-style/jate-2.0-beta.1/testdata/solr-testbed/ACLRDTEC/data/tlog/tlog.0000000000000004167 refcount=2}`

Is there a clean way to do what I want to do?

Also, by way of note, the lack of support for concurrent processes (also caused by SOLR only wanting one JATE indexer running at a time) is a real bottleneck ...

test class & utility for GENIA benchmarking

Relate to #2

It needs put source code as sort of automatic test and also services as demo for end-user to run benchmark test to re-produce the result on GENIA corpus. The benchmark test source code should be put into src/test/java.

Large scaled term extraction

There is a performance issue for computational expensive term extraction (e.g., contigency table based Chi Square test, mutual information, Z-score) when dealing with a large corpus.

It would be good to add:

parallel processing configurable with multi-core processor resource;
mapReduce model support for cloud/cluster computing;
association rule algorithms (e.g., Apriori, FP-Growth)
PMI algorithm

Error creating core [GENIA]: Function not implemented

15:25:58 ERROR CoreContainer:740 - Error creating core [ACLRDTEC]: Function not implemented
15:25:58 ERROR CoreContainer:740 - Error creating core [GENIA]: Function not implemented

Exception in thread "main" org.apache.solr.common.SolrException: SolrCore 'ACLRDTEC' is not available due to init failure: Function not implemented
        at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:978)
        at uk.ac.shef.dcs.jate.app.App.extract(App.java:268)
        at uk.ac.shef.dcs.jate.app.AppATTF.main(AppATTF.java:51)
Caused by: org.apache.solr.common.SolrException: Function not implemented
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:820)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:659)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:727)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:447)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:438)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:210)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Function not implemented
        at sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
        at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:90)
        at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1082)
        at java.nio.channels.FileChannel.tryLock(FileChannel.java:1155)
        at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:114)
        at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
        at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
        at org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:94)
        at org.apache.lucene.index.IndexWriter.isLocked(IndexWriter.java:4508)
        at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:524)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:761)
        ... 9 more

null value causes chisquare to exit without warning, exit with 0

There is a bug which affects chisquare because FrequencyCtxSentenceBasedFBWorker can get document that has no TermVector.

Specifically, line 68

Terms lookupVector = SolrUtil.getTermVector(docId, properties.getSolrFieldNameJATENGramInfo(), solrIndexSearcher);

can return null, when an indexed document is empty. For some reason this did not cause program to throw any exception but just exit with code 0.

To fix this, a check like below should be inserted. This is currently available through the dev branch and will be merged in the next release

if(lookupVector==null){ LOG.error("Term vector for document id="+count+" is null. The document may be empty"); System.err.println("Term vector for document id="+count+" is null. The document may be empty"); continue; }

OpenNLPRegexChunkerFactory failed to execute PoS tagging with initialised PoS tagger

@ziqizhang I have found the bug when testing Genia corpus. OpenNLPRegexChunkerFactory filter exposes the setting for POS tagger (with class and model). However, setting the POS tagger with the following configuration failed to tagging the content which results in 0 part-of-speech sequence matching.

<filter class="org.apache.lucene.analysis.jate.OpenNLPRegexChunkerFactory"
                            posTaggerClass="uk.ac.shef.dcs.jate.nlp.opennlp.POSTaggerOpenNLP"
                            posTaggerModel="en-pos-maxent.bin"
                            patterns="genia.patterns"
                            minTokens="1" maxTokens="5"
                            maxCharLength="50" minCharLength="1" removeLeadingStopWords="true"
                            removeTrailingStopWords="true" removeLeadingSymbolicTokens="true"
                            removeTrailingSymbolicTokens="true"
                            stopWords="stopwords.txt" stopWordsIgnoreCase="true"/>

My current solution is to have a POS filter before the regex chunker filter as below.

<filter class="org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFactory"
                            posTaggerClass="uk.ac.shef.dcs.jate.nlp.opennlp.POSTaggerOpenNLP"
                            posTaggerModel="en-pos-maxent.bin"/>
                <filter class="org.apache.lucene.analysis.jate.OpenNLPRegexChunkerFactory"
                            posTaggerClass="uk.ac.shef.dcs.jate.nlp.opennlp.POSTaggerOpenNLP"
                            posTaggerModel="en-pos-maxent.bin"
                            patterns="genia.patterns"
                            minTokens="1" maxTokens="5"
                            maxCharLength="50" minCharLength="1" removeLeadingStopWords="true"
                            removeTrailingStopWords="true" removeLeadingSymbolicTokens="true"
                            removeTrailingSymbolicTokens="true"
                            stopWords="stopwords.txt" stopWordsIgnoreCase="true"/>

In current version, configuring POS tagger in sequence chunker does not work and needs extra PoS filter in the pipeline.

Upgrade JATE to use latest Solr?

What do you think about it?

I mean, I could do this by myself, just want to get confirmation

OpenNLPTokenizer + CharFilter occasionally generates invalid long tokens

Tested version: JATE 2.0 Alpha
Tested schema:

<analyzer type="index">
          <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\u0027" replacement=" ' " />
          <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\u000a" replacement=" \\n " />
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\u0009" replacement=" \\t " />
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\u0008" replacement=" \\b " />
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\u000d" replacement=" \\r " />
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\u000c" replacement=" \\f " />
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\u0022" replacement=" &quot; " />
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\u005c" replacement=" \\ " />
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\u003c" replacement=" &lt; " />
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\u003e" replacement=" &gt; " />
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\u003d" replacement=" = " />
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\u0026" replacement=" &amp; " />
                <charFilter class="solr.HTMLStripCharFilterFactory"/>
                <!--tokenizer class="solr.StandardTokenizerFactory" /-->
                <tokenizer class="org.apache.lucene.analysis.jate.OpenNLPTokenizerFactory"
                           sentenceModel="en-sent.bin"
                           tokenizerModel="en-token.bin"/>
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFactory"
                        posTaggerClass="uk.ac.shef.dcs.jate.nlp.opennlp.POSTaggerOpenNLP"
                        posTaggerModel="en-pos-maxent.bin"/>
.... <!-- the remaining lines are not important -->
</analyzer>

Tested file:
C00-2128, see attachment. Download and change extension to .xml

Description:
When OpenNLPTokenizer is preceded by CharFilters such as that shown above in the schema, it seems that only on specific files, OpenNLPTokenizer creates invalid, extremely long token. For example, with the above schema, using the attached file, the following code (beginning from line 114 of OpenNLPTokenizer) will generate a token with 1 character "&", start offset = 7248, end offset=8298.

termAtt.setLength(termLength);
char[] buffer = termAtt.buffer();
if(word.getStart()==109 && sentenceOffset==7139)
                System.out.println("wrong");
finalOffset = correctOffset(sentenceOffset + word.getEnd());
int start = correctOffset(word.getStart() + sentenceOffset);
offsetAtt.setOffset(start, finalOffset);

But in fact, the end offset should be 7249.

The error seems to be caused by line 4 above, which calls 'correctOffset' of the previous filter in the chain, i.e., HTMLStripCharFilterFactory in the example schema above. Without calling that method, the value "sentenceOffset + word.getEnd()" would have been the correct value.

The invalid offsets of the token can cause issues to subsequent processors in the chain. For example, in OpenNLPPOSTaggerFilter, line 84:

word = new String(buffer, 0, offsetAtt.endOffset() - offsetAtt.startOffset());

would get StringIndexOutOfBound exception, when it is trying to reconstruct the string using start, end offset on the char array object (buffer) that holds the actual string (1 char length '&').

COMMENTS
It has been difficult to debug this issue, because the bug only happens with OpenNLPTokenizer + CharFilter combination, and only on certain files. It does not happen when solr's StandardTokenizer is used, or when tested on a separate input file that also contains HTML entities which will be stripped by HTMLStripCharFilterFactory. For example, the second attached file does not cause this problem.

C00-2128_cln.change.to.xml.pdf
C00-2128_cln_small.change.to.xml.pdf

Google group does not exist

When I click the link to the google group in the README file, Google tells me the group does not exist or I do not have permission to access it (I am logged in with my personal Google account).

refactor JATE to Standard Directory Layout

JATE needs to be refactored to conform to Standard Directory Layout so as to allow unit test.

/resource may be considered to be moved to src/main/resource so that resources can be packaged in jar.

see also the source directory config in build config in pom.xml

Recommended practice for package hierarchy is introduced in https://maven.apache.org/guides/introduction/introduction-to-the-standard-directory-layout.html

java.lang.IllegalStateException in jate.OpenNLPMWEFilter

@ziqizhang java.lang.IllegalStateException encountered in jate.OpenNLPMWEFilter when processing ACL RD-TEC corpus. any idea ?

2016-02-05 17:42:48 ERROR SolrCore:139 - org.apache.solr.common.SolrException: Exception writing document id P06-1139 to the index; possible analysis error.
        at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167)
        at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
        at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706)
        at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
        at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:250)
        at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177)
        at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)
        at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:179)
        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:174)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:139)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:153)
        at uk.ac.shef.dcs.jate.util.JATEUtil.addNewDoc(JATEUtil.java:244)
        at uk.ac.shef.dcs.jate.app.ACLRDTECTest.indexJATEDocuments(ACLRDTECTest.java:104)
        at uk.ac.shef.dcs.jate.app.ACLRDTECTest.lambda$indexAndExtract$0(ACLRDTECTest.java:93)
        at java.util.ArrayList.forEach(ArrayList.java:1249)
        at uk.ac.shef.dcs.jate.app.ACLRDTECTest.indexAndExtract(ACLRDTECTest.java:91)
        at uk.ac.shef.dcs.jate.app.AppATTFACLRDTECTest.main(AppATTFACLRDTECTest.java:79)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException
        at java.util.ArrayList$Itr.remove(ArrayList.java:864)
        at org.apache.lucene.analysis.jate.OpenNLPMWEFilter.prune(OpenNLPMWEFilter.java:152)
        at org.apache.lucene.analysis.jate.OpenNLPRegexChunker.incrementToken(OpenNLPRegexChunker.java:54)
        at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:45)
        at org.apache.lucene.analysis.en.EnglishMinimalStemFilter.incrementToken(EnglishMinimalStemFilter.java:48)
        at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:51)
        at org.apache.lucene.analysis.jate.PunctuationRemover.incrementToken(PunctuationRemover.java:50)
        at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:613)
        at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
        at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
        at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
        at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1475)
        at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
        at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
        ... 29 more

Porting to ESearch?

Hi all,

first of all, congrats for such a good project. I've come across JATE as I'm taking a look on automatic term extraction literature to test a disambiguation approach I'm working on. Unfortunately, my ecosystem is currently built on top of Elasticsearch. Do you think it'd be worthy considering porting JATE 2.0 to ESearch? I'm going to use it as a separate module, but I wanted to ask this just in case the interaction with Solr was isolated in a connector and the port could be feasible (At first sight, I'm assuming that at least it would require the implementation effort of porting the analyzers to ESearch, as well as the appropriate connectors and candidate ranking algorithms).

Thank you very much,

test class & utility for ACL RD-TEC benchmarking

Relates to #2
It needs further experiment & test for 10 algorithms with ACL RD-TEC corpus. The source should be put in the src/test/java, providing sort of automatic test as well as a demo of how to use the tool for large text analysis for end-user

test/demo class for using jate.solr.TermCandidateFilterFactory

We need test/demo class to demonstrate how to use jate.solr.TermCandidateFilterFactory for PoS sequence matching during indexing time.

The idea is that we can load an example "schema.xml" file as test resource to start an embedded server and a test corpus can be indexed and enriched with candidates as metadata.

java.lang.StringIndexOutOfBoundsException in OpenNLPPOSTaggerFilter

Exception throws out when indexing documents.

This happens when replacing JATE customized tokeniser with Solr standard tokeniser and charFilter for irregular text in jate_text_2_terms analyser chain.

2016-03-06 00:49:08 ERROR SolrCore:139 - org.apache.solr.common.SolrException: Exception writing document id A00-1002_cln.xml to the index; possible analysis error.
    at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167)
    at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
    at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
    at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955)
    at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110)
    at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706)
    at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:250)
    at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177)
    at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98)
    at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)
    at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:179)
    at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
    at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:174)
    at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:139)
    at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:153)
    at uk.ac.shef.dcs.jate.util.JATEUtil.addNewDoc(JATEUtil.java:292)
    at uk.ac.shef.dcs.jate.app.ACLRDTECTest.indexJATEDocuments(ACLRDTECTest.java:164)
    at uk.ac.shef.dcs.jate.app.ACLRDTECTest.indexAndExtract(ACLRDTECTest.java:127)
    at uk.ac.shef.dcs.jate.app.AppATEACLRDTECTest.main(AppATEACLRDTECTest.java:59)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 18
    at java.lang.String.<init>(String.java:199)
    at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFilter.walkTokens(OpenNLPPOSTaggerFilter.java:81)
    at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFilter.incrementToken(OpenNLPPOSTaggerFilter.java:45)
    at org.apache.lucene.analysis.jate.OpenNLPMWEFilter.walkTokens(OpenNLPMWEFilter.java:267)
    at org.apache.lucene.analysis.jate.OpenNLPRegexChunker.incrementToken(OpenNLPRegexChunker.java:53)
    at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:45)
    at org.apache.lucene.analysis.jate.EnglishLemmatisationFilter.incrementToken(EnglishLemmatisationFilter.java:30)
    at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:51)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:613)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1475)
    at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
    at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
    ... 25 more

support term frequency filtering before rankin

JATE Solr architecture issue

@ziqizhang Just to clarify what you've mentioned about architecture solution of JATE Solr plugins yesterday.

For JATE solr toolset, we configure TR aware fields and analyser first. The candidate extraction pipeline will also be configured in schema and solrconfig.xml and first stage of candidate extraction/boundary detection should be done in indexing time.

Required setting in schema.xml for candidate extraction should be :

content field for indexing all n-grams

<!-- Field to index and store token-n-grams. These are used as a field to lookup information
         including frequency, offsets, etc. for candidate terms from the candidate term's field 
         (default=jate_cterms). Must be indexed, termVectors and termOffsets set to true-->
<field name="jate_ngraminfo" type="jate_text_2_ngrams" indexed="true" stored="false" multiValued="false" termVectors="true" termPositions="true" termOffsets="true"/>

<fieldType name="jate_text_2_ngrams" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="solr.LowerCaseFilterFactory" />                              
                <filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="5"
                        outputUnigrams="true" outputUnigramsIfNoShingles="false" tokenSeparator=" "/>
                <filter class="solr.EnglishMinimalStemFilterFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            </analyzer>
        </fieldType>

content field for indexing all term candidates (1st stage filtering)


<!-- Field to index and store candidate terms. Must be indexed, and termVectors set to true-->
<field name="jate_cterms" type="jate_text_2_terms" indexed="true" stored="false" multiValued="false" termVectors="true"  termOffsets="true"/>

<fieldType name="jate_text_2_terms" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <!--tokenizer class="org.apache.lucene.analysis.opennlp.OpenNLPTokenizerFactory"
                            sentenceModel="../resource/en-sent.bin"
                            tokenizerModel="../resource/en-token.bin"/-->
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="uk.ac.shef.dcs.jate.lucene.filter.OpenNLPRegexChunkerFactory"
                            posTaggerClass="uk.ac.shef.dcs.jate.nlp.opennlp.POSTaggerOpenNLP"
                            posTaggerModel="../jate/resource/en-pos-maxent.bin"
                            patterns="D:/Work/jate_github/jate/jate.candidate.patterns"/>
                <filter class="solr.LowerCaseFilterFactory" />                  
                <filter class="solr.EnglishMinimalStemFilterFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            </analyzer>
        </fieldType>

content field for indexing word-level features

<!-- Field to index and store words. You only need this if you use algorithms that require
            word-level features, such as Weirdness, GlossEx, and TermEx
            Must be indexed, termVectors and termOffsets set to true -->
<field name="jate_words" type="jate_text_2_words" indexed="true" stored="false" multiValued="false" termVectors="true" termPositions="true" termOffsets="true"/>


<fieldType name="jate_text_2_words" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="solr.LowerCaseFilterFactory" />              
                <filter class="solr.EnglishMinimalStemFilterFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            </analyzer>
        </fieldType>

unique field

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />

After configuring those, i think i can run index for all documents.

Then, the 2nd stage of candidate filtering should be implement in TermRecognitionRequestHandler.

As the TermRecognitionRequestHandler is configured for every core, so the index path (as in every App*.java) is not needed in configuration. Meanwhile, instead of "jatePropertyFile", i think we should not separate configuration in more external files. So, we can try to configure various setting in solr.

We can support to configure the request handler in following options in solrconfig.xml. We can see some overlapped settings so we should minimise those to make it more neat and simplified.

algorithm
algorithm name choosing to run can be configured here
min_term_freq
minimum tern frequency can be configured here to filter term candidates lower than the min value before term ranking
fieldname_id
fieldname_jate_terminfo
fieldname_jate_cterms
fieldname_jate_sentences
fieldname_jate_words
fieldname_jate_cterms_f
featurebuilder_max_terms_per_worker
featurebuilder_max_docs_per_worker
indexer_max_docs_per_worker
indexer_max_units_to_commit
max_cpu_usage

Log configuration

We should use "org.slf4j.Logger" for all the classes.

An example of usage:

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
private final Logger log = LoggerFactory.getLogger(getClass());

We can then change log level in Solr console for real-time. Please refer to solr document for details how to configure log level for debugging in real-time.

Two mode can be supported including Embedded Solr mode and plugin mode

Embedded Solr mode

Each algorithm can be used as a standalone application that can be directly applied to a document directory. The app should be able to start a embedded solr server with default/external configurations (solrHome, coreName, jatePropertyFile) and execute automatic indexing & term extraction. Term can be exported into external csv file for the sake of evaluation and benchmarking. Setting should be as taken simple as possible with advanced configurations as optional.

Plugin mode

This is a way that user can apply term recognition algorithm in a more scalable way so as to analyse a large number of documents (from single server to cloud clusters). That is to configure terminology recognition request handler to run the algorithm to perform whole index analysis.

NullPointException when loading dragon nlp resource with Apache Solr 5.5 or above

NullPointException when loading dragon nlp resource. This problem does not appear in Apache Solr 5.3.0. However, it happens in Apache Solr 5.5 or above.

The stack traces are like following:

java.lang.NullPointerException
        at dragon.nlp.tool.lemmatiser.ExceptionOperation.loadExceptions(ExceptionOperation.java:60)
        at dragon.nlp.tool.lemmatiser.ExceptionOperation.<init>(ExceptionOperation.java:22)
        at dragon.nlp.tool.lemmatiser.EngLemmatiser.loadLemmatiser(EngLemmatiser.java:139)
        at dragon.nlp.tool.lemmatiser.EngLemmatiser.initialize(EngLemmatiser.java:64)
        at dragon.nlp.tool.lemmatiser.EngLemmatiser.<init>(EngLemmatiser.java:41)
        at org.apache.lucene.analysis.jate.EnglishLemmatisationFilterFactory.inform(EnglishLemmatisationFilterFactory.java:38)
        at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:721)
        at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:160)
        at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:56)
        at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:70)
        at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:108)
        at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:79)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:812)
        at org.apache.solr.core.CoreContainer.access$000(CoreContainer.java:87)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:467)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:458)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

This problem happens in plugin mode. It seems that the problem is due to incorrect setting for Lemmatisation. However, the 'lemmatiser' directory is in the correct directory and has correct setting in schema.

<filter class="org.apache.lucene.analysis.jate.EnglishLemmatisationFilterFactory"
                    lemmaResourceDir="lemmatiser"/>

With the exactly same setting, it complains the exceptions mentioned above in Apache Solr 5.5.x while works well in Apache Solr 5.3.0.

Provide API and improvement for PoS tagger with dictionary

At the moment, we are using OpenNLP Part-of-Speech (PoS) tagger, which relies on a pre-trained english maxent pos model (1.5). This is a general purpose model for PoS tagging.

However, for most of domain specific tasks (typically like the biomedical text GENIA contains), the general purposed PoS tagger works very bad and can only gain very low recall. We can provide support for dictionary based PoS tagging in JATE toolset with simple setting. It will be a valuable features as for most of domain specific problem, having a large training set is great challenge, while have a set of manually maintained dictionary is a simple and efficient way such as the approach adopted in GENIA Tagger.

We can provide an example/demo of (default) setting for benchmaking various ATE algorithms over GENIA corpus, by using biomedical PoS dictionary provided by the GENIA Tagger.

OpenNLP PoS tagger provide the support of using Tag Dictionary

For more details, see GENIA Tagger.

test/demo class for using "TR request handler"

We need a test/demo class for using "TR request handler" in plugin mode to extract and enrich indexed documents in Solr.

This will be a dummy data indexed in a test solr index and then run automatic test to extract terms and validate results.

NullPointException when loading dragon nlp resource with Apache Solr 5.5 or above

java.lang.NullPointerException
        at dragon.nlp.tool.lemmatiser.ExceptionOperation.loadExceptions(ExceptionOperation.java:60)
        at dragon.nlp.tool.lemmatiser.ExceptionOperation.<init>(ExceptionOperation.java:22)
        at dragon.nlp.tool.lemmatiser.EngLemmatiser.loadLemmatiser(EngLemmatiser.java:139)
        at dragon.nlp.tool.lemmatiser.EngLemmatiser.initialize(EngLemmatiser.java:64)
        at dragon.nlp.tool.lemmatiser.EngLemmatiser.<init>(EngLemmatiser.java:41)
        at org.apache.lucene.analysis.jate.EnglishLemmatisationFilterFactory.inform(EnglishLemmatisationFilterFactory.java:38)
        at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:721)
        at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:160)
        at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:56)
        at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:70)
        at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:108)
        at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:79)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:812)
        at org.apache.solr.core.CoreContainer.access$000(CoreContainer.java:87)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:467)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:458)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

parallel indexing exception

there seems to be a bug somewhere in the opennlp tokenizer, when it is used as part of solr processing chain in parallel writing mode, strange exception is thrown

Would be good to add corpus based evaluation

The well-known problem in ATR is the lack of evaluation protocol and standards to compare performances of a proliferation of algorithms. It would be good to add corpus based evaluation taking annotated corpus as input and output benchmark result with selected algorithms. It should naively support some well-known gold standards :

Cannot build with sbt since Dragon tool not linked

Related to the Troubleshooting part in wiki, the artifact edu.drexel:dragontool:jar:1.3.3 is creating a problem with SBT:
[warn] :: edu.drexel#dragontool;1.3.3: not found
I have dragontool.jar in my lib folder, but it seems to want find .../drexel/dragontool/1.3.3/dragontool-1.3.3.pom file.
I am quite new to SBT and maven so if this is a trivial and ignorant issue then please forgive me.
Kindly,
Henri

Caused by: java.lang.NoSuchMethodError: org.apache.solr.common.SolrInputDocument.<init>([Ljava/lang/String;)

Thanks 4 your wonderful work ,but I got a issue with Solr when uploading my docs.

Auto build and deploy JATE GitHub pages with Travis-CI

Need to configure auto build and deploy JATE GitHub pages with Travis-CI. This will make every commit trigger an automatically build of the repository from remote server/service provided by Travis-CI.

TermComponentIndex sort problem.

Why the getSorted function sorts list of pairs with "(o1, o2) -> o2.first().compareTo(String.valueOf(o1.second()))".
And this cause "java.lang.IllegalArgumentException: Comparison method violates its general contract!"

check if any matrix library support concurrent writting to matrix

this can significantly improve performance of coocurrence based algorithsm

export ranked term candidates with surface form

In current version, we export ranked term candidates with lemmatised form which causes integration and evaluation problem for external system. It will be better to have an option to export either normalised term/concept or original surface forms with weights.

Timeout waiting for all directory ref counts to be released

Im trying to exectue TF-IDF on the Embedded Mode jate-2.0-beta.7 but i get this error:

_2018-06-14 14:34:33 INFO  TFIDF:27 - Beginning computing TermEx values,, total terms=4148
2018-06-14 14:34:33 INFO  TFIDF:38 - Complete
2018-06-14 14:34:45 **ERROR CachingDirectoryFactory:184** - Timeout waiting for all directory ref counts to be released - gave up waiting on CachedDir<<refCount=1;path=C:\Users\Sonja\Desktop\ProjetStage\jateDemo\solr-testbed\ACLRDTEC\data\index;done=false>>
2018-06-14 14:34:45 ERROR CachingDirectoryFactory:150 - Error closing directory:org.apache.solr.common.SolrException: Timeout waiting for all directory ref counts to be released - gave up waiting on CachedDir<<refCount=1;path=C:\Users\Sonja\Desktop\ProjetStage\jateDemo\solr-testbed\ACLRDTEC\data\index;done=false>>
        at org.apache.solr.core.CachingDirectoryFactory.close(CachingDirectoryFactory.java:187)
        at org.apache.solr.core.SolrCore.close(SolrCore.java:1257)
        at org.apache.solr.core.SolrCores.close(SolrCores.java:124)
        at org.apache.solr.core.CoreContainer.shutdown(CoreContainer.java:562)
        at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.shutdown(EmbeddedSolrServer.java:263)
        at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.close(EmbeddedSolrServer.java:268)
        at uk.ac.shef.dcs.jate.app.App.extract(App.java:317)
        at uk.ac.shef.dcs.jate.app.AppTFIDF.main(AppTFIDF.java:49)

2018-06-14 14:34:45 INFO  AppTFIDF:516 - Exporting terms to [t-terms.json]
2018-06-14 14:34:45 INFO  AppTFIDF:520 - complete._

I don't understand what is the problem with the CachingDirectoryFactory, also i remarcted that even if I run TF-IDF or only TF i get the same first term scored its weird.Normally they are the opposite function ???(so i conclude that this error need to cause this)

embedded mode document content get missed by tika

using the Indexing class to index documents, which uses Tika to parse text and take content.

It seems sometimes Tika fails to parse the text and extract content properly. Example: try aclrd corpus with docid=the-acl-rd-tec_chunk_10228.txt

term candidate extraction issue

@ziqizhang We can support three types of candidates extraction. This is optional as user can configure solr.TermRecognitionRequestHandler to rank and filter only on pre-processed term field.

Part-of-Speech sequence/pattern
Ngram based
Noun-Phrase (NP) chunking

The settings vary from different methods.

Part-of-Speech sequence/pattern filter

part-of-speech(pos) sequence pattern for filtering term candidate lexical units
pos_sequence_filter=../config/pos_sequence_filter3
Maximum number of characters (single candidate term unit) is useful to overcome the erroneous PoS tagging or irregular text in order to increase precision
max_char_length=10
Minimum number of characters allowed in any term candidates units. increase for better precision
min_char_length=1
Stopwords filtering is quite domain-specific and useful to increase precision. This can be default to use SMART stop-word list built by Chris Buckley
stopwords=stoplist.txt
Minimum frequency allowed for term candidates.; increase for better precision. This is a corpus level statistic based filtering, we can leave to term extraction request handler
min_term_freq=2

Ngram filter

Ngram based term candidate generation is useful in the situation where linguistic and language resource/tool is lack. In solr, we can generate ngrams by means of sol.ShingleFilterFactory. Ngram based filter should be more aggressive in order to balance precision and recall for different domains.

Maximum number of words allowed in a multi-word term. Adjust the range to balance precision and recall.

max_tokens=5

Minimum number of words allowed in a multi-word term. Adjust the range to balance precision and recall.

min_tokens=1

Stopwords filtering is useful here to increase precision
stopwords=stoplist.txt
Maximum number of characters (single candidate term unit) is useful to increase precision. This is to increase precision in irregular text
max_char_length=10
Minimum number of characters allowed in any term candidates units. increase for better precision
min_char_length=1
Minimum frequency allowed for term candidates.; increase for better precision. This is a corpus level statistic based filtering, we can leave to term extraction request handler
min_term_freq=2

Noun-Phrase (NP) chunking

Noun-phrase (NP) chunking has some advantage by using machine learning model trained on domain-specific corpus.

Maximum number of words allowed in a multi-word term. Adjust the range to balance precision and recall.

max_tokens=5

Minimum number of words allowed in a multi-word term. Adjust the range to balance precision and recall.

min_tokens=1

Maximum number of characters (single candidate term unit) is useful to increase precision. This is to increase precision in irregular text
max_char_length=10
Minimum number of characters allowed in any term candidates units. increase for better precision
min_char_length=1
Minimum frequency allowed for term candidates.; increase for better precision. This is a corpus level statistic based filtering, we can leave to term extraction request handler
min_term_freq=2

Stopwords filtering strategy

Two different strategy for stopwords filtering we can implement:

Aggressive approach: remove whole term candidates if there is any stopword detected

Aggressive approach based stopword filtering is useful for ngram based term candidate extraction method, since all the possible combinations of term candidates are generated in extraction stage.

Conservative approach: remove only the term unit (heuristic rule: for only first N terms in compound term) if it is matched with a stopword

The conservative approach is beneficial for PoS pattern based approach in order to increase recall.

Suggested modification for current setting:

we can modify uk.ac.shef.dcs.jate.lucene.filter.OpenNLPRegexChunkerFactory first to support a stopword filter to only filter substring stopwords from candidates (with options for ignoreCase, etc)

An example of filter configuration would be:

<filter class="uk.ac.shef.dcs.jate.lucene.filter.OpenNLPRegexChunkerFactory"
                            posTaggerClass="uk.ac.shef.dcs.jate.nlp.opennlp.POSTaggerOpenNLP"
                            posTaggerModel="D:/Work/jate_github/jate/resource/en-pos-maxent.bin"
                            patterns="D:/Work/jate_github/jate/jate.candidate.patterns"
stopwords="stopwords.txt"
ignoreCase="false"/>`

NullPointerException when generating ngram from empty content

It happens when perform candidate extraction and indexing document 'C02-1055' from ACL RD-TEC 1.0

Caused by: java.lang.NullPointerException
	at org.apache.lucene.analysis.jate.ComplexShingleFilter.incrementToken(ComplexShingleFilter.java:234)
	at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:45)
	at org.apache.lucene.analysis.jate.EnglishLemmatisationFilter.incrementToken(EnglishLemmatisationFilter.java:30)
	at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:51)
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:613)
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1475)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)

'C02-1055' content:

<?xml version="1.0" standalone="yes"?>

<Paper id="C02-1055">
  <Title>amp;quot;</Title>
  <Keywords></Keywords>
  <Abbreviations></Abbreviations>
  <Authors></Authors>
  <References></References>
</Paper>

API?

I'm wondering if there is any way that I can use it as a library in my application? Could you provide some basic example codes in Wiki? I just want to use the algorithmics.

Thank you.

stemming in the analyzer chain vs reference corpus (affecting termex, glossex, weirdness)

The analyzer chain will normalize candidate terms by stemming/lemmatizing. Depending on the choice of stemmer/lemmatizer, this may cause candidate terms to be incorrectly transformed. For example, "analysis" => "analysi".

This will affect algorithms that look up word/term frequency in a reference corpus. The reference corpus must be processed using the same analyzer chain, and/or stemming/lemmatizing. Otherwise, words/terms may have mismatch and causing unexpected results.

Example SOLR configuration for German text corpus?

For somebody not familiar with SOLR it is very hard to start using this. Would it be possible to
add an example configuration for processing a corpus where each document is just a text file for the language German?

Is there a way to provide the corpus in a way where the necessary NLP preprocessing (POS tagging, lemmatization, stop word identification) has already been performed by other tools?

Some problem with dependencies

Hi all,

there is one dependency that should be updated. The groupId of dragontool seems to be problematic right now. Switching edu.drexel for de.julielab seems to do the magic :)

Best,

solr seems to remove /r from /r/n

this causes offsets to be inconsistent with original input

functionality and test/demo class for embedded mode

It needs some further work on refining the functionality and test/demo classes for embedded mode.

improve handling of unsuccessfully content extraction

In current version (2.0-beta.1) of JATE2.0, the app will complain "uk.ac.shef.dcs.jate.JATEException: Cannot find expected field: jate_ngraminfo" if data format is not supported or simply no content can be extracted.

It is very confusing log information and not very indicative. We should log useful information to indicate if no content can be extracted and explain the cause if "cannot find expected field "

uk.ac.shef.dcs.jate.JATEException: Cannot find expected field: jate_ngraminfo

JATEException when executing AppChiSquare for ACL RD-TEC corpus. Meanwhile, it fails to shutdown threads cleanly which causes main thread hangs.

2016-02-07 22:53:05 ERROR FrequencyCtxSentenceBasedFBWorker:89 - Unable to build feature for document id:6996
uk.ac.shef.dcs.jate.JATEException: Cannot find expected field: jate_ngraminfo
    at uk.ac.shef.dcs.jate.util.SolrUtil.getTermVector(SolrUtil.java:35)
    at uk.ac.shef.dcs.jate.feature.FrequencyCtxSentenceBasedFBWorker.computeSingleWorker(FrequencyCtxSentenceBasedFBWorker.java:67)
    at uk.ac.shef.dcs.jate.feature.FrequencyCtxSentenceBasedFBWorker.computeSingleWorker(FrequencyCtxSentenceBasedFBWorker.java:22)
    at uk.ac.shef.dcs.jate.JATERecursiveTaskWorker.compute(JATERecursiveTaskWorker.java:38)
    at java.util.concurrent.RecursiveTask.exec(RecursiveTask.java:94)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    at java.util.concurrent.ForkJoinPool$WorkQueue.tryRemoveAndExec(ForkJoinPool.java:969)
    at java.util.concurrent.ForkJoinPool.awaitJoin(ForkJoinPool.java:2004)
    at java.util.concurrent.ForkJoinTask.doJoin(ForkJoinTask.java:389)
    at java.util.concurrent.ForkJoinTask.join(ForkJoinTask.java:713)
    at uk.ac.shef.dcs.jate.feature.FrequencyCtxSentenceBasedFBWorker.mergeResult(FrequencyCtxSentenceBasedFBWorker.java:54)
    at uk.ac.shef.dcs.jate.feature.FrequencyCtxSentenceBasedFBWorker.mergeResult(FrequencyCtxSentenceBasedFBWorker.java:22)
    at uk.ac.shef.dcs.jate.JATERecursiveTaskWorker.compute(JATERecursiveTaskWorker.java:36)
    at java.util.concurrent.RecursiveTask.exec(RecursiveTask.java:94)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    at java.util.concurrent.ForkJoinPool$WorkQueue.tryRemoveAndExec(ForkJoinPool.java:969)
    at java.util.concurrent.ForkJoinPool.awaitJoin(ForkJoinPool.java:2004)
    at java.util.concurrent.ForkJoinTask.doJoin(ForkJoinTask.java:389)
    at java.util.concurrent.ForkJoinTask.join(ForkJoinTask.java:713)
    at uk.ac.shef.dcs.jate.feature.FrequencyCtxSentenceBasedFBWorker.mergeResult(FrequencyCtxSentenceBasedFBWorker.java:54)
    at uk.ac.shef.dcs.jate.feature.FrequencyCtxSentenceBasedFBWorker.mergeResult(FrequencyCtxSentenceBasedFBWorker.java:22)
    at uk.ac.shef.dcs.jate.JATERecursiveTaskWorker.compute(JATERecursiveTaskWorker.java:36)
    at java.util.concurrent.RecursiveTask.exec(RecursiveTask.java:94)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:902)
    at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1689)
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1644)
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

2016-02-07 22:53:06 ERROR FrequencyCtxSentenceBasedFBWorker:89 - Unable to build feature for document id:7015
uk.ac.shef.dcs.jate.JATEException: Cannot find expected field: jate_ngraminfo
    at uk.ac.shef.dcs.jate.util.SolrUtil.getTermVector(SolrUtil.java:35)
    at uk.ac.shef.dcs.jate.feature.FrequencyCtxSentenceBasedFBWorker.computeSingleWorker(FrequencyCtxSentenceBasedFBWorker.java:67)
    at uk.ac.shef.dcs.jate.feature.FrequencyCtxSentenceBasedFBWorker.computeSingleWorker(FrequencyCtxSentenceBasedFBWorker.java:22)
    at uk.ac.shef.dcs.jate.JATERecursiveTaskWorker.compute(JATERecursiveTaskWorker.java:38)
    at java.util.concurrent.RecursiveTask.exec(RecursiveTask.java:94)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    at java.util.concurrent.ForkJoinPool$WorkQueue.tryRemoveAndExec(ForkJoinPool.java:969)
    at java.util.concurrent.ForkJoinPool.awaitJoin(ForkJoinPool.java:2004)
    at java.util.concurrent.ForkJoinTask.doJoin(ForkJoinTask.java:389)
    at java.util.concurrent.ForkJoinTask.join(ForkJoinTask.java:713)
    at uk.ac.shef.dcs.jate.feature.FrequencyCtxSentenceBasedFBWorker.mergeResult(FrequencyCtxSentenceBasedFBWorker.java:54)
    at uk.ac.shef.dcs.jate.feature.FrequencyCtxSentenceBasedFBWorker.mergeResult(FrequencyCtxSentenceBasedFBWorker.java:22)
    at uk.ac.shef.dcs.jate.JATERecursiveTaskWorker.compute(JATERecursiveTaskWorker.java:36)
    at java.util.concurrent.RecursiveTask.exec(RecursiveTask.java:94)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    at java.util.concurrent.ForkJoinPool$WorkQueue.tryRemoveAndExec(ForkJoinPool.java:969)
    at java.util.concurrent.ForkJoinPool.awaitJoin(ForkJoinPool.java:2004)
    at java.util.concurrent.ForkJoinTask.doJoin(ForkJoinTask.java:389)
    at java.util.concurrent.ForkJoinTask.join(ForkJoinTask.java:713)
    at uk.ac.shef.dcs.jate.feature.FrequencyCtxSentenceBasedFBWorker.mergeResult(FrequencyCtxSentenceBasedFBWorker.java:54)
    at uk.ac.shef.dcs.jate.feature.FrequencyCtxSentenceBasedFBWorker.mergeResult(FrequencyCtxSentenceBasedFBWorker.java:22)
    at uk.ac.shef.dcs.jate.JATERecursiveTaskWorker.compute(JATERecursiveTaskWorker.java:36)
    at java.util.concurrent.RecursiveTask.exec(RecursiveTask.java:94)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:902)
    at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1689)
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1644)
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

Jate stops working after couple of corpuses

I have to manually stop the program, delete .lock file and restart the program.

01 Mar 2017 16:50:20 ERROR CoreContainer - Error creating core [GENIA]: Could not load conf for core GENIA: Initiating org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFactory failed due to:
java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedConstructorAccessor138.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at uk.ac.shef.dcs.jate.nlp.InstanceCreator.createPOSTagger(InstanceCreator.java:28)
at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFactory.inform(OpenNLPPOSTaggerFactory.java:40)
at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:643)
at org.apache.solr.schema.IndexSchema.(IndexSchema.java:176)
at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)
at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:104)
at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:75)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:725)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:447)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:438)
at java.util.concurrent.FutureTask.run(Unknown Source)
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:210)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at java.io.FilterInputStream.read(Unknown Source)
at java.io.PushbackInputStream.read(Unknown Source)
at java.util.zip.InflaterInputStream.fill(Unknown Source)
at java.util.zip.InflaterInputStream.read(Unknown Source)
at java.util.zip.ZipInputStream.read(Unknown Source)
at java.io.DataInputStream.readFully(Unknown Source)
at java.io.DataInputStream.readLong(Unknown Source)
at java.io.DataInputStream.readDouble(Unknown Source)
at opennlp.tools.ml.model.BinaryFileDataReader.readDouble(BinaryFileDataReader.java:53)
at opennlp.tools.ml.model.AbstractModelReader.readDouble(AbstractModelReader.java:75)
at opennlp.tools.ml.model.AbstractModelReader.getParameters(AbstractModelReader.java:146)
at opennlp.tools.ml.maxent.io.GISModelReader.constructModel(GISModelReader.java:75)
at opennlp.tools.ml.model.GenericModelReader.constructModel(GenericModelReader.java:59)
at opennlp.tools.ml.model.AbstractModelReader.getModel(AbstractModelReader.java:87)
at opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSerializer.java:35)
at opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSerializer.java:31)
at opennlp.tools.util.model.BaseModel.finishLoadingArtifacts(BaseModel.java:328)
at opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:256)
at opennlp.tools.util.model.BaseModel.(BaseModel.java:179)
at opennlp.tools.postag.POSModel.(POSModel.java:105)
at uk.ac.shef.dcs.jate.nlp.opennlp.POSTaggerOpenNLP.(POSTaggerOpenNLP.java:18)
at sun.reflect.GeneratedConstructorAccessor138.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at uk.ac.shef.dcs.jate.nlp.InstanceCreator.createPOSTagger(InstanceCreator.java:28)
at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFactory.inform(OpenNLPPOSTaggerFactory.java:40)
at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:643)
at org.apache.solr.schema.IndexSchema.(IndexSchema.java:176)
at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)

org.apache.solr.common.SolrException: Could not load conf for core GENIA: Initiating org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFactory failed due to:
java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedConstructorAccessor138.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at uk.ac.shef.dcs.jate.nlp.InstanceCreator.createPOSTagger(InstanceCreator.java:28)
at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFactory.inform(OpenNLPPOSTaggerFactory.java:40)
at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:643)
at org.apache.solr.schema.IndexSchema.(IndexSchema.java:176)
at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)
at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:104)
at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:75)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:725)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:447)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:438)
at java.util.concurrent.FutureTask.run(Unknown Source)
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:210)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at java.io.FilterInputStream.read(Unknown Source)
at java.io.PushbackInputStream.read(Unknown Source)
at java.util.zip.InflaterInputStream.fill(Unknown Source)
at java.util.zip.InflaterInputStream.read(Unknown Source)
at java.util.zip.ZipInputStream.read(Unknown Source)
at java.io.DataInputStream.readFully(Unknown Source)
at java.io.DataInputStream.readLong(Unknown Source)
at java.io.DataInputStream.readDouble(Unknown Source)
at opennlp.tools.ml.model.BinaryFileDataReader.readDouble(BinaryFileDataReader.java:53)
at opennlp.tools.ml.model.AbstractModelReader.readDouble(AbstractModelReader.java:75)
at opennlp.tools.ml.model.AbstractModelReader.getParameters(AbstractModelReader.java:146)
at opennlp.tools.ml.maxent.io.GISModelReader.constructModel(GISModelReader.java:75)
at opennlp.tools.ml.model.GenericModelReader.constructModel(GenericModelReader.java:59)
at opennlp.tools.ml.model.AbstractModelReader.getModel(AbstractModelReader.java:87)
at opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSerializer.java:35)
at opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSerializer.java:31)
at opennlp.tools.util.model.BaseModel.finishLoadingArtifacts(BaseModel.java:328)
at opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:256)
at opennlp.tools.util.model.BaseModel.(BaseModel.java:179)
at opennlp.tools.postag.POSModel.(POSModel.java:105)
at uk.ac.shef.dcs.jate.nlp.opennlp.POSTaggerOpenNLP.(POSTaggerOpenNLP.java:18)
at sun.reflect.GeneratedConstructorAccessor138.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at uk.ac.shef.dcs.jate.nlp.InstanceCreator.createPOSTagger(InstanceCreator.java:28)
at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFactory.inform(OpenNLPPOSTaggerFactory.java:40)
at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:643)
at org.apache.solr.schema.IndexSchema.(IndexSchema.java:176)
at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)

at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:80)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:725)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:447)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:438)
at java.util.concurrent.FutureTask.run(Unknown Source)
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:210)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

Caused by: java.lang.IllegalArgumentException: Initiating org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFactory failed due to:
java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedConstructorAccessor138.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at uk.ac.shef.dcs.jate.nlp.InstanceCreator.createPOSTagger(InstanceCreator.java:28)
at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFactory.inform(OpenNLPPOSTaggerFactory.java:40)
at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:643)
at org.apache.solr.schema.IndexSchema.(IndexSchema.java:176)
at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)
at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:104)
at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:75)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:725)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:447)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:438)
at java.util.concurrent.FutureTask.run(Unknown Source)
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:210)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at java.io.FilterInputStream.read(Unknown Source)
at java.io.PushbackInputStream.read(Unknown Source)
at java.util.zip.InflaterInputStream.fill(Unknown Source)
at java.util.zip.InflaterInputStream.read(Unknown Source)
at java.util.zip.ZipInputStream.read(Unknown Source)
at java.io.DataInputStream.readFully(Unknown Source)
at java.io.DataInputStream.readLong(Unknown Source)
at java.io.DataInputStream.readDouble(Unknown Source)
at opennlp.tools.ml.model.BinaryFileDataReader.readDouble(BinaryFileDataReader.java:53)
at opennlp.tools.ml.model.AbstractModelReader.readDouble(AbstractModelReader.java:75)
at opennlp.tools.ml.model.AbstractModelReader.getParameters(AbstractModelReader.java:146)
at opennlp.tools.ml.maxent.io.GISModelReader.constructModel(GISModelReader.java:75)
at opennlp.tools.ml.model.GenericModelReader.constructModel(GenericModelReader.java:59)
at opennlp.tools.ml.model.AbstractModelReader.getModel(AbstractModelReader.java:87)
at opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSerializer.java:35)
at opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSerializer.java:31)
at opennlp.tools.util.model.BaseModel.finishLoadingArtifacts(BaseModel.java:328)
at opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:256)
at opennlp.tools.util.model.BaseModel.(BaseModel.java:179)
at opennlp.tools.postag.POSModel.(POSModel.java:105)
at uk.ac.shef.dcs.jate.nlp.opennlp.POSTaggerOpenNLP.(POSTaggerOpenNLP.java:18)
at sun.reflect.GeneratedConstructorAccessor138.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at uk.ac.shef.dcs.jate.nlp.InstanceCreator.createPOSTagger(InstanceCreator.java:28)
at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFactory.inform(OpenNLPPOSTaggerFactory.java:40)
at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:643)
at org.apache.solr.schema.IndexSchema.(IndexSchema.java:176)
at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)

at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFactory.inform(OpenNLPPOSTaggerFactory.java:45)
at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:643)
at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:176)
at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)
at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:104)
at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:75)
... 8 more

Wed Mar 01 16:50:26 EST 2017 loading exception data for lemmatiser...
Wed Mar 01 16:50:27 EST 2017 loading exception data for lemmatiser...
Wed Mar 01 16:50:40 EST 2017 loading done
Wed Mar 01 16:50:40 EST 2017 loading done

payload in the development of lucene plugin pipelines

Lucene supports attaching information to a token produced in the plugin pipeline. This information is attached in a payload attribute, which takes a BytesRef object that is constructed on a byte[] or string.

The best way to attach such information, based on experience so far, is to create a BytesRef object from a String (which can be a JSON string and using Gson library to convert between objects and strings), then convert the BytesRef object by calling BytesRef.utf8tostring.

DO NOT using serialisation or custom byte[] to construct BytesRef objects. This has caused many inconsistent write/read time data that are inexplicable.

solr did not shut down cleanly

I have downloaded the sourcecode of JATE from github, and import it into eclipse. When I use "maven clean install" command, it met some problem as below:

[INFO] Scanning for projects...
[INFO] Inspecting build with total of 1 modules...
[INFO] Installing Nexus Staging features:
[INFO] ... total of 1 executions of maven-deploy-plugin replaced with nexus-staging-maven-plugin
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Java Automatic Term Extraction Toolkit (JATE) 2.0-beta.1
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ jate ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 2 resources
[INFO]
[INFO] --- maven-compiler-plugin:3.5.1:compile (default-compile) @ jate ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 119 source files to F:\maven-projects\jate-2.0-beta.1\target\classes
[INFO]
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ jate ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 0 resource
[INFO]
[INFO] --- maven-compiler-plugin:3.5.1:testCompile (default-testCompile) @ jate ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 7 source files to F:\maven-projects\jate-2.0-beta.1\target\test-classes
[INFO] /F:/maven-projects/jate-2.0-beta.1/src/test/java/uk/ac/shef/dcs/jate/util/JATEUtilTest.java: 某些输入文件使用或覆盖了已过时的 API。
[INFO] /F:/maven-projects/jate-2.0-beta.1/src/test/java/uk/ac/shef/dcs/jate/util/JATEUtilTest.java: 有关详细信息, 请使用 -Xlint:deprecation 重新编译。
[INFO]
[INFO] --- maven-surefire-plugin:2.12.4:test (default-test) @ jate ---
[INFO] Surefire report directory: F:\maven-projects\jate-2.0-beta.1\target\surefire-reports

T E S T S

Running uk.ac.shef.dcs.jate.app.AppATEGENIATest
Sun Mar 05 17:15:33 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:15:33 CST 2017 loading done
Sun Mar 05 17:15:35 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:15:35 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:15:35 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:15:35 CST 2017 loading done
Sun Mar 05 17:15:35 CST 2017 loading done
Sun Mar 05 17:15:35 CST 2017 loading done
Sun Mar 05 17:15:36 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:15:36 CST 2017 loading done
Sun Mar 05 17:15:36 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:15:36 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:15:36 CST 2017 loading done
Sun Mar 05 17:15:36 CST 2017 loading done
2017-03-05 17:15:54 INFO AppATEGENIATest:142 - 500 documents indexed.
2017-03-05 17:16:08 INFO AppATEGENIATest:142 - 1000 documents indexed.
2017-03-05 17:16:21 INFO AppATEGENIATest:142 - 1500 documents indexed.
2017-03-05 17:16:35 INFO AppATEGENIATest:142 - 2000 documents indexed.
2017-03-05 17:16:35 INFO AppATEGENIATest:156 - Indexing and candidate extraction took [55574] milliseconds
2017-03-05 17:16:37 INFO AppATEGENIATest:93 - <>
2017-03-05 17:16:37 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=38850, max per worker=4856
2017-03-05 17:16:37 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=38850 success=38850
2017-03-05 17:16:37 INFO TermComponentIndexFBMaster:31 - Beginning building features (TermComponentIndex). Total terms=38850, cpu cores=8, max per core=4856
2017-03-05 17:16:37 INFO TermComponentIndexFBMaster:39 - Complete building features. Total processed terms = 38850
2017-03-05 17:16:37 INFO ContainmentFBMaster:42 - Building features using cpu cores=8, total terms=38850, max per worker=4856
2017-03-05 17:16:37 INFO ContainmentFBWorker:49 - Total terms to process=4856
2017-03-05 17:16:37 INFO ContainmentFBWorker:49 - Total terms to process=4856
2017-03-05 17:16:37 INFO ContainmentFBWorker:49 - Total terms to process=4856
2017-03-05 17:16:37 INFO ContainmentFBWorker:49 - Total terms to process=4856
2017-03-05 17:16:37 INFO ContainmentFBWorker:49 - Total terms to process=4856
2017-03-05 17:16:40 INFO ContainmentFBWorker:49 - Total terms to process=4856
2017-03-05 17:16:40 INFO ContainmentFBWorker:49 - Total terms to process=2428
2017-03-05 17:16:40 INFO ContainmentFBWorker:49 - Total terms to process=2428
2017-03-05 17:16:40 INFO ContainmentFBWorker:49 - Total terms to process=2429
2017-03-05 17:16:41 INFO ContainmentFBWorker:49 - Total terms to process=2429
2017-03-05 17:16:43 INFO ContainmentFBMaster:51 - Complete building features. Total=38850 success=38850
2017-03-05 17:16:43 INFO CValue:39 - Beginning computing CValue, cores=8, total terms=10582, max terms per worker thread=1322
2017-03-05 17:16:43 INFO CValue:46 - Complete
2017-03-05 17:16:43 INFO AppCValue:109 - Complete CValue term extraction.
2017-03-05 17:16:43 INFO AppATEGENIATest:269 - appCValue ranking took [6482] milliseconds
2017-03-05 17:17:01 INFO AppATEGENIATest:282 - =============CVALUE GENIA Benchmarking Results==================
2017-03-05 17:17:01 INFO AppATEGENIATest:349 - top 50 Precision:0.94
2017-03-05 17:17:01 INFO AppATEGENIATest:350 - top 100 Precision:0.91
2017-03-05 17:17:01 INFO AppATEGENIATest:351 - top 300 Precision:0.9
2017-03-05 17:17:01 INFO AppATEGENIATest:352 - top 500 Precision:0.86
2017-03-05 17:17:01 INFO AppATEGENIATest:353 - top 800 Precision:0.84
2017-03-05 17:17:01 INFO AppATEGENIATest:354 - top 1000 Precision:0.82
2017-03-05 17:17:01 INFO AppATEGENIATest:355 - top 1500 Precision:0.79
2017-03-05 17:17:01 INFO AppATEGENIATest:356 - top 2000 Precision:0.77
2017-03-05 17:17:01 INFO AppATEGENIATest:357 - top 3000 Precision:0.74
2017-03-05 17:17:01 INFO AppATEGENIATest:358 - top 4000 Precision:0.71
2017-03-05 17:17:01 INFO AppATEGENIATest:359 - top 5000 Precision:0.65
2017-03-05 17:17:01 INFO AppATEGENIATest:360 - top 6000 Precision:0.64
2017-03-05 17:17:01 INFO AppATEGENIATest:361 - top 7000 Precision:0.65
2017-03-05 17:17:01 INFO AppATEGENIATest:362 - top 8000 Precision:0.65
2017-03-05 17:17:01 INFO AppATEGENIATest:363 - top 9000 Precision:0.64
2017-03-05 17:17:01 INFO AppATEGENIATest:364 - top 10000 Precision:0.63
2017-03-05 17:17:01 INFO AppATEGENIATest:365 - overall recall:0.1
2017-03-05 17:17:01 INFO BaseEmbeddedSolrTest:106 - shutting down core in :F:\maven-projects\jate-2.0-beta.1\testdata\solr-testbed
solr did not shut down cleanly
Sun Mar 05 17:17:02 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:17:02 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:17:02 CST 2017 loading exception data for lemmatiser...
2017-03-05 17:17:03 ERROR SolrCore:1300 - REFCOUNT ERROR: unreferenced org.apache.solr.core.SolrCore@73060617 (GENIA) has a reference count of 1
Sun Mar 05 17:17:03 CST 2017 loading done
Sun Mar 05 17:17:03 CST 2017 loading done
Sun Mar 05 17:17:03 CST 2017 loading done
Sun Mar 05 17:17:03 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:17:03 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:17:03 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:17:03 CST 2017 loading done
Sun Mar 05 17:17:03 CST 2017 loading done
Sun Mar 05 17:17:03 CST 2017 loading done
2017-03-05 17:17:21 INFO AppATEGENIATest:142 - 500 documents indexed.
2017-03-05 17:17:33 INFO AppATEGENIATest:142 - 1000 documents indexed.
2017-03-05 17:17:47 INFO AppATEGENIATest:142 - 1500 documents indexed.
2017-03-05 17:18:02 INFO AppATEGENIATest:142 - 2000 documents indexed.
2017-03-05 17:18:02 INFO AppATEGENIATest:156 - Indexing and candidate extraction took [55354] milliseconds
2017-03-05 17:18:03 INFO AppATEGENIATest:93 - <>
2017-03-05 17:18:03 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=38850, max per worker=4856
2017-03-05 17:18:03 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=38850 success=38850
2017-03-05 17:18:03 INFO FrequencyCtxSentenceBasedFBMaster:61 - Building features using cpu cores=8, total docs=2000, max per worker=250
2017-03-05 17:18:03 INFO FrequencyCtxSentenceBasedFBWorker:61 - Total docs to process=250
2017-03-05 17:18:03 INFO FrequencyCtxSentenceBasedFBWorker:61 - Total docs to process=250
2017-03-05 17:18:03 INFO FrequencyCtxSentenceBasedFBWorker:61 - Total docs to process=250
2017-03-05 17:18:03 INFO FrequencyCtxSentenceBasedFBWorker:61 - Total docs to process=250
2017-03-05 17:18:03 INFO FrequencyCtxSentenceBasedFBWorker:61 - Total docs to process=250
2017-03-05 17:18:03 INFO FrequencyCtxSentenceBasedFBWorker:61 - Total docs to process=250
2017-03-05 17:18:03 INFO FrequencyCtxSentenceBasedFBWorker:61 - Total docs to process=250
2017-03-05 17:18:03 INFO FrequencyCtxSentenceBasedFBWorker:61 - Total docs to process=250
2017-03-05 17:18:03 INFO FrequencyCtxSentenceBasedFBMaster:66 - Complete building features. Total sentence ctx=18275, from total processed docs=2000
2017-03-05 17:18:03 INFO FrequencyCtxBasedCopier:37 - Copying features using 1 core, filtering 38850 terms.
2017-03-05 17:18:03 INFO FrequencyCtxBasedCopier:52 - Complete filtering, copying for 4206 terms.
2017-03-05 17:18:04 INFO FrequencyCtxBasedCopier:70 - Complete copying features.
2017-03-05 17:18:04 INFO CooccurrenceFBMaster:84 - Building features using cpu cores=8, total ctx where reference terms appear =18275, max per worker=2284
2017-03-05 17:18:04 INFO CooccurrenceFBMaster:86 - Filtering candidates with min.ttf=2 min.tcf=2
2017-03-05 17:18:04 INFO CooccurrenceFBMaster:104 - Beginning building features. Total terms=10430, total contexts=18275
2017-03-05 17:18:04 INFO CooccurrenceFBWorker:62 - Total ctx to process=18275, total ref terms=4206
2017-03-05 17:18:04 INFO CooccurrenceFBMaster:179 - Complete building features, total contexts processed=18275; total indexed candidate terms=10429; total indexed reference terms=4204
2017-03-05 17:18:04 INFO ChiSquareFrequentTermsFBMaster:44 - Beginning building features (ChiSquare frequent terms). Total terms=4206, cpu cores=8, max per core=525
2017-03-05 17:18:04 INFO ChiSquareFrequentTermsFBMaster:52 - Complete building features. Total processed terms = 4206
2017-03-05 17:18:04 INFO ChiSquare:44 - Beginning computing ChiSquare, cores=8, total terms=10430, max terms per worker thread=1303
2017-03-05 17:18:04 INFO ChiSquare:52 - Complete
2017-03-05 17:18:04 INFO AppATEGENIATest:228 - appChiSquare ranking took [1543] milliseconds
2017-03-05 17:18:22 INFO AppATEGENIATest:240 - =============CHISQUARE GENIA Benchmarking Results==================
2017-03-05 17:18:22 INFO AppATEGENIATest:349 - top 50 Precision:0.96
2017-03-05 17:18:22 INFO AppATEGENIATest:350 - top 100 Precision:0.89
2017-03-05 17:18:22 INFO AppATEGENIATest:351 - top 300 Precision:0.84
2017-03-05 17:18:22 INFO AppATEGENIATest:352 - top 500 Precision:0.8
2017-03-05 17:18:22 INFO AppATEGENIATest:353 - top 800 Precision:0.79
2017-03-05 17:18:22 INFO AppATEGENIATest:354 - top 1000 Precision:0.78
2017-03-05 17:18:22 INFO AppATEGENIATest:355 - top 1500 Precision:0.76
2017-03-05 17:18:22 INFO AppATEGENIATest:356 - top 2000 Precision:0.75
2017-03-05 17:18:22 INFO AppATEGENIATest:357 - top 3000 Precision:0.73
2017-03-05 17:18:22 INFO AppATEGENIATest:358 - top 4000 Precision:0.71
2017-03-05 17:18:22 INFO AppATEGENIATest:359 - top 5000 Precision:0.69
2017-03-05 17:18:22 INFO AppATEGENIATest:360 - top 6000 Precision:0.67
2017-03-05 17:18:22 INFO AppATEGENIATest:361 - top 7000 Precision:0.66
2017-03-05 17:18:22 INFO AppATEGENIATest:362 - top 8000 Precision:0.65
2017-03-05 17:18:22 INFO AppATEGENIATest:363 - top 9000 Precision:0.64
2017-03-05 17:18:22 INFO AppATEGENIATest:364 - top 10000 Precision:0.63
2017-03-05 17:18:22 INFO AppATEGENIATest:365 - overall recall:0.1
2017-03-05 17:18:22 INFO BaseEmbeddedSolrTest:106 - shutting down core in :F:\maven-projects\jate-2.0-beta.1\testdata\solr-testbed
solr did not shut down cleanly
Sun Mar 05 17:18:23 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:18:23 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:18:23 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:18:23 CST 2017 loading done
Sun Mar 05 17:18:23 CST 2017 loading done
Sun Mar 05 17:18:23 CST 2017 loading done
Sun Mar 05 17:18:23 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:18:23 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:18:23 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:18:23 CST 2017 loading done
Sun Mar 05 17:18:23 CST 2017 loading done
Sun Mar 05 17:18:23 CST 2017 loading done
2017-03-05 17:18:24 ERROR SolrCore:1300 - REFCOUNT ERROR: unreferenced org.apache.solr.core.SolrCore@4211655 (GENIA) has a reference count of 1
2017-03-05 17:18:41 INFO AppATEGENIATest:142 - 500 documents indexed.
2017-03-05 17:18:54 INFO AppATEGENIATest:142 - 1000 documents indexed.
2017-03-05 17:19:08 INFO AppATEGENIATest:142 - 1500 documents indexed.
2017-03-05 17:19:22 INFO AppATEGENIATest:142 - 2000 documents indexed.
2017-03-05 17:19:22 INFO AppATEGENIATest:156 - Indexing and candidate extraction took [55523] milliseconds
2017-03-05 17:19:23 INFO AppATEGENIATest:93 - <>
2017-03-05 17:19:23 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=38850, max per worker=4856
2017-03-05 17:19:23 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=38850 success=38850
2017-03-05 17:19:24 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=18630, max per worker=2328
2017-03-05 17:19:24 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=18630 success=18630
2017-03-05 17:19:24 INFO FrequencyCtxDocBasedFBMaster:49 - Beginning building features. Total terms=38850, cpu cores=8, max per core=4856
2017-03-05 17:19:24 INFO FrequencyCtxDocBasedFBMaster:59 - Complete building features. Total processed terms = 38850/38850
2017-03-05 17:19:24 INFO TermEx:76 - Beginning computing TermEx values,, total terms=10582
2017-03-05 17:19:25 INFO TermEx:143 - Complete
2017-03-05 17:19:25 INFO AppATEGENIATest:461 - appTermEx ranking took [1342] milliseconds
2017-03-05 17:19:41 INFO AppATEGENIATest:492 - =============TERMEX GENIA Benchmarking Results==================
2017-03-05 17:19:41 INFO AppATEGENIATest:349 - top 50 Precision:0.9
2017-03-05 17:19:41 INFO AppATEGENIATest:350 - top 100 Precision:0.93
2017-03-05 17:19:41 INFO AppATEGENIATest:351 - top 300 Precision:0.9
2017-03-05 17:19:41 INFO AppATEGENIATest:352 - top 500 Precision:0.88
2017-03-05 17:19:41 INFO AppATEGENIATest:353 - top 800 Precision:0.85
2017-03-05 17:19:41 INFO AppATEGENIATest:354 - top 1000 Precision:0.86
2017-03-05 17:19:41 INFO AppATEGENIATest:355 - top 1500 Precision:0.87
2017-03-05 17:19:41 INFO AppATEGENIATest:356 - top 2000 Precision:0.86
2017-03-05 17:19:41 INFO AppATEGENIATest:357 - top 3000 Precision:0.84
2017-03-05 17:19:41 INFO AppATEGENIATest:358 - top 4000 Precision:0.84
2017-03-05 17:19:41 INFO AppATEGENIATest:359 - top 5000 Precision:0.83
2017-03-05 17:19:41 INFO AppATEGENIATest:360 - top 6000 Precision:0.81
2017-03-05 17:19:41 INFO AppATEGENIATest:361 - top 7000 Precision:0.79
2017-03-05 17:19:41 INFO AppATEGENIATest:362 - top 8000 Precision:0.77
2017-03-05 17:19:41 INFO AppATEGENIATest:363 - top 9000 Precision:0.72
2017-03-05 17:19:41 INFO AppATEGENIATest:364 - top 10000 Precision:0.65
2017-03-05 17:19:41 INFO AppATEGENIATest:365 - overall recall:0.1
2017-03-05 17:19:41 INFO BaseEmbeddedSolrTest:106 - shutting down core in :F:\maven-projects\jate-2.0-beta.1\testdata\solr-testbed
solr did not shut down cleanly
Sun Mar 05 17:19:41 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:19:41 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:19:41 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:19:41 CST 2017 loading done
Sun Mar 05 17:19:41 CST 2017 loading done
Sun Mar 05 17:19:41 CST 2017 loading done
Sun Mar 05 17:19:42 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:19:42 CST 2017 loading done
Sun Mar 05 17:19:42 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:19:42 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:19:42 CST 2017 loading done
Sun Mar 05 17:19:42 CST 2017 loading done
2017-03-05 17:19:47 ERROR SolrCore:1300 - REFCOUNT ERROR: unreferenced org.apache.solr.core.SolrCore@6612a1a2 (GENIA) has a reference count of 1
2017-03-05 17:20:00 INFO AppATEGENIATest:142 - 500 documents indexed.
2017-03-05 17:20:13 INFO AppATEGENIATest:142 - 1000 documents indexed.
2017-03-05 17:20:27 INFO AppATEGENIATest:142 - 1500 documents indexed.
2017-03-05 17:20:41 INFO AppATEGENIATest:142 - 2000 documents indexed.
2017-03-05 17:20:41 INFO AppATEGENIATest:156 - Indexing and candidate extraction took [56176] milliseconds
2017-03-05 17:20:42 INFO AppATEGENIATest:93 - <>
2017-03-05 17:20:42 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=38850, max per worker=4856
2017-03-05 17:20:42 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=38850 success=38850
2017-03-05 17:20:42 INFO TTF:25 - Beginning computing TTF values,, total terms=10582
2017-03-05 17:20:42 INFO TTF:32 - Complete
2017-03-05 17:20:42 INFO AppATEGENIATest:542 - appTTF ranking took [246] milliseconds
2017-03-05 17:20:42 INFO AppATEGENIATest:547 - termList.size():10582
2017-03-05 17:21:00 INFO AppATEGENIATest:574 - =============TTF GENIA Benchmarking Results==================
2017-03-05 17:21:00 INFO AppATEGENIATest:349 - top 50 Precision:0.96
2017-03-05 17:21:00 INFO AppATEGENIATest:350 - top 100 Precision:0.88
2017-03-05 17:21:00 INFO AppATEGENIATest:351 - top 300 Precision:0.84
2017-03-05 17:21:00 INFO AppATEGENIATest:352 - top 500 Precision:0.82
2017-03-05 17:21:00 INFO AppATEGENIATest:353 - top 800 Precision:0.82
2017-03-05 17:21:00 INFO AppATEGENIATest:354 - top 1000 Precision:0.82
2017-03-05 17:21:00 INFO AppATEGENIATest:355 - top 1500 Precision:0.8
2017-03-05 17:21:00 INFO AppATEGENIATest:356 - top 2000 Precision:0.79
2017-03-05 17:21:00 INFO AppATEGENIATest:357 - top 3000 Precision:0.77
2017-03-05 17:21:00 INFO AppATEGENIATest:358 - top 4000 Precision:0.74
2017-03-05 17:21:00 INFO AppATEGENIATest:359 - top 5000 Precision:0.72
2017-03-05 17:21:00 INFO AppATEGENIATest:360 - top 6000 Precision:0.7
2017-03-05 17:21:00 INFO AppATEGENIATest:361 - top 7000 Precision:0.68
2017-03-05 17:21:00 INFO AppATEGENIATest:362 - top 8000 Precision:0.66
2017-03-05 17:21:00 INFO AppATEGENIATest:363 - top 9000 Precision:0.65
2017-03-05 17:21:00 INFO AppATEGENIATest:364 - top 10000 Precision:0.63
2017-03-05 17:21:00 INFO AppATEGENIATest:365 - overall recall:0.1
2017-03-05 17:21:00 INFO BaseEmbeddedSolrTest:106 - shutting down core in :F:\maven-projects\jate-2.0-beta.1\testdata\solr-testbed
solr did not shut down cleanly
Sun Mar 05 17:21:00 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:21:00 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:21:00 CST 2017 loading done
Sun Mar 05 17:21:00 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:21:00 CST 2017 loading done
Sun Mar 05 17:21:00 CST 2017 loading done
Sun Mar 05 17:21:01 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:21:01 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:21:01 CST 2017 loading done
Sun Mar 05 17:21:01 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:21:01 CST 2017 loading done
Sun Mar 05 17:21:01 CST 2017 loading done
2017-03-05 17:21:18 INFO AppATEGENIATest:142 - 500 documents indexed.
2017-03-05 17:21:31 INFO AppATEGENIATest:142 - 1000 documents indexed.
2017-03-05 17:21:44 INFO AppATEGENIATest:142 - 1500 documents indexed.
2017-03-05 17:21:59 INFO AppATEGENIATest:142 - 2000 documents indexed.
2017-03-05 17:21:59 INFO AppATEGENIATest:156 - Indexing and candidate extraction took [55418] milliseconds
2017-03-05 17:22:00 INFO AppATEGENIATest:93 - <>
2017-03-05 17:22:00 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=38850, max per worker=4856
2017-03-05 17:22:00 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=38850 success=38850
2017-03-05 17:22:00 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=18630, max per worker=2328
2017-03-05 17:22:00 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=18630 success=18630
2017-03-05 17:22:02 ERROR SolrCore:1300 - REFCOUNT ERROR: unreferenced org.apache.solr.core.SolrCore@2b01b2df (GENIA) has a reference count of 1
2017-03-05 17:22:02 INFO GlossEx:60 - Calculating GlossEx for 10582 candidate terms.
2017-03-05 17:22:02 INFO GlossEx:92 - Complete
2017-03-05 17:22:02 INFO AppGlossEx:111 - complete GlossEx term extraction.
2017-03-05 17:22:02 INFO AppATEGENIATest:313 - appGlossEx ranking took [2077] milliseconds
2017-03-05 17:22:02 INFO AppATEGENIATest:315 - termList.size():10582
2017-03-05 17:22:19 INFO AppATEGENIATest:323 - =============GLOSSEX GENIA Benchmarking Results==================
2017-03-05 17:22:19 INFO AppATEGENIATest:349 - top 50 Precision:0.94
2017-03-05 17:22:19 INFO AppATEGENIATest:350 - top 100 Precision:0.84
2017-03-05 17:22:19 INFO AppATEGENIATest:351 - top 300 Precision:0.78
2017-03-05 17:22:19 INFO AppATEGENIATest:352 - top 500 Precision:0.71
2017-03-05 17:22:19 INFO AppATEGENIATest:353 - top 800 Precision:0.7
2017-03-05 17:22:19 INFO AppATEGENIATest:354 - top 1000 Precision:0.68
2017-03-05 17:22:19 INFO AppATEGENIATest:355 - top 1500 Precision:0.67
2017-03-05 17:22:19 INFO AppATEGENIATest:356 - top 2000 Precision:0.68
2017-03-05 17:22:19 INFO AppATEGENIATest:357 - top 3000 Precision:0.7
2017-03-05 17:22:19 INFO AppATEGENIATest:358 - top 4000 Precision:0.7
2017-03-05 17:22:19 INFO AppATEGENIATest:359 - top 5000 Precision:0.7
2017-03-05 17:22:19 INFO AppATEGENIATest:360 - top 6000 Precision:0.7
2017-03-05 17:22:19 INFO AppATEGENIATest:361 - top 7000 Precision:0.69
2017-03-05 17:22:19 INFO AppATEGENIATest:362 - top 8000 Precision:0.68
2017-03-05 17:22:19 INFO AppATEGENIATest:363 - top 9000 Precision:0.67
2017-03-05 17:22:19 INFO AppATEGENIATest:364 - top 10000 Precision:0.65
2017-03-05 17:22:19 INFO AppATEGENIATest:365 - overall recall:0.1
2017-03-05 17:22:19 INFO BaseEmbeddedSolrTest:106 - shutting down core in :F:\maven-projects\jate-2.0-beta.1\testdata\solr-testbed
solr did not shut down cleanly
Sun Mar 05 17:22:20 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:22:20 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:22:20 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:22:20 CST 2017 loading done
Sun Mar 05 17:22:20 CST 2017 loading done
Sun Mar 05 17:22:20 CST 2017 loading done
Sun Mar 05 17:22:20 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:22:20 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:22:20 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:22:20 CST 2017 loading done
Sun Mar 05 17:22:20 CST 2017 loading done
Sun Mar 05 17:22:20 CST 2017 loading done
2017-03-05 17:22:37 INFO AppATEGENIATest:142 - 500 documents indexed.
2017-03-05 17:22:50 INFO AppATEGENIATest:142 - 1000 documents indexed.
2017-03-05 17:23:04 INFO AppATEGENIATest:142 - 1500 documents indexed.
2017-03-05 17:23:19 INFO AppATEGENIATest:142 - 2000 documents indexed.
2017-03-05 17:23:19 INFO AppATEGENIATest:156 - Indexing and candidate extraction took [56100] milliseconds
2017-03-05 17:23:20 INFO AppATEGENIATest:93 - <>
2017-03-05 17:23:20 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=38850, max per worker=4856
2017-03-05 17:23:20 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=38850 success=38850
2017-03-05 17:23:21 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=18630, max per worker=2328
2017-03-05 17:23:21 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=18630 success=18630
2017-03-05 17:23:21 INFO Weirdness:43 - Beginning computing Weirdness values,, total terms=10582
2017-03-05 17:23:21 INFO Weirdness:77 - Complete
2017-03-05 17:23:21 INFO AppATEGENIATest:586 - appTTF ranking took [954] milliseconds
2017-03-05 17:23:21 INFO AppATEGENIATest:591 - termList.size():10582
2017-03-05 17:23:38 INFO AppATEGENIATest:618 - =============WEIRDNESS GENIA Benchmarking Results==================
2017-03-05 17:23:38 INFO AppATEGENIATest:349 - top 50 Precision:0.88
2017-03-05 17:23:38 INFO AppATEGENIATest:350 - top 100 Precision:0.91
2017-03-05 17:23:38 INFO AppATEGENIATest:351 - top 300 Precision:0.91
2017-03-05 17:23:38 INFO AppATEGENIATest:352 - top 500 Precision:0.89
2017-03-05 17:23:38 INFO AppATEGENIATest:353 - top 800 Precision:0.89
2017-03-05 17:23:38 INFO AppATEGENIATest:354 - top 1000 Precision:0.89
2017-03-05 17:23:38 INFO AppATEGENIATest:355 - top 1500 Precision:0.86
2017-03-05 17:23:38 INFO AppATEGENIATest:356 - top 2000 Precision:0.85
2017-03-05 17:23:38 INFO AppATEGENIATest:357 - top 3000 Precision:0.81
2017-03-05 17:23:38 INFO AppATEGENIATest:358 - top 4000 Precision:0.78
2017-03-05 17:23:38 INFO AppATEGENIATest:359 - top 5000 Precision:0.76
2017-03-05 17:23:38 INFO AppATEGENIATest:360 - top 6000 Precision:0.73
2017-03-05 17:23:38 INFO AppATEGENIATest:361 - top 7000 Precision:0.72
2017-03-05 17:23:38 INFO AppATEGENIATest:362 - top 8000 Precision:0.69
2017-03-05 17:23:38 INFO AppATEGENIATest:363 - top 9000 Precision:0.67
2017-03-05 17:23:38 INFO AppATEGENIATest:364 - top 10000 Precision:0.64
2017-03-05 17:23:38 INFO AppATEGENIATest:365 - overall recall:0.1
2017-03-05 17:23:38 INFO BaseEmbeddedSolrTest:106 - shutting down core in :F:\maven-projects\jate-2.0-beta.1\testdata\solr-testbed
solr did not shut down cleanly
Sun Mar 05 17:23:39 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:23:39 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:23:39 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:23:39 CST 2017 loading done
Sun Mar 05 17:23:39 CST 2017 loading done
Sun Mar 05 17:23:39 CST 2017 loading done
Sun Mar 05 17:23:41 CST 2017 loading exception data for lemmatiser...
2017-03-05 17:23:41 ERROR SolrCore:1300 - REFCOUNT ERROR: unreferenced org.apache.solr.core.SolrCore@271ec556 (GENIA) has a reference count of 1
2017-03-05 17:23:41 ERROR SolrCore:1300 - REFCOUNT ERROR: unreferenced org.apache.solr.core.SolrCore@6f200838 (GENIA) has a reference count of 1
Sun Mar 05 17:23:41 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:23:41 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:23:41 CST 2017 loading done
Sun Mar 05 17:23:41 CST 2017 loading done
Sun Mar 05 17:23:41 CST 2017 loading done
2017-03-05 17:23:58 INFO AppATEGENIATest:142 - 500 documents indexed.
2017-03-05 17:24:11 INFO AppATEGENIATest:142 - 1000 documents indexed.
2017-03-05 17:24:26 INFO AppATEGENIATest:142 - 1500 documents indexed.
2017-03-05 17:24:40 INFO AppATEGENIATest:142 - 2000 documents indexed.
2017-03-05 17:24:40 INFO AppATEGENIATest:156 - Indexing and candidate extraction took [56501] milliseconds
2017-03-05 17:24:41 INFO AppATEGENIATest:93 - <>
2017-03-05 17:24:41 INFO BaseEmbeddedSolrTest:106 - shutting down core in :F:\maven-projects\jate-2.0-beta.1\testdata\solr-testbed
Sun Mar 05 17:24:42 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:24:42 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:24:42 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:24:43 CST 2017 loading done
Sun Mar 05 17:24:43 CST 2017 loading done
Sun Mar 05 17:24:43 CST 2017 loading done
Sun Mar 05 17:24:44 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:24:44 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:24:44 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:24:46 CST 2017 loading done
Sun Mar 05 17:24:46 CST 2017 loading done
Sun Mar 05 17:24:46 CST 2017 loading done
2017-03-05 17:25:02 INFO AppATEGENIATest:142 - 500 documents indexed.
2017-03-05 17:25:15 INFO AppATEGENIATest:142 - 1000 documents indexed.
2017-03-05 17:25:29 INFO AppATEGENIATest:142 - 1500 documents indexed.
2017-03-05 17:25:43 INFO AppATEGENIATest:142 - 2000 documents indexed.
2017-03-05 17:25:43 INFO AppATEGENIATest:156 - Indexing and candidate extraction took [54984] milliseconds
2017-03-05 17:25:44 INFO AppATEGENIATest:93 - <>
2017-03-05 17:25:44 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=38850, max per worker=4856
2017-03-05 17:25:45 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=38850 success=38850
2017-03-05 17:25:45 INFO ATTF:19 - Calculating ATTF for 10582 candidate terms.
2017-03-05 17:25:45 INFO ATTF:38 - Complete calculating ATTF
2017-03-05 17:25:45 INFO AppATTF:109 - Complete ATTF term extraction.
2017-03-05 17:25:45 INFO AppATEGENIATest:183 - appATTF ranking took [238] milliseconds
2017-03-05 17:26:04 INFO AppATEGENIATest:196 - =============ATTF GENIA Benchmarking Results==================
2017-03-05 17:26:04 INFO AppATEGENIATest:349 - top 50 Precision:0.84
2017-03-05 17:26:04 INFO AppATEGENIATest:350 - top 100 Precision:0.85
2017-03-05 17:26:04 INFO AppATEGENIATest:351 - top 300 Precision:0.77
2017-03-05 17:26:04 INFO AppATEGENIATest:352 - top 500 Precision:0.78
2017-03-05 17:26:04 INFO AppATEGENIATest:353 - top 800 Precision:0.77
2017-03-05 17:26:04 INFO AppATEGENIATest:354 - top 1000 Precision:0.77
2017-03-05 17:26:04 INFO AppATEGENIATest:355 - top 1500 Precision:0.76
2017-03-05 17:26:04 INFO AppATEGENIATest:356 - top 2000 Precision:0.72
2017-03-05 17:26:04 INFO AppATEGENIATest:357 - top 3000 Precision:0.69
2017-03-05 17:26:04 INFO AppATEGENIATest:358 - top 4000 Precision:0.7
2017-03-05 17:26:04 INFO AppATEGENIATest:359 - top 5000 Precision:0.7
2017-03-05 17:26:04 INFO AppATEGENIATest:360 - top 6000 Precision:0.71
2017-03-05 17:26:04 INFO AppATEGENIATest:361 - top 7000 Precision:0.69
2017-03-05 17:26:04 INFO AppATEGENIATest:362 - top 8000 Precision:0.66
2017-03-05 17:26:04 INFO AppATEGENIATest:363 - top 9000 Precision:0.65
2017-03-05 17:26:04 INFO AppATEGENIATest:364 - top 10000 Precision:0.63
2017-03-05 17:26:04 INFO AppATEGENIATest:365 - overall recall:0.1
2017-03-05 17:26:04 INFO BaseEmbeddedSolrTest:106 - shutting down core in :F:\maven-projects\jate-2.0-beta.1\testdata\solr-testbed
solr did not shut down cleanly
Sun Mar 05 17:26:04 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:26:04 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:26:04 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:26:04 CST 2017 loading done
Sun Mar 05 17:26:04 CST 2017 loading done
Sun Mar 05 17:26:05 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:26:05 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:26:05 CST 2017 loading done
Sun Mar 05 17:26:06 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:26:06 CST 2017 loading done
Sun Mar 05 17:26:06 CST 2017 loading done
Sun Mar 05 17:26:07 CST 2017 loading done
2017-03-05 17:26:10 ERROR SolrCore:1300 - REFCOUNT ERROR: unreferenced org.apache.solr.core.SolrCore@1450ed35 (GENIA) has a reference count of 1
2017-03-05 17:26:26 INFO AppATEGENIATest:142 - 500 documents indexed.
2017-03-05 17:26:39 INFO AppATEGENIATest:142 - 1000 documents indexed.
2017-03-05 17:26:52 INFO AppATEGENIATest:142 - 1500 documents indexed.
2017-03-05 17:27:06 INFO AppATEGENIATest:142 - 2000 documents indexed.
2017-03-05 17:27:06 INFO AppATEGENIATest:156 - Indexing and candidate extraction took [54343] milliseconds
2017-03-05 17:27:08 INFO AppATEGENIATest:93 - <>
2017-03-05 17:27:08 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=38850, max per worker=4856
2017-03-05 17:27:08 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=38850 success=38850
2017-03-05 17:27:08 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=18630, max per worker=2328
2017-03-05 17:27:08 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=18630 success=18630
2017-03-05 17:27:08 INFO TermComponentIndexFBMaster:31 - Beginning building features (TermComponentIndex). Total terms=38850, cpu cores=8, max per core=4856
2017-03-05 17:27:08 INFO TermComponentIndexFBMaster:39 - Complete building features. Total processed terms = 38850
2017-03-05 17:27:08 INFO RAKE:50 - Beginning computing RAKE values, cores=8 total terms=10582, max terms per worker thread=1322
2017-03-05 17:27:09 INFO RAKEWorker:101 - done =2000/10582
2017-03-05 17:27:10 INFO RAKEWorker:101 - done =4000/10582
2017-03-05 17:27:10 INFO RAKEWorker:101 - done =6000/10582
2017-03-05 17:27:11 INFO RAKEWorker:101 - done =8000/10582
2017-03-05 17:27:12 INFO RAKEWorker:101 - done =10000/10582
2017-03-05 17:27:12 INFO RAKE:58 - Complete
2017-03-05 17:27:12 INFO AppATEGENIATest:374 - appRAKE ranking took [4096] milliseconds
2017-03-05 17:27:31 INFO AppATEGENIATest:388 - =============RAKE GENIA Benchmarking Results==================
2017-03-05 17:27:31 INFO AppATEGENIATest:349 - top 50 Precision:0.82
2017-03-05 17:27:31 INFO AppATEGENIATest:350 - top 100 Precision:0.81
2017-03-05 17:27:31 INFO AppATEGENIATest:351 - top 300 Precision:0.72
2017-03-05 17:27:31 INFO AppATEGENIATest:352 - top 500 Precision:0.67
2017-03-05 17:27:31 INFO AppATEGENIATest:353 - top 800 Precision:0.71
2017-03-05 17:27:31 INFO AppATEGENIATest:354 - top 1000 Precision:0.7
2017-03-05 17:27:31 INFO AppATEGENIATest:355 - top 1500 Precision:0.69
2017-03-05 17:27:31 INFO AppATEGENIATest:356 - top 2000 Precision:0.68
2017-03-05 17:27:31 INFO AppATEGENIATest:357 - top 3000 Precision:0.66
2017-03-05 17:27:31 INFO AppATEGENIATest:358 - top 4000 Precision:0.67
2017-03-05 17:27:31 INFO AppATEGENIATest:359 - top 5000 Precision:0.68
2017-03-05 17:27:31 INFO AppATEGENIATest:360 - top 6000 Precision:0.68
2017-03-05 17:27:31 INFO AppATEGENIATest:361 - top 7000 Precision:0.67
2017-03-05 17:27:31 INFO AppATEGENIATest:362 - top 8000 Precision:0.67
2017-03-05 17:27:31 INFO AppATEGENIATest:363 - top 9000 Precision:0.66
2017-03-05 17:27:31 INFO AppATEGENIATest:364 - top 10000 Precision:0.64
2017-03-05 17:27:31 INFO AppATEGENIATest:365 - overall recall:0.1
2017-03-05 17:27:31 INFO BaseEmbeddedSolrTest:106 - shutting down core in :F:\maven-projects\jate-2.0-beta.1\testdata\solr-testbed
solr did not shut down cleanly
Sun Mar 05 17:27:31 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:27:31 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:27:31 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:27:31 CST 2017 loading done
Sun Mar 05 17:27:31 CST 2017 loading done
Sun Mar 05 17:27:31 CST 2017 loading done
Sun Mar 05 17:27:32 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:27:32 CST 2017 loading done
Sun Mar 05 17:27:32 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:27:32 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:27:32 CST 2017 loading done
Sun Mar 05 17:27:32 CST 2017 loading done
2017-03-05 17:27:49 INFO AppATEGENIATest:142 - 500 documents indexed.
2017-03-05 17:28:02 INFO AppATEGENIATest:142 - 1000 documents indexed.
2017-03-05 17:28:16 INFO AppATEGENIATest:142 - 1500 documents indexed.
2017-03-05 17:28:30 INFO AppATEGENIATest:142 - 2000 documents indexed.
2017-03-05 17:28:30 INFO AppATEGENIATest:156 - Indexing and candidate extraction took [55141] milliseconds
2017-03-05 17:28:31 INFO AppATEGENIATest:93 - <>
2017-03-05 17:28:31 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=38850, max per worker=4856
2017-03-05 17:28:31 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=38850 success=38850
2017-03-05 17:28:31 INFO RIDF:28 - Beginning computing RIDF values,, total terms=10582
2017-03-05 17:28:31 INFO RIDF:54 - Complete
2017-03-05 17:28:31 INFO AppATEGENIATest:417 - appRIDF ranking took [249] milliseconds
2017-03-05 17:28:49 INFO AppATEGENIATest:448 - =============RIDF GENIA Benchmarking Results==================
2017-03-05 17:28:49 INFO AppATEGENIATest:349 - top 50 Precision:0.92
2017-03-05 17:28:49 INFO AppATEGENIATest:350 - top 100 Precision:0.91
2017-03-05 17:28:49 INFO AppATEGENIATest:351 - top 300 Precision:0.88
2017-03-05 17:28:49 INFO AppATEGENIATest:352 - top 500 Precision:0.87
2017-03-05 17:28:49 INFO AppATEGENIATest:353 - top 800 Precision:0.83
2017-03-05 17:28:49 INFO AppATEGENIATest:354 - top 1000 Precision:0.83
2017-03-05 17:28:49 INFO AppATEGENIATest:355 - top 1500 Precision:0.83
2017-03-05 17:28:49 INFO AppATEGENIATest:356 - top 2000 Precision:0.82
2017-03-05 17:28:49 INFO AppATEGENIATest:357 - top 3000 Precision:0.8
2017-03-05 17:28:49 INFO AppATEGENIATest:358 - top 4000 Precision:0.75
2017-03-05 17:28:49 INFO AppATEGENIATest:359 - top 5000 Precision:0.72
2017-03-05 17:28:49 INFO AppATEGENIATest:360 - top 6000 Precision:0.71
2017-03-05 17:28:49 INFO AppATEGENIATest:361 - top 7000 Precision:0.7
2017-03-05 17:28:49 INFO AppATEGENIATest:362 - top 8000 Precision:0.68
2017-03-05 17:28:49 INFO AppATEGENIATest:363 - top 9000 Precision:0.65
2017-03-05 17:28:49 INFO AppATEGENIATest:364 - top 10000 Precision:0.64
2017-03-05 17:28:49 INFO AppATEGENIATest:365 - overall recall:0.1
2017-03-05 17:28:49 INFO BaseEmbeddedSolrTest:106 - shutting down core in :F:\maven-projects\jate-2.0-beta.1\testdata\solr-testbed
solr did not shut down cleanly
Sun Mar 05 17:28:49 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:28:49 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:28:49 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:28:49 CST 2017 loading done
Sun Mar 05 17:28:49 CST 2017 loading done
Sun Mar 05 17:28:49 CST 2017 loading done
Sun Mar 05 17:28:50 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:28:50 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:28:50 CST 2017 loading exception data for lemmatiser...
Sun Mar 05 17:28:50 CST 2017 loading done
Sun Mar 05 17:28:50 CST 2017 loading done
Sun Mar 05 17:28:50 CST 2017 loading done
2017-03-05 17:28:52 ERROR SolrCore:1300 - REFCOUNT ERROR: unreferenced org.apache.solr.core.SolrCore@17b16c61 (GENIA) has a reference count of 1
2017-03-05 17:28:52 ERROR SolrCore:1300 - REFCOUNT ERROR: unreferenced org.apache.solr.core.SolrCore@7083bc1c (GENIA) has a reference count of 1
2017-03-05 17:29:08 INFO AppATEGENIATest:142 - 500 documents indexed.
2017-03-05 17:29:21 INFO AppATEGENIATest:142 - 1000 documents indexed.
2017-03-05 17:29:35 INFO AppATEGENIATest:142 - 1500 documents indexed.
2017-03-05 17:29:50 INFO AppATEGENIATest:142 - 2000 documents indexed.
2017-03-05 17:29:50 INFO AppATEGENIATest:156 - Indexing and candidate extraction took [55408] milliseconds
2017-03-05 17:29:51 INFO AppATEGENIATest:93 - <>
2017-03-05 17:29:51 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=38850, max per worker=4856
2017-03-05 17:29:51 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=38850 success=38850
2017-03-05 17:29:51 INFO TFIDF:27 - Beginning computing TermEx values,, total terms=10582
2017-03-05 17:29:51 INFO TFIDF:38 - Complete
2017-03-05 17:29:51 INFO AppATEGENIATest:504 - appTFIDF ranking took [249] milliseconds
2017-03-05 17:30:08 INFO AppATEGENIATest:531 - =============TFIDF GENIA Benchmarking Results==================
2017-03-05 17:30:08 INFO AppATEGENIATest:349 - top 50 Precision:0.96
2017-03-05 17:30:08 INFO AppATEGENIATest:350 - top 100 Precision:0.89
2017-03-05 17:30:08 INFO AppATEGENIATest:351 - top 300 Precision:0.85
2017-03-05 17:30:08 INFO AppATEGENIATest:352 - top 500 Precision:0.83
2017-03-05 17:30:08 INFO AppATEGENIATest:353 - top 800 Precision:0.83
2017-03-05 17:30:08 INFO AppATEGENIATest:354 - top 1000 Precision:0.83
2017-03-05 17:30:08 INFO AppATEGENIATest:355 - top 1500 Precision:0.81
2017-03-05 17:30:08 INFO AppATEGENIATest:356 - top 2000 Precision:0.8
2017-03-05 17:30:08 INFO AppATEGENIATest:357 - top 3000 Precision:0.77
2017-03-05 17:30:08 INFO AppATEGENIATest:358 - top 4000 Precision:0.75
2017-03-05 17:30:08 INFO AppATEGENIATest:359 - top 5000 Precision:0.73
2017-03-05 17:30:08 INFO AppATEGENIATest:360 - top 6000 Precision:0.71
2017-03-05 17:30:08 INFO AppATEGENIATest:361 - top 7000 Precision:0.69
2017-03-05 17:30:08 INFO AppATEGENIATest:362 - top 8000 Precision:0.68
2017-03-05 17:30:08 INFO AppATEGENIATest:363 - top 9000 Precision:0.65
2017-03-05 17:30:08 INFO AppATEGENIATest:364 - top 10000 Precision:0.64
2017-03-05 17:30:08 INFO AppATEGENIATest:365 - overall recall:0.1
2017-03-05 17:30:08 INFO BaseEmbeddedSolrTest:106 - shutting down core in :F:\maven-projects\jate-2.0-beta.1\testdata\solr-testbed
solr did not shut down cleanly
Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 875.721 sec
Running uk.ac.shef.dcs.jate.nlp.opennlp.SentenceSplitterOpenNLPTest
2017-03-05 17:30:08 INFO SentenceSplitterOpenNLP:32 - Initializing OpenNLP sentence splitter...
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec
Running uk.ac.shef.dcs.jate.util.JATEUtilTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec

Results :

Tests run: 16, Failures: 0, Errors: 0, Skipped: 0

[INFO]
[INFO] --- maven-jar-plugin:2.6:jar (default-jar) @ jate ---
[INFO] Building jar: F:\maven-projects\jate-2.0-beta.1\target\jate-2.0-beta.1.jar
[INFO]
[INFO] --- maven-source-plugin:2.2.1:jar-no-fork (attach-sources) @ jate ---
[INFO] Building jar: F:\maven-projects\jate-2.0-beta.1\target\jate-2.0-beta.1-sources.jar
[INFO]
[INFO] --- maven-javadoc-plugin:2.9.1:jar (attach-javadocs) @ jate ---

[INFO] Building jar: F:\maven-projects\jate-2.0-beta.1\target\jate-2.0-beta.1-javadoc.jar
[INFO]
[INFO] --- maven-jar-plugin:2.6:test-jar (default) @ jate ---
[INFO] Building jar: F:\maven-projects\jate-2.0-beta.1\target\jate-2.0-beta.1-tests.jar
[INFO]
[INFO] --- maven-shade-plugin:2.3:shade (default) @ jate ---
[INFO] Including org.slf4j:slf4j-log4j12:jar:1.6.0 in the shaded jar.
[INFO] Including org.slf4j:slf4j-api:jar:1.6.0 in the shaded jar.
[INFO] Including log4j:log4j:jar:1.2.14 in the shaded jar.
[INFO] Including org.apache.tika:tika-core:jar:1.10 in the shaded jar.
[INFO] Including org.apache.tika:tika-parsers:jar:1.10 in the shaded jar.
[INFO] Including org.gagravarr:vorbis-java-tika:jar:0.6 in the shaded jar.
[INFO] Including com.healthmarketscience.jackcess:jackcess:jar:2.1.2 in the shaded jar.
[INFO] Including commons-logging:commons-logging:jar:1.1.3 in the shaded jar.
[INFO] Including com.healthmarketscience.jackcess:jackcess-encrypt:jar:2.1.0 in the shaded jar.
[INFO] Including net.sourceforge.jmatio:jmatio:jar:1.0 in the shaded jar.
[INFO] Including org.apache.james:apache-mime4j-core:jar:0.7.2 in the shaded jar.
[INFO] Including org.apache.james:apache-mime4j-dom:jar:0.7.2 in the shaded jar.
[INFO] Including org.apache.commons:commons-compress:jar:1.9 in the shaded jar.
[INFO] Including org.tukaani:xz:jar:1.5 in the shaded jar.
[INFO] Including commons-codec:commons-codec:jar:1.9 in the shaded jar.
[INFO] Including org.apache.pdfbox:pdfbox:jar:1.8.10 in the shaded jar.
[INFO] Including org.bouncycastle:bcmail-jdk15on:jar:1.52 in the shaded jar.
[INFO] Including org.bouncycastle:bcpkix-jdk15on:jar:1.52 in the shaded jar.
[INFO] Including org.bouncycastle:bcprov-jdk15on:jar:1.52 in the shaded jar.
[INFO] Including org.apache.poi:poi:jar:3.13-beta1 in the shaded jar.
[INFO] Including org.apache.poi:poi-scratchpad:jar:3.13-beta1 in the shaded jar.
[INFO] Including org.apache.poi:poi-ooxml:jar:3.13-beta1 in the shaded jar.
[INFO] Including org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1 in the shaded jar.
[INFO] Including org.ow2.asm:asm-debug-all:jar:4.1 in the shaded jar.
[INFO] Including com.googlecode.mp4parser:isoparser:jar:1.0.2 in the shaded jar.
[INFO] Including com.drewnoakes:metadata-extractor:jar:2.8.0 in the shaded jar.
[INFO] Including de.l3s.boilerpipe:boilerpipe:jar:1.1.0 in the shaded jar.
[INFO] Including rome:rome:jar:1.0 in the shaded jar.
[INFO] Including org.gagravarr:vorbis-java-core:jar:0.6 in the shaded jar.
[INFO] Including com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3 in the shaded jar.
[INFO] Including org.codelibs:jhighlight:jar:1.0.2 in the shaded jar.
[INFO] Including com.pff:java-libpst:jar:0.8.1 in the shaded jar.
[INFO] Including com.github.junrar:junrar:jar:0.7 in the shaded jar.
[INFO] Including commons-logging:commons-logging-api:jar:1.1 in the shaded jar.
[INFO] Including org.apache.commons:commons-vfs2:jar:2.0 in the shaded jar.
[INFO] Including org.apache.maven.scm:maven-scm-api:jar:1.4 in the shaded jar.
[INFO] Including org.codehaus.plexus:plexus-utils:jar:1.5.6 in the shaded jar.
[INFO] Including org.apache.maven.scm:maven-scm-provider-svnexe:jar:1.4 in the shaded jar.
[INFO] Including org.apache.maven.scm:maven-scm-provider-svn-commons:jar:1.4 in the shaded jar.
[INFO] Including regexp:regexp:jar:1.3 in the shaded jar.
[INFO] Including commons-io:commons-io:jar:2.4 in the shaded jar.
[INFO] Including org.apache.commons:commons-exec:jar:1.3 in the shaded jar.
[INFO] Including com.googlecode.json-simple:json-simple:jar:1.1.1 in the shaded jar.
[INFO] Including edu.ucar:netcdf4:jar:4.5.5 in the shaded jar.
[INFO] Including net.jcip:jcip-annotations:jar:1.0 in the shaded jar.
[INFO] Including net.java.dev.jna:jna:jar:4.1.0 in the shaded jar.
[INFO] Including edu.ucar:grib:jar:4.5.5 in the shaded jar.
[INFO] Including org.jdom:jdom2:jar:2.0.4 in the shaded jar.
[INFO] Including org.jsoup:jsoup:jar:1.7.2 in the shaded jar.
[INFO] Including edu.ucar:jj2000:jar:5.2 in the shaded jar.
[INFO] Including org.itadaki:bzip2:jar:0.9.1 in the shaded jar.
[INFO] Including edu.ucar:cdm:jar:4.5.5 in the shaded jar.
[INFO] Including edu.ucar:udunits:jar:4.5.5 in the shaded jar.
[INFO] Including org.quartz-scheduler:quartz:jar:2.2.0 in the shaded jar.
[INFO] Including c3p0:c3p0:jar:0.9.1.1 in the shaded jar.
[INFO] Including net.sf.ehcache:ehcache-core:jar:2.6.2 in the shaded jar.
[INFO] Including com.beust:jcommander:jar:1.35 in the shaded jar.
[INFO] Including edu.ucar:httpservices:jar:4.5.5 in the shaded jar.
[INFO] Including com.google.guava:guava:jar:11.0.2 in the shaded jar.
[INFO] Including com.google.code.findbugs:jsr305:jar:1.3.9 in the shaded jar.
[INFO] Including org.apache.commons:commons-csv:jar:1.0 in the shaded jar.
[INFO] Including org.apache.sis.core:sis-utility:jar:0.5 in the shaded jar.
[INFO] Including org.apache.sis.storage:sis-netcdf:jar:0.5 in the shaded jar.
[INFO] Including org.apache.sis.storage:sis-storage:jar:0.5 in the shaded jar.
[INFO] Including org.apache.sis.core:sis-referencing:jar:0.5 in the shaded jar.
[INFO] Including org.apache.sis.core:sis-metadata:jar:0.5 in the shaded jar.
[INFO] Including org.opengis:geoapi:jar:3.0.0 in the shaded jar.
[INFO] Including javax.measure:jsr-275:jar:0.9.3 in the shaded jar.
[INFO] Including org.apache.opennlp:opennlp-tools:jar:1.6.0 in the shaded jar.
[INFO] Including edu.drexel:dragontool:jar:1.3.3 in the shaded jar.
[INFO] Including org.apache.solr:solr-core:jar:5.3.0 in the shaded jar.
[INFO] Including org.apache.lucene:lucene-analyzers-common:jar:5.3.0 in the shaded jar.
[INFO] Including org.apache.lucene:lucene-analyzers-kuromoji:jar:5.3.0 in the shaded jar.
[INFO] Including org.apache.lucene:lucene-analyzers-phonetic:jar:5.3.0 in the shaded jar.
[INFO] Including org.apache.lucene:lucene-backward-codecs:jar:5.3.0 in the shaded jar.
[INFO] Including org.apache.lucene:lucene-codecs:jar:5.3.0 in the shaded jar.
[INFO] Including org.apache.lucene:lucene-core:jar:5.3.0 in the shaded jar.
[INFO] Including org.apache.lucene:lucene-expressions:jar:5.3.0 in the shaded jar.
[INFO] Including org.apache.lucene:lucene-grouping:jar:5.3.0 in the shaded jar.
[INFO] Including org.apache.lucene:lucene-highlighter:jar:5.3.0 in the shaded jar.
[INFO] Including org.apache.lucene:lucene-join:jar:5.3.0 in the shaded jar.
[INFO] Including org.apache.lucene:lucene-memory:jar:5.3.0 in the shaded jar.
[INFO] Including org.apache.lucene:lucene-misc:jar:5.3.0 in the shaded jar.
[INFO] Including org.apache.lucene:lucene-queries:jar:5.3.0 in the shaded jar.
[INFO] Including org.apache.lucene:lucene-queryparser:jar:5.3.0 in the shaded jar.
[INFO] Including org.apache.lucene:lucene-sandbox:jar:5.3.0 in the shaded jar.
[INFO] Including org.apache.lucene:lucene-spatial:jar:5.3.0 in the shaded jar.
[INFO] Including org.apache.lucene:lucene-suggest:jar:5.3.0 in the shaded jar.
[INFO] Including com.carrotsearch:hppc:jar:0.5.2 in the shaded jar.
[INFO] Including com.fasterxml.jackson.core:jackson-core:jar:2.5.4 in the shaded jar.
[INFO] Including com.fasterxml.jackson.dataformat:jackson-dataformat-smile:jar:2.5.4 in the shaded jar.
[INFO] Including com.google.protobuf:protobuf-java:jar:2.5.0 in the shaded jar.
[INFO] Including com.googlecode.concurrentlinkedhashmap:concurrentlinkedhashmap-lru:jar:1.2 in the shaded jar.
[INFO] Including com.spatial4j:spatial4j:jar:0.4.1 in the shaded jar.
[INFO] Including com.tdunning:t-digest:jar:3.1 in the shaded jar.
[INFO] Including commons-cli:commons-cli:jar:1.2 in the shaded jar.
[INFO] Including commons-collections:commons-collections:jar:3.2.1 in the shaded jar.
[INFO] Including commons-configuration:commons-configuration:jar:1.6 in the shaded jar.
[INFO] Including commons-fileupload:commons-fileupload:jar:1.2.1 in the shaded jar.
[INFO] Including commons-lang:commons-lang:jar:2.6 in the shaded jar.
[INFO] Including dom4j:dom4j:jar:1.6.1 in the shaded jar.
[INFO] Including javax.servlet:javax.servlet-api:jar:3.1.0 in the shaded jar.
[INFO] Including joda-time:joda-time:jar:2.2 in the shaded jar.
[INFO] Including org.antlr:antlr-runtime:jar:3.5 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-annotations:jar:2.6.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-auth:jar:2.6.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-common:jar:2.6.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-hdfs:jar:2.6.0 in the shaded jar.
[INFO] Including org.apache.httpcomponents:httpclient:jar:4.4.1 in the shaded jar.
[INFO] Including org.apache.httpcomponents:httpcore:jar:4.4.1 in the shaded jar.
[INFO] Including org.apache.httpcomponents:httpmime:jar:4.4.1 in the shaded jar.
[INFO] Including org.apache.zookeeper:zookeeper:jar:3.4.6 in the shaded jar.
[INFO] Including org.codehaus.woodstox:stax2-api:jar:3.1.4 in the shaded jar.
[INFO] Including org.codehaus.woodstox:woodstox-core-asl:jar:4.4.1 in the shaded jar.
[INFO] Including org.eclipse.jetty:jetty-continuation:jar:9.2.11.v20150529 in the shaded jar.
[INFO] Including org.eclipse.jetty:jetty-deploy:jar:9.2.11.v20150529 in the shaded jar.
[INFO] Including org.eclipse.jetty:jetty-http:jar:9.2.11.v20150529 in the shaded jar.
[INFO] Including org.eclipse.jetty:jetty-io:jar:9.2.11.v20150529 in the shaded jar.
[INFO] Including org.eclipse.jetty:jetty-jmx:jar:9.2.11.v20150529 in the shaded jar.
[INFO] Including org.eclipse.jetty:jetty-rewrite:jar:9.2.11.v20150529 in the shaded jar.
[INFO] Including org.eclipse.jetty:jetty-security:jar:9.2.11.v20150529 in the shaded jar.
[INFO] Including org.eclipse.jetty:jetty-server:jar:9.2.11.v20150529 in the shaded jar.
[INFO] Including org.eclipse.jetty:jetty-servlet:jar:9.2.11.v20150529 in the shaded jar.
[INFO] Including org.eclipse.jetty:jetty-servlets:jar:9.2.11.v20150529 in the shaded jar.
[INFO] Including org.eclipse.jetty:jetty-util:jar:9.2.11.v20150529 in the shaded jar.
[INFO] Including org.eclipse.jetty:jetty-webapp:jar:9.2.11.v20150529 in the shaded jar.
[INFO] Including org.eclipse.jetty:jetty-xml:jar:9.2.11.v20150529 in the shaded jar.
[INFO] Including org.htrace:htrace-core:jar:3.0.4 in the shaded jar.
[INFO] Including org.noggit:noggit:jar:0.6 in the shaded jar.
[INFO] Including org.ow2.asm:asm:jar:4.1 in the shaded jar.
[INFO] Including org.ow2.asm:asm-commons:jar:4.1 in the shaded jar.
[INFO] Including org.restlet.jee:org.restlet:jar:2.3.0 in the shaded jar.
[INFO] Including org.restlet.jee:org.restlet.ext.servlet:jar:2.3.0 in the shaded jar.
[INFO] Including org.apache.solr:solr-cell:jar:5.3.0 in the shaded jar.
[INFO] Including com.adobe.xmp:xmpcore:jar:5.1.2 in the shaded jar.
[INFO] Including com.ibm.icu:icu4j:jar:54.1 in the shaded jar.
[INFO] Including jdom:jdom:jar:1.0 in the shaded jar.
[INFO] Including org.apache.pdfbox:fontbox:jar:1.8.8 in the shaded jar.
[INFO] Including org.apache.pdfbox:jempbox:jar:1.8.8 in the shaded jar.
[INFO] Including org.apache.poi:poi-ooxml-schemas:jar:3.11 in the shaded jar.
[INFO] Including org.apache.tika:tika-java7:jar:1.7 in the shaded jar.
[INFO] Including org.apache.tika:tika-xmp:jar:1.7 in the shaded jar.
[INFO] Including org.apache.xmlbeans:xmlbeans:jar:2.6.0 in the shaded jar.
[INFO] Including org.aspectj:aspectjrt:jar:1.8.0 in the shaded jar.
[INFO] Including org.bouncycastle:bcmail-jdk15:jar:1.45 in the shaded jar.
[INFO] Including org.bouncycastle:bcprov-jdk15:jar:1.45 in the shaded jar.
[INFO] Including xerces:xercesImpl:jar:2.9.1 in the shaded jar.
[INFO] Including org.apache.solr:solr-langid:jar:5.3.0 in the shaded jar.
[INFO] Including com.cybozu.labs:langdetect:jar:1.1-20120112 in the shaded jar.
[INFO] Including net.arnx:jsonic:jar:1.2.7 in the shaded jar.
[INFO] Including com.google.code.gson:gson:jar:2.3.1 in the shaded jar.
[INFO] Including org.apache.solr:solr-solrj:jar:5.3.0 in the shaded jar.
[INFO] Including com.googlecode.matrix-toolkits-java:mtj:jar:1.0.5-SNAPSHOT in the shaded jar.
[INFO] Skipping pom dependency com.github.fommil.netlib:all:pom:1.1.2 in the shaded jar.
[INFO] Including net.sourceforge.f2j:arpack_combined_all:jar:0.1 in the shaded jar.
[INFO] Including com.github.fommil.netlib:core:jar:1.1.2 in the shaded jar.
[INFO] Including com.github.fommil.netlib:netlib-native_ref-osx-x86_64:jar:natives:1.1 in the shaded jar.
[INFO] Including com.github.fommil.netlib:native_ref-java:jar:1.1 in the shaded jar.
[INFO] Including com.github.fommil:jniloader:jar:1.1 in the shaded jar.
[INFO] Including com.github.fommil.netlib:netlib-native_ref-linux-x86_64:jar:natives:1.1 in the shaded jar.
[INFO] Including com.github.fommil.netlib:netlib-native_ref-linux-i686:jar:natives:1.1 in the shaded jar.
[INFO] Including com.github.fommil.netlib:netlib-native_ref-win-x86_64:jar:natives:1.1 in the shaded jar.
[INFO] Including com.github.fommil.netlib:netlib-native_ref-win-i686:jar:natives:1.1 in the shaded jar.
[INFO] Including com.github.fommil.netlib:netlib-native_ref-linux-armhf:jar:natives:1.1 in the shaded jar.
[INFO] Including com.github.fommil.netlib:netlib-native_system-osx-x86_64:jar:natives:1.1 in the shaded jar.
[INFO] Including com.github.fommil.netlib:native_system-java:jar:1.1 in the shaded jar.
[INFO] Including com.github.fommil.netlib:netlib-native_system-linux-x86_64:jar:natives:1.1 in the shaded jar.
[INFO] Including com.github.fommil.netlib:netlib-native_system-linux-i686:jar:natives:1.1 in the shaded jar.
[INFO] Including com.github.fommil.netlib:netlib-native_system-linux-armhf:jar:natives:1.1 in the shaded jar.
[INFO] Including com.github.fommil.netlib:netlib-native_system-win-x86_64:jar:natives:1.1 in the shaded jar.
[INFO] Including com.github.fommil.netlib:netlib-native_system-win-i686:jar:natives:1.1 in the shaded jar.
[INFO] Including junit:junit:jar:4.12 in the shaded jar.
[INFO] Including org.hamcrest:hamcrest-core:jar:1.3 in the shaded jar.
[INFO] Including mysql:mysql-connector-java:jar:5.1.40 in the shaded jar.
[WARNING] bcmail-jdk15-1.45.jar, bcmail-jdk15on-1.52.jar define 55 overlappping classes:
[WARNING] - org.bouncycastle.mail.smime.SMIMEEnvelopedGenerator$EnvelopedGenerator
[WARNING] - org.bouncycastle.mail.smime.validator.SignedMailValidatorException
[WARNING] - org.bouncycastle.mail.smime.examples.ReadLargeCompressedMail
[WARNING] - org.bouncycastle.mail.smime.SMIMESigned
[WARNING] - org.bouncycastle.mail.smime.handlers.PKCS7ContentHandler
[WARNING] - org.bouncycastle.mail.smime.util.FileBackedMimeBodyPart
[WARNING] - org.bouncycastle.mail.smime.CMSProcessableBodyPartOutbound
[WARNING] - org.bouncycastle.mail.smime.examples.ValidateSignedMail
[WARNING] - org.bouncycastle.mail.smime.examples.ReadEncryptedMail
[WARNING] - org.bouncycastle.mail.smime.SMIMESignedGenerator
[WARNING] - 45 more...
[WARNING] asm-debug-all-4.1.jar, asm-4.1.jar define 23 overlappping classes:
[WARNING] - org.objectweb.asm.Type
[WARNING] - org.objectweb.asm.AnnotationVisitor
[WARNING] - org.objectweb.asm.MethodVisitor
[WARNING] - org.objectweb.asm.Attribute
[WARNING] - org.objectweb.asm.FieldWriter
[WARNING] - org.objectweb.asm.signature.SignatureWriter
[WARNING] - org.objectweb.asm.MethodWriter
[WARNING] - org.objectweb.asm.Edge
[WARNING] - org.objectweb.asm.Handler
[WARNING] - org.objectweb.asm.ByteVector
[WARNING] - 13 more...
[WARNING] bcpkix-jdk15on-1.52.jar, bcmail-jdk15-1.45.jar define 58 overlappping classes:
[WARNING] - org.bouncycastle.cms.CMSSignedDataStreamGenerator$CmsSignedDataOutputStream
[WARNING] - org.bouncycastle.cms.RecipientInformationStore
[WARNING] - org.bouncycastle.cms.CMSEnvelopedDataParser
[WARNING] - org.bouncycastle.cms.CMSAuthenticatedData
[WARNING] - org.bouncycastle.cms.CMSAuthenticatedDataParser
[WARNING] - org.bouncycastle.cms.CMSSignedGenerator
[WARNING] - org.bouncycastle.cms.KeyTransRecipientInformation
[WARNING] - org.bouncycastle.cms.CMSException
[WARNING] - org.bouncycastle.cms.SignerInformationStore
[WARNING] - org.bouncycastle.cms.CMSRuntimeException
[WARNING] - 48 more...
[WARNING] commons-logging-1.1.3.jar, commons-logging-api-1.1.jar define 19 overlappping classes:
[WARNING] - org.apache.commons.logging.LogSource
[WARNING] - org.apache.commons.logging.impl.SimpleLog$1
[WARNING] - org.apache.commons.logging.LogFactory$4
[WARNING] - org.apache.commons.logging.Log
[WARNING] - org.apache.commons.logging.impl.WeakHashtable$1
[WARNING] - org.apache.commons.logging.LogFactory$3
[WARNING] - org.apache.commons.logging.LogFactory$2
[WARNING] - org.apache.commons.logging.impl.SimpleLog
[WARNING] - org.apache.commons.logging.impl.WeakHashtable$Entry
[WARNING] - org.apache.commons.logging.impl.Jdk14Logger
[WARNING] - 9 more...
[WARNING] asm-commons-4.1.jar, asm-debug-all-4.1.jar define 22 overlappping classes:
[WARNING] - org.objectweb.asm.commons.CodeSizeEvaluator
[WARNING] - org.objectweb.asm.commons.TryCatchBlockSorter$1
[WARNING] - org.objectweb.asm.commons.RemappingSignatureAdapter
[WARNING] - org.objectweb.asm.commons.JSRInlinerAdapter$Instantiation
[WARNING] - org.objectweb.asm.commons.InstructionAdapter
[WARNING] - org.objectweb.asm.commons.SimpleRemapper
[WARNING] - org.objectweb.asm.commons.SerialVersionUIDAdder
[WARNING] - org.objectweb.asm.commons.LocalVariablesSorter
[WARNING] - org.objectweb.asm.commons.JSRInlinerAdapter
[WARNING] - org.objectweb.asm.commons.SerialVersionUIDAdder$Item
[WARNING] - 12 more...
[WARNING] bcprov-jdk15-1.45.jar, bcpkix-jdk15on-1.52.jar define 9 overlappping classes:
[WARNING] - org.bouncycastle.openssl.PasswordException
[WARNING] - org.bouncycastle.openssl.PEMUtilities
[WARNING] - org.bouncycastle.openssl.EncryptionException
[WARNING] - org.bouncycastle.voms.VOMSAttribute
[WARNING] - org.bouncycastle.openssl.PasswordFinder
[WARNING] - org.bouncycastle.voms.VOMSAttribute$FQAN
[WARNING] - org.bouncycastle.openssl.PEMException
[WARNING] - org.bouncycastle.mozilla.SignedPublicKeyAndChallenge
[WARNING] - org.bouncycastle.openssl.PEMWriter
[WARNING] bcprov-jdk15-1.45.jar, bcprov-jdk15on-1.52.jar define 927 overlappping classes:
[WARNING] - org.bouncycastle.crypto.modes.gcm.Tables8kGCMMultiplier
[WARNING] - org.bouncycastle.asn1.cmp.CRLAnnContent
[WARNING] - org.bouncycastle.i18n.MissingEntryException
[WARNING] - org.bouncycastle.asn1.tsp.TimeStampResp
[WARNING] - org.bouncycastle.asn1.pkcs.PBKDF2Params
[WARNING] - org.bouncycastle.asn1.x509.CRLNumber
[WARNING] - org.bouncycastle.asn1.x509.TBSCertList$1
[WARNING] - org.bouncycastle.asn1.ASN1SequenceParser
[WARNING] - org.bouncycastle.crypto.agreement.DHBasicAgreement
[WARNING] - org.bouncycastle.asn1.cmp.CertResponse
[WARNING] - 917 more...
[WARNING] maven-shade-plugin has detected that some .class files
[WARNING] are present in two or more JARs. When this happens, only
[WARNING] one single version of the class is copied in the uberjar.
[WARNING] Usually this is not harmful and you can skeep these
[WARNING] warnings, otherwise try to manually exclude artifacts
[WARNING] based on mvn dependency:tree -Ddetail=true and the above
[WARNING] output
[WARNING] See http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin
[INFO] Attaching shaded artifact.
[INFO]
[INFO] --- maven-gpg-plugin:1.5:sign (sign-artifacts) @ jate ---
'gpg.exe'
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 14:56 min
[INFO] Finished at: 2017-03-05T17:30:25+08:00
[INFO] Final Memory: 137M/851M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-gpg-plugin:1.5:sign (sign-artifacts) on project jate: Exit code: 1 -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException