cleartk / cleartk Goto Github PK

View Code? Open in Web Editor NEW

128.0 128.0 58.0 519.09 MB

Machine learning components for Apache UIMA

Home Page: http://cleartk.github.io/cleartk/

License: Other

Java 93.35% C 2.78% C++ 1.50% HTML 0.10% TeX 1.59% Shell 0.03% Roff 0.65%

cleartk's Introduction

ClearTK

Introduction

ClearTK provides a framework for developing statistical natural language processing (NLP) components in Java and is built on top of Apache UIMA. It is developed by the Center for Computational Language and Education Research (CLEAR) at the University of Colorado at Boulder.

ClearTK is built with Maven and we recommend that you build your project that depends on ClearTK with Maven. This will allow you to add dependencies for only the parts of ClearTK that you are interested and automatically pull in only those dependencies that those parts depend on. The zip file you have downloaded is provided as a convenience to those who are unable to build with Maven. It provides jar files for each of the sub-projects of ClearTK as well as all the dependencies that each of those sub-projects uses. To use ClearTK in your Java project, simply add all of these jar files to your classpath. If you are only interested in one (or a few) sub-project of ClearTK, then you may not want to add every jar file provided here. Please consult the maven build files to determine which jar files are required for the parts of ClearTK you want to use.

Please see the section titled "Dependencies" below for important licensing information.

License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the University of Colorado at Boulder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Dependencies

ClearTK depends on a variety of different open source libraries that are redistributed here subject to the respective licensing terms provided by each library. We have been careful to use only libraries that are commercially friendly. Please see the notes below for exceptions. For a complete listing of the dependencies and their respective licenses please see the file licenses/index.html.

GPL Dependencies

ClearTK has two sub-projects that depend on GPL licensed libraries:

cleartk-syntax-berkeley
cleartk-stanford-corenlp Neither of these projects nor their dependencies are provided in this release. To obtain these projects, please manually download them from our googlecode hosted maven repository:

http://cleartk.googlecode.com/svn/repo/org/cleartk/cleartk-syntax-berkeley/ http://cleartk.googlecode.com/svn/repo/org/cleartk/cleartk-stanford-corenlp/

SVMLIGHT

ClearTK also has two projects called cleartk-ml-svmlight and cleartk-ml-tksvmlight which have special licensing considerations. The ClearTK project does not redistribute SVMlight. ClearTK does, however, facilitate the building of SVMlight models via the ClassifierBuilder interface. In order to use the implementations of this interface to good effect you will need to have SVMlight installed on your machine. The ClassifierBuilders for SVMlight simply call the executable "svm_learn" provided by the SVMlight distribution. ClearTK does not use SVMlight at classification time - it only uses the models that are build by SVMlight. Instead, ClearTK provides its own code for classification that makes use of an SVMlight generated model. This code is provided with ClearTK and is available with the above BSD license as is all of the other code written for ClearTK. Therefore, be advised that while ClearTK is not required (or compelled) to redistribute the code or license of SVMlight or to comply with it (i.e. the noncommercial license provided by SVMlight is not compatible with our BSD License) - it would be very difficult to use the SVMlight wrappers we provide in a commercial setting without obtaining a license for SVMlight directly from its authors.

LGPL

The cleartk-ml-mallet project depends on Mallet (http://mallet.cs.umass.edu/), which depends on trove4j (http://trove.starlight-systems.com/), which is released under the LGPL license. If you do not need Mallet classifiers and would like to avoid the LGPL license, you can omit the cleartk-ml-mallet dependency.

cleartk's People

Contributors

Stargazers

Watchers

cleartk's Issues

add factory method to Feature

Original issue 45 created by ClearTK on 2009-02-06T16:23:21.000Z:

I want a factory method that takes a Feature and creates a new feature with
an updated name. I am going to add a method with the following feature:

public static Feature createFeature(String namePrefix, Feature feature)

which returns:

return new Feature(createName(namePrefix, feature.name), feature.value);

part-of-speech tagging infrastructure

Original issue 31 created by ClearTK on 2009-01-19T18:18:07.000Z:

we need to solidify the part-of-speech tagging functionality so that you
don't have to resort to using the example code or having to do it all
yourself if you want a specialized part-of-speech tagger. I have a bunch
of code that I wrote for an old version of ClearTK that I had in a separate
project that supported the results in our LREC workshop paper. I am in the
process of refactoring this code and adding it to the ClearTK code base.

Move TestsUtil component creation methods into src

Original issue 12 created by ClearTK on 2008-12-05T19:28:49.000Z:

Please use labels and text to provide additional information.

There are a number of use cases for creating analysis engines, collection
readers, etc. outside of unit testing. In fact, any time you are going to
create and run components without using descriptor files - these methods
will come in handy.

We should put the following methods into UIMAUtil:

getAnalysisEngine
getCollectionReader
getTypeSystem
setConfigurationParameters

unit tests that contain material that should not be redistributed

Original issue 7 created by ClearTK on 2008-12-05T18:21:14.000Z:

The following unit tests have data in the text of the .java file that
should not be redistributed (e.g. song lyrics, treebank data, etc.) These
unit tests are only available on the old repository and either need to be
refactored or replaced.

Parent directory
/ClearTK/test/src/

Affected Files
org/cleartk/classifier/encoder/features/string/StringFeatureEncoderTests.java

Document expectations for _InitialView

Original issue 11 created by ClearTK on 2008-12-05T19:20:24.000Z:

Please use labels and text to provide additional information.

We need to document our expectations for what goes into _InitialView:
basically plain text with no markup, no escaping, etc. It might also be
useful to include an overview of what other views we include, e.g.
TreebankView, PropbankView, etc.

TreebankFormatParser makes poor assumption concerning sentence splitting

Original issue 50 created by ClearTK on 2009-02-09T23:10:58.000Z:

The method TreebankFormatParser.parseDocument calls splitSentences which
attempts to break down the document into single sentence parses. This
method assumes that sentences are delimited by "( (S" as they are in the
PTB. I have some data that has one parsed sentence per line where each
sentence start with "(S1 (S". The method splitSentences is not splitting
these files up into sentence-level parse strings. Instead it passes in the
whole file to parse(String, String, int) which returns a single
TopTreebankNode with child nodes corresponding to the parse of the last
sentence of the file.

I think this is a fairly common format that we should handle.

make isSequential available through ClassifierAnnotator

Original issue 30 created by ClearTK on 2009-01-15T17:32:46.000Z:

When you are going to use a non-sequential tagger for a sequential tagging
task (e.g. BIO chunking) then you are going to want to use previous labels
as features to pass into the classifier. This precludes using
consumeAll(List<Instance>) because you need to classify each instance one
at a time so that feature extraction has access to labels already assigned.
This is in general not necessary for data writers and more generally it
should not be part of the instance consumer interface. I would, however,
like to be able to cast my instance consumer and get isSequential.

Inverse document frequency component

Original issue 36 created by ClearTK on 2009-01-23T19:03:55.000Z:

Please use labels and text to provide additional information.

We should have a component that can calculate inverse document frequencies
(IDFs) over a corpus.

Make sure all descriptors are tested

Original issue 47 created by ClearTK on 2009-02-06T21:48:42.000Z:

As of r224 the following descriptors are not being tested anywhere:

"org.cleartk.corpus.ace2005.Ace2005GoldAnnotator"
"org.cleartk.corpus.ace2005.Ace2005GoldReader"
"org.cleartk.corpus.ace2005.Ace2005Writer"
"org.cleartk.corpus.conll2003.Conll2003GoldReader"
"org.cleartk.corpus.penntreebank.PennTreebankReader"
"org.cleartk.srl.conll2005.Conll2005GoldAnnotator"
"org.cleartk.srl.conll2005.Conll2005GoldReader"
"org.cleartk.srl.propbank.PropbankGoldAnnotator"
"org.cleartk.srl.propbank.PropbankGoldReader"
"org.cleartk.srl.propbank.TreebankPropbankGoldAnnotator"

See the output of DescriptorCoverageTests.

Test SVMlight kernels

Original issue 16 created by ClearTK on 2008-12-05T20:01:32.000Z:

Please use labels and text to provide additional information.

We currently test the linear and RBF kernels for SVMlight. Since we're
using Philipp's implementations, we should also test the polynomial kernel
and the sigmoid kernel.

Test parameter type checking in ClassifierAnnotator

Original issue 1 created by ClearTK on 2008-12-05T17:15:36.000Z:

What steps will reproduce the problem?

ReflectionUtil.getTypeParameterClass(Class) is used.
It only works for classes of the form class X extends Y<Z>

What is the expected output? What do you see instead?

ReflectionUtil.getTypeParameterClass(Class, Class) should be used everywhere.

Please use labels and text to provide additional information.

We need to write some tests where we mix a Classifier and a
ClassifierAnnotator with different parameter types, and make sure we're
throwing exceptions correctly.

license/copyright statement in descriptor files

Original issue 29 created by ClearTK on 2009-01-09T20:29:59.000Z:

I just noticed a descriptor file (SentencesAndTokens.xml) that has an
out-of-date license statement that points to the "for research only"
license. This needs to be replaced.

Replace Julie types in TypePathExtractor with ClearTK types

Original issue 14 created by ClearTK on 2008-12-05T19:52:00.000Z:

Please use labels and text to provide additional information.

We should remove our dependency on the Julie type system - we should be
able to exercise TypePathExtractor (the only test that uses the Julie type
system) well enough using just the ClearTK types.

LineWriter should be more flexible wrt blocks

Original issue 32 created by ClearTK on 2009-01-20T18:38:47.000Z:

I have a use case where I want something other than blank lines betwixt my
blocks when using LineWriter. In my case I want to print out the document
id at the beginning. I propose to add a BlockWriter interface that allows
one to write out whatever one wants at the top of each block. The default
behavior will still be to print out a newline at each block.

provide instructions for recreating the dog-fox-parser model

Original issue 28 created by ClearTK on 2008-12-22T18:07:09.000Z:

we have the model checked in and it works but there are not instructions on
how to recreate the model if we ever needed to. Basically, the parsed
sentence is provided in the unit tests that make use of the model. This
needs to be put in a separate data file repeated N times and then passed
into the appropriate model builder - probably a direct call to a main
method of an OpenNLP class.

Introduction documentation feedback

Original issue 38 created by ClearTK on 2009-01-30T22:06:22.000Z:

I received the following feedback from a developer new to UIMA, NLP, and
machine learning - though otherwise very sharp. I think it would be
worthwhile to address all of the points that he makes.

<begin-message>
I looked at the wiki earlier and was a bit overwhelmed
and abandoned it as my first source of information.
After a short conversation with Kevin, it makes a bit
more sense, but I think it would benefit from a paragraph
or two of introduction. I may be on the edge of the expected
audience, but following are some questions I have after looking
at the main page and the main wiki page. A better introduction
on googleCode may not answer them all.

BTW, Do we have a book in the lab library that introduces machine
learning in the context of NLP? I've read Jackson and Moulinier.

http://code.google.com/p/cleartk/

"...feature extraction library" ...like what? POS, named entity,
misc relationships?

" ...wrappers..." UIMA wrappers?
I'd like to learn more about maximum entropy, support vector machines
and conditional random fields, but wouldn't expect that from a ClearTK
intro.

...also sequential taggers, chunkers, role labelling and temporal
resolution.

Where does the name come from? (certainly not Tcl/Tk)

http://code.google.com/p/cleartk/w/list

What's a classifier? ...I'm guessing you could use one
to do tagging in UIMA.

-What's the Maxent classifier and how is it different
than the POS tagger?

-What's a chunk tokenizer and how is that different from
other kinds of tokenizers.

How is ClearTK both a pos tagger and these other things?

introduce CleartkTest class?

Original issue 26 created by ClearTK on 2008-12-12T18:42:29.000Z:

creating a JCas is fairly expensive (my tests indicate about .1 seconds).
So, each unit test should create at most one jcas and then call reset
instead of calling newJCas.

In fact, it may be possible to create a single JCas that gets kept across
all unit tests using a static ThreadLocal - one for each type system used
in the unit tests.

Window Extractor improvements

Original issue 19 created by ClearTK on 2008-12-05T21:35:12.000Z:

Here is a quick sketch of what we expect the window extractor to do:

Given a focus annotation, extract features on either side of or within the
focus annotation such that you can specify (optionally) a boundary past
which the extractor will not go. This will be referred to as boundary
conditions below. Additionally, you should be able to specify which
annotations to examine inside the boundary (all, 0-3, 2-5, etc.) This will
be referred to as start and end annotations below. The following outlines
the different possibilities:

Start annotation:

not specified (default 0)
specified (e.g. 1, 2, 3...)

End annotation:

not specified (examine all elgible annotations past the start annotation)
specified (e.g. 1, 2, 3, ...)

Boundary conditions:

not specified (this condition will allow WindowExtractor to be a
SimpleFeatureExtractor by default - default values will be 0 and
document.size()
character offsets
a single annotation (e.g. a sentence, paragraph, etc.)
multiple annotations (e.g. current sentence + preceding sentence)

The current implementation requires that you specify the start and end
annotation and use a single annotation as the boundary condition.

Similar issues almost certainly exist for WindowNGramExtractor

Remove constituentParse from Sentence type

Original issue 2 created by ClearTK on 2008-12-05T17:34:20.000Z:

What steps will reproduce the problem?

Anything using the constituentParse attribute of the Sentence type

What is the expected output? What do you see instead?

The corresponding TopTreebankNode should be selected using
AnnotationRetrieval instead.

Please use labels and text to provide additional information.

We should keep the requirements of our Sentence type as minimal as possible
so that in the not so distant future, we can allow people to use their own
Sentence types.

Document view names for all AnalysisEngines

Original issue 10 created by ClearTK on 2008-12-05T19:13:19.000Z:

Please use labels and text to provide additional information.

We need to document the names of all views that each AnalysisEngine reads
from or writes to. For example, TreebankGoldAnnotator reads from
"TreebankView" and writes to "_InitialView", but doesn't document this
anywhere. This is crucial for users that want to use our annotators with
other views, e.g. through view mapping:

http://incubator.apache.org/uima/downloads/releaseDocs/2.2.2-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html#ugr.tug.aae.logging

Remove deprecated code from FeatureProliferatorTests

Original issue 9 created by ClearTK on 2008-12-05T18:52:31.000Z:

There is deprecated code in
/ClearTK/test/src/org/cleartk/classifier/feature/proliferate/FeatureProliferatorTests.java
that should be removed. The warnings are currently suppressed.

DocumentUtil.createDocument should be over-ridable

Original issue 42 created by ClearTK on 2009-02-06T15:51:34.000Z:

We have introduced a nasty and unnecessary dependency on our type system by
using the DocumentUtil class in all of our collection readers and various
places elsewhere. I noticed this most recently when I was trying to use
the XWriter and XReader in a different project and was not wanting anything
to do with the ClearTK type system. We should probably re-engineer this
such that one could implement their own createDocument and getDocument
methods that are specific to a different type system.

the build.xml should compile into the same directory as Eclipse

Original issue 22 created by ClearTK on 2008-12-05T22:08:46.000Z:

What is the expected output? What do you see instead?

The build.xml currently compiles into build/bin, while Eclipse compiles into bin. This makes the
interactions a bit more complicated when one uses both ant and Eclipse. Building in ant and
building in Eclipse should be equivalent.

Move descriptors beside their classes

Original issue 23 created by ClearTK on 2008-12-05T22:42:37.000Z:

Please use labels and text to provide additional information.

Descriptors should sit beside the corresponding java classes in the
hierarchy. They can then be imported directly from the .jar file.

All descriptor files should be on the classpath

Original issue 39 created by ClearTK on 2009-02-04T19:16:05.000Z:

ClearTK descriptors can be used from a .jar with XML like:

But this only works if we put PlainTextCollectionReader.xml in the
appropriate package folder. Philipp has moved some of the descriptor files
from "desc" to their appropriate packages, but we should do it for all of
them (e.g. PlainTextCollectionReader).

In the end, "desc" should be empty except for CPEs. All other descriptor
files should be in an appropriate package.

Document feature extraction library

Original issue 3 created by ClearTK on 2008-12-05T17:48:34.000Z:

Please use labels and text to provide additional information.

There is currently no documentation for the different feature extractors.
We need to both complete the JavaDocs, and, more importantly, give a more
tutorial-like overview of what is available and how it can be used.

Extra blank lines in LineWriter output with chunk-based tokenizer

Original issue 4 created by ClearTK on 2008-12-05T17:55:10.000Z:

What steps will reproduce the problem?

Use a LineWriter to write out tokens that were created by chunk-based
tokenizer

What is the expected output? What do you see instead?

There are blank lines in the output file where there shouldn't be.

Please use labels and text to provide additional information.

Running the same code using the default penn tokenizer, there are no blank
lines. Is the chunk-based tokenizer creating tokens that contain newlines?

OpenNLP parser wrapper integration

Original issue 41 created by ClearTK on 2009-02-06T15:04:00.000Z:

I have started integrating the opennlp.uima wrapper around the "chunking"
constituent parser so that it works with our type system. The class
opennlp.uima.parser.chunking.Parser provides a method createAnnotation
which provides an OpenNLP Parse object which must be converted into
annotations which can be posted to the CAS. I have written a first pass at
a conversion between the opennlp.tools.parser.Parse object and our type
system in org.cleartk.syntax.treebank.util.TreebankNodeUtility (it probably
belongs elsewhere).

One problem with this approach is that opennlp.uima.parser.chunking.Parser
has been declared "final" which means we either need to modify the code for
our own purposes or we need to convince the authors that this class should
not be final (in progress).

http://opennlp.cvs.sourceforge.net/viewvc/opennlp/opennlp.uima/

Document command line training of classifiers

Original issue 15 created by ClearTK on 2008-12-05T19:55:39.000Z:

Please use labels and text to provide additional information.

We need to explain how Train and BuildJar can be used at the command line
(and perhaps what that's actually doing in the code).

Make RowNormalizingFeaturesEncoder more accessible

Original issue 48 created by ClearTK on 2009-02-08T22:24:17.000Z:

Right now, there's no way short of writing your own new EncoderFactory to
use a RowNormalizingFeaturesEncoder. SVMEncoderFactory should do what
ContextValueEncoderFactory and look at the UimaContext to determine whether
to use normalization or not.

unit test and javadoc AnnotationRetrieval.getOverlappingAnnotations()

Original issue 27 created by ClearTK on 2008-12-15T22:03:18.000Z:

I added a new method AnnotationRetrieval.getOverlappingAnnotations() which
seems to work for some code I am writing for my project - but it deserves
some javadoc comments and some unit tests.

TreebankAnnotation.xml and TreebankCoreference.xml should be considered for removal

Original issue 49 created by ClearTK on 2009-02-09T19:50:21.000Z:

I happened to notice two type system descriptor files that don't seem to
have any purpose in the org.cleartk.syntax.treebank.type package. My guess
is that they are there by mistake and can be safely removed.
TreebankCoreference isn't even in the TypeSystem.xml file

Remove all System.{out,err} uses

Original issue 5 created by ClearTK on 2008-12-05T18:14:52.000Z:

Please use labels and text to provide additional information.

To be more compatible with UIMA, we should use Loggers everywhere instead
of printing to System.out or System.err. Some information about how to do
this is here:

http://incubator.apache.org/uima/downloads/releaseDocs/2.2.2-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html#ugr.tug.aae.logging

Tutorial on writing a feature extractor

Original issue 17 created by ClearTK on 2008-12-05T20:25:38.000Z:

Please use labels and text to provide additional information.

We should write a tutorial on how to write a new feature extractor class,
demonstrating when a FeatureProliferator might be useful, etc.

AE's that use more than one view should say so in their descriptors

Original issue 51 created by ClearTK on 2009-02-10T00:26:44.000Z:

Take TreebankGoldAnnotator.xml as an example - it takes in one view
(TreebankView) that has treebank parses in it and puts the parsed treebank
node objects into another view (_InitialView). These need to be listed in
the capabilities section of the descriptor file. This makes them much more
flexible wrt to aggregate analysis engines where you are trying to map one
view to another.

PlainTextCollectionReader should take in a list of file names to process

Original issue 33 created by ClearTK on 2009-01-20T20:12:25.000Z:

When doing things like cross-validation, it is nice to be able to point the
PlainTextCollectionReader at a directory and provide a list of file names
that should be run (or several lists).

DocumentUtil.createDocument

Original issue 43 created by ClearTK on 2009-02-06T16:07:59.000Z:

I have been using aggregate analysis engines a fair bit recently and I've
noticed that when you change the view of a single-view analysis engine to
something other than _InitialView - you end up posting Document annotations
to the other view when you probably still (or also) want it in the initial
view.

I haven't thought this through carefully - but I think the solution is to
make sure that whenever createDocument is called that the Document
annotation gets added to the _InitialView. I'm not sure whether it should
also get added to the mapped view or not - I think putting it in both might
be preferred and shouldn't cause any problems.

Tests failing due to data redistribution issues

Original issue 24 created by ClearTK on 2008-12-06T00:42:43.000Z:

There are a number of tests that fail because we do not have (yet)
permission to redistribute the data that the tests rely on.

GeniaPOSParserTests
GeniaPosGoldReaderTests
TimeMLGoldAnnotatorTests
TreebankAligningAnnotatorTests

Use UIMA example OpenNLP wrappers

Original issue 20 created by ClearTK on 2008-12-05T21:49:49.000Z:

Please use labels and text to provide additional information.

Wrappers for OpenNLP are already provided with UIMA in the
examples/opennlp_wrappers directory. We should use these instead of
maintaining our own wrappers. We may still want to distribute descriptors
that use our type system.

TreebankFormatParser regex bug

Original issue 35 created by ClearTK on 2009-01-21T18:20:01.000Z:

The class:

org.cleartk.syntax.treebank.util.TreebankFormatParser

contains the following regex:

private static final Pattern nonwhiteSpaceCharPattern =
Pattern.compile("[^\s+]");

It looks like the + sign inside the square brackets is misplaced and is
causing problems with treebank data that used to parse. When I remove the
plus sign (I don't think it needs to be there as a quantifier either) then
my data parses (i.e. it doesn't barf the first time a token beginning with
a + sign appears.)

all config params should be read in using UIMAUtil

Original issue 34 created by ClearTK on 2009-01-20T20:43:32.000Z:

We should establish a best-practice that requires all configuration
parameters to be read using either
UIMAUtil.getDefaultingConfigParameterValue() or
UIMAUtil.getRequiredConfigParameterValue().

This makes the code more consistent and will avoid heart-breaking mishaps
like reading in an empty string when you are expecting a null (for example).

We should do a pass over all initialize methods and replace all occurrences
of context.getConfigParameterValue.

build-jar produces a very large jar file because it includes opennlp models

Original issue 25 created by ClearTK on 2008-12-12T17:47:30.000Z:

the default jar file created by the build script should not include the
large models provided by opennlp. otherwise, you end up with a >40MB jar file.

parameter names should be "fully-qualified"

Original issue 44 created by ClearTK on 2009-02-06T16:16:06.000Z:

I was looking through someone else's UIMA code and noticed that their
parameter names were "somewhat-qualified" wrt the package name you might
find them in. I would like to suggest that we name our parameters with
fully qualified names wrt the class they are defined (and documented) in.
For example,

"ViewName" defined by PlainTextCollectionReader.PARAM_VIEW_NAME should be
named: "org.cleartk.util.PlainTextCollectionReader:ViewName" - or something
simlar.

This will ensure that our parameter names our never ambiguous and provides
a clear way to find where the parameter is defined and documented.

Document how to create/use encoders

Original issue 6 created by ClearTK on 2008-12-05T18:18:59.000Z:

Please use labels and text to provide additional information.

We need a tutorial/overview of how FeaturesEncoder, OutcomeEncoder,
EncoderFactory, etc. work together.

Parameterize SnowballStemmer for Token type

Original issue 18 created by ClearTK on 2008-12-05T20:35:05.000Z:

Please use labels and text to provide additional information.

The snowball stemmer annotator should grow extra parameters that specify
the token type and the slot where to store the stem. Right now it only
works with the default token type, assuming the "stem" slot.

cleartk-opennnlp-tools should just be thin layers over opennlp.uima

Original issue 40 created by ClearTK on 2009-02-06T14:56:57.000Z:

The uima wrappers being developed by the opennlp folks are really nice.
Basically, we can rip out our lame postagger wrapper and replace it with a
descriptor file that specifies our type system.

See http://opennlp.cvs.sourceforge.net/viewvc/opennlp/opennlp.uima/

We will likely need to build the code ourselves for now keeping careful
track of which revision we are using and any changes we've made.

Use @ConfigurationParameter annotation and InitializeUtil.initialize

Original issue 21 created by ClearTK on 2008-12-05T21:53:49.000Z:

Please use labels and text to provide additional information.

We should only ever use UIMAUtil.getDefaultingConfigParameterValue() or
UIMAUtil.getRequiredConfigParameterValue() when getting component
parameters. This way if we decide we need to do extra checking (e.g. for
empty strings), we can do it all in one place.

TreebankGoldAnnotator should post annotations to Gold view - rather than initial view

Original issue 46 created by ClearTK on 2009-02-06T17:52:09.000Z:

for clarity - the TreebankGoldAnnotator should post treebank annotations to
a "GoldView" rather than the "_InitialView". This makes the names of the
views clearer and simplifies the understanding of the sofa mappings in
aggregate analysis engine descriptors. Generally, if you are using gold
standard data to evaluate system output - you shouldn't assume that the
gold-standard data is in the _InitialView - but rather in some special view

a "GoldView".

Descriptor parameters need javadocs

Original issue 8 created by ClearTK on 2008-12-05T18:26:15.000Z:

Please use labels and text to provide additional information.

For bad examples, see:

ClassifierAnnotator.PARAM_CLASSIFIER_JAR
DelegatingDataWriter.PARAM_DATA_WRITER
DataWriter_ImplBase
InstanceConsumer_ImplBase
MaxentDataWriter

For good examples see:

ChunkerHandler

Document requirements for using LIBSVM, SVMlight, etc.

Original issue 13 created by ClearTK on 2008-12-05T19:32:53.000Z:

Please use labels and text to provide additional information.

To train models the following binaries need to be on your path:

LIBSVM: svm-train
LIBLINEAR: train
SVMlight: svm_learn

We should document this somewhere.

cleartk / cleartk Goto Github PK

cleartk's Introduction

ClearTK

Introduction

License

Dependencies

GPL Dependencies

SVMLIGHT

LGPL

cleartk's People

Contributors

Stargazers

Watchers

Forkers

cleartk's Issues

Recommend Projects

Recommend Topics

Recommend Org