Git Product home page Git Product logo

ternip's People

Contributors

cnorthwood avatar jo-fu avatar nova77 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ternip's Issues

"Malformed rule expression" when running extras/terneval.py

When running extras/terneval.py, the following output is included:

....

chtb_245.eng.sgm
recognition 0.083
extent 0.0
normalisation 0.0

TERNIP: WARNING: Malformed rule expression
Traceback (most recent call last):
File "../ternip/rule_engine/normalisation_rule.py", line 139, in apply
timex.value = eval(self._value_exp)
File "../ternip/rules/normalisation/gutime-date.ruleblock:134:value", line 1, in
File "../ternip/rule_engine/normalisation_functions/relative_date_functions.py", line 179, in compute_offset_base
ref_m = int(ref_date[4:6])
ValueError: invalid literal for int() with base 10: ''

chtb_183.eng.sgm
recognition 0.267
extent 0.0
normalisation 0.0

....

Does compute_offset_base() need to be called only after validating/sanitising the current reference date?

Rules 116 and 117 (gutime-date.ruleblock) miss an integer cast

..which means that we get an error while parsing.

More specifically:

date_to_dow(int(cur_context[:4]), int(cur_context[4:6]), (cur_context[6:8])) + 6)

should be

date_to_dow(int(cur_context[:4]), int(cur_context[4:6]), int(cur_context[6:8])) + 6)

TERNIP isn't PEP-8 compliant

This would involve API breakage to fix. It's probably worth doing though, just to make it easier for Python programmers to get used to.

DCT detection from filename

It's possible to extract DCT (at day granularity) from filenames - is this attemped?

From TimeBank:

VOA19980331.1700.1533.tml
WARNING: Could not determine document creation time, use -c to override

Add docs to Read The Docs

As CI is now done using Travis rather than buildbot docs aren't pushed to pling.org.uk anymore - we should set up some docs to be published to RTD

TIMEX3 strings need hyphens

Dates should be formatted with hyphens separating years and months/weeks from their sub-parts.

e.g:

199     =>  199
1993    =>  1993
199307  =>  1993-07
BC0045  =>  BC0045
BC004508    =>  BC0045-08
BC00450829T16:00    =>  BC0045-08-29T16:00
200401  =>  2004-01
20040101    =>  2004-01-01
20040101TNI     =>  2004-01-01TNI
20040101T1802   =>  2004-01-01T1802
200X04  =>  200X-04
2003W32     =>  2003-W32
2007W325    =>  2007-W32-5

"Malformed rule expression" in gutime-date.ruleblock:143

In TimeBank, wsj_0918.tml:

WARNING: Could not determine document creation time, use -c to override
TERNIP: WARNING: Malformed rule expression
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_rule.py", line 139, in apply
timex.value = eval(self._value_exp)
File "/usr/local/lib/python2.6/dist-packages/ternip/rules/normalisation/gutime-date.ruleblock:143:value", line 1, in
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_functions/relative_date_functions.py", line 203, in compute_offset_base
m = date_functions.month_to_num(match.group()) - int(ref_date[4:6])
ValueError: invalid literal for int() with base 10: ''

"Malformed rule expression" in gutime-date.ruleblock:102

From TimeBank, wsj_1011.tml

WARNING: Could not determine document creation time, use -c to override
TERNIP: WARNING: Malformed rule expression
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_rule.py", line 139, in apply
timex.value = eval(self._value_exp)
File "/usr/local/lib/python2.6/dist-packages/ternip/rules/normalisation/gutime-date.ruleblock:102:value", line 1, in
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_functions/relative_date_functions.py", line 203, in compute_offset_base
m = date_functions.month_to_num(match.group()) - int(ref_date[4:6])
ValueError: invalid literal for int() with base 10: ''

"Malformed rule expression" gutime-date.ruleblock:148

On TimeBank:

wsj_0612.tml
WARNING: Could not determine document creation time, use -c to override
TERNIP: WARNING: Malformed rule expression
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_rule.py", line 139, in apply
timex.value = eval(self._value_exp)
File "/usr/local/lib/python2.6/dist-packages/ternip/rules/normalisation/gutime-date.ruleblock:148:value", line 1, in
ValueError: invalid literal for int() with base 10: ''

Document text is changed during processing

XML entities are replaced/inserted; " becomes " and so on. For example, document WSJ910225-0066 in TimeBank:

Input:

spokesman Marlin Fitzwater said late yesterday that "the operation has been very successful."

TERNIP output:

spokesman Marlin Fitzwater said late <TIMEX3 tid="t23" type="DATE" value="19910224">yesterday</TIMEX3> 
that &quot;the operation has been very successful.&quot; 

IndexError in _nodes_to_sents

When running ternip -t timeml -s APW19981205.0374.tml (from TimeBank), the following exception is thrown (also with APW19990312.0251.tml, APW19990122.0193.tml and some other APW* / CNN* docs) :

Traceback (most recent call last):
File "/usr/local/bin/annotate_timex", line 76, in
sents = doc.get_sents()
File "/usr/local/lib/python2.6/dist-packages/ternip/formats/xml_doc.py", line 734, in get_sents
(nodesents, ndsents, i) = self._nodes_to_sents(self._xml_body, [], [(sent, []) for sent in nltk.tokenize.sent_tokenize(self._get_text(self._xml_body))], 0)
File "/usr/local/lib/python2.6/dist-packages/ternip/formats/xml_doc.py", line 661, in _nodes_to_sents
(done_sents, nondone_sents, senti) = self._nodes_to_sents(child, done_sents, nondone_sents, senti)
File "/usr/local/lib/python2.6/dist-packages/ternip/formats/xml_doc.py", line 651, in _nodes_to_sents
(sent, snodes) = nondone_sents[0]
IndexError: list index out of range

Output doesn't handle non-ascii gracefully

From TAC_2010_KBP_Source_Data/data/2010/wb/eng-WL-11-174596-12957493.sgm (http://pastebin.com/Wz2QKEAZ):

Traceback (most recent call last):
File "/usr/local/bin/annotate_timex", line 154, in
print str(doc)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 662: ordinal not in range(128)

Add option to normalise only

In order to separate the evaluation of recognition and normalisation performance, and allow the use of non-integrated timex recognisers, it would be useful to have a "normalise only" module, that uses existing TIMEX3/TIMEX2 annotation boundaries and only provides attributes for those elements.

"Error whilst attempting to add TIMEX"

From TimeBank:

wsj_0586.tml

TERNIP: WARNING: Error whilst attempting to add TIMEX
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/ternip/formats/xml_doc.py", line 640, in reconcile
self._add_timex(timex, sents[i], s_nodes[i])
File "/usr/local/lib/python2.6/dist-packages/ternip/formats/xml_doc.py", line 521, in _add_timex
self._add_timex_child(timex, sent, s_node, start, end)
File "/usr/local/lib/python2.6/dist-packages/ternip/formats/xml_doc.py", line 487, in _add_timex_child
raise nesting_error('Can not tag TIMEX (' + str(timex) + ') without causing invalid XML nesting')
nesting_error: Can not tag TIMEX (<ternip.timex.timex instance at 0x235eb00>) without causing invalid XML nesting

when called with annotate_timex -s -t timeml

GATE PR doesn't separate tokens in the expected way

GATE's ANNIE tokeniser splits on different boundaries to TERNIP's (NLTK). This can cause many TERNIP rules to not match. For example,

nltk.word_tokenize('Example 31/12/2010 text.')
['Example', '31/12/2010', 'text', '.']

Places a dd/mm/yyyy date into one token, whereas ANNIE will give us a SpaceToken, followed by tokens of '31', '/', '12', '/', '2010', and another SpaceToken.

This should be fixed in the GATE plugin (the preprocessing/postprocessing JAPE), so that the ANNIE Tokeniser's output can be mapped slightly more closely to the results of the NLTK tokeniser. It may also help to specify (if not already done) the tokenisation scheme that NLTK expects, to help in other situations where the upstream tokeniser is switched out from the default.

"Malformed rule expression" with gutime-date.ruleblock:89

From TimeBank, wsj_0751.tml:

WARNING: Could not determine document creation time, use -c to override
TERNIP: WARNING: Malformed rule expression
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_rule.py", line 139, in apply
timex.value = eval(self._value_exp)
File "/usr/local/lib/python2.6/dist-packages/ternip/rules/normalisation/gutime-date.ruleblock:89:value", line 1, in
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_functions/relative_date_functions.py", line 203, in compute_offset_base
m = date_functions.month_to_num(match.group()) - int(ref_date[4:6])
ValueError: invalid literal for int() with base 10: ''

"Malformed rule expression" in gutime-date.ruleblock:149

TERNIP: WARNING: Malformed rule expression
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_rule.py", line 139, in apply
timex.value = eval(self._value_exp)
File "/usr/local/lib/python2.6/dist-packages/ternip/rules/normalisation/gutime-date.ruleblock:149:value", line 1, in
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_functions/relative_date_functions.py", line 193, in compute_offset_base
t = day - date_functions.date_to_dow(int(ref_date[:4]), int(ref_date[4:6]), int(ref_date[6:8]))
ValueError: invalid literal for int() with base 10: ''

in TimeBank's wsj_0586.tml.

"Malformed rule expression" gutime-date.ruleblock:99

TERNIP: WARNING: Malformed rule expression
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_rule.py", line 139, in apply
timex.value = eval(self._value_exp)
File "/usr/local/lib/python2.6/dist-packages/ternip/rules/normalisation/gutime-date.ruleblock:99:value", line 1, in
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_functions/relative_date_functions.py", line 193, in compute_offset_base
t = day - date_functions.date_to_dow(int(ref_date[:4]), int(ref_date[4:6]), int(ref_date[6:8]))
ValueError: invalid literal for int() with base 10: ''

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.