cnorthwood / ternip Goto Github PK
View Code? Open in Web Editor NEWTemporal Expression Recognition and Normalisation in Python
License: Other
Temporal Expression Recognition and Normalisation in Python
License: Other
When running extras/terneval.py, the following output is included:
....
chtb_245.eng.sgm
recognition 0.083
extent 0.0
normalisation 0.0
TERNIP: WARNING: Malformed rule expression
Traceback (most recent call last):
File "../ternip/rule_engine/normalisation_rule.py", line 139, in apply
timex.value = eval(self._value_exp)
File "../ternip/rules/normalisation/gutime-date.ruleblock:134:value", line 1, in
File "../ternip/rule_engine/normalisation_functions/relative_date_functions.py", line 179, in compute_offset_base
ref_m = int(ref_date[4:6])
ValueError: invalid literal for int() with base 10: ''
chtb_183.eng.sgm
recognition 0.267
extent 0.0
normalisation 0.0
....
Does compute_offset_base() need to be called only after validating/sanitising the current reference date?
..which means that we get an error while parsing.
More specifically:
date_to_dow(int(cur_context[:4]), int(cur_context[4:6]), (cur_context[6:8])) + 6)
should be
date_to_dow(int(cur_context[:4]), int(cur_context[4:6]), int(cur_context[6:8])) + 6)
This would involve API breakage to fix. It's probably worth doing though, just to make it easier for Python programmers to get used to.
It's possible to extract DCT (at day granularity) from filenames - is this attemped?
From TimeBank:
VOA19980331.1700.1533.tml
WARNING: Could not determine document creation time, use -c to override
As CI is now done using Travis rather than buildbot docs aren't pushed to pling.org.uk anymore - we should set up some docs to be published to RTD
Dates should be formatted with hyphens separating years and months/weeks from their sub-parts.
e.g:
199 => 199
1993 => 1993
199307 => 1993-07
BC0045 => BC0045
BC004508 => BC0045-08
BC00450829T16:00 => BC0045-08-29T16:00
200401 => 2004-01
20040101 => 2004-01-01
20040101TNI => 2004-01-01TNI
20040101T1802 => 2004-01-01T1802
200X04 => 200X-04
2003W32 => 2003-W32
2007W325 => 2007-W32-5
In TimeBank, wsj_0918.tml:
WARNING: Could not determine document creation time, use -c to override
TERNIP: WARNING: Malformed rule expression
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_rule.py", line 139, in apply
timex.value = eval(self._value_exp)
File "/usr/local/lib/python2.6/dist-packages/ternip/rules/normalisation/gutime-date.ruleblock:143:value", line 1, in
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_functions/relative_date_functions.py", line 203, in compute_offset_base
m = date_functions.month_to_num(match.group()) - int(ref_date[4:6])
ValueError: invalid literal for int() with base 10: ''
From TimeBank, wsj_1011.tml
WARNING: Could not determine document creation time, use -c to override
TERNIP: WARNING: Malformed rule expression
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_rule.py", line 139, in apply
timex.value = eval(self._value_exp)
File "/usr/local/lib/python2.6/dist-packages/ternip/rules/normalisation/gutime-date.ruleblock:102:value", line 1, in
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_functions/relative_date_functions.py", line 203, in compute_offset_base
m = date_functions.month_to_num(match.group()) - int(ref_date[4:6])
ValueError: invalid literal for int() with base 10: ''
On TimeBank:
wsj_0612.tml
WARNING: Could not determine document creation time, use -c to override
TERNIP: WARNING: Malformed rule expression
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_rule.py", line 139, in apply
timex.value = eval(self._value_exp)
File "/usr/local/lib/python2.6/dist-packages/ternip/rules/normalisation/gutime-date.ruleblock:148:value", line 1, in
ValueError: invalid literal for int() with base 10: ''
XML entities are replaced/inserted; " becomes " and so on. For example, document WSJ910225-0066 in TimeBank:
Input:
spokesman Marlin Fitzwater said late yesterday that "the operation has been very successful."
TERNIP output:
spokesman Marlin Fitzwater said late <TIMEX3 tid="t23" type="DATE" value="19910224">yesterday</TIMEX3>
that "the operation has been very successful."
When running ternip -t timeml -s APW19981205.0374.tml (from TimeBank), the following exception is thrown (also with APW19990312.0251.tml, APW19990122.0193.tml and some other APW* / CNN* docs) :
Traceback (most recent call last):
File "/usr/local/bin/annotate_timex", line 76, in
sents = doc.get_sents()
File "/usr/local/lib/python2.6/dist-packages/ternip/formats/xml_doc.py", line 734, in get_sents
(nodesents, ndsents, i) = self._nodes_to_sents(self._xml_body, [], [(sent, []) for sent in nltk.tokenize.sent_tokenize(self._get_text(self._xml_body))], 0)
File "/usr/local/lib/python2.6/dist-packages/ternip/formats/xml_doc.py", line 661, in _nodes_to_sents
(done_sents, nondone_sents, senti) = self._nodes_to_sents(child, done_sents, nondone_sents, senti)
File "/usr/local/lib/python2.6/dist-packages/ternip/formats/xml_doc.py", line 651, in _nodes_to_sents
(sent, snodes) = nondone_sents[0]
IndexError: list index out of range
When set to TimeML output, the "mod" attribute uses the TIMEX2 values "EARLY" and "LATE" instead of TIMEX3 values "START" and "END" (http://timeml.org/site/publications/timeMLdocs/timeml_1.2.1.html#timex3)
What license is this released under?
From TAC_2010_KBP_Source_Data/data/2010/wb/eng-WL-11-174596-12957493.sgm (http://pastebin.com/Wz2QKEAZ):
Traceback (most recent call last):
File "/usr/local/bin/annotate_timex", line 154, in
print str(doc)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 662: ordinal not in range(128)
In order to separate the evaluation of recognition and normalisation performance, and allow the use of non-integrated timex recognisers, it would be useful to have a "normalise only" module, that uses existing TIMEX3/TIMEX2 annotation boundaries and only provides attributes for those elements.
From TimeBank:
wsj_0586.tml
TERNIP: WARNING: Error whilst attempting to add TIMEX
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/ternip/formats/xml_doc.py", line 640, in reconcile
self._add_timex(timex, sents[i], s_nodes[i])
File "/usr/local/lib/python2.6/dist-packages/ternip/formats/xml_doc.py", line 521, in _add_timex
self._add_timex_child(timex, sent, s_node, start, end)
File "/usr/local/lib/python2.6/dist-packages/ternip/formats/xml_doc.py", line 487, in _add_timex_child
raise nesting_error('Can not tag TIMEX (' + str(timex) + ') without causing invalid XML nesting')
nesting_error: Can not tag TIMEX (<ternip.timex.timex instance at 0x235eb00>) without causing invalid XML nesting
when called with annotate_timex -s -t timeml
GATE's ANNIE tokeniser splits on different boundaries to TERNIP's (NLTK). This can cause many TERNIP rules to not match. For example,
nltk.word_tokenize('Example 31/12/2010 text.')
['Example', '31/12/2010', 'text', '.']
Places a dd/mm/yyyy date into one token, whereas ANNIE will give us a SpaceToken, followed by tokens of '31', '/', '12', '/', '2010', and another SpaceToken.
This should be fixed in the GATE plugin (the preprocessing/postprocessing JAPE), so that the ANNIE Tokeniser's output can be mapped slightly more closely to the results of the NLTK tokeniser. It may also help to specify (if not already done) the tokenisation scheme that NLTK expects, to help in other situations where the upstream tokeniser is switched out from the default.
From TimeBank, wsj_0751.tml:
WARNING: Could not determine document creation time, use -c to override
TERNIP: WARNING: Malformed rule expression
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_rule.py", line 139, in apply
timex.value = eval(self._value_exp)
File "/usr/local/lib/python2.6/dist-packages/ternip/rules/normalisation/gutime-date.ruleblock:89:value", line 1, in
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_functions/relative_date_functions.py", line 203, in compute_offset_base
m = date_functions.month_to_num(match.group()) - int(ref_date[4:6])
ValueError: invalid literal for int() with base 10: ''
TERNIP: WARNING: Malformed rule expression
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_rule.py", line 139, in apply
timex.value = eval(self._value_exp)
File "/usr/local/lib/python2.6/dist-packages/ternip/rules/normalisation/gutime-date.ruleblock:149:value", line 1, in
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_functions/relative_date_functions.py", line 193, in compute_offset_base
t = day - date_functions.date_to_dow(int(ref_date[:4]), int(ref_date[4:6]), int(ref_date[6:8]))
ValueError: invalid literal for int() with base 10: ''
in TimeBank's wsj_0586.tml.
TERNIP: WARNING: Malformed rule expression
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_rule.py", line 139, in apply
timex.value = eval(self._value_exp)
File "/usr/local/lib/python2.6/dist-packages/ternip/rules/normalisation/gutime-date.ruleblock:99:value", line 1, in
File "/usr/local/lib/python2.6/dist-packages/ternip/rule_engine/normalisation_functions/relative_date_functions.py", line 193, in compute_offset_base
t = day - date_functions.date_to_dow(int(ref_date[:4]), int(ref_date[4:6]), int(ref_date[6:8]))
ValueError: invalid literal for int() with base 10: ''
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.