universaldependencies / tools Goto Github PK

View Code? Open in Web Editor NEW

197.0 197.0 42.0 19.73 MB

Various utilities for processing the data.

License: GNU General Public License v2.0

Python 49.96% Perl 48.09% SystemVerilog 0.01% Shell 1.88% NewLisp 0.06%

tools's People

Contributors

Stargazers

Watchers

tools's Issues

warnings about punctuation and non-projectivity are wrong in validate.py

Take the sentence CF0883-1. After long discussion in LR-POR/cl-conllu#85. I believe that:

The script validate.py reports 3 warnings

[Line 28 Sent CF883-1 Node 22]: [L3 Syntax punct-causes-nonproj] Punctuation must not cause non-projectivity of nodes [4, 23, 24]

token 4 is projective, so it doesn't make sense being reported as a token that became non-projective because of 22. Token 24 is a non-projective token but it is a punctuation, so it is already reported as punct-is-nonproj case below.

[Line 28 Sent CF883-1 Node 22]: [L3 Syntax punct-is-nonproj] Punctuation must not be attached non-projectively over nodes [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

token 21 should be included in the list.

[Line 30 Sent CF883-1 Node 24]: [L3 Syntax punct-is-nonproj] Punctuation must not be attached non-projectively over nodes [22]

token 23 should be included in the list.

Validation for the "Copula is not AUX" constraint

The "Copula is not AUX" section of the syntax validation page (http://universaldependencies.org/svalidation.html#copula-is-not-aux) reports that UD_Greek has 437 hits.

Yet, when one clicks the "Go to search" link http://bionlp-www.utu.fi/dep_search/query?search=%28%21%28AUX%7CPRON%29%29+%3Ccop+_&db=UD_Greek-dev, all sentences contain AUX-tagged copulas (with typical verb features like VerbForm). Is this a bug?

Thanks.

Add a "goeswith" exception to some validation rules

The current validator emits a warning if the dependent of an advmod relation is different from all the UPOS: ADV, ADJ, CCONJ, PART and SYM except if the dependent is in a fixed construction.
I propose to add the same exception if the dependant is in a goeswith construction.

Some examples of invalid annotations that would become valid with this proposition:

3 examples in UD_English-GUM
3 examples in UD_French-GSD

Single-rooted trees

When preparing the Finnish release, we were bitten badly by the fact that the validator does not warn on forests. That is in the specs, so that's not a bug. Nevertheless, I added a --single-root argument which checks that the tree is single-rooted. Now the question is, whether this shouldn't be the default behavior, with the option to switch that check off if the analyses in some particular treebank are really meant to allow multiple roots.

TypeError: "validate.py" compatibility with Python 3.6

While using "validate.py" in UDtools to validate conllu-formatted GUM corpus, the following error message occurs to my computer:

"""
C:\Users\logan\Dropbox\GUM\UDtools>python validate.py --lang en ..\amir_gumdev_build\target\dep\ud\GUM_interview_ants.conllu
Traceback (most recent call last):
File "validate.py", line 735, in
validate(inp,out,args,tagsets,known_sent_ids)
File "validate.py", line 613, in validate
for comments,tree in trees(inp,tag_sets,args):
File "validate.py", line 74, in trees
for line_counter, line in enumerate(inp):
File "C:\Program Files\Python36\lib\codecs.py", line 644, in next
line = self.readline()
File "C:\Program Files\Python36\lib\codecs.py", line 557, in readline
data = self.read(readsize, firstline=True)
File "C:\Program Files\Python36\lib\codecs.py", line 499, in read
data = self.bytebuffer + newdata
TypeError: can't concat str to bytes

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "validate.py", line 737, in
warn(u"Exception caught!",u"Format")
File "validate.py", line 49, in warn
print >> sys.stderr, (u"[%sLine %d]: %s"%(fn,curr_line,msg)).encode(args.err_enc)
TypeError: unsupported operand type(s) for >>: 'builtin_function_or_method' and '_io.TextIOWrapper'. Did you mean "print(, file=<output_stream>)"?
"""

I was using Python 3,6,4 with Windows 10 Home. The "conllu" file seems correctly-formatted. Interestingly, my adviser @amir-zeldes did not encounter this problem while running the same command on his computer.

I tried my luck with Python 2.7.14 on my own and it also ran through the document successfully. Not sure why it fails with bytes/strings issue.

Thank you for your support.

Best,
Logan

Validation rules for the det relation

In the current validator, the UPOS constraint rule about advmod dependant is not applied if this dependant is in a fixed construction.
I Propose to add the same exception to other relations and in particular to the det relation.

In several French corpora, n'importe quel is annotated as 3 tokens with fixed relations and it is used as a determiner (see Grew-match).
This construction is invalid and would be valid with the proposition above.

Validation rule for Foreign feature

Using validate.py for some French data, I had the following error:

[L4 Morpho feature-upos-not-permitted] Feature Foreign is not permitted with UPOS X in language [br]

for the CoNLL line:

11	maen	maen	X	_	Foreign=Yes	10	appos	_	Lang=br|SpaceAfter=No

I think it would be sensible to allow the feature Foreign=Yes on the X tag whatever is the language.

validate.py complains about combining macrons

In UD_Coptic-Scriptorium we have combining macrons in the MISC field, indicating supra-linear strokes added to characters in the manuscript forms (a MISC annotation called orig, containing unnormalized word forms):

[Line 149 Sent shenoute_a22-MONB_YA_421_428_s0002]: Unicode not normalized: MISC.character[6] is COMBINING MACRON, should be COMBINING DOT BELOW

Why are combining macrons bad unicode? There are no combined glyphs for these characters. Why should they be DOT BELOW instead?

eval.py doesn't work on current UD treebanks

If I use eval.py to compare a gold treebank against itself, it does not work. For example, a current git repo of English EWT has the following error.

Actually, it's unclear to me if this is a problem with the eval script or the treebank... is it okay for a copy node to have no dependency information like this?

[john@localhost udtools]$ python3 eval.py ~/extern_data/ud2/git/UD_English-EWT/en_ewt-ud-train.conllu ~/extern_data/ud2/git/UD_English-EWT/en_ewt-ud-train.conllu
Traceback (most recent call last):
  File "eval.py", line 705, in <module>
    main()
  File "eval.py", line 670, in main
    evaluation = evaluate_wrapper(args)
  File "eval.py", line 650, in evaluate_wrapper
    gold_ud = load_conllu_file(args.gold_file,treebank_type)
  File "eval.py", line 637, in load_conllu_file
    return load_conllu(_file,treebank_type)
  File "eval.py", line 357, in load_conllu
    raise UDError("The collapsed CoNLL-U line still contains empty nodes: {}".format(_encode(line)))
__main__.UDError: The collapsed CoNLL-U line still contains empty nodes: 8.1    reported        report  VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     _       _       5:conj:and      CopyOf=5

Missing NumType=Range from feat_val?

Not sure if this an oversight or not, but I think NumType=Range (from here) is missing from feat_val.ud.

Can't locate namespace/autoclean.pm in @INC

Hi, I want to try the perl script enhanced_collapse_empty_nodes.pl with the following command:
perl enhanced_collapse_empty_nodes.pl ../train-dev/UD_English-EWT/en_ewt-ud-train.conllu > en_ewt-ud-train-collapsed.conllu
But I get the following error:

Can't locate namespace/autoclean.pm in @INC (you may need to install the namespace::autoclean module) (@INC contains: /home/wangxy/workspace/iwpt2020/tools /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.26.1 /usr/local/share/perl/5.26.1 /usr/lib/x86_64-linux-gnu/perl5/5.26 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.26 /usr/share/perl/5.26 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at /home/wangxy/workspace/iwpt2020/tools/Graph.pm line 4.
BEGIN failed--compilation aborted at /home/wangxy/workspace/iwpt2020/tools/Graph.pm line 4.
Compilation failed in require at enhanced_collapse_empty_nodes.pl line 38.
BEGIN failed--compilation aborted at enhanced_collapse_empty_nodes.pl line 38.

Any suggestion for this? Thank you.

Problem with conllu_to_conll.pl and restore_conllu_lines.pl files

Hello,
I think there is a bug with conllu_to_conll.pl and restore_conllu_lines.pl. Here is the code that I run for Swedish:
perl conllu_to_conllx.pl < sv_talbanken-ud-test.conllu > sv_talbanken-ud-test.conll
Then I convert it back to 'conllu' format:
perl restore_conllu_lines.pl sv_talbanken-ud-test.conll sv_talbanken-ud-test.conllu > sv_talbanken-ud-test.conllu.merged

Then, I run the UD official evaluation script for "sv_talbanken-ud-test.conllu.merged" and "sv_talbanken-ud-test.conllu", but the code crashed with the following error:

main.UDError: The concatenation of tokens in gold file and in system file differ!
First 20 differing characters in gold file: 'kbasbeloppetvidsamma' and system file: '_kbasbeloppetvidsamm'

The same thing happened with "tr-imst-ud-test.conllu" and "ru_syntagrus-ud-test.conllu".

Enhanced representation validation

I prepared the enhanced representation for ru_syntagrus UD2.0 release, but unfortunately I had to abandon it because validation fails on fractional ids in DEPS field.
Did I miss something important or did I find a bug?

I have these errors:
[Line 15187]: Undefined ID in DEPS: E4.2
[Line 15187]: Undefined ID in DEPS: E4.2
[Line 15187]: Undefined ID in DEPS: E4.2
[Line 15187]: Failed to parse DEPS: E4.2:case

But the data looks ok to me:
http://paste2.org/6w2WmvG0

broken link to the tools readme

Broken link to
https://github.com/UniversalDependencies/tools/blob/master/README.txt
on page https://universaldependencies.org/release_checklist.html#validation

It seems that it should be https://github.com/UniversalDependencies/tools/blob/master/README.md

Adding new auxiliary verb by specify_auxiliary.pl

I want to add word to auxiliary verb list for Japanese UDs for Modern and Spoken Japanese. (#71 )
I checked below site, but i cannot find how to add new auxiliary verb.
http://quest.ms.mff.cuni.cz/udvalidator/cgi-bin/unidep/langspec/specify_auxiliary.pl
Do you have any plans on adding new auxiliary verb the site in the future? @masayu-a

Validation requirements for a treebank to be released in 2.5

Following the discussion at the end of UDW 2019 in Paris, I tried to put together a proposal of the validation vs. release policy for the upcoming releases. The goal is to be able to add new tests and find more guideline violations, but without having to kick out older treebanks that do not pass the stricter tests (some of them are no longer maintained and there is no one who could fix the bugs soon; others have too many bugs and fixing them will take a lot of time).

The full proposal is currently available here and comments are welcome. In a nutshell: if a treebank was valid and released in UD 2.3, it can stay in the upcoming releases without passing tests that were added after UD 2.3. Newer treebanks must pass all tests that exist when the treebank is released for the first time.

I have modified the online validation page to reflect the proposal and identify treebanks with legacy status. There are 6 old treebanks that contain errors which were not tolerated even in UD 2.3 (that means, these errors were introduced in UD 2.4 and slipped attention of the release team). Errors of this type must be fixed before UD 2.5. The treebanks are Croatian-SET (@nljubesi), English-EWT (@manning @sebschu), French-Spoken (@sylvainkahane), Norwegian-Bokmaal, Norwegian-NynorskLIA (@liljao), Serbian-SET (@tsamardzic).

4 treebanks were released in UD 2.4 for the first time but contained errors that were already checked at that time. Hence I think they are not really legacy treebanks (the only reason why they made it into the release was that we ignored some error messages in order to save older treebanks). (Disclaimer: I’m actually looking at the current report, so it is possible that the errors were not there at release time and were introduced later.) The treebanks are Classical_Chinese-Kyoto (@KoichiYasuoka), German-HDT (@akoehn @EmanuelUHH), German-LIT (@a-salomoni), Old_Russian-RNC (@olesar).

Finally, issues are also reported for 4 new treebanks: Bhojpuri-BHTB (@shashwatup9k), Chinese-GSDSimp (@qipeng), Skolt_Sami-Giellagas (@rueter), Swiss_German-UZH (@noe-eva).

What do people think about this?

Python 3 support?

Consider https://pythonhosted.org/six/ .

Not all multiword tokens matched in restore_conllu_lines.pl

Should be m/^\d+-/

tools/restore_conllu_lines.pl

Line 26 in 3dc383c

m/^\d-/)

validate.py does not pick up presentential comments

Udpipe complains if there is a comment in front of a sentence, but the validator doesn't pick this up. Is this an issue with UDpipe (e.g. does the format allow pre-sentential comments) or is it an issue with the validator ? Here is an example:

# can't find this in common voice file
# sent_id = 349
# text = Uico a rak rang tuk.
# text_en = The dog was very fast.
1       Uico    uico    NOUN    _       _       4       nsubj   _       dog
2       a       a       PRON    _       Number=Sing|Person=3    4       expl    _       3SG
3       rak     rak     PART    _       _       4       discourse       _       PERF
4       rang    rang    VERB    _       _       0       root    _       fast
5       tuk     tuk     ADV     _       _       4       advmod  _       very|SpaceAfter=No
6       .       PUNCT   PUNCT   _       _       4       punct   _       _

# can't find this in common voice file
# sent_id = 350
# text = Mei a vun sen.
# text_en = The light immediately turned red.
1       Mei     mei     NOUN    _       _       4       nsubj   _       light
2       a       a       PRON    _       Number=Sing|Person=3    4       expl    _       3SG
3       vun     vun     ADV     _       _       4       advmod  _       immediately
4       sen     sen     VERB    _       _       0       root    _       red|SpaceAfter=No
5       .       PUNCT   PUNCT   _       _       4       punct   _       _

Here is the output from UDpipe:

[Line 1814 Sent 214]: [L1 Format misplaced-comment] Spurious comment line. Comments are only allowed before a sentence.

Report sent_id in errors

[Tree number 4093 on line 86571]: Mismatch between the text attribute and the FORM field.

You cat *.conllu to pipe to validate.py to check for duplicate ids and whatnot. After the script is done with the first file, line numbers become non-informative. # sent_ids would reliably id the erroneous sentence.

Is there a pipeline to convert from eaf to conllu, parse and put data back into eaf?

I'am looking for the easiest way to take an .eaf file (ELAN annotation file), parse relevant layers (transcripts of utterances) with certain UD parser, and insert annotations back into eaf (as additional layers).
Any suggestions about existing pipelines or just some pieces of the pipeline would be helpful.

Test that SpaceAfter=No corresponds with the text

@fginter: This is already fixed in fc5b73f9956
(I planned a pull request, but have forgotten and did direct push to master, sorry.)

@dan-zeman was concerned we should not make tests more strict few days before data freeze.
So please, consider adding --no-space-after to the script which generates http://universaldependencies.org/validation.html
Alternatively, I can provide a script which fixes the SpaceAfter=No based on the text.

Is there a script that can convert from PTB to UD in CoNLL-U style?

Hi,

Is there a script within this repository that can convert from PTB to UD in CoNLL-U style?
Thanks.

empty-node-without-dependent.conllu

Currently, runtests.sh fails on empty-node-without-dependent.conllu.
This is not ideal as we plan to publish this tools repo within the UD 2.0 release.

I can fix this, but I would prefer to test this within svalidation (rather than within validate.py).
First, it would be easier to implement there (and validate.py is terrible to maintain even now without adding more tests).
Second, I consider this a linguistic decision, we may want to keep in UD treebanks, but not a strict requirement of the CoNLL-U format.

Relaxed criteria

Allow a basic CoNLL-U format check without the extra character set and symbol list restrictions imposed by UD.

Two new lines result in multiple root error in iwpt20_xud_eval

tools/iwpt20_xud_eval.py

Line 344 in 925b829

raise UDError("There are multiple roots in a sentence")

If sentences are separated by two new lines instead of one, the script fails on this line because it finds that there is not exactly one root: there is none. Running into this error for this is a bit confusing, maybe two new lines should be caught beforehand, or the error message could be updated.

Is conllu-stats.py Python 3 compatible?

Because of validate.py I updated my system to Python 3. After that conllu-stats.py started to fail with this error message:

  File "conllu-stats.py", line 132
    print json.dumps(d)
             ^
SyntaxError: invalid syntax

Does this mean conllu-stats.py still need Python 2.7? Is there a way I can avoid needing two different Python versions to make UD release?

add readme in tools/data whether the stata are for training sets only or for all data

Users such as in shared tasks will need to know whether the stats and label sets in the tools/data folder are for just the training data or for all treebank data. A readme should be added.

No language specific subtyping?

We get a validate.py errors complaining about various relations. I thought these were allowed?

Incorrect string comparison in `check_files.pl`

tools/check_files.pl

Line 903 in a7cf45e

elsif(!defined($correct) && $claimed < $current_release)

This line validates treebank version numbers using orthographic comparison, which will no longer work properly now that there exist treebanks like Ancient Hebrew-PTNK which have Data available since: 2.10, since '2.10' < '2.9'.

Typo=Yes

Could it be that validator doesn't know about morpho feature "Typo=Yes", even though it is described in universal guidelines? I added a new version for Latvian and got lots of warnings [L4 Morpho unknown-feature-value] Unknown feature-value pair 'Typo=Yes'. Or is it some typo on my side?

Spec for the validator?

question: where is the documentation on which rules the validator checks, please?
I know it checks for single-roots, but what else?

validate.py online reports non-existing upos-feat combinations as not permitted

In UD_Russian-Taiga, I see 65 validation errors, neither of which is present in the data

online

./validate.sh --lang ru --max-err 0 UD_Russian-Taiga/ru_taiga-ud-dev.conllu
[Line 15 Sent instagram-2]: [L4 Morpho feature-value-upos-not-permitted] Value Imp of feature Aspect is not permitted with UPOS NOUN in language [ru].

offline

$ cat /UD_Russian-Taiga/*-dev.conllu | grep 'NOUN.*Aspect' -c
0
and freshly-updated validation.py on my side reports PASSED

@dan-zeman, could you look at this case?

bug in FixPunct?

My input is test.conllu file contains:

# sent_id = wiki-11.p.10.s.1.xml
# text = (*)
1       (       (       PUNCT   LET     _       0       root    _       _
2       *       *       PUNCT   LET     _       1       punct   _       _
3       )       )       PUNCT   LET     _       1       punct   _       _

after running

udapy -s ud.FixPunct < test.conllu I get:

# sent_id = wiki-11.p.10.s.1.xml
# text = (*)
1	(	(	PUNCT	LET	_	0	root	_	_
2	*	*	PUNCT	LET	_	0	punct	_	_
3	)	)	PUNCT	LET	_	0	punct	_	_

It is a pathological string, nevertheless, this is a nasty bug (it took me a while to realize it was not my conversion script that was wrong....)

Auxiliary or auxiliary verb?

I'm getting this error,

[Line 18810 Sent ...:18 Node 2]: [L5 Morpho aux-lemma] '...' is not an auxiliary verb in language [...] (there are no known approved auxiliaries in this language)

The item in question is not an auxiliary verb, but rather another kind of auxiliary word, should the message be changed to just "is not an auxiliary in language ..." ?

Fails to detect broken token ranges at the end of a sentence

74-76 ...
74

passes validation and shouldn't

validation and fixed: wrongly unallowed deprels

I am running the validate.py script, and I receive an error for this sentence (which, I am sorry, is very long... but so it goes with Medieval Latin):

# sent_id = 170
# text = Quare , cribellum cupientes deponere , ut residentiam cito visamus , dicimus Tridentum atque Taurinum nec non Alexandriam civitates metis Ytalie in tantum sedere propinquas quod puras nequeunt habere loquelas ; ita quod si etiam quod turpissimum habent vulgare , haberent pulcerrimum , propter aliorum commixtionem esse vere latium negaremus .
1	Quare	quare	ADV	r	_	12	discourse	_	_
2	,	,	PUNCT	Pu	_	4	punct	_	_
3	cribellum	cribellum	NOUN	sns2a	Case=Acc|Gender=Neut|NounClass=IndEurO|Number=Sing	5	obj	_	_
4	cupientes	cupio	VERB	va3pppmn	Aspect=Imp|Case=Nom|Degree=Pos|Gender=Masc|NounClass=IndEurI|Number=Plur|Tense=Pres|VerbClass=LatX2|VerbForm=Part|Voice=Act	12	advcl:pred	_	_
5	deponere	depono	VERB	va3fp	Aspect=Imp|Tense=Pres|VerbClass=LatX|VerbForm=Inf|Voice=Act	4	ccomp	_	_
6	,	,	PUNCT	Pu	_	10	punct	_	_
7	ut	ut	SCONJ	cs	ConjType=Cmpr	10	mark	_	_
8	residentiam	residentia	NOUN	sfs1a	Case=Acc|Gender=Fem|NounClass=IndEurA|Number=Sing	10	obj	_	_
9	cito	cito	ADV	r	_	10	advmod	_	_
10	visamus	viso	VERB	va3cpp1	Aspect=Imp|Mood=Sub|Number=Plur|Person=1|Tense=Pres|VerbClass=LatX|VerbForm=Fin|Voice=Act	4	advcl	_	_
11	,	,	PUNCT	Pu	_	12	punct	_	_
12	dicimus	dico	VERB	va3ipp1	Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbClass=LatX|VerbForm=Fin|Voice=Act	0	root	_	_
13	Tridentum	tridentum	PROPN	Sns2a	Case=Acc|Gender=Neut|NounClass=IndEurO|Number=Sing	24	nsubj	_	_
14	atque	atque	CCONJ	co	_	15	cc	_	_
15	Taurinum	taurinum	PROPN	Sns2a	Case=Acc|Gender=Neut|NounClass=IndEurO|Number=Sing	13	conj	_	_
16	nec	nec	CCONJ	co	Polarity=Neg	18	cc	_	_
17	non	non	PART	r	Polarity=Neg	16	fixed	_	_
18	Alexandriam	alexandria	PROPN	Sfs1a	Case=Acc|Gender=Fem|NounClass=IndEurA|Number=Sing	13	conj	_	_
19	civitates	civitas	NOUN	sfp3a	Case=Acc|Gender=Fem|NounClass=IndEurX|Number=Plur	13	flat	_	_
20	metis	meta	NOUN	sfp1d	Case=Dat|Gender=Fem|NounClass=IndEurA|Number=Plur	25	obl:arg	_	_
21	Ytalie	italia	PROPN	Sfs1g	Case=Gen|Gender=Fem|NounClass=IndEurA|Number=Sing	20	nmod	_	_
22	in	in	ADP	e	AdpType=Prep	25	advmod	_	_
23	tantum	tantum	ADV	r	_	22	fixed	_	_
24	sedere	sedeo	VERB	va2fp	Aspect=Imp|Tense=Pres|VerbClass=LatE|VerbForm=Inf|Voice=Act	12	ccomp	_	_
25	propinquas	propinquus	ADJ	afp1a	Case=Acc|Degree=Pos|Gender=Fem|NounClass=IndEurA|Number=Plur	24	advcl:pred	_	_
26	quod	quod	SCONJ	cs	_	28	mark	_	_
27	puras	purus	ADJ	afp1a	Case=Acc|Degree=Pos|Gender=Fem|NounClass=IndEurA|Number=Plur	30	amod	_	_
28	nequeunt	nequeo	VERB	va5ipp3	Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbClass=LatAnom|VerbForm=Fin|Voice=Act	24	advcl	_	_
29	habere	habeo	VERB	va2fp	Aspect=Imp|Tense=Pres|VerbClass=LatE|VerbForm=Inf|Voice=Act	28	xcomp	_	_
30	loquelas	loquela	NOUN	sfp1a	Case=Acc|Gender=Fem|NounClass=IndEurA|Number=Plur	29	obj	_	_
31	;	;	PUNCT	Pu	_	32	punct	_	_
32	ita	ita	ADV	r	_	24	parataxis	_	_
33	quod	quod	SCONJ	cs	_	50	mark	_	_
34	si	si	SCONJ	cs	_	41	mark	_	_
35	etiam	etiam	ADV	co	_	41	advmod	_	_
36	quod	qui	PRON	presna	Case=Acc|Gender=Neut|Number=Sing|PronType=Rel	38	obj	_	_
37	turpissimum	turpis	ADJ	ans2as	Case=Acc|Degree=Abs|Gender=Neut|NounClass=IndEurO|Number=Sing	36	amod	_	_
38	habent	habeo	VERB	va2ipp3	Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbClass=LatE|VerbForm=Fin|Voice=Act	39	acl:relcl	_	_
39	vulgare	vulgare	NOUN	sns3a	Case=Acc|Gender=Neut|NounClass=IndEurX|Number=Sing	41	obj	_	_
40	,	,	PUNCT	Pu	_	39	punct	_	_
41	haberent	habeo	VERB	va2cip3	Aspect=Imp|Mood=Sub|Number=Plur|Person=3|Tense=Past|VerbClass=LatE|VerbForm=Fin|Voice=Act	50	advcl	_	_
42	pulcerrimum	pulcher	ADJ	ans1as	Case=Acc|Degree=Abs|Gender=Neut|NounClass=IndEurO|Number=Sing	39	amod	_	_
43	,	,	PUNCT	Pu	_	50	punct	_	_
44	propter	propter	ADP	e	AdpType=Prep	46	case	_	_
45	aliorum	alius	DET	dpnpg	Case=Gen|Gender=Neut|NounClass=LatPron|Number=Plur|PronType=Ind	46	nmod	_	_
46	commixtionem	commixtio	NOUN	sfs3a	Case=Acc|Gender=Fem|NounClass=IndEurX|Number=Sing	50	obl	_	_
47	esse	sum	AUX	va5fp	Aspect=Imp|Tense=Pres|VerbClass=LatAnom|VerbForm=Inf|Voice=Act	49	cop	_	_
48	vere	vere	ADV	r	Degree=Pos	49	advmod	_	_
49	latium	latius	ADJ	ans1a	Case=Acc|Degree=Pos|Gender=Neut|NounClass=IndEurO|Number=Sing	50	ccomp	_	_
50	negaremus	nego	VERB	va1cip1	Aspect=Imp|Mood=Sub|Number=Plur|Person=1|Tense=Past|VerbClass=LatA|VerbForm=Fin|Voice=Act	32	orphan	_	_
51	.	.	PUNCT	Pu	_	12	punct	_	_

The problem is at token 22: we have a fixed adverbial expression, in tantum 'so much (so)', meaning that tantum depends as fixed from in, and then in bears the advmod relation with respect to propinquas (25). The validator complains:

'advmod' should be 'ADV' but it is 'ADP'

Should it not be so that fixed relations do not give an error for alleged POS/deprel mismatches of fixed heads? Because, since individual elements of fixed expressions retain their morphological analysis, but the whole expression can act differently than the single parts, there can be no restrictions.

I think I have other errors like that, but I still have to check.

The README file in this repo has some bad links - [404:NotFound]

The markup version of the readme that is displayed for the main page in this repo contains the following links:
Status code [404:NotFound] - Link: https://github.com/UniversalDependencies/tools/blob/conll_convert_tags_to_uposf.pl (conll_convert_tags_to_uposf.pl)
Status code [404:NotFound] - Link: https://github.com/UniversalDependencies/tools/blob/check_files.pl (check_files.pl)
Status code [404:NotFound] - Link: https://github.com/UniversalDependencies/tools/blob/conllu_align_tokens.pl (conllu_align_tokens.pl)

(The link in the readme’s raw markdown may appear in a different format to the link above)

Theses bad links were found by a tool I very recently created as part of an new experimental hobby project: https://github.com/MrCull/GitHub-Repo-ReadMe-Dead-Link-Finder
I (a human) verified that this link is broken and have manually logged this Issue (i.e. this Issue has not been created by a bot).
If this has been in any way helpful then please consider giving the above Repo a Star.

Limitation on the size of the text file for the conllu to text script

I tried to use the conllu_to_text.pl on a sizeable (1gb) conllu file, however it seems that the script has some limitations.

By doing perl conllu_to_text.pl <C:\Folder/Blabla.conllu> C:\Folder\blabla.txt

It seems the system is only able to create 70+ mb txt, I've trying other solutions on Windows. but it seems to be a limitation of the script.

Is there a way around this?

Thanks

[validate.py] ImportError: No module named file_util

I'm using Python 2.7.15rc1 that comes by default in Ubuntu 18.04
the only things I ever installed where pip and virtualenv
the code was run in a virtualenv where I installed the regex module.

I thought the file_util module was built-in, but apparently it isn't (at least in this python version). I tried pip install distutils to no avail. what's the name of the package I should install? (a requirements.txt file would be useful!)

ref in enhanced dependencies

I am working on adding enhanced dependencies.
For relative pronouns, the guidelines propose to add a 'ref' relation to the antecedent noun. However, validate.py stumbles over this, as it is not a known dependency label. Adding it to the language specific set does not help either, as it is not an extension of a known relation. Should it be added to deprel.ud?

Does UD provide conversion tools?

I am curious to compare UD models to others on a UD test set. The problem is, of course, that the others' labels are of a different tagset. Does UD provide conversion scripts to convert, for instance, the dependency labels of OntoNotes and Penn? Thanks in advance. (I am aware that conversion scripts will add noise, but I am fine with that.)

More than one copula.

We want to add more than one copula word.
Japanese has normal copula and honorific copula.

We also have contracted forms of the copula in speech corpora.

Hopefully, the restriction on the copula words should be relaxed in Japanese and other morphologically rich languages.
http://quest.ms.mff.cuni.cz/udvalidator/cgi-bin/unidep/langspec/specify_auxiliary.pl?lcode=ja

Test suite (runtests.sh) execution fails with error --lang is required

The script runtests.sh fails to execute the test suite as all invocations ofvalidate.py fail with

usage: validate.py [-h] [--noecho] [--echo] [--quiet] [--max-err MAX_ERR]
                   [--err-enc ERR_ENC] --lang LANG [--multiple-roots]
                   [input] [output]
validate.py: error: argument --lang is required

Can we either add a dummy language or something like a --nolang argument for the test suite? (@fginter)

Incorrect flagging of non-projective punctuation in validator

For this sentence, the validator claims that the punctuation mark node 14 is introducing non-projectivity. However, non-projective relations (introduced by the reparandum relation) would also exist without the punctuation mark and I think the structure of this tree adheres to the punctuation guidelines.

# sent_id = n01002058
# text = What she’s saying and what she’s doing, it — actually, it’s unbelievable.
1       What    what    PRON    WP      PronType=Int    4       obj     4:obj   _
2       she     she     PRON    PRP     Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs   4       nsubj   4:nsubj SpaceAfter=No
3       ’s      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   4       aux     4:aux   _
4       saying  say     VERB    VBG     VerbForm=Ger    17      dislocated      17:dislocated   _
5       and     and     CCONJ   CC      _       9       cc      9:cc    _
6       what    what    PRON    WP      PronType=Int    9       obj     9:obj   _
7       she     she     PRON    PRP     Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs   9       nsubj   9:nsubj SpaceAfter=No
8       ’s      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   9       aux     9:aux   _
9       doing   do      VERB    VBG     VerbForm=Ger    4       conj    4:conj:and      SpaceAfter=No
10      ,       ,       PUNCT   ,       _       17      punct   17:punct        _
11      it      it      PRON    PRP     Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs  15      reparandum      15:reparandum   _
12      —       —       PUNCT   :       _       11      punct   11:punct        _
13      actually        actually        ADV     RB      _       17      advmod  17:advmod       SpaceAfter=No
14      ,       ,       PUNCT   ,       _       17      punct   17:punct        _
15      it      it      PRON    PRP     Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs  17      nsubj   17:nsubj        SpaceAfter=No
16      ’s      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   17      cop     17:cop  _
17      unbelievable    unbelievable    ADJ     JJ      Degree=Pos      0       root    0:root  SpaceAfter=No
18      .       .       PUNCT   .       _       17      punct   17:punct        _

evaluate_treebank.pl "WARNING: Validator not found"

Hello,

I was hoping to get your help on an issue I have with the evaluate_treebank.pl script. Evaluation goes well and I get results the same as on the https://universaldependencies.org website in the case of many treebanks. However, I get a "WARNING: Validator not found. We will assume that all files are valid." message. So in the case of treebanks that do not score 1 in the validity test I get much different treebank ratings than those posted on https://universaldependencies.org.

I do not know Perl so I was not able to figure out what is the issue here. I would appreciate your help.

PS. validate.py script is present in the directory where it is supposed to be (I think I tried all possible combinations)

Validation is not exhaustive in checking non-projective punctuation

The current dev branch of the corpus UD_French-GSD is valid even if it contains a few cases of non-projective punctuation.

See 3 examples (the request takes some times to execute (about 15s) because the pattern is not connected…)

feats.ud used by conllu-stats.py removed from master

Not sure if this is expected or not but on the master branch, the data/feats.ud file is no longer present. It is there in the 2.7 release tag but seems to have been lost since then.

Validator, sorting and locale

I don't know Python, so not sure how sort works there, but in Perl - which I used to produce the treebank - it is dependent on the locale settings. And, although I sorted my features, the validator complained that they were unsorted, because of "|NumType=Card|Number=Plur".
It seems that Slovene locale puts capital letters before lower case ones.
Not sure if this is an issue of simply documenting the validator (e.g. "should be sorted according to C locale") or if this impacts the code.

universaldependencies / tools Goto Github PK

tools's People

Contributors

Stargazers

Watchers

Forkers

tools's Issues

online

offline

Recommend Projects

Recommend Topics

Recommend Org