universaldependencies / tools Goto Github PK
View Code? Open in Web Editor NEWVarious utilities for processing the data.
License: GNU General Public License v2.0
Various utilities for processing the data.
License: GNU General Public License v2.0
Take the sentence CF0883-1. After long discussion in LR-POR/cl-conllu#85. I believe that:
The script validate.py reports 3 warnings
[Line 28 Sent CF883-1 Node 22]: [L3 Syntax punct-causes-nonproj] Punctuation must not cause non-projectivity of nodes [4, 23, 24]
token 4 is projective, so it doesn't make sense being reported as a token that became non-projective because of 22. Token 24 is a non-projective token but it is a punctuation, so it is already reported as punct-is-nonproj
case below.
[Line 28 Sent CF883-1 Node 22]: [L3 Syntax punct-is-nonproj] Punctuation must not be attached non-projectively over nodes [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
token 21 should be included in the list.
[Line 30 Sent CF883-1 Node 24]: [L3 Syntax punct-is-nonproj] Punctuation must not be attached non-projectively over nodes [22]
token 23 should be included in the list.
The "Copula is not AUX" section of the syntax validation page (http://universaldependencies.org/svalidation.html#copula-is-not-aux) reports that UD_Greek has 437 hits.
Yet, when one clicks the "Go to search" link http://bionlp-www.utu.fi/dep_search/query?search=%28%21%28AUX%7CPRON%29%29+%3Ccop+_&db=UD_Greek-dev, all sentences contain AUX-tagged copulas (with typical verb features like VerbForm). Is this a bug?
Thanks.
The current validator emits a warning if the dependent of an advmod
relation is different from all the UPOS: ADV
, ADJ
, CCONJ
, PART
and SYM
except if the dependent is in a fixed
construction.
I propose to add the same exception if the dependant is in a goeswith
construction.
Some examples of invalid annotations that would become valid with this proposition:
When preparing the Finnish release, we were bitten badly by the fact that the validator does not warn on forests. That is in the specs, so that's not a bug. Nevertheless, I added a --single-root
argument which checks that the tree is single-rooted. Now the question is, whether this shouldn't be the default behavior, with the option to switch that check off if the analyses in some particular treebank are really meant to allow multiple roots.
While using "validate.py" in UDtools to validate conllu-formatted GUM corpus, the following error message occurs to my computer:
"""
C:\Users\logan\Dropbox\GUM\UDtools>python validate.py --lang en ..\amir_gumdev_build\target\dep\ud\GUM_interview_ants.conllu
Traceback (most recent call last):
File "validate.py", line 735, in
validate(inp,out,args,tagsets,known_sent_ids)
File "validate.py", line 613, in validate
for comments,tree in trees(inp,tag_sets,args):
File "validate.py", line 74, in trees
for line_counter, line in enumerate(inp):
File "C:\Program Files\Python36\lib\codecs.py", line 644, in next
line = self.readline()
File "C:\Program Files\Python36\lib\codecs.py", line 557, in readline
data = self.read(readsize, firstline=True)
File "C:\Program Files\Python36\lib\codecs.py", line 499, in read
data = self.bytebuffer + newdata
TypeError: can't concat str to bytes
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "validate.py", line 737, in
warn(u"Exception caught!",u"Format")
File "validate.py", line 49, in warn
print >> sys.stderr, (u"[%sLine %d]: %s"%(fn,curr_line,msg)).encode(args.err_enc)
TypeError: unsupported operand type(s) for >>: 'builtin_function_or_method' and '_io.TextIOWrapper'. Did you mean "print(, file=<output_stream>)"?
"""
I was using Python 3,6,4 with Windows 10 Home. The "conllu" file seems correctly-formatted. Interestingly, my adviser @amir-zeldes did not encounter this problem while running the same command on his computer.
I tried my luck with Python 2.7.14 on my own and it also ran through the document successfully. Not sure why it fails with bytes/strings issue.
Thank you for your support.
Best,
Logan
In the current validator, the UPOS constraint rule about advmod
dependant is not applied if this dependant is in a fixed
construction.
I Propose to add the same exception to other relations and in particular to the det
relation.
In several French corpora, n'importe quel is annotated as 3 tokens with fixed
relations and it is used as a determiner (see Grew-match).
This construction is invalid and would be valid with the proposition above.
Using validate.py
for some French data, I had the following error:
[L4 Morpho feature-upos-not-permitted] Feature Foreign is not permitted with UPOS X in language [br]
for the CoNLL line:
11 maen maen X _ Foreign=Yes 10 appos _ Lang=br|SpaceAfter=No
I think it would be sensible to allow the feature Foreign=Yes
on the X
tag whatever is the language.
In UD_Coptic-Scriptorium we have combining macrons in the MISC field, indicating supra-linear strokes added to characters in the manuscript forms (a MISC annotation called orig
, containing unnormalized word forms):
[Line 149 Sent shenoute_a22-MONB_YA_421_428_s0002]: Unicode not normalized: MISC.character[6] is COMBINING MACRON, should be COMBINING DOT BELOW
Why are combining macrons bad unicode? There are no combined glyphs for these characters. Why should they be DOT BELOW instead?
If I use eval.py
to compare a gold treebank against itself, it does not work. For example, a current git repo of English EWT has the following error.
Actually, it's unclear to me if this is a problem with the eval script or the treebank... is it okay for a copy node to have no dependency information like this?
[john@localhost udtools]$ python3 eval.py ~/extern_data/ud2/git/UD_English-EWT/en_ewt-ud-train.conllu ~/extern_data/ud2/git/UD_English-EWT/en_ewt-ud-train.conllu
Traceback (most recent call last):
File "eval.py", line 705, in <module>
main()
File "eval.py", line 670, in main
evaluation = evaluate_wrapper(args)
File "eval.py", line 650, in evaluate_wrapper
gold_ud = load_conllu_file(args.gold_file,treebank_type)
File "eval.py", line 637, in load_conllu_file
return load_conllu(_file,treebank_type)
File "eval.py", line 357, in load_conllu
raise UDError("The collapsed CoNLL-U line still contains empty nodes: {}".format(_encode(line)))
__main__.UDError: The collapsed CoNLL-U line still contains empty nodes: 8.1 reported report VERB VBN Tense=Past|VerbForm=Part|Voice=Pass _ _ 5:conj:and CopyOf=5
Not sure if this an oversight or not, but I think NumType=Range
(from here) is missing from feat_val.ud
.
Hi, I want to try the perl script enhanced_collapse_empty_nodes.pl
with the following command:
perl enhanced_collapse_empty_nodes.pl ../train-dev/UD_English-EWT/en_ewt-ud-train.conllu > en_ewt-ud-train-collapsed.conllu
But I get the following error:
Can't locate namespace/autoclean.pm in @INC (you may need to install the namespace::autoclean module) (@INC contains: /home/wangxy/workspace/iwpt2020/tools /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.26.1 /usr/local/share/perl/5.26.1 /usr/lib/x86_64-linux-gnu/perl5/5.26 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.26 /usr/share/perl/5.26 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at /home/wangxy/workspace/iwpt2020/tools/Graph.pm line 4.
BEGIN failed--compilation aborted at /home/wangxy/workspace/iwpt2020/tools/Graph.pm line 4.
Compilation failed in require at enhanced_collapse_empty_nodes.pl line 38.
BEGIN failed--compilation aborted at enhanced_collapse_empty_nodes.pl line 38.
Any suggestion for this? Thank you.
Hello,
I think there is a bug with conllu_to_conll.pl and restore_conllu_lines.pl. Here is the code that I run for Swedish:
perl conllu_to_conllx.pl < sv_talbanken-ud-test.conllu > sv_talbanken-ud-test.conll
Then I convert it back to 'conllu' format:
perl restore_conllu_lines.pl sv_talbanken-ud-test.conll sv_talbanken-ud-test.conllu > sv_talbanken-ud-test.conllu.merged
Then, I run the UD official evaluation script for "sv_talbanken-ud-test.conllu.merged" and "sv_talbanken-ud-test.conllu", but the code crashed with the following error:
main.UDError: The concatenation of tokens in gold file and in system file differ!
First 20 differing characters in gold file: 'kbasbeloppetvidsamma' and system file: '_kbasbeloppetvidsamm'
The same thing happened with "tr-imst-ud-test.conllu" and "ru_syntagrus-ud-test.conllu".
I prepared the enhanced representation for ru_syntagrus UD2.0 release, but unfortunately I had to abandon it because validation fails on fractional ids in DEPS field.
Did I miss something important or did I find a bug?
I have these errors:
[Line 15187]: Undefined ID in DEPS: E4.2
[Line 15187]: Undefined ID in DEPS: E4.2
[Line 15187]: Undefined ID in DEPS: E4.2
[Line 15187]: Failed to parse DEPS: E4.2:case
But the data looks ok to me:
http://paste2.org/6w2WmvG0
Broken link to
https://github.com/UniversalDependencies/tools/blob/master/README.txt
on page https://universaldependencies.org/release_checklist.html#validation
It seems that it should be https://github.com/UniversalDependencies/tools/blob/master/README.md
I want to add word to auxiliary verb list for Japanese UDs for Modern and Spoken Japanese. (#71 )
I checked below site, but i cannot find how to add new auxiliary verb.
http://quest.ms.mff.cuni.cz/udvalidator/cgi-bin/unidep/langspec/specify_auxiliary.pl
Do you have any plans on adding new auxiliary verb the site in the future? @masayu-a
Following the discussion at the end of UDW 2019 in Paris, I tried to put together a proposal of the validation vs. release policy for the upcoming releases. The goal is to be able to add new tests and find more guideline violations, but without having to kick out older treebanks that do not pass the stricter tests (some of them are no longer maintained and there is no one who could fix the bugs soon; others have too many bugs and fixing them will take a lot of time).
The full proposal is currently available here and comments are welcome. In a nutshell: if a treebank was valid and released in UD 2.3, it can stay in the upcoming releases without passing tests that were added after UD 2.3. Newer treebanks must pass all tests that exist when the treebank is released for the first time.
I have modified the online validation page to reflect the proposal and identify treebanks with legacy status. There are 6 old treebanks that contain errors which were not tolerated even in UD 2.3 (that means, these errors were introduced in UD 2.4 and slipped attention of the release team). Errors of this type must be fixed before UD 2.5. The treebanks are Croatian-SET (@nljubesi), English-EWT (@manning @sebschu), French-Spoken (@sylvainkahane), Norwegian-Bokmaal, Norwegian-NynorskLIA (@liljao), Serbian-SET (@tsamardzic).
4 treebanks were released in UD 2.4 for the first time but contained errors that were already checked at that time. Hence I think they are not really legacy treebanks (the only reason why they made it into the release was that we ignored some error messages in order to save older treebanks). (Disclaimer: I’m actually looking at the current report, so it is possible that the errors were not there at release time and were introduced later.) The treebanks are Classical_Chinese-Kyoto (@KoichiYasuoka), German-HDT (@akoehn @EmanuelUHH), German-LIT (@a-salomoni), Old_Russian-RNC (@olesar).
Finally, issues are also reported for 4 new treebanks: Bhojpuri-BHTB (@shashwatup9k), Chinese-GSDSimp (@qipeng), Skolt_Sami-Giellagas (@rueter), Swiss_German-UZH (@noe-eva).
What do people think about this?
Consider https://pythonhosted.org/six/ .
Should be m/^\d+-/
Line 26 in 3dc383c
Udpipe complains if there is a comment in front of a sentence, but the validator doesn't pick this up. Is this an issue with UDpipe (e.g. does the format allow pre-sentential comments) or is it an issue with the validator ? Here is an example:
# can't find this in common voice file
# sent_id = 349
# text = Uico a rak rang tuk.
# text_en = The dog was very fast.
1 Uico uico NOUN _ _ 4 nsubj _ dog
2 a a PRON _ Number=Sing|Person=3 4 expl _ 3SG
3 rak rak PART _ _ 4 discourse _ PERF
4 rang rang VERB _ _ 0 root _ fast
5 tuk tuk ADV _ _ 4 advmod _ very|SpaceAfter=No
6 . PUNCT PUNCT _ _ 4 punct _ _
# can't find this in common voice file
# sent_id = 350
# text = Mei a vun sen.
# text_en = The light immediately turned red.
1 Mei mei NOUN _ _ 4 nsubj _ light
2 a a PRON _ Number=Sing|Person=3 4 expl _ 3SG
3 vun vun ADV _ _ 4 advmod _ immediately
4 sen sen VERB _ _ 0 root _ red|SpaceAfter=No
5 . PUNCT PUNCT _ _ 4 punct _ _
Here is the output from UDpipe:
[Line 1814 Sent 214]: [L1 Format misplaced-comment] Spurious comment line. Comments are only allowed before a sentence.
[Tree number 4093 on line 86571]: Mismatch between the text attribute and the FORM field.
You cat *.conllu
to pipe to validate.py
to check for duplicate ids and whatnot. After the script is done with the first file, line numbers become non-informative. # sent_id
s would reliably id the erroneous sentence.
I'am looking for the easiest way to take an .eaf file (ELAN annotation file), parse relevant layers (transcripts of utterances) with certain UD parser, and insert annotations back into eaf (as additional layers).
Any suggestions about existing pipelines or just some pieces of the pipeline would be helpful.
@fginter: This is already fixed in fc5b73f9956
(I planned a pull request, but have forgotten and did direct push to master, sorry.)
@dan-zeman was concerned we should not make tests more strict few days before data freeze.
So please, consider adding --no-space-after
to the script which generates http://universaldependencies.org/validation.html
Alternatively, I can provide a script which fixes the SpaceAfter=No
based on the text.
Hi,
Is there a script within this repository that can convert from PTB to UD in CoNLL-U style?
Thanks.
Currently, runtests.sh fails on empty-node-without-dependent.conllu.
This is not ideal as we plan to publish this tools repo within the UD 2.0 release.
I can fix this, but I would prefer to test this within svalidation (rather than within validate.py).
First, it would be easier to implement there (and validate.py
is terrible to maintain even now without adding more tests).
Second, I consider this a linguistic decision, we may want to keep in UD treebanks, but not a strict requirement of the CoNLL-U format.
Allow a basic CoNLL-U format check without the extra character set and symbol list restrictions imposed by UD.
Line 344 in 925b829
If sentences are separated by two new lines instead of one, the script fails on this line because it finds that there is not exactly one root: there is none. Running into this error for this is a bit confusing, maybe two new lines should be caught beforehand, or the error message could be updated.
Because of validate.py I updated my system to Python 3. After that conllu-stats.py started to fail with this error message:
File "conllu-stats.py", line 132
print json.dumps(d)
^
SyntaxError: invalid syntax
Does this mean conllu-stats.py still need Python 2.7? Is there a way I can avoid needing two different Python versions to make UD release?
Users such as in shared tasks will need to know whether the stats and label sets in the tools/data
folder are for just the training data or for all treebank data. A readme should be added.
We get a validate.py
errors complaining about various relations. I thought these were allowed?
Line 903 in a7cf45e
This line validates treebank version numbers using orthographic comparison, which will no longer work properly now that there exist treebanks like Ancient Hebrew-PTNK which have Data available since: 2.10
, since '2.10' < '2.9'
.
Could it be that validator doesn't know about morpho feature "Typo=Yes", even though it is described in universal guidelines? I added a new version for Latvian and got lots of warnings [L4 Morpho unknown-feature-value] Unknown feature-value pair 'Typo=Yes'.
Or is it some typo on my side?
question: where is the documentation on which rules the validator checks, please?
I know it checks for single-roots, but what else?
In UD_Russian-Taiga, I see 65 validation errors, neither of which is present in the data
./validate.sh --lang ru --max-err 0 UD_Russian-Taiga/ru_taiga-ud-dev.conllu
[Line 15 Sent instagram-2]: [L4 Morpho feature-value-upos-not-permitted] Value Imp of feature Aspect is not permitted with UPOS NOUN in language [ru].
$ cat /UD_Russian-Taiga/*-dev.conllu | grep 'NOUN.*Aspect' -c
0
and freshly-updated validation.py on my side reports PASSED
@dan-zeman, could you look at this case?
My input is test.conllu file contains:
# sent_id = wiki-11.p.10.s.1.xml
# text = (*)
1 ( ( PUNCT LET _ 0 root _ _
2 * * PUNCT LET _ 1 punct _ _
3 ) ) PUNCT LET _ 1 punct _ _
after running
udapy -s ud.FixPunct < test.conllu
I get:
# sent_id = wiki-11.p.10.s.1.xml
# text = (*)
1 ( ( PUNCT LET _ 0 root _ _
2 * * PUNCT LET _ 0 punct _ _
3 ) ) PUNCT LET _ 0 punct _ _
It is a pathological string, nevertheless, this is a nasty bug (it took me a while to realize it was not my conversion script that was wrong....)
I'm getting this error,
[Line 18810 Sent ...:18 Node 2]: [L5 Morpho aux-lemma] '...' is not an auxiliary verb in language [...] (there are no known approved auxiliaries in this language)
The item in question is not an auxiliary verb, but rather another kind of auxiliary word, should the message be changed to just "is not an auxiliary in language ..." ?
74-76 ...
74
passes validation and shouldn't
I am running the validate.py script, and I receive an error for this sentence (which, I am sorry, is very long... but so it goes with Medieval Latin):
# sent_id = 170
# text = Quare , cribellum cupientes deponere , ut residentiam cito visamus , dicimus Tridentum atque Taurinum nec non Alexandriam civitates metis Ytalie in tantum sedere propinquas quod puras nequeunt habere loquelas ; ita quod si etiam quod turpissimum habent vulgare , haberent pulcerrimum , propter aliorum commixtionem esse vere latium negaremus .
1 Quare quare ADV r _ 12 discourse _ _
2 , , PUNCT Pu _ 4 punct _ _
3 cribellum cribellum NOUN sns2a Case=Acc|Gender=Neut|NounClass=IndEurO|Number=Sing 5 obj _ _
4 cupientes cupio VERB va3pppmn Aspect=Imp|Case=Nom|Degree=Pos|Gender=Masc|NounClass=IndEurI|Number=Plur|Tense=Pres|VerbClass=LatX2|VerbForm=Part|Voice=Act 12 advcl:pred _ _
5 deponere depono VERB va3fp Aspect=Imp|Tense=Pres|VerbClass=LatX|VerbForm=Inf|Voice=Act 4 ccomp _ _
6 , , PUNCT Pu _ 10 punct _ _
7 ut ut SCONJ cs ConjType=Cmpr 10 mark _ _
8 residentiam residentia NOUN sfs1a Case=Acc|Gender=Fem|NounClass=IndEurA|Number=Sing 10 obj _ _
9 cito cito ADV r _ 10 advmod _ _
10 visamus viso VERB va3cpp1 Aspect=Imp|Mood=Sub|Number=Plur|Person=1|Tense=Pres|VerbClass=LatX|VerbForm=Fin|Voice=Act 4 advcl _ _
11 , , PUNCT Pu _ 12 punct _ _
12 dicimus dico VERB va3ipp1 Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbClass=LatX|VerbForm=Fin|Voice=Act 0 root _ _
13 Tridentum tridentum PROPN Sns2a Case=Acc|Gender=Neut|NounClass=IndEurO|Number=Sing 24 nsubj _ _
14 atque atque CCONJ co _ 15 cc _ _
15 Taurinum taurinum PROPN Sns2a Case=Acc|Gender=Neut|NounClass=IndEurO|Number=Sing 13 conj _ _
16 nec nec CCONJ co Polarity=Neg 18 cc _ _
17 non non PART r Polarity=Neg 16 fixed _ _
18 Alexandriam alexandria PROPN Sfs1a Case=Acc|Gender=Fem|NounClass=IndEurA|Number=Sing 13 conj _ _
19 civitates civitas NOUN sfp3a Case=Acc|Gender=Fem|NounClass=IndEurX|Number=Plur 13 flat _ _
20 metis meta NOUN sfp1d Case=Dat|Gender=Fem|NounClass=IndEurA|Number=Plur 25 obl:arg _ _
21 Ytalie italia PROPN Sfs1g Case=Gen|Gender=Fem|NounClass=IndEurA|Number=Sing 20 nmod _ _
22 in in ADP e AdpType=Prep 25 advmod _ _
23 tantum tantum ADV r _ 22 fixed _ _
24 sedere sedeo VERB va2fp Aspect=Imp|Tense=Pres|VerbClass=LatE|VerbForm=Inf|Voice=Act 12 ccomp _ _
25 propinquas propinquus ADJ afp1a Case=Acc|Degree=Pos|Gender=Fem|NounClass=IndEurA|Number=Plur 24 advcl:pred _ _
26 quod quod SCONJ cs _ 28 mark _ _
27 puras purus ADJ afp1a Case=Acc|Degree=Pos|Gender=Fem|NounClass=IndEurA|Number=Plur 30 amod _ _
28 nequeunt nequeo VERB va5ipp3 Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbClass=LatAnom|VerbForm=Fin|Voice=Act 24 advcl _ _
29 habere habeo VERB va2fp Aspect=Imp|Tense=Pres|VerbClass=LatE|VerbForm=Inf|Voice=Act 28 xcomp _ _
30 loquelas loquela NOUN sfp1a Case=Acc|Gender=Fem|NounClass=IndEurA|Number=Plur 29 obj _ _
31 ; ; PUNCT Pu _ 32 punct _ _
32 ita ita ADV r _ 24 parataxis _ _
33 quod quod SCONJ cs _ 50 mark _ _
34 si si SCONJ cs _ 41 mark _ _
35 etiam etiam ADV co _ 41 advmod _ _
36 quod qui PRON presna Case=Acc|Gender=Neut|Number=Sing|PronType=Rel 38 obj _ _
37 turpissimum turpis ADJ ans2as Case=Acc|Degree=Abs|Gender=Neut|NounClass=IndEurO|Number=Sing 36 amod _ _
38 habent habeo VERB va2ipp3 Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbClass=LatE|VerbForm=Fin|Voice=Act 39 acl:relcl _ _
39 vulgare vulgare NOUN sns3a Case=Acc|Gender=Neut|NounClass=IndEurX|Number=Sing 41 obj _ _
40 , , PUNCT Pu _ 39 punct _ _
41 haberent habeo VERB va2cip3 Aspect=Imp|Mood=Sub|Number=Plur|Person=3|Tense=Past|VerbClass=LatE|VerbForm=Fin|Voice=Act 50 advcl _ _
42 pulcerrimum pulcher ADJ ans1as Case=Acc|Degree=Abs|Gender=Neut|NounClass=IndEurO|Number=Sing 39 amod _ _
43 , , PUNCT Pu _ 50 punct _ _
44 propter propter ADP e AdpType=Prep 46 case _ _
45 aliorum alius DET dpnpg Case=Gen|Gender=Neut|NounClass=LatPron|Number=Plur|PronType=Ind 46 nmod _ _
46 commixtionem commixtio NOUN sfs3a Case=Acc|Gender=Fem|NounClass=IndEurX|Number=Sing 50 obl _ _
47 esse sum AUX va5fp Aspect=Imp|Tense=Pres|VerbClass=LatAnom|VerbForm=Inf|Voice=Act 49 cop _ _
48 vere vere ADV r Degree=Pos 49 advmod _ _
49 latium latius ADJ ans1a Case=Acc|Degree=Pos|Gender=Neut|NounClass=IndEurO|Number=Sing 50 ccomp _ _
50 negaremus nego VERB va1cip1 Aspect=Imp|Mood=Sub|Number=Plur|Person=1|Tense=Past|VerbClass=LatA|VerbForm=Fin|Voice=Act 32 orphan _ _
51 . . PUNCT Pu _ 12 punct _ _
The problem is at token 22: we have a fixed adverbial expression, in tantum 'so much (so)', meaning that tantum depends as fixed
from in, and then in bears the advmod
relation with respect to propinquas (25). The validator complains:
'advmod' should be 'ADV' but it is 'ADP'
Should it not be so that fixed
relations do not give an error for alleged POS/deprel mismatches of fixed
heads? Because, since individual elements of fixed expressions retain their morphological analysis, but the whole expression can act differently than the single parts, there can be no restrictions.
I think I have other errors like that, but I still have to check.
The README file in this repo has some bad links - [404:NotFound]
The markup version of the readme that is displayed for the main page in this repo contains the following links:
Status code [404:NotFound] - Link: https://github.com/UniversalDependencies/tools/blob/conll_convert_tags_to_uposf.pl (conll_convert_tags_to_uposf.pl)
Status code [404:NotFound] - Link: https://github.com/UniversalDependencies/tools/blob/check_files.pl (check_files.pl)
Status code [404:NotFound] - Link: https://github.com/UniversalDependencies/tools/blob/conllu_align_tokens.pl (conllu_align_tokens.pl)
(The link in the readme’s raw markdown may appear in a different format to the link above)
Theses bad links were found by a tool I very recently created as part of an new experimental hobby project: https://github.com/MrCull/GitHub-Repo-ReadMe-Dead-Link-Finder
I (a human) verified that this link is broken and have manually logged this Issue (i.e. this Issue has not been created by a bot).
If this has been in any way helpful then please consider giving the above Repo a Star.
I tried to use the conllu_to_text.pl on a sizeable (1gb) conllu file, however it seems that the script has some limitations.
By doing perl conllu_to_text.pl <C:\Folder/Blabla.conllu> C:\Folder\blabla.txt
It seems the system is only able to create 70+ mb txt, I've trying other solutions on Windows. but it seems to be a limitation of the script.
Is there a way around this?
Thanks
Python 2.7.15rc1
that comes by default in Ubuntu 18.04pip
and virtualenv
virtualenv
where I installed the regex
module.I thought the file_util
module was built-in, but apparently it isn't (at least in this python version). I tried pip install distutils
to no avail. what's the name of the package I should install? (a requirements.txt
file would be useful!)
I am working on adding enhanced dependencies.
For relative pronouns, the guidelines propose to add a 'ref' relation to the antecedent noun. However, validate.py stumbles over this, as it is not a known dependency label. Adding it to the language specific set does not help either, as it is not an extension of a known relation. Should it be added to deprel.ud?
I am curious to compare UD models to others on a UD test set. The problem is, of course, that the others' labels are of a different tagset. Does UD provide conversion scripts to convert, for instance, the dependency labels of OntoNotes and Penn? Thanks in advance. (I am aware that conversion scripts will add noise, but I am fine with that.)
We want to add more than one copula word.
Japanese has normal copula and honorific copula.
We also have contracted forms of the copula in speech corpora.
Hopefully, the restriction on the copula words should be relaxed in Japanese and other morphologically rich languages.
http://quest.ms.mff.cuni.cz/udvalidator/cgi-bin/unidep/langspec/specify_auxiliary.pl?lcode=ja
The script runtests.sh
fails to execute the test suite as all invocations ofvalidate.py
fail with
usage: validate.py [-h] [--noecho] [--echo] [--quiet] [--max-err MAX_ERR]
[--err-enc ERR_ENC] --lang LANG [--multiple-roots]
[input] [output]
validate.py: error: argument --lang is required
Can we either add a dummy language or something like a --nolang argument for the test suite? (@fginter)
For this sentence, the validator claims that the punctuation mark node 14 is introducing non-projectivity. However, non-projective relations (introduced by the reparandum
relation) would also exist without the punctuation mark and I think the structure of this tree adheres to the punctuation guidelines.
# sent_id = n01002058
# text = What she’s saying and what she’s doing, it — actually, it’s unbelievable.
1 What what PRON WP PronType=Int 4 obj 4:obj _
2 she she PRON PRP Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs 4 nsubj 4:nsubj SpaceAfter=No
3 ’s be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 4 aux 4:aux _
4 saying say VERB VBG VerbForm=Ger 17 dislocated 17:dislocated _
5 and and CCONJ CC _ 9 cc 9:cc _
6 what what PRON WP PronType=Int 9 obj 9:obj _
7 she she PRON PRP Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs 9 nsubj 9:nsubj SpaceAfter=No
8 ’s be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 9 aux 9:aux _
9 doing do VERB VBG VerbForm=Ger 4 conj 4:conj:and SpaceAfter=No
10 , , PUNCT , _ 17 punct 17:punct _
11 it it PRON PRP Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs 15 reparandum 15:reparandum _
12 — — PUNCT : _ 11 punct 11:punct _
13 actually actually ADV RB _ 17 advmod 17:advmod SpaceAfter=No
14 , , PUNCT , _ 17 punct 17:punct _
15 it it PRON PRP Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs 17 nsubj 17:nsubj SpaceAfter=No
16 ’s be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 17 cop 17:cop _
17 unbelievable unbelievable ADJ JJ Degree=Pos 0 root 0:root SpaceAfter=No
18 . . PUNCT . _ 17 punct 17:punct _
Hello,
I was hoping to get your help on an issue I have with the evaluate_treebank.pl script. Evaluation goes well and I get results the same as on the https://universaldependencies.org website in the case of many treebanks. However, I get a "WARNING: Validator not found. We will assume that all files are valid." message. So in the case of treebanks that do not score 1 in the validity test I get much different treebank ratings than those posted on https://universaldependencies.org.
I do not know Perl so I was not able to figure out what is the issue here. I would appreciate your help.
PS. validate.py script is present in the directory where it is supposed to be (I think I tried all possible combinations)
The current dev branch of the corpus UD_French-GSD
is valid even if it contains a few cases of non-projective punctuation.
See 3 examples (the request takes some times to execute (about 15s) because the pattern is not connected…)
Not sure if this is expected or not but on the master branch, the data/feats.ud file is no longer present. It is there in the 2.7 release tag but seems to have been lost since then.
I don't know Python, so not sure how sort works there, but in Perl - which I used to produce the treebank - it is dependent on the locale settings. And, although I sorted my features, the validator complained that they were unsorted, because of "|NumType=Card|Number=Plur".
It seems that Slovene locale puts capital letters before lower case ones.
Not sure if this is an issue of simply documenting the validator (e.g. "should be sorted according to C locale") or if this impacts the code.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.