own-pt / glosstag Goto Github PK

Semantically Tagged PWN glosses

License: Other

Common Lisp 0.01% Julia 99.98% Python 0.01% Shell 0.01%

glosstag's Introduction

Open Portuguese WordNet (OWN-PT)

This repository hosts Portuguese WordNet data in textual format, this is an experimental branch of http://openwordnet-pt.org. It is linked to (but independent from) the Open English WordNet.

You can also get the data in JSON and RDF format.

See the Wiki for how the data was generated, how it compares to Princeton WordNet and what is the syntax of the text files. This data is validated and exported by the mill tool — see its repository for more information about validation, export formats, etc.

glosstag's People

Contributors

Stargazers

Watchers

Forkers

alexandretessarollo fcbond

glosstag's Issues

all single sense are really single sense?

all auto should be revised, maybe new senses could be necessary.

Error in original data causes information loss

<synset id="r00003483" ofs="00003483" pos="r">
 <terms>
  <term>basically</term>
  <term>fundamentally</term>
  <term>essentially</term>
 </terms>
 <keys>
  <sk>basically%4:02:00::</sk>
  <sk>fundamentally%4:02:00::</sk>
  <sk>essentially%4:02:01::</sk>
 </keys>
 <gloss desc="orig">
  <orig>in essence; at bottom or by one's (or its) very nature; "He is basically dishonest"; "the argument was essentially a technical one"; "for all his bluster he is in essence a shy person"</orig>
 </gloss>
 <gloss desc="text">
  <text>in essence ; at bottom or by one's ( or its ) very nature ; â�� He is basically dishonest â�� ; â�� the argument was essentially a technical one â�� ; â�� for all his bluster he is in essence a shy person â��</text>
 </gloss>
 <gloss desc="wsd">
  <def id="r00003483_d">
   <wf id="r00003483_wf1" lemma="in" pos="IN" tag="ignore">in</wf>
   <wf id="r00003483_wf2" lemma="essence%1" pos="NN" sep="" tag="un">essence</wf>
   <wf id="r00003483_wf3" pos=":" tag="ignore" type="punc">;</wf>
   <cf coll="a" id="r00003483_wf4" lemma="at" pos="IN" tag="ignore">
    <glob coll="a" glob="man" id="r00003483_coll.a" lemma="at_bottom%4" tag="man">
     <id coll="b" id="r00003483_id.6" lemma="at bottom" sk="at_bottom%4:02:00::"/>
   </glob>at</cf>
   <cf coll="a" id="r00003483_wf5" lemma="bottom%1|bottom%2|bottom%3" pos="NN" tag="un">bottom</cf>
   <wf id="r00003483_wf6" lemma="or" pos="CC" tag="ignore">or</wf>
   <wf id="r00003483_wf7" lemma="by" pos="IN" tag="ignore">by</wf>
....

(:ofs "00003483" :pos "r" :keys (("essentially%4:02:01::" . "essentially")
				 ("fundamentally%4:02:00::" . "fundamentally")
				 ("basically%4:02:00::" . "basically"))
      :gloss "in essence; at bottom or by one's (or its) very nature; 
         \"He is basically dishonest\"; \"the argument was essentially a technical one\";  
        \"for all his bluster he is in essence a shy person\""
      :tokens ((:kind :def :action :open)
	       (:kind :wf :form "in" :lemma "in" :pos "IN" :tag "ignore")
	       (:kind :wf :form "essence" :lemma "essence%1" :pos "NN" :tag "un" :sep "")
	       (:kind :wf :form ";" :pos ":" :tag "ignore" :type "punc")
	       (:kind (:glob . "a") :lemma "at_bottom%4" :tag "man" :glob "man")
	       (:kind (:cf "a") :form "at" :lemma "at" :pos "IN" :tag "ignore")
	       (:kind (:cf "a") :form "bottom" :lemma "bottom%1|bottom%2|bottom%3" 
                   :pos "NN" :tag "un")
	       (:kind :wf :form "or" :lemma "or" :pos "CC" :tag "ignore")
	       (:kind :wf :form "by" :lemma "by" :pos "IN" :tag "ignore")
...

The glob in plist version is not annotatted, is that right? Is it a bug in conversion?

inconsistency in pre-POS and lemmas

tokens without lemmas as below. It looks like all of them are punctuation. We can add lemma for them, same as form.

{'form': ';', 'kind': ['wf'], 'meta': {'pos': ':', 'type': 'punc'}, 'tag': 'ignore', 'begin': 46, 'end': 47}

tokens with lemmas without %N
1. some with meta {'form': 'to', 'kind': ['wf'], 'lemmas': ['to'], 'meta': {'pos': 'TO'}, 'tag': 'ignore', 'begin': 50, 'end': 52}
2. some without meta {'form': 'of', 'kind': ['wf'], 'lemmas': ['of'], 'tag': 'ignore', 'begin': 15, 'end': 17}
3. some proper nouns {'form': 'Edmund', 'kind': ['wf'], 'lemmas': ['Edmund'], 'tag': 'un', 'begin': 41, 'end': 47}
4. some missing words in PWN {'glob': 'man', 'kind': ['glob', 'c'], 'lemmas': ['flour_moths'], 'tag': 'un'}
5. some annotated {'glob': 'man', 'kind': ['glob', 'b'], 'lemmas': ['appellate_court'], 'senses': ['appellate_court%1:14:00::'], 'tag': 'man'}

court of record

Ii has a wikipedia page https://en.wikipedia.org/wiki/Court_of_record

A court of record is a trial court or appellate court in which a record of the proceedings is captured and preserved, for the possibility of appeal.

We have in the glosstag some occurences of court of public records and court record . Should we add a synset for it?

standoff files

releases v0.1 and v0.2 changed only the merged files. Does it make sense to make the standoff files consistent? Generate new standoff files from the new merged files?

original annotations of adj satellites uses wrong sense-key

fix the sense keys in the XML files
adjust the XML to PLIST/JSON script to not fix it during transformation.

tokenization issues

after #9

we still have cases were names 'A.B.Fulano' in a single token
we may have other tokens that need to be split. We can search for . or - inside token forms.
we have some cases of WF tokens with sep= , the space is the default sep, need to remove those cases and check if the detokenization approach still works matching the text field.

`cross court` vs `cross` `court`

(n) return | a tennis stroke that sends the ball back to the other player ; he won the point on a cross court return ;

I opted for globing cross court because many results from google show this is a MWE in tennis domain. But this MWE is missing in PWN.

tokenization

why

(n) ruling,opinion | the reason for a court's judgment ( as opposed to the decision itself) ;

and

(v) wash | admit to testing or proof ; This silly excuse wo n't wash in traffic court ;

assessment of the quality of ERG analyses

Executive summary:

Considering the spans that group predications and tokens for each sentence. In total, we have 1842193 such groups. In only 49793 of them, I found apparent POS inconsistency between ERG and the sense annotation.

49793/1842193 = 0.02

Note that I only consider the tokens that were sense tagged. If we count per sentence, 38883 sentences contain at least one error from a total of 159614 sentences. If we ignore the mismatches a/r (adverbs as adjectives) and q/n (someone), we have 28358 sentences with at least one error. If we also ignore mismatches caused by verb/adjective we have 17401 sentences:

38883/159614 = 0.24
28358/159614 = 0.17
17401/159614 = 0.11

The dataset contains 165994 sentences, but not all of them got a parse from ERG.

Details:

For all sentences, I join the tokens with the MRS predicates using the spans.

Below I found no conflict between ERG and the annotation. For instance, affect%2 means it was annotated as a verb, and ERG made it the predicate _affect_v_1. For hydrarthrosis, it was annotated as a noun, and ERG preprocessing instantiated a generic token from NNS pos tagger.

> START def hydrarthrosis affecting the knee
 (0,32) 0 => [('unknown', 0, 32, 'e2', 'h1', None), ('udef_q', 0, 32, 'q4', 'h5', None)]
 (0,13) 1 => [('hydrarthrosis', 0, 13, ['hydrarthrosis%1:26:00::'], ['wf'], 'NN'), ('_hydrarthrosis/nns_u_unknown', 0, 13, 'x4', 'h8', None)]
 (14,23) 1 => [('affecting', 14, 23, ['affect%2:29:00::'], ['wf'], 'VBG'), ('_affect_v_1', 14, 23, 'e9', 'h8', None)]
 (24,27) 1 => [('the', 24, 27, None, ['wf'], 'DT'), ('_the_q', 24, 27, 'q10', 'h11', None)]
 (28,32) 1 => [('knee', 28, 32, ['knee%1:08:00::'], ['wf'], 'NN'), ('_knee_n_1', 28, 32, 'x10', 'h14', None)]

Next, excess was annotated as an adjective (%5) but analysed as NOUN by ERG. See the line starting with “D>"

> START def an abnormality of pregnancy; accumulation of excess amniotic fluid
D> {'n', 'a'} [('excess', 45, 51, ['excess%5:00:00:unnecessary:00'], ['wf'], 'JJ'), ('udef_q', 45, 51, 'q29', 'h30', None), ('_excess_n_1', 45, 51, 'x29', 'h33', None)]
 (0,66) 0 => [('implicit_conj', 0, 66, 'e2', 'h1', None)]
  (0,28) 1 => [('unknown', 0, 28, 'e4', 'h1', None)]
   (0,2) 2 => [('an', 0, 2, None, ['wf'], 'DT'), ('_a_q', 0, 2, 'q6', 'h7', None)]
   (3,14) 2 => [('abnormality', 3, 14, None, ['wf'], 'NN'), ('_abnormality_n_1', 3, 14, 'x6', 'h10', None)]
   (15,17) 2 => [('of', 15, 17, None, ['wf'], 'IN'), ('_of_p', 15, 17, 'e11', 'h10', None)]
   (18,28) 2 => [('udef_q', 18, 28, 'q12', 'h13', None)]
    (18,27) 3 => [('pregnancy', 18, 27, ['pregnancy%1:26:00::'], ['wf'], 'NN'), ('_pregnancy_n_1', 18, 27, 'x12', 'h16', None)]
    (27,28) 3 => [(';', 27, 28, None, ['wf'], 'punc')]
  (29,66) 1 => [('unknown', 29, 66, 'e5', 'h1', None), ('udef_q', 29, 66, 'q17', 'h18', None)]
   (29,41) 2 => [('accumulation', 29, 41, ['accumulation%1:22:00::'], ['wf'], 'NN'), ('_accumulation_n_of', 29, 41, 'x17', 'h21', None)]
   (42,44) 2 => [('of', 42, 44, None, ['wf'], 'IN')]
   (45,66) 2 => [('udef_q', 45, 66, 'q22', 'h23', None)]
    (45,60) 3 => [('compound', 45, 60, 'e27', 'h26', None)]
     (45,51) 4 => [('excess', 45, 51, ['excess%5:00:00:unnecessary:00'], ['wf'], 'JJ'), ('udef_q', 45, 51, 'q29', 'h30', None), ('_excess_n_1', 45, 51, 'x29', 'h33', None)]
     (52,60) 4 => [('amniotic', 52, 60, None, ['cf', 'a'], 'JJ'), ('_amniotic/jj_u_unknown', 52, 60, 'e28', 'h26', None)]
    (61,66) 3 => [('fluid', 61, 66, None, ['cf', 'a'], 'NN'), ('_fluid_n_1', 61, 66, 'x22', 'h26', None)]

ERG annotated adverbs and adjectives as adjoins, so another common mismatch is a vs r. The fragment after the first semi-colon should be an example "equally balanced”?

> START def a state of being essentially equal or equivalent; equally balanced; 
D> {'a', 'r'} [('essentially', 17, 28, ['essentially%4:02:01::'], ['wf'], 'RB'), ('_essential_a_1', 17, 28, 'e17', 'h16', None)]
D> {'n', 'a'} [('equivalent', 38, 48, ['equivalent%1:09:00::'], ['wf'], 'JJ'), ('_equivalent_a_to', 38, 48, 'e22', 'h16', None)]
 (0,67) 0 => [('implicit_conj', 0, 67, 'e2', 'h1', None)]
  (0,49) 1 => [('unknown', 0, 49, 'e4', 'h1', None)]
   (0,1) 2 => [('a', 0, 1, None, ['wf'], 'DT'), ('_a_q', 0, 1, 'q6', 'h7', None)]
   (2,7) 2 => [('state', 2, 7, ['state%1:03:00::'], ['wf'], 'NN'), ('_state_n_of', 2, 7, 'x6', 'h10', None)]
   (8,10) 2 => [('of', 8, 10, None, ['wf'], 'IN')]
   (11,49) 2 => [('udef_q', 11, 49, 'q11', 'h12', None), ('nominalization', 11, 49, 'x11', 'h15', None)]
    (11,16) 3 => [('being', 11, 16, None, ['wf'], 'VBG')]
    (17,28) 3 => [('essentially', 17, 28, ['essentially%4:02:01::'], ['wf'], 'RB'), ('_essential_a_1', 17, 28, 'e17', 'h16', None)]
    (29,34) 3 => [('equal', 29, 34, None, ['wf'], 'JJ'), ('_equal_a_to', 29, 34, 'e18', 'h16', None)]
    (35,37) 3 => [('or', 35, 37, None, ['wf'], 'CC'), ('_or_c', 35, 37, 'e21', 'h16', None)]
    (38,48) 3 => [('equivalent', 38, 48, ['equivalent%1:09:00::'], ['wf'], 'JJ'), ('_equivalent_a_to', 38, 48, 'e22', 'h16', None)]
    (48,49) 3 => [(';', 48, 49, None, ['wf'], 'punc')]
  (50,67) 1 => [('unknown', 50, 67, 'e5', 'h1', None)]
   (50,57) 2 => [('equally', 50, 57, None, ['wf'], 'RB'), ('_equal_a_to', 50, 57, 'e25', 'h1', None)]
   (58,66) 2 => [('balanced', 58, 66, ['balance%2:42:00::'], ['wf'], 'VBN'), ('_balance_v_1', 58, 66, 'e26', 'h1', None)]
   (66,67) 2 => [(';', 66, 67, None, ['wf'], 'punc')]

Adjective vs verb:

> START def the condition of being reinstated; 
D> {'v', 'a'} [('reinstated', 23, 33, ['reinstate%2:41:00::'], ['wf'], 'VBN'), ('_instate_v_1', 23, 33, 'e15', 'h14', None), ('_re-_a_again', 23, 33, 'e18', 'h14', None)]
 (0,34) 0 => [('unknown', 0, 34, 'e2', 'h1', None)]
  (0,3) 1 => [('the', 0, 3, None, ['wf'], 'DT'), ('_the_q', 0, 3, 'q4', 'h5', None)]
  (4,13) 1 => [('condition', 4, 13, ['condition%1:26:00::'], ['wf'], 'NN'), ('_condition_n_of', 4, 13, 'x4', 'h8', None)]
  (14,16) 1 => [('of', 14, 16, None, ['wf'], 'IN')]
  (17,34) 1 => [('udef_q', 17, 34, 'q9', 'h10', None), ('nominalization', 17, 34, 'x9', 'h13', None)]
   (17,22) 2 => [('being', 17, 22, None, ['wf'], 'VBG')]
   (23,33) 2 => [('reinstated', 23, 33, ['reinstate%2:41:00::'], ['wf'], 'VBN'), ('_instate_v_1', 23, 33, 'e15', 'h14', None), ('_re-_a_again', 23, 33, 'e18', 'h14', None)]
   (33,34) 2 => [(';', 33, 34, None, ['wf'], 'punc')]

Someone vs person+some_q. (1829 cases), I need to improve my check to remove this from the suspicious cases.

> START def a situation of being uncomfortably close to someone or something
D> {'a', 'r'} [('uncomfortably', 21, 34, ['uncomfortably%4:02:00::'], ['wf'], 'RB'), ('_uncomfortable_a_1', 21, 34, 'e16', 'h15', None)]
D> {'q', 'n'} [('someone', 44, 51, ['someone%1:03:00::'], ['wf'], 'NN'), ('person', 44, 51, 'x24', 'h23', None), ('_some_q', 44, 51, 'q24', 'h25', None)]
 (0,64) 0 => [('unknown', 0, 64, 'e2', 'h1', None)]
  (0,1) 1 => [('a', 0, 1, None, ['wf'], 'DT'), ('_a_q', 0, 1, 'q4', 'h5', None)]
  (2,11) 1 => [('situation', 2, 11, ['situation%1:15:00::'], ['wf'], 'NN'), ('_situation_n_1', 2, 11, 'x4', 'h8', None)]
  (12,14) 1 => [('of', 12, 14, None, ['wf'], 'IN'), ('_of_p', 12, 14, 'e9', 'h8', None)]
  (15,64) 1 => [('udef_q', 15, 64, 'q10', 'h11', None), ('nominalization', 15, 64, 'x10', 'h14', None)]
   (15,20) 2 => [('being', 15, 20, None, ['wf'], 'VBG')]
   (21,34) 2 => [('uncomfortably', 21, 34, ['uncomfortably%4:02:00::'], ['wf'], 'RB'), ('_uncomfortable_a_1', 21, 34, 'e16', 'h15', None)]
   (35,40) 2 => [('close', 35, 40, None, ['wf'], 'JJ'), ('_close_a_to', 35, 40, 'e17', 'h15', None)]
   (41,43) 2 => [('to', 41, 43, None, ['wf'], 'TO')]
   (44,64) 2 => [('udef_q', 44, 64, 'q19', 'h20', None)]
    (44,51) 3 => [('someone', 44, 51, ['someone%1:03:00::'], ['wf'], 'NN'), ('person', 44, 51, 'x24', 'h23', None), ('_some_q', 44, 51, 'q24', 'h25', None)]
    (52,54) 3 => [('or', 52, 54, None, ['wf'], 'CC'), ('_or_c', 52, 54, 'x19', 'h28', None)]
    (55,64) 3 => [('something', 55, 64, None, ['wf'], 'PRP'), ('thing', 55, 64, 'x29', 'h30', None), ('_some_q', 55, 64, 'q29', 'h31', None)]

What is especially below? Tagged as an adverb, in the ERG analysis, it is X?

> START def the relative position or standing of things or especially persons in a society; 
D> {'x', 'r'} [('especially', 47, 57, ['especially%4:02:01::'], ['wf'], 'RB'), ('_especially_x_deg', 47, 57, 'e35', 'h34', None)]
 (0,79) 0 => [('unknown', 0, 79, 'e2', 'h1', None), ('udef_q', 0, 79, 'q4', 'h5', None)]
  (0,3) 1 => [('the', 0, 3, None, ['wf'], 'DT'), ('_the_q', 0, 3, 'q9', 'h8', None)]
  (4,12) 1 => [('relative', 4, 12, ['relative%3:00:00::'], ['wf'], 'JJ'), ('_relative_a_to', 4, 12, 'e13', 'h12', None)]
  (13,21) 1 => [('position', 13, 21, None, ['wf'], 'NN'), ('udef_q', 13, 21, 'q16', 'h15', None), ('_position_n_of', 13, 21, 'x16', 'h19', None)]
  (22,33) 1 => [('udef_q', 22, 33, 'q21', 'h20', None)]
   (22,24) 2 => [('or', 22, 24, None, ['wf'], 'CC'), ('_or_c', 22, 24, 'x9', 'h12', None)]
   (25,33) 2 => [('standing', 25, 33, None, ['wf'], 'NN'), ('_standing_n_1', 25, 33, 'x21', 'h24', None)]
  (34,36) 1 => [('of', 34, 36, None, ['wf'], 'IN'), ('_of_p', 34, 36, 'e25', 'h12', None)]
  (37,43) 1 => [('things', 37, 43, ['thing%1:06:01::'], ['wf'], 'NNS'), ('udef_q', 37, 43, 'q26', 'h27', None), ('_thing_n_of-about', 37, 43, 'x26', 'h30', None)]
  (44,46) 1 => [('or', 44, 46, None, ['wf'], 'CC'), ('_or_c', 44, 46, 'x4', 'h32', None)]
  (47,57) 1 => [('especially', 47, 57, ['especially%4:02:01::'], ['wf'], 'RB'), ('_especially_x_deg', 47, 57, 'e35', 'h34', None)]
  (58,79) 1 => [('udef_q', 58, 79, 'q33', 'h34', None)]
   (58,65) 2 => [('persons', 58, 65, ['person%1:03:00::'], ['wf'], 'NNS'), ('_person_n_1', 58, 65, 'x33', 'h39', None)]
   (66,68) 2 => [('in', 66, 68, None, ['wf'], 'IN'), ('_in_p_loc', 66, 68, 'e40', 'h39', None)]
   (69,70) 2 => [('a', 69, 70, None, ['wf'], 'DT'), ('_a_q', 69, 70, 'q41', 'h42', None)]
   (71,78) 2 => [('society', 71, 78, None, ['wf'], 'NN'), ('_society_n_of', 71, 78, 'x41', 'h45', None)]
   (78,79) 2 => [(';', 78, 79, None, ['wf'], 'punc')]

remove field 'pos' from some synset objects

glosstag/src/issue-9.lisp

Line 90 in 72bdd8b

(remhash "pos" obj)))

This line should have executed for all synset objects, but the ones that were with pos='a' and changed to type='s' didn't have the field 'pos' removed.

court dance

(n) courante | a court dance of the 16th century ; consisted of short advances and retreats ;
(n) pavan,pavane | a stately court dance of the 16th and 17th centuries ;
(n) saraband | a stately court dance of the 17th and 18th centuries ; in slow time ;
(n) minuet | a stately court dance in the 17th century ;

many cases. It looks like an expression.

https://www.google.com/search?client=safari&rls=en&q=court+dance&ie=UTF-8&oe=UTF-8

It can be a missing expression at

http://wnpt.sl.res.ibm.com/wn/synset?id=00534849-n < social_dancing

http://wnpt.sl.res.ibm.com/wn/synset?id=08253141-n < party

http://wnpt.sl.res.ibm.com/wn/synset?id=07448717-n < party duplicated?

a subconcept of http://wnpt.sl.res.ibm.com/wn/synset?id=00428270-n, the act

http://wnpt.sl.res.ibm.com/wn/synset?id=00537682-n (another synset for the royal courts dancing)

http://wnpt.sl.res.ibm.com/wn/synset?id=00532110-n - this is the act

dark brown

(n) Pinus strobiformis,southwestern white pine | medium-size pine of northwestern Mexico ; bark is dark brown and furrowed when mature ;

many cases of drak brown are marked as glob and sense tagged to http://wnpt.sl.res.ibm.com/wn/synset?id=00372111-a but this synset doen't have the lemma dark brown with spaces.

a02674398 glob error

... calcium or calcium carbonate

there are no two globs here.

export Mongo json to TSV

https://github.com/own-pt/glosstag/blob/master/src/export.lisp#L19

Is it a bug or we are working on diff data formats? This is a so silly bug that I need to make sure if I am not using a diff version of the data format in my sensetion installation.

semi-colon vs colon

we have cases were semi-colon were used instead of colon. The complete list of manually analyzed cases done by the @fcbond team is https://github.com/own-pt/glosstag/blob/master/scripts/issue-34.xml

punctuations

punctuations can be represented in a more simplified way:

% jq -c -S ".tokens | .[] " data/*.jl | rg "\"punc\""  | sort | uniq -c | sort -nr
50891 {"form":";","kind":["wf"],"pos":":","tag":"ignore","type":"punc"}
48354 {"form":"“","kind":["wf"],"pos":"dq","sep":"","tag":"ignore","type":"punc"}
47610 {"form":"”","kind":["wf"],"pos":"dq","sep":"","tag":"ignore","type":"punc"}
15710 {"form":";","kind":["wf"],"tag":"ignore","type":"punc"}
14024 {"form":"(","kind":["wf"],"pos":"(","sep":"","tag":"ignore","type":"punc"}
8606 {"form":")","kind":["wf"],"pos":")","sep":"","tag":"ignore","type":"punc"}

We could avoid repetition and save some space, considering we have 210429 tokens of type punc.

a. {"form":";","kind":["wf"],"pos":":","tag":"ignore","type":"punc"}
b. {"form":";","kind":["wf"],"tag":"ignore","pos":"punc"}

glob error: add glob

(a) incumbent | lying or aleaning aon something else ; an incumbent bgeological bformation ;

noun:
WARNING: ==> id n00044900_id.3
WARNING: ==> id n00312266_id.4
WARNING: ==> id n00540701_id.7
WARNING: ==> id n00662017_id.8
WARNING: ==> id n00723547_id.7
WARNING: ==> id n00734107_id.3
WARNING: ==> id n00800121_id.1
WARNING: ==> id n01203277_id.4
WARNING: ==> id n01246334_id.7
WARNING: ==> id n01441117_id.7
WARNING: ==> id n01548865_id.7
WARNING: ==> id n01599269_id.3
WARNING: ==> id n01618356_id.2
WARNING: ==> id n01731137_id.4
WARNING: ==> id n01746565_id.2
WARNING: ==> id n02042759_id.2
WARNING: ==> id n02076402_id.4
WARNING: ==> id n02217050_id.2
WARNING: ==> id n02300797_id.6
WARNING: ==> id n02351686_id.4
WARNING: ==> id n02587051_id.3
WARNING: ==> id n02675885_id.3
WARNING: ==> id n02786984_id.1
WARNING: ==> id n03059366_id.2
WARNING: ==> id n03139640_id.3
WARNING: ==> id n03559373_id.3
WARNING: ==> id n03838899_id.5
WARNING: ==> id n03963645_id.2
WARNING: ==> id n04119478_id.6
WARNING: ==> id n04414675_id.5
WARNING: ==> id n04439840_id.5
WARNING: ==> id n04593524_id.2
WARNING: ==> id n04595028_id.2
WARNING: ==> id n04973669_id.5
WARNING: ==> id n04973816_id.5
WARNING: ==> id n06185748_id.8
WARNING: ==> id n06274760_id.4
WARNING: ==> id n06358159_id.3
WARNING: ==> id n06704115_id.7
WARNING: ==> id n07451903_id.3
WARNING: ==> id n07548978_id.3
WARNING: ==> id n07711683_id.2
WARNING: ==> id n07712959_id.4
WARNING: ==> id n07767344_id.4
WARNING: ==> id n08026539_id.15
WARNING: ==> id n08142370_id.16
WARNING: ==> id n08165866_id.3
WARNING: ==> id n08165866_id.1
WARNING: ==> id n08181658_id.3
WARNING: ==> id n08316564_id.1
WARNING: ==> id n08361720_id.3
WARNING: ==> id n08582157_id.4
WARNING: ==> id n08766988_id.1
WARNING: ==> id n09166902_id.8
WARNING: ==> id n09343266_id.4
WARNING: ==> id n09488584_id.4
WARNING: ==> id n09541125_id.4
WARNING: ==> id n09602484_id.3
WARNING: ==> id n09854708_id.2
WARNING: ==> id n09927305_id.2
WARNING: ==> id n09933842_id.3
WARNING: ==> id n10302700_id.4
WARNING: ==> id n10668024_id.9
WARNING: ==> id n10673296_id.7
WARNING: ==> id n10730820_id.8
WARNING: ==> id n11147924_id.9
WARNING: ==> id n11187930_id.6
WARNING: ==> id n11188123_id.5
WARNING: ==> id n11263180_id.3
WARNING: ==> id n11481209_id.3
WARNING: ==> id n11507797_id.3
WARNING: ==> id n11663449_id.3
WARNING: ==> id n11691332_id.7
WARNING: ==> id n11763142_id.3
WARNING: ==> id n11816336_id.12
WARNING: ==> id n11929880_id.1
WARNING: ==> id n12011838_id.9
WARNING: ==> id n12052787_id.12
WARNING: ==> id n12301445_id.4
WARNING: ==> id n12305819_id.4
WARNING: ==> id n12566627_id.3
WARNING: ==> id n12568506_id.4
WARNING: ==> id n12680125_id.5
WARNING: ==> id n12680125_id.3
WARNING: ==> id n13177048_id.9
WARNING: ==> id n13399379_id.4
WARNING: ==> id n13414554_id.9
WARNING: ==> id n13522485_id.2
WARNING: ==> id n13980288_id.3
WARNING: ==> id n14244003_id.4
WARNING: ==> id n14252184_id.4
WARNING: ==> id n14325006_id.7
WARNING: ==> id n14647907_id.5
WARNING: ==> id n14649775_id.13
WARNING: ==> id n14764715_id.3

verb:
WARNING: ==> id v00055142_id.2
WARNING: ==> id v00529411_id.3
WARNING: ==> id v00614444_id.3
WARNING: ==> id v01164081_id.2
WARNING: ==> id v01168259_id.2
WARNING: ==> id v01283893_id.3
WARNING: ==> id v01898769_id.1
WARNING: ==> id v02162310_id.3
WARNING: ==> id v02612762_id.7

ukb evaluation

We can show performance improvements in https://github.com/asoroa/ukb/tree/master/wsdeval with our version of the data?

some glosses are repeated

We have glosses repeated in PWN 3.0 and PWN 3.1.

376 cases in PWN30 where 289 gloses have been repeated twice.
363 cases in PWN31, where 275 sentences were repeated twice.

One example:

http://wn.mybluemix.net/synset?id=01156302-a (mellow)
http://wn.mybluemix.net/synset?id=01492061-a (mellowed • mellow)

ar@tenis glosstag-kg % cat ../WordNet-3.0/dict/data.* | awk -F "|" '$0 ~ /^[0-9]/ {print $2}' | sort | uniq -c | sort -nr | head
  23  a variety of aster
  18  a branch of the Tai languages
  13  one species
  13  one of the British colonies that formed the United States
  11  a genus of bacteria
   9  an artificial language
   9  a radioactive transuranic element
   9  a genus of Mustelidae
   9  a Chadic language spoken south of Lake Chad
   8  a genus of Psittacidae

% cat ../WordNet-3.0/dict/data.* | awk -F "|" '$0 ~ /^[0-9]/ {print $2}' | sort | uniq -c | sort -nr | awk '$1 > 1 {print $1}' | sort | uniq -c
   1 11
   2 13
   1 18
 289 2
   1 23
  40 3
  18 4
   9 5
   5 6
   5 7
   1 8
   4 9

For PWN 3.1

% cat ../WordNet-3.1-dict/data.* | awk -F "|" '$0 ~ /^[0-9]/ {print $2}' | sort | uniq -c | sort -nr | head
  23  a variety of aster
  18  a branch of the Tai languages
  13  one species
  13  one of the British colonies that formed the United States
  11  a genus of bacteria
   9  an artificial language
   9  a radioactive transuranic element
   9  a genus of Mustelidae
   9  a Chadic language spoken south of Lake Chad
   8  a genus of Psittacidae

% cat ../WordNet-3.1-dict/data.* | awk -F "|" '$0 ~ /^[0-9]/ {print $2}' | sort | uniq -c | sort -nr | awk '$1 > 1 {print $1}' | sort | uniq -c
   1 11
   2 13
   1 18
 275 2
   1 23
  40 3
  18 4
   9 5
   6 6
   5 7
   1 8
   4 9

plans

ideas for the future:

how to use the glosstag to improve mappings from WN to Propbank and VerbNet
...

higher court is a MWE?

(n) appeal | ( law) a legal proceeding in which the appellant resorts to a higher court for the purpose of obtaining a review of a lower court decision and a reversal of the lower court's judgment or the granting of a new trial ; their appeal was denied in the superior court ;

(n) reversal | a judgment by a higher court that the judgment of a lower court was incorrect and should be set aside ;

(n) affirmation | a judgment by a higher court that the judgment of a lower court was correct and should stand ;

is higher court a missing expression in the http://wnpt.sl.res.ibm.com/wn/synset?id=08335751-n ?

`law court', 'court of law' and 'lawcourt'

(n) reform | a change for the better as a result of correcting abuses ; justice was for sale before the reform of the law courts ;

(n) awarding,award | a grant made by a law court ; he criticized the awarding of compensation by the court;

I globed the expression law courts making the case of a missing expression in http://wnpt.sl.res.ibm.com/wn/synset?id=03649459-n

Deploying

With the ideas of

UniversalDependencies/docs#667 (comment)

We may find conllu a good release format for the corpus

validation

we need ways to validate the files.

lemma `fresh water` vs `freshwater'

the lemma should be 'fresh water':

(:ofs "07776545" :pos "n" :keys
      (("freshwater_fish%1:13:00::" . "freshwater fish"))
      :gloss "flesh of fish from fresh water used as food" :tokens
      ((:kind :def :action :open)
       (:kind :wf :form "flesh" :lemma "flesh%1|flesh%2" :pos "NN" :tag "man" :senses
	       (("flesh%1:08:02::" . "flesh")))
       (:kind :wf :form "of" :lemma "of" :pos "IN" :tag "ignore")
       (:kind :wf :form "fish" :lemma "fish%1|fish%2" :pos "NN" :tag "man" :senses
	       (("fish%1:05:00::" . "fish")))
       (:kind :wf :form "from" :lemma "from" :pos "IN" :tag "ignore")
       (:kind
	 (:glob . "a")
	 :lemma "freshwater%1" :tag "man" :senses
	 (("freshwater%1:27:00::" . "freshwater"))
	 :glob "auto")
       (:kind
	 (:cf "a")
	 :form "fresh" :lemma "fresh%3|fresh%4" :pos "JJ" :tag "un")
       (:kind
	 (:cf "a")
	 :form "water" :lemma "water%1|water%2" :pos "NN" :tag "un")
       (:kind :wf :form "used" :lemma "use%2|used%3" :pos "VBN" :tag "un")
       (:kind :wf :form "as" :lemma "as" :pos "IN" :tag "ignore")
       (:kind :wf :form "food" :lemma "food%1" :pos "NN" :tag "un" :sep "")
       (:kind :wf :form ";" :pos ":" :tag "ignore" :type "punc")
       (:kind :def :action :close)))

export Mongo Json to TSV error

Token ordinary at sentence adj.all-1044557 has inexisting sense ordinary%5:45:00:common:01

But

% grep "ordinary%" query-senses.csv
"ordinary",7,"ordinary%1:18:01::	ordinary%1:06:00::	ordinary%5:00:02:common:01	ordinary%1:26:00::	ordinary%1:18:00::ordinary%1:06:01::	ordinary%3:00:00::"

% grep "ordinary%" ~/work/wn/WordNet-3.0/dict/index.sense
ordinary%1:06:00:: 03853734 5 0
ordinary%1:06:01:: 03853924 4 0
ordinary%1:18:00:: 10382480 3 0
ordinary%1:18:01:: 10382380 1 3
ordinary%1:26:00:: 13942743 2 1
ordinary%3:00:00:: 01672607 1 28
ordinary%5:00:02:common:01 00486290 2 4

I checked in the original glosstag files (merged dir) and this sentence 'ordinary' was not annotated. So why sensetion created this non-existent ordinary%5:45:00:common:01?

create a new release 0.2

After fixing #17 and #18