Git Product home page Git Product logo

glosstag's Introduction

Open Portuguese WordNet (OWN-PT)

This repository hosts Portuguese WordNet data in textual format, this is an experimental branch of http://openwordnet-pt.org. It is linked to (but independent from) the Open English WordNet.

You can also get the data in JSON and RDF format.

See the Wiki for how the data was generated, how it compares to Princeton WordNet and what is the syntax of the text files. This data is validated and exported by the mill tool — see its repository for more information about validation, export formats, etc.

glosstag's People

Contributors

arademaker avatar fcbond avatar hmuniz avatar odanoburu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

glosstag's Issues

Error in original data causes information loss

<synset id="r00003483" ofs="00003483" pos="r">
 <terms>
  <term>basically</term>
  <term>fundamentally</term>
  <term>essentially</term>
 </terms>
 <keys>
  <sk>basically%4:02:00::</sk>
  <sk>fundamentally%4:02:00::</sk>
  <sk>essentially%4:02:01::</sk>
 </keys>
 <gloss desc="orig">
  <orig>in essence; at bottom or by one's (or its) very nature; "He is basically dishonest"; "the argument was essentially a technical one"; "for all his bluster he is in essence a shy person"</orig>
 </gloss>
 <gloss desc="text">
  <text>in essence ; at bottom or by one's ( or its ) very nature ; � He is basically dishonest � ; � the argument was essentially a technical one � ; � for all his bluster he is in essence a shy person �</text>
 </gloss>
 <gloss desc="wsd">
  <def id="r00003483_d">
   <wf id="r00003483_wf1" lemma="in" pos="IN" tag="ignore">in</wf>
   <wf id="r00003483_wf2" lemma="essence%1" pos="NN" sep="" tag="un">essence</wf>
   <wf id="r00003483_wf3" pos=":" tag="ignore" type="punc">;</wf>
   <cf coll="a" id="r00003483_wf4" lemma="at" pos="IN" tag="ignore">
    <glob coll="a" glob="man" id="r00003483_coll.a" lemma="at_bottom%4" tag="man">
     <id coll="b" id="r00003483_id.6" lemma="at bottom" sk="at_bottom%4:02:00::"/>
   </glob>at</cf>
   <cf coll="a" id="r00003483_wf5" lemma="bottom%1|bottom%2|bottom%3" pos="NN" tag="un">bottom</cf>
   <wf id="r00003483_wf6" lemma="or" pos="CC" tag="ignore">or</wf>
   <wf id="r00003483_wf7" lemma="by" pos="IN" tag="ignore">by</wf>
....
(:ofs "00003483" :pos "r" :keys (("essentially%4:02:01::" . "essentially")
				 ("fundamentally%4:02:00::" . "fundamentally")
				 ("basically%4:02:00::" . "basically"))
      :gloss "in essence; at bottom or by one's (or its) very nature; 
         \"He is basically dishonest\"; \"the argument was essentially a technical one\";  
        \"for all his bluster he is in essence a shy person\""
      :tokens ((:kind :def :action :open)
	       (:kind :wf :form "in" :lemma "in" :pos "IN" :tag "ignore")
	       (:kind :wf :form "essence" :lemma "essence%1" :pos "NN" :tag "un" :sep "")
	       (:kind :wf :form ";" :pos ":" :tag "ignore" :type "punc")
	       (:kind (:glob . "a") :lemma "at_bottom%4" :tag "man" :glob "man")
	       (:kind (:cf "a") :form "at" :lemma "at" :pos "IN" :tag "ignore")
	       (:kind (:cf "a") :form "bottom" :lemma "bottom%1|bottom%2|bottom%3" 
                   :pos "NN" :tag "un")
	       (:kind :wf :form "or" :lemma "or" :pos "CC" :tag "ignore")
	       (:kind :wf :form "by" :lemma "by" :pos "IN" :tag "ignore")
...

The glob in plist version is not annotatted, is that right? Is it a bug in conversion?

inconsistency in pre-POS and lemmas

  1. tokens without lemmas as below. It looks like all of them are punctuation. We can add lemma for them, same as form.
{'form': ';', 'kind': ['wf'], 'meta': {'pos': ':', 'type': 'punc'}, 'tag': 'ignore', 'begin': 46, 'end': 47}
  1. tokens with lemmas without %N

    1. some with meta {'form': 'to', 'kind': ['wf'], 'lemmas': ['to'], 'meta': {'pos': 'TO'}, 'tag': 'ignore', 'begin': 50, 'end': 52}
    2. some without meta {'form': 'of', 'kind': ['wf'], 'lemmas': ['of'], 'tag': 'ignore', 'begin': 15, 'end': 17}
    3. some proper nouns {'form': 'Edmund', 'kind': ['wf'], 'lemmas': ['Edmund'], 'tag': 'un', 'begin': 41, 'end': 47}
    4. some missing words in PWN {'glob': 'man', 'kind': ['glob', 'c'], 'lemmas': ['flour_moths'], 'tag': 'un'}
    5. some annotated {'glob': 'man', 'kind': ['glob', 'b'], 'lemmas': ['appellate_court'], 'senses': ['appellate_court%1:14:00::'], 'tag': 'man'}

court of record

Ii has a wikipedia page https://en.wikipedia.org/wiki/Court_of_record

A court of record is a trial court or appellate court in which a record of the proceedings is captured and preserved, for the possibility of appeal.

We have in the glosstag some occurences of court of public records and court record . Should we add a synset for it?

standoff files

releases v0.1 and v0.2 changed only the merged files. Does it make sense to make the standoff files consistent? Generate new standoff files from the new merged files?

tokenization issues

after #9

  1. we still have cases were names 'A.B.Fulano' in a single token
  2. we may have other tokens that need to be split. We can search for . or - inside token forms.
  3. we have some cases of WF tokens with sep= , the space is the default sep, need to remove those cases and check if the detokenization approach still works matching the text field.

`cross court` vs `cross` `court`

(n) return | a tennis stroke that sends the ball back to the other player ; he won the point on a cross court return ;

I opted for globing cross court because many results from google show this is a MWE in tennis domain. But this MWE is missing in PWN.

tokenization

why

(n) ruling,opinion | the reason for a court's judgment ( as opposed to the decision itself) ;

and

(v) wash | admit to testing or proof ; This silly excuse wo n't wash in traffic court ;

assessment of the quality of ERG analyses

Executive summary:

Considering the spans that group predications and tokens for each sentence. In total, we have 1842193 such groups. In only 49793 of them, I found apparent POS inconsistency between ERG and the sense annotation.

49793/1842193 = 0.02

Note that I only consider the tokens that were sense tagged. If we count per sentence, 38883 sentences contain at least one error from a total of 159614 sentences. If we ignore the mismatches a/r (adverbs as adjectives) and q/n (someone), we have 28358 sentences with at least one error. If we also ignore mismatches caused by verb/adjective we have 17401 sentences:

38883/159614 = 0.24
28358/159614 = 0.17
17401/159614 = 0.11

The dataset contains 165994 sentences, but not all of them got a parse from ERG.

Details:

For all sentences, I join the tokens with the MRS predicates using the spans.

Below I found no conflict between ERG and the annotation. For instance, affect%2 means it was annotated as a verb, and ERG made it the predicate _affect_v_1. For hydrarthrosis, it was annotated as a noun, and ERG preprocessing instantiated a generic token from NNS pos tagger.

> START def hydrarthrosis affecting the knee
 (0,32) 0 => [('unknown', 0, 32, 'e2', 'h1', None), ('udef_q', 0, 32, 'q4', 'h5', None)]
 (0,13) 1 => [('hydrarthrosis', 0, 13, ['hydrarthrosis%1:26:00::'], ['wf'], 'NN'), ('_hydrarthrosis/nns_u_unknown', 0, 13, 'x4', 'h8', None)]
 (14,23) 1 => [('affecting', 14, 23, ['affect%2:29:00::'], ['wf'], 'VBG'), ('_affect_v_1', 14, 23, 'e9', 'h8', None)]
 (24,27) 1 => [('the', 24, 27, None, ['wf'], 'DT'), ('_the_q', 24, 27, 'q10', 'h11', None)]
 (28,32) 1 => [('knee', 28, 32, ['knee%1:08:00::'], ['wf'], 'NN'), ('_knee_n_1', 28, 32, 'x10', 'h14', None)]

Next, excess was annotated as an adjective (%5) but analysed as NOUN by ERG. See the line starting with “D>"

> START def an abnormality of pregnancy; accumulation of excess amniotic fluid
D> {'n', 'a'} [('excess', 45, 51, ['excess%5:00:00:unnecessary:00'], ['wf'], 'JJ'), ('udef_q', 45, 51, 'q29', 'h30', None), ('_excess_n_1', 45, 51, 'x29', 'h33', None)]
 (0,66) 0 => [('implicit_conj', 0, 66, 'e2', 'h1', None)]
  (0,28) 1 => [('unknown', 0, 28, 'e4', 'h1', None)]
   (0,2) 2 => [('an', 0, 2, None, ['wf'], 'DT'), ('_a_q', 0, 2, 'q6', 'h7', None)]
   (3,14) 2 => [('abnormality', 3, 14, None, ['wf'], 'NN'), ('_abnormality_n_1', 3, 14, 'x6', 'h10', None)]
   (15,17) 2 => [('of', 15, 17, None, ['wf'], 'IN'), ('_of_p', 15, 17, 'e11', 'h10', None)]
   (18,28) 2 => [('udef_q', 18, 28, 'q12', 'h13', None)]
    (18,27) 3 => [('pregnancy', 18, 27, ['pregnancy%1:26:00::'], ['wf'], 'NN'), ('_pregnancy_n_1', 18, 27, 'x12', 'h16', None)]
    (27,28) 3 => [(';', 27, 28, None, ['wf'], 'punc')]
  (29,66) 1 => [('unknown', 29, 66, 'e5', 'h1', None), ('udef_q', 29, 66, 'q17', 'h18', None)]
   (29,41) 2 => [('accumulation', 29, 41, ['accumulation%1:22:00::'], ['wf'], 'NN'), ('_accumulation_n_of', 29, 41, 'x17', 'h21', None)]
   (42,44) 2 => [('of', 42, 44, None, ['wf'], 'IN')]
   (45,66) 2 => [('udef_q', 45, 66, 'q22', 'h23', None)]
    (45,60) 3 => [('compound', 45, 60, 'e27', 'h26', None)]
     (45,51) 4 => [('excess', 45, 51, ['excess%5:00:00:unnecessary:00'], ['wf'], 'JJ'), ('udef_q', 45, 51, 'q29', 'h30', None), ('_excess_n_1', 45, 51, 'x29', 'h33', None)]
     (52,60) 4 => [('amniotic', 52, 60, None, ['cf', 'a'], 'JJ'), ('_amniotic/jj_u_unknown', 52, 60, 'e28', 'h26', None)]
    (61,66) 3 => [('fluid', 61, 66, None, ['cf', 'a'], 'NN'), ('_fluid_n_1', 61, 66, 'x22', 'h26', None)]

ERG annotated adverbs and adjectives as adjoins, so another common mismatch is a vs r. The fragment after the first semi-colon should be an example "equally balanced”?

> START def a state of being essentially equal or equivalent; equally balanced; 
D> {'a', 'r'} [('essentially', 17, 28, ['essentially%4:02:01::'], ['wf'], 'RB'), ('_essential_a_1', 17, 28, 'e17', 'h16', None)]
D> {'n', 'a'} [('equivalent', 38, 48, ['equivalent%1:09:00::'], ['wf'], 'JJ'), ('_equivalent_a_to', 38, 48, 'e22', 'h16', None)]
 (0,67) 0 => [('implicit_conj', 0, 67, 'e2', 'h1', None)]
  (0,49) 1 => [('unknown', 0, 49, 'e4', 'h1', None)]
   (0,1) 2 => [('a', 0, 1, None, ['wf'], 'DT'), ('_a_q', 0, 1, 'q6', 'h7', None)]
   (2,7) 2 => [('state', 2, 7, ['state%1:03:00::'], ['wf'], 'NN'), ('_state_n_of', 2, 7, 'x6', 'h10', None)]
   (8,10) 2 => [('of', 8, 10, None, ['wf'], 'IN')]
   (11,49) 2 => [('udef_q', 11, 49, 'q11', 'h12', None), ('nominalization', 11, 49, 'x11', 'h15', None)]
    (11,16) 3 => [('being', 11, 16, None, ['wf'], 'VBG')]
    (17,28) 3 => [('essentially', 17, 28, ['essentially%4:02:01::'], ['wf'], 'RB'), ('_essential_a_1', 17, 28, 'e17', 'h16', None)]
    (29,34) 3 => [('equal', 29, 34, None, ['wf'], 'JJ'), ('_equal_a_to', 29, 34, 'e18', 'h16', None)]
    (35,37) 3 => [('or', 35, 37, None, ['wf'], 'CC'), ('_or_c', 35, 37, 'e21', 'h16', None)]
    (38,48) 3 => [('equivalent', 38, 48, ['equivalent%1:09:00::'], ['wf'], 'JJ'), ('_equivalent_a_to', 38, 48, 'e22', 'h16', None)]
    (48,49) 3 => [(';', 48, 49, None, ['wf'], 'punc')]
  (50,67) 1 => [('unknown', 50, 67, 'e5', 'h1', None)]
   (50,57) 2 => [('equally', 50, 57, None, ['wf'], 'RB'), ('_equal_a_to', 50, 57, 'e25', 'h1', None)]
   (58,66) 2 => [('balanced', 58, 66, ['balance%2:42:00::'], ['wf'], 'VBN'), ('_balance_v_1', 58, 66, 'e26', 'h1', None)]
   (66,67) 2 => [(';', 66, 67, None, ['wf'], 'punc')]

Adjective vs verb:

> START def the condition of being reinstated; 
D> {'v', 'a'} [('reinstated', 23, 33, ['reinstate%2:41:00::'], ['wf'], 'VBN'), ('_instate_v_1', 23, 33, 'e15', 'h14', None), ('_re-_a_again', 23, 33, 'e18', 'h14', None)]
 (0,34) 0 => [('unknown', 0, 34, 'e2', 'h1', None)]
  (0,3) 1 => [('the', 0, 3, None, ['wf'], 'DT'), ('_the_q', 0, 3, 'q4', 'h5', None)]
  (4,13) 1 => [('condition', 4, 13, ['condition%1:26:00::'], ['wf'], 'NN'), ('_condition_n_of', 4, 13, 'x4', 'h8', None)]
  (14,16) 1 => [('of', 14, 16, None, ['wf'], 'IN')]
  (17,34) 1 => [('udef_q', 17, 34, 'q9', 'h10', None), ('nominalization', 17, 34, 'x9', 'h13', None)]
   (17,22) 2 => [('being', 17, 22, None, ['wf'], 'VBG')]
   (23,33) 2 => [('reinstated', 23, 33, ['reinstate%2:41:00::'], ['wf'], 'VBN'), ('_instate_v_1', 23, 33, 'e15', 'h14', None), ('_re-_a_again', 23, 33, 'e18', 'h14', None)]
   (33,34) 2 => [(';', 33, 34, None, ['wf'], 'punc')]

Someone vs person+some_q. (1829 cases), I need to improve my check to remove this from the suspicious cases.

> START def a situation of being uncomfortably close to someone or something
D> {'a', 'r'} [('uncomfortably', 21, 34, ['uncomfortably%4:02:00::'], ['wf'], 'RB'), ('_uncomfortable_a_1', 21, 34, 'e16', 'h15', None)]
D> {'q', 'n'} [('someone', 44, 51, ['someone%1:03:00::'], ['wf'], 'NN'), ('person', 44, 51, 'x24', 'h23', None), ('_some_q', 44, 51, 'q24', 'h25', None)]
 (0,64) 0 => [('unknown', 0, 64, 'e2', 'h1', None)]
  (0,1) 1 => [('a', 0, 1, None, ['wf'], 'DT'), ('_a_q', 0, 1, 'q4', 'h5', None)]
  (2,11) 1 => [('situation', 2, 11, ['situation%1:15:00::'], ['wf'], 'NN'), ('_situation_n_1', 2, 11, 'x4', 'h8', None)]
  (12,14) 1 => [('of', 12, 14, None, ['wf'], 'IN'), ('_of_p', 12, 14, 'e9', 'h8', None)]
  (15,64) 1 => [('udef_q', 15, 64, 'q10', 'h11', None), ('nominalization', 15, 64, 'x10', 'h14', None)]
   (15,20) 2 => [('being', 15, 20, None, ['wf'], 'VBG')]
   (21,34) 2 => [('uncomfortably', 21, 34, ['uncomfortably%4:02:00::'], ['wf'], 'RB'), ('_uncomfortable_a_1', 21, 34, 'e16', 'h15', None)]
   (35,40) 2 => [('close', 35, 40, None, ['wf'], 'JJ'), ('_close_a_to', 35, 40, 'e17', 'h15', None)]
   (41,43) 2 => [('to', 41, 43, None, ['wf'], 'TO')]
   (44,64) 2 => [('udef_q', 44, 64, 'q19', 'h20', None)]
    (44,51) 3 => [('someone', 44, 51, ['someone%1:03:00::'], ['wf'], 'NN'), ('person', 44, 51, 'x24', 'h23', None), ('_some_q', 44, 51, 'q24', 'h25', None)]
    (52,54) 3 => [('or', 52, 54, None, ['wf'], 'CC'), ('_or_c', 52, 54, 'x19', 'h28', None)]
    (55,64) 3 => [('something', 55, 64, None, ['wf'], 'PRP'), ('thing', 55, 64, 'x29', 'h30', None), ('_some_q', 55, 64, 'q29', 'h31', None)]

What is especially below? Tagged as an adverb, in the ERG analysis, it is X?

> START def the relative position or standing of things or especially persons in a society; 
D> {'x', 'r'} [('especially', 47, 57, ['especially%4:02:01::'], ['wf'], 'RB'), ('_especially_x_deg', 47, 57, 'e35', 'h34', None)]
 (0,79) 0 => [('unknown', 0, 79, 'e2', 'h1', None), ('udef_q', 0, 79, 'q4', 'h5', None)]
  (0,3) 1 => [('the', 0, 3, None, ['wf'], 'DT'), ('_the_q', 0, 3, 'q9', 'h8', None)]
  (4,12) 1 => [('relative', 4, 12, ['relative%3:00:00::'], ['wf'], 'JJ'), ('_relative_a_to', 4, 12, 'e13', 'h12', None)]
  (13,21) 1 => [('position', 13, 21, None, ['wf'], 'NN'), ('udef_q', 13, 21, 'q16', 'h15', None), ('_position_n_of', 13, 21, 'x16', 'h19', None)]
  (22,33) 1 => [('udef_q', 22, 33, 'q21', 'h20', None)]
   (22,24) 2 => [('or', 22, 24, None, ['wf'], 'CC'), ('_or_c', 22, 24, 'x9', 'h12', None)]
   (25,33) 2 => [('standing', 25, 33, None, ['wf'], 'NN'), ('_standing_n_1', 25, 33, 'x21', 'h24', None)]
  (34,36) 1 => [('of', 34, 36, None, ['wf'], 'IN'), ('_of_p', 34, 36, 'e25', 'h12', None)]
  (37,43) 1 => [('things', 37, 43, ['thing%1:06:01::'], ['wf'], 'NNS'), ('udef_q', 37, 43, 'q26', 'h27', None), ('_thing_n_of-about', 37, 43, 'x26', 'h30', None)]
  (44,46) 1 => [('or', 44, 46, None, ['wf'], 'CC'), ('_or_c', 44, 46, 'x4', 'h32', None)]
  (47,57) 1 => [('especially', 47, 57, ['especially%4:02:01::'], ['wf'], 'RB'), ('_especially_x_deg', 47, 57, 'e35', 'h34', None)]
  (58,79) 1 => [('udef_q', 58, 79, 'q33', 'h34', None)]
   (58,65) 2 => [('persons', 58, 65, ['person%1:03:00::'], ['wf'], 'NNS'), ('_person_n_1', 58, 65, 'x33', 'h39', None)]
   (66,68) 2 => [('in', 66, 68, None, ['wf'], 'IN'), ('_in_p_loc', 66, 68, 'e40', 'h39', None)]
   (69,70) 2 => [('a', 69, 70, None, ['wf'], 'DT'), ('_a_q', 69, 70, 'q41', 'h42', None)]
   (71,78) 2 => [('society', 71, 78, None, ['wf'], 'NN'), ('_society_n_of', 71, 78, 'x41', 'h45', None)]
   (78,79) 2 => [(';', 78, 79, None, ['wf'], 'punc')]

court dance

(n) courante | a court dance of the 16th century ; consisted of short advances and retreats ;
(n) pavan,pavane | a stately court dance of the 16th and 17th centuries ;
(n) saraband | a stately court dance of the 17th and 18th centuries ; in slow time ;
(n) minuet | a stately court dance in the 17th century ;

many cases. It looks like an expression.

https://www.google.com/search?client=safari&rls=en&q=court+dance&ie=UTF-8&oe=UTF-8

It can be a missing expression at

http://wnpt.sl.res.ibm.com/wn/synset?id=00534849-n < social_dancing

http://wnpt.sl.res.ibm.com/wn/synset?id=08253141-n < party

http://wnpt.sl.res.ibm.com/wn/synset?id=07448717-n < party duplicated?

a subconcept of http://wnpt.sl.res.ibm.com/wn/synset?id=00428270-n, the act

http://wnpt.sl.res.ibm.com/wn/synset?id=00537682-n (another synset for the royal courts dancing)

http://wnpt.sl.res.ibm.com/wn/synset?id=00532110-n - this is the act

dark brown

(n) Pinus strobiformis,southwestern white pine | medium-size pine of northwestern Mexico ; bark is dark brown and furrowed when mature ;

many cases of drak brown are marked as glob and sense tagged to http://wnpt.sl.res.ibm.com/wn/synset?id=00372111-a but this synset doen't have the lemma dark brown with spaces.

punctuations

punctuations can be represented in a more simplified way:

% jq -c -S ".tokens | .[] " data/*.jl | rg "\"punc\""  | sort | uniq -c | sort -nr
50891 {"form":";","kind":["wf"],"pos":":","tag":"ignore","type":"punc"}
48354 {"form":"“","kind":["wf"],"pos":"dq","sep":"","tag":"ignore","type":"punc"}
47610 {"form":"”","kind":["wf"],"pos":"dq","sep":"","tag":"ignore","type":"punc"}
15710 {"form":";","kind":["wf"],"tag":"ignore","type":"punc"}
14024 {"form":"(","kind":["wf"],"pos":"(","sep":"","tag":"ignore","type":"punc"}
8606 {"form":")","kind":["wf"],"pos":")","sep":"","tag":"ignore","type":"punc"}

We could avoid repetition and save some space, considering we have 210429 tokens of type punc.

a. {"form":";","kind":["wf"],"pos":":","tag":"ignore","type":"punc"}
b. {"form":";","kind":["wf"],"tag":"ignore","pos":"punc"}

glob error: add glob

(a) incumbent | lying or aleaning aon something else ; an incumbent bgeological bformation ;

@coll attribute in ID tag outside glob

adj
WARNING: ==> id a00259820_id.3
WARNING: ==> id a00260323_id.1
WARNING: ==> id a00260430_id.3
WARNING: ==> id a00261735_id.3
WARNING: ==> id a01692222_id.5
WARNING: ==> id a01692512_id.3
WARNING: ==> id a02160291_id.2
WARNING: ==> id a02830223_id.2

noun:
WARNING: ==> id n00044900_id.3
WARNING: ==> id n00312266_id.4
WARNING: ==> id n00540701_id.7
WARNING: ==> id n00662017_id.8
WARNING: ==> id n00723547_id.7
WARNING: ==> id n00734107_id.3
WARNING: ==> id n00800121_id.1
WARNING: ==> id n01203277_id.4
WARNING: ==> id n01246334_id.7
WARNING: ==> id n01441117_id.7
WARNING: ==> id n01548865_id.7
WARNING: ==> id n01599269_id.3
WARNING: ==> id n01618356_id.2
WARNING: ==> id n01731137_id.4
WARNING: ==> id n01746565_id.2
WARNING: ==> id n02042759_id.2
WARNING: ==> id n02076402_id.4
WARNING: ==> id n02217050_id.2
WARNING: ==> id n02300797_id.6
WARNING: ==> id n02351686_id.4
WARNING: ==> id n02587051_id.3
WARNING: ==> id n02675885_id.3
WARNING: ==> id n02786984_id.1
WARNING: ==> id n03059366_id.2
WARNING: ==> id n03139640_id.3
WARNING: ==> id n03559373_id.3
WARNING: ==> id n03838899_id.5
WARNING: ==> id n03963645_id.2
WARNING: ==> id n04119478_id.6
WARNING: ==> id n04414675_id.5
WARNING: ==> id n04439840_id.5
WARNING: ==> id n04593524_id.2
WARNING: ==> id n04595028_id.2
WARNING: ==> id n04973669_id.5
WARNING: ==> id n04973816_id.5
WARNING: ==> id n06185748_id.8
WARNING: ==> id n06274760_id.4
WARNING: ==> id n06358159_id.3
WARNING: ==> id n06704115_id.7
WARNING: ==> id n07451903_id.3
WARNING: ==> id n07548978_id.3
WARNING: ==> id n07711683_id.2
WARNING: ==> id n07712959_id.4
WARNING: ==> id n07767344_id.4
WARNING: ==> id n08026539_id.15
WARNING: ==> id n08142370_id.16
WARNING: ==> id n08165866_id.3
WARNING: ==> id n08165866_id.1
WARNING: ==> id n08181658_id.3
WARNING: ==> id n08316564_id.1
WARNING: ==> id n08361720_id.3
WARNING: ==> id n08582157_id.4
WARNING: ==> id n08766988_id.1
WARNING: ==> id n09166902_id.8
WARNING: ==> id n09343266_id.4
WARNING: ==> id n09488584_id.4
WARNING: ==> id n09541125_id.4
WARNING: ==> id n09602484_id.3
WARNING: ==> id n09854708_id.2
WARNING: ==> id n09927305_id.2
WARNING: ==> id n09933842_id.3
WARNING: ==> id n10302700_id.4
WARNING: ==> id n10668024_id.9
WARNING: ==> id n10673296_id.7
WARNING: ==> id n10730820_id.8
WARNING: ==> id n11147924_id.9
WARNING: ==> id n11187930_id.6
WARNING: ==> id n11188123_id.5
WARNING: ==> id n11263180_id.3
WARNING: ==> id n11481209_id.3
WARNING: ==> id n11507797_id.3
WARNING: ==> id n11663449_id.3
WARNING: ==> id n11691332_id.7
WARNING: ==> id n11763142_id.3
WARNING: ==> id n11816336_id.12
WARNING: ==> id n11929880_id.1
WARNING: ==> id n12011838_id.9
WARNING: ==> id n12052787_id.12
WARNING: ==> id n12301445_id.4
WARNING: ==> id n12305819_id.4
WARNING: ==> id n12566627_id.3
WARNING: ==> id n12568506_id.4
WARNING: ==> id n12680125_id.5
WARNING: ==> id n12680125_id.3
WARNING: ==> id n13177048_id.9
WARNING: ==> id n13399379_id.4
WARNING: ==> id n13414554_id.9
WARNING: ==> id n13522485_id.2
WARNING: ==> id n13980288_id.3
WARNING: ==> id n14244003_id.4
WARNING: ==> id n14252184_id.4
WARNING: ==> id n14325006_id.7
WARNING: ==> id n14647907_id.5
WARNING: ==> id n14649775_id.13
WARNING: ==> id n14764715_id.3

verb:
WARNING: ==> id v00055142_id.2
WARNING: ==> id v00529411_id.3
WARNING: ==> id v00614444_id.3
WARNING: ==> id v01164081_id.2
WARNING: ==> id v01168259_id.2
WARNING: ==> id v01283893_id.3
WARNING: ==> id v01898769_id.1
WARNING: ==> id v02162310_id.3
WARNING: ==> id v02612762_id.7

some glosses are repeated

We have glosses repeated in PWN 3.0 and PWN 3.1.

  • 376 cases in PWN30 where 289 gloses have been repeated twice.
  • 363 cases in PWN31, where 275 sentences were repeated twice.

One example:

  1. http://wn.mybluemix.net/synset?id=01156302-a (mellow)
  2. http://wn.mybluemix.net/synset?id=01492061-a (mellowed • mellow)
ar@tenis glosstag-kg % cat ../WordNet-3.0/dict/data.* | awk -F "|" '$0 ~ /^[0-9]/ {print $2}' | sort | uniq -c | sort -nr | head
  23  a variety of aster
  18  a branch of the Tai languages
  13  one species
  13  one of the British colonies that formed the United States
  11  a genus of bacteria
   9  an artificial language
   9  a radioactive transuranic element
   9  a genus of Mustelidae
   9  a Chadic language spoken south of Lake Chad
   8  a genus of Psittacidae

% cat ../WordNet-3.0/dict/data.* | awk -F "|" '$0 ~ /^[0-9]/ {print $2}' | sort | uniq -c | sort -nr | awk '$1 > 1 {print $1}' | sort | uniq -c
   1 11
   2 13
   1 18
 289 2
   1 23
  40 3
  18 4
   9 5
   5 6
   5 7
   1 8
   4 9

For PWN 3.1

% cat ../WordNet-3.1-dict/data.* | awk -F "|" '$0 ~ /^[0-9]/ {print $2}' | sort | uniq -c | sort -nr | head
  23  a variety of aster
  18  a branch of the Tai languages
  13  one species
  13  one of the British colonies that formed the United States
  11  a genus of bacteria
   9  an artificial language
   9  a radioactive transuranic element
   9  a genus of Mustelidae
   9  a Chadic language spoken south of Lake Chad
   8  a genus of Psittacidae

% cat ../WordNet-3.1-dict/data.* | awk -F "|" '$0 ~ /^[0-9]/ {print $2}' | sort | uniq -c | sort -nr | awk '$1 > 1 {print $1}' | sort | uniq -c
   1 11
   2 13
   1 18
 275 2
   1 23
  40 3
  18 4
   9 5
   6 6
   5 7
   1 8
   4 9

plans

ideas for the future:

  1. how to use the glosstag to improve mappings from WN to Propbank and VerbNet
    ...

higher court is a MWE?

(n) appeal | ( law) a legal proceeding in which the appellant resorts to a higher court for the purpose of obtaining a review of a lower court decision and a reversal of the lower court's judgment or the granting of a new trial ; their appeal was denied in the superior court ;

(n) reversal | a judgment by a higher court that the judgment of a lower court was incorrect and should be set aside ;

(n) affirmation | a judgment by a higher court that the judgment of a lower court was correct and should stand ;

is higher court a missing expression in the http://wnpt.sl.res.ibm.com/wn/synset?id=08335751-n ?

lemma `fresh water` vs `freshwater'

the lemma should be 'fresh water':

(:ofs "07776545" :pos "n" :keys
      (("freshwater_fish%1:13:00::" . "freshwater fish"))
      :gloss "flesh of fish from fresh water used as food" :tokens
      ((:kind :def :action :open)
       (:kind :wf :form "flesh" :lemma "flesh%1|flesh%2" :pos "NN" :tag "man" :senses
	       (("flesh%1:08:02::" . "flesh")))
       (:kind :wf :form "of" :lemma "of" :pos "IN" :tag "ignore")
       (:kind :wf :form "fish" :lemma "fish%1|fish%2" :pos "NN" :tag "man" :senses
	       (("fish%1:05:00::" . "fish")))
       (:kind :wf :form "from" :lemma "from" :pos "IN" :tag "ignore")
       (:kind
	 (:glob . "a")
	 :lemma "freshwater%1" :tag "man" :senses
	 (("freshwater%1:27:00::" . "freshwater"))
	 :glob "auto")
       (:kind
	 (:cf "a")
	 :form "fresh" :lemma "fresh%3|fresh%4" :pos "JJ" :tag "un")
       (:kind
	 (:cf "a")
	 :form "water" :lemma "water%1|water%2" :pos "NN" :tag "un")
       (:kind :wf :form "used" :lemma "use%2|used%3" :pos "VBN" :tag "un")
       (:kind :wf :form "as" :lemma "as" :pos "IN" :tag "ignore")
       (:kind :wf :form "food" :lemma "food%1" :pos "NN" :tag "un" :sep "")
       (:kind :wf :form ";" :pos ":" :tag "ignore" :type "punc")
       (:kind :def :action :close)))

export Mongo Json to TSV error

Token ordinary at sentence adj.all-1044557 has inexisting sense ordinary%5:45:00:common:01

But

% grep "ordinary%" query-senses.csv
"ordinary",7,"ordinary%1:18:01::	ordinary%1:06:00::	ordinary%5:00:02:common:01	ordinary%1:26:00::	ordinary%1:18:00::ordinary%1:06:01::	ordinary%3:00:00::"

% grep "ordinary%" ~/work/wn/WordNet-3.0/dict/index.sense
ordinary%1:06:00:: 03853734 5 0
ordinary%1:06:01:: 03853924 4 0
ordinary%1:18:00:: 10382480 3 0
ordinary%1:18:01:: 10382380 1 3
ordinary%1:26:00:: 13942743 2 1
ordinary%3:00:00:: 01672607 1 28
ordinary%5:00:02:common:01 00486290 2 4

I checked in the original glosstag files (merged dir) and this sentence 'ordinary' was not annotated. So why sensetion created this non-existent ordinary%5:45:00:common:01?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.