own-pt / mill Goto Github PK

crunch textual wordnet data

License: Apache License 2.0

Haskell 44.58% Python 40.26% Emacs Lisp 15.16%

mill's Introduction

Open Portuguese WordNet (OWN-PT)

This repository hosts Portuguese WordNet data in textual format, this is an experimental branch of http://openwordnet-pt.org. It is linked to (but independent from) the Open English WordNet.

You can also get the data in JSON and RDF format.

See the Wiki for how the data was generated, how it compares to Princeton WordNet and what is the syntax of the text files. This data is validated and exported by the mill tool — see its repository for more information about validation, export formats, etc.

mill's People

Contributors

Stargazers

Watchers

mill's Issues

add syntactic marker to exports

JSON
RDF
WNDB

add JSON-LD export

this would help eliminate a python dependency and an export backend (since we'd get the RDF by joining the JSON output + with the json-ld context)

this would probably require a flattening of the relations in the synset document -- having each of them in the top-level instead of under the relations key. senses would be embedded in the synset (as now).

(we could use https://github.com/digitalbazaar/pyld for processing when needed, but this wouldn't be an actual dependency)

add nomlex data

these are in the original RDF but not yet in the text format

use Intern library?

http://hackage.haskell.org/package/intern

could be useful for reducing memory footprint, and might even improve performance by means of faster comparisons.

installation os haskell

I executed

$ stack build --copy-bins
Specified file "ChangeLog.md" for extra-source-files does not exist
Specified source-dir "test" does not exist
Multiple .cabal files found in directory /Users/ar/work/wn/mill/:
- mill.cabal
- wntext.cabal

what does it mean? Did it finish the installation or not?

better error messages

~~- [ ] use http://hackage.haskell.org/package/formatting-6.3.7/docs/Formatting.html for string formatting~~

use prettyprinter for string formatting
use megaparsec's error reporting capabilities to show errors? or reduce them to our error reporting capabilities?
more precise error locations
- probably requires re-adding statements to the pipeline

don't enforce sorting of synsets

support multiple languages

branch multilang has a simple implementation were we don't actually change anything about the representation of the synsets or their relations, we simply allow lexicographer files in different directories (one for each language). this approach is nice because it doesn't create any special cases for English (or any other language), but it generates the following doubts:

~~- figure out export commands:~~
~~- [ ] for WNDB, should we strip out interlingual relations? (yes, but how to do it cleanly?)~~
~~- [ ] for JSON, should we lump everything in one output, or should we generate one file per language?~~

branch multilingual now implements an approach where each WN has a name and this name is part of {synset,wordsense} identifiers; when exporting we may choose to restrict output to one language or to output everything.

validation

subsumes https://github.com/own-pt/own-en/issues/38

references

targets exist
pointers exist
frames exist (and only verbs specify them)
synset relations exist
word relations exist
no duplicate references (two different things with the same name -- if the lexical ids are correct does this ever happen?)
check symmetric relations? (I think it's good to have the redundancy)
- probably best left to RDF schema validation
check there's a head synset to adjective satellites (similarTo relation)

select ?as
where {
  ?as rdf:type ns1:AdjectiveSatelliteSynset .
  FILTER NOT EXISTS {?as ns1:similarTo ?hs .}
  } LIMIT 10

check if relation is used in the proper place (word relations aren't used in synset positions and vice-versa)

sorting

check if words senses are sorted
check if synsets are sorted
check if word relations are sorted
check if synset relations are sorted
check if frames are sorted
check if examples are sorted

other

lexical indexing is correct
- word sense that comes first has lower lexical id
- no two words with the same lexical id

please add more suggestions.

more flexible configuration

validation

allow user to specify configuration files or directory where to find them
improve relations.tsv to contain information about domain and counterdomain, and check those at validation phase

RDF generation

allow user to specify base IRI instead of hardcoding it
- base IRI for each wordnet is specified in wn.tsv, but this information is not used. the base IRI for the predicates should be specified as a flag.

reading

probably better to improve configuration data types to decrease duplication (instead of several Map a b, use Map a (b, c, d, e))

mill error in the export

leme:mill ar$ mill export --json data/ wnX.json
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
...
/Users/ar/work/wn/mill/data/verb.weather:13018:13019: unexpected "drf"

/Users/ar/work/wn/mill/data/verb.weather:13189:13190: unexpected "drf"

/Users/ar/work/wn/mill/data/verb.weather:13339:13340: unexpected "drf"

/Users/ar/work/wn/mill/data/verb.weather:13464:13465: unexpected "drf"

/Users/ar/work/wn/mill/data/verb.weather:13766:13767: unexpected "hypo"

/Users/ar/work/wn/mill/data/verb.weather:13791:13792: unexpected "drf"

/Users/ar/work/wn/mill/data/verb.weather:13984:13985: unexpected "hypo"

/Users/ar/work/wn/mill/data/verb.weather:14003:14004: unexpected "drf"

/Users/ar/work/wn/mill/data/verb.weather:14100:14101: unexpected "drf"

/Users/ar/work/wn/mill/data/verb.weather:14269:14270: unexpected "drf"

/Users/ar/work/wn/mill/data/verb.weather:14537:14538: unexpected "hypo"

output human-readable positions

currently mill uses position offsets instead of the more human-readable (line,column) paradigm.

see mrkkrp/megaparsec#343 for how best to do it with megaparsec

export to sensetion JSON

correct wn2text sorting

output word senses in the proper order
output synsets in the proper order (among a lexicographer file)

the latter is questionable since we prevent the lexicographer from writing the synsets in an order that might make more sense, however it is useful for stability

add restricted comment syntax

we can't allow comments anywhere since that would break serialization (comments and other metadata will be present in the full versions of the export formats; we may also export leaner versions for end-users not interested in re-serialization to the original text format)

mill mode

export swapping adjective and adjective satellite in sense_key

I used mill export over OWN-EN files and it worked fine for most cases. When it deals with adjectives, though, it seems to change the classification of adjective and adjective satellite in the sense-key. This changes significantly the sense_key, mainly due to the effect of head_word and head_id.

This effect on sense_key results in more difficulty in mapping synset_ids between different versions of OWN (both EN and PT).

Reflexitivity

When it comes to reflexive relations, mill should expect only one of the reflexive pair to be on the file, not both.

For instance, if A is hyper:B, we should have 1 or 2, but not 3.

1) 


w: A
d: xxxxx
hyper: B

w: B
d: yyyyyy

2)  w: A
d: xxxxx

w: B
d: yyyyyy
hypo: A

3) w: A
d: xxxxx
hyper: B

w: B
d: yyyyyy
hypo: A

Should mill detect 3, it should raise a warning - it is not an error.

This applies to reflexive relations such as hyper/hypo, mp/hp, etc.

This issue is probably related to issue 6.

[mill mode] fix go to definition

regexp is wrong, hold 1 will catch hold 10 and apparently there are other errors too.

documentation

How is this repo related to the https://github.com/own-pt/own-en/? Please make it clear in the README file.

text->RDF

❗ means that the IRI does not exist in the current RDF schema.

synset

wordsense

lexicalForm
lemma (do we really need this? it's determinable from the lexicalForm)
frame (do we need to convert it to string?)
word relations
senseKey (do we really need this? it's determinable from other information)
syntactic marker

other

add missing symmetry (or just report it as an error in #6 ? or check in the RDF ?)
use names from configuration file

sense key backwards compatibility

in mill we decided to have a uniform syntax for synset description (no different syntax for adjective cluster synsets), with each wordsense having a unique ID in the same wordnet, given by its lexicographer file, lexical form and lexical ID. in the PWN this ID scheme does not work; there are several adjective satellite wordsenses which share the same mill ID, as this query shows.

we 'solved' this problem by splitting the adjective satellites in their own lexicographer file (adjs.all; although this is not hardcoded in mill, but controlled by a configuration file) and updating their lexical IDs, which broke the backward-compatibility of sense keys as defined in the PWN documentation.

use syntactic marker as relation

see here for what a syntactic marker is.

instead of specifying a syntactic marker as w: afloat marker p we would do w: afloat marker predicative 2(where p is one of the hardcoded options, while the target of the marker relation could be any wordsense)

there are synsets already for predicative and attributive positions, but none for immediately postnominal.

this could be an issue at https://github.com/own-pt/openWordnet-PT/ too, but this is one is more of a reminder to remove the markers from the parser if this proposal is accepted.

RDF->text

as of 2019-07-22, there are 7865 errors found by running mill --validate in the text files exported from the RDF by wn2text.py. all of them are due to missing targets in word relations. own-pt/openWordnet-PT#151 alone should be responsible for 6558 instances of the problem.
separate adjective satellites in their own lexfile (or else there'll be conflicting identifiers -- same lexical form, lexical id, and lexfile)

sorting

wordsenses
synsets
synset relations
word relations

add filename information to validation output

offer human-readable output too? (we can't really interpret integer offsets)

don't stop executing when there are only warnings

unsorted wordsenses/relations are warnings, but behave as errors -- execution is stopped when any is found.

also make sure that the errors/warnings are output to stderr.

change parser error message

The parser error message is not in the same format that validation error is.
Should mill-flymke also deals with that format or the error message needs to change?

handle file paths correctly

when validating a single file mill will read all files specified by lexnames.tsv and separate the file to be validated from them. paths are currently not being normalised, so this separation may go wrong.

@hmuniz caught this error.

use Data-Map-Justified

not sure we'll be able to substitute all 'unsafe' lookups, but at least we should be able to use member during parsing and store stuff like relation names as keys with the phantom parameters that afford us safe lookups.

see https://hackage.haskell.org/package/justified-containers

[mill-mode] wrong flymake error positions

problem: mill-mode uses flymake to highlight nonsense errors.

cause: when mill validates a single file it still outputs parse errors in the other files, and mill-mode doesn't filter which errors to show.

solution: filter the errors from the other files and show an error message in the first line similar to there are syntax errors in files x, y, and z -- this can be done in mill or in mill-mode.

proposal: get rid of lexical ids

#40 discusses the identification of satellite adjectives, which in PWN differ from the identification scheme used by other kinds of synsets. in mill we tried to make this identification scheme more regular by having the satellite adjectives use the same scheme as every other kind of synset (which ultimately led to the issue discussed in #40).

a more radical idea is to stop trying to have satellite adjectives behave like other synsets, and have the other synsets behave like satellite adjectives. what follows is a proposal for a new wordsense ID scheme.

lexical ids have no meaning whatsoever; they are solely an ad hoc way of preventing ID clashes, because the combination (lexical form, lexicographer file) is not enough to uniquely determine a wordsense. we could get rid of lexical ids by generalizing a version of the ID scheme formerly used by adjective satellites, which can be uniquely identified by (lexical form, head synset).

nouns and verbs could be identified by (lexfile, lexical form, hypernym) (or hyponym?)
pertainyms (adjective or adverbs) could be identified by which wordsense they pertain to (plus lexfile and lexical form).

all in all, we define a 'core' relation for each 'kind' of wordsense/synset, and use the relation's target + the lexical form of the source to identify the wordsense/synset. naturally, mill would have to be able to verify the uniqueness of this naming scheme. naturally, we wouldn't have to identify the core target beyond its lexical form unless that's not sufficient to satisfy the uniqueness constraint.

but this is all very radical, so I don't know if it should be implemented.

error in the export for dbfiles

% mill export -c config -l en --wndb src ~/Temp/tt
mill: No head or more than one head
CallStack (from HasCallStack):
  error, called at src/Export.hs:305:25 in mill-0.1.0.0-H0ExyKEE7Io8CVBxpaTfTx:Export

@odanoburu any idea? We need to generate the DB files to produce the Freeling files from OWN-EN... Can you help?

own-pt / mill Goto Github PK

mill's Introduction

Open Portuguese WordNet (OWN-PT)

mill's People

Contributors

Stargazers

Watchers

mill's Issues

references

sorting

other

validation

RDF generation

reading

synset

wordsense

other

sorting

Recommend Projects

Recommend Topics

Recommend Org