Git Product home page Git Product logo

mill's Introduction

Open Portuguese WordNet (OWN-PT)

This repository hosts Portuguese WordNet data in textual format, this is an experimental branch of http://openwordnet-pt.org. It is linked to (but independent from) the Open English WordNet.

You can also get the data in JSON and RDF format.

See the Wiki for how the data was generated, how it compares to Princeton WordNet and what is the syntax of the text files. This data is validated and exported by the mill tool — see its repository for more information about validation, export formats, etc.

mill's People

Contributors

arademaker avatar hmuniz avatar odanoburu avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mill's Issues

add JSON-LD export

this would help eliminate a python dependency and an export backend (since we'd get the RDF by joining the JSON output + with the json-ld context)

this would probably require a flattening of the relations in the synset document -- having each of them in the top-level instead of under the relations key. senses would be embedded in the synset (as now).

(we could use https://github.com/digitalbazaar/pyld for processing when needed, but this wouldn't be an actual dependency)

add nomlex data

these are in the original RDF but not yet in the text format

installation os haskell

I executed

$ stack build --copy-bins
Specified file "ChangeLog.md" for extra-source-files does not exist
Specified source-dir "test" does not exist
Multiple .cabal files found in directory /Users/ar/work/wn/mill/:
- mill.cabal
- wntext.cabal

what does it mean? Did it finish the installation or not?

support multiple languages

branch multilang has a simple implementation were we don't actually change anything about the representation of the synsets or their relations, we simply allow lexicographer files in different directories (one for each language). this approach is nice because it doesn't create any special cases for English (or any other language), but it generates the following doubts:

- figure out export commands:
- [ ] for WNDB, should we strip out interlingual relations? (yes, but how to do it cleanly?)
- [ ] for JSON, should we lump everything in one output, or should we generate one file per language?

branch multilingual now implements an approach where each WN has a name and this name is part of {synset,wordsense} identifiers; when exporting we may choose to restrict output to one language or to output everything.

validation

subsumes https://github.com/own-pt/own-en/issues/38

references

  • targets exist
  • pointers exist
  • frames exist (and only verbs specify them)
  • synset relations exist
  • word relations exist
  • no duplicate references (two different things with the same name -- if the lexical ids are correct does this ever happen?)
  • check symmetric relations? (I think it's good to have the redundancy)
    • probably best left to RDF schema validation
  • check there's a head synset to adjective satellites (similarTo relation)
select ?as
where {
  ?as rdf:type ns1:AdjectiveSatelliteSynset .
  FILTER NOT EXISTS {?as ns1:similarTo ?hs .}
  } LIMIT 10
  • check if relation is used in the proper place (word relations aren't used in synset positions and vice-versa)

sorting

  • check if words senses are sorted
  • check if synsets are sorted
  • check if word relations are sorted
  • check if synset relations are sorted
  • check if frames are sorted
  • check if examples are sorted

other

  • lexical indexing is correct
    • word sense that comes first has lower lexical id
    • no two words with the same lexical id

please add more suggestions.

more flexible configuration

validation

  • allow user to specify configuration files or directory where to find them
  • improve relations.tsv to contain information about domain and counterdomain, and check those at validation phase

RDF generation

  • allow user to specify base IRI instead of hardcoding it
    • base IRI for each wordnet is specified in wn.tsv, but this information is not used. the base IRI for the predicates should be specified as a flag.

reading

  • probably better to improve configuration data types to decrease duplication (instead of several Map a b, use Map a (b, c, d, e))

mill error in the export

leme:mill ar$ mill export --json data/ wnX.json
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
Wrong number of fields in relations.tsv
...
/Users/ar/work/wn/mill/data/verb.weather:13018:13019: unexpected "drf"

/Users/ar/work/wn/mill/data/verb.weather:13189:13190: unexpected "drf"

/Users/ar/work/wn/mill/data/verb.weather:13339:13340: unexpected "drf"

/Users/ar/work/wn/mill/data/verb.weather:13464:13465: unexpected "drf"

/Users/ar/work/wn/mill/data/verb.weather:13766:13767: unexpected "hypo"

/Users/ar/work/wn/mill/data/verb.weather:13791:13792: unexpected "drf"

/Users/ar/work/wn/mill/data/verb.weather:13984:13985: unexpected "hypo"

/Users/ar/work/wn/mill/data/verb.weather:14003:14004: unexpected "drf"

/Users/ar/work/wn/mill/data/verb.weather:14100:14101: unexpected "drf"

/Users/ar/work/wn/mill/data/verb.weather:14269:14270: unexpected "drf"

/Users/ar/work/wn/mill/data/verb.weather:14537:14538: unexpected "hypo"

correct wn2text sorting

  • output word senses in the proper order
  • output synsets in the proper order (among a lexicographer file)

the latter is questionable since we prevent the lexicographer from writing the synsets in an order that might make more sense, however it is useful for stability

add restricted comment syntax

we can't allow comments anywhere since that would break serialization (comments and other metadata will be present in the full versions of the export formats; we may also export leaner versions for end-users not interested in re-serialization to the original text format)

mill mode

  • fontification
  • indentation
  • go to definition
  • show related synsets in two different wordnets side-by-side
  • find references
  • checker
  • list
    • frames (if you press ENTER it inserts the frame number for you, several frames may be picked; press q to exit) M-x mill-list-frames
    • relations (if you press ENTER it inserts the relation code for you; press q to exit; by default this will only show you relations that match the current PoS and obj -- synset or wordsense) M-x mill-list-relations
  • interactive features:
    • new synset creation
    • add frame/relation (show help text)
  • fill definition/example automatically (normally bound to M-q) (how to make it?)
  • syntax-highlight:
    • syntactic markers
      image
    • comments (properly -- currently comments will be highlighted even where they are not allowed)

export swapping adjective and adjective satellite in sense_key

I used mill export over OWN-EN files and it worked fine for most cases. When it deals with adjectives, though, it seems to change the classification of adjective and adjective satellite in the sense-key. This changes significantly the sense_key, mainly due to the effect of head_word and head_id.

This effect on sense_key results in more difficulty in mapping synset_ids between different versions of OWN (both EN and PT).

Reflexitivity

When it comes to reflexive relations, mill should expect only one of the reflexive pair to be on the file, not both.

For instance, if A is hyper:B, we should have 1 or 2, but not 3.

1) 


w: A
d: xxxxx
hyper: B

w: B
d: yyyyyy

2)  w: A
d: xxxxx

w: B
d: yyyyyy
hypo: A

3) w: A
d: xxxxx
hyper: B

w: B
d: yyyyyy
hypo: A

Should mill detect 3, it should raise a warning - it is not an error.

This applies to reflexive relations such as hyper/hypo, mp/hp, etc.

This issue is probably related to issue 6.

text->RDF

means that the IRI does not exist in the current RDF schema.

synset

  • lexicographerFile
  • definition ( was: gloss)
  • example
  • frame
  • containsWordSense
  • synset relations

wordsense

  • lexicalForm
  • lemma (do we really need this? it's determinable from the lexicalForm)
  • frame (do we need to convert it to string?)
  • word relations
  • senseKey (do we really need this? it's determinable from other information)
  • syntactic marker

other

  • add missing symmetry (or just report it as an error in #6 ? or check in the RDF ?)
  • use names from configuration file

sense key backwards compatibility

in mill we decided to have a uniform syntax for synset description (no different syntax for adjective cluster synsets), with each wordsense having a unique ID in the same wordnet, given by its lexicographer file, lexical form and lexical ID. in the PWN this ID scheme does not work; there are several adjective satellite wordsenses which share the same mill ID, as this query shows.

we 'solved' this problem by splitting the adjective satellites in their own lexicographer file (adjs.all; although this is not hardcoded in mill, but controlled by a configuration file) and updating their lexical IDs, which broke the backward-compatibility of sense keys as defined in the PWN documentation.

use syntactic marker as relation

see here for what a syntactic marker is.

instead of specifying a syntactic marker as w: afloat marker p we would do w: afloat marker predicative 2(where p is one of the hardcoded options, while the target of the marker relation could be any wordsense)

there are synsets already for predicative and attributive positions, but none for immediately postnominal.

this could be an issue at https://github.com/own-pt/openWordnet-PT/ too, but this is one is more of a reminder to remove the markers from the parser if this proposal is accepted.

RDF->text

  • as of 2019-07-22, there are 7865 errors found by running mill --validate in the text files exported from the RDF by wn2text.py. all of them are due to missing targets in word relations. own-pt/openWordnet-PT#151 alone should be responsible for 6558 instances of the problem.
  • separate adjective satellites in their own lexfile (or else there'll be conflicting identifiers -- same lexical form, lexical id, and lexfile)

sorting

  • wordsenses
  • synsets
  • synset relations
  • word relations

change parser error message

The parser error message is not in the same format that validation error is.
Should mill-flymke also deals with that format or the error message needs to change?

handle file paths correctly

when validating a single file mill will read all files specified by lexnames.tsv and separate the file to be validated from them. paths are currently not being normalised, so this separation may go wrong.

@hmuniz caught this error.

[mill-mode] wrong flymake error positions

problem: mill-mode uses flymake to highlight nonsense errors.

cause: when mill validates a single file it still outputs parse errors in the other files, and mill-mode doesn't filter which errors to show.

solution: filter the errors from the other files and show an error message in the first line similar to there are syntax errors in files x, y, and z -- this can be done in mill or in mill-mode.

proposal: get rid of lexical ids

#40 discusses the identification of satellite adjectives, which in PWN differ from the identification scheme used by other kinds of synsets. in mill we tried to make this identification scheme more regular by having the satellite adjectives use the same scheme as every other kind of synset (which ultimately led to the issue discussed in #40).

a more radical idea is to stop trying to have satellite adjectives behave like other synsets, and have the other synsets behave like satellite adjectives. what follows is a proposal for a new wordsense ID scheme.

lexical ids have no meaning whatsoever; they are solely an ad hoc way of preventing ID clashes, because the combination (lexical form, lexicographer file) is not enough to uniquely determine a wordsense. we could get rid of lexical ids by generalizing a version of the ID scheme formerly used by adjective satellites, which can be uniquely identified by (lexical form, head synset).

nouns and verbs could be identified by (lexfile, lexical form, hypernym) (or hyponym?)
pertainyms (adjective or adverbs) could be identified by which wordsense they pertain to (plus lexfile and lexical form).

all in all, we define a 'core' relation for each 'kind' of wordsense/synset, and use the relation's target + the lexical form of the source to identify the wordsense/synset. naturally, mill would have to be able to verify the uniqueness of this naming scheme. naturally, we wouldn't have to identify the core target beyond its lexical form unless that's not sufficient to satisfy the uniqueness constraint.

but this is all very radical, so I don't know if it should be implemented.

error in the export for dbfiles

% mill export -c config -l en --wndb src ~/Temp/tt
mill: No head or more than one head
CallStack (from HasCallStack):
  error, called at src/Export.hs:305:25 in mill-0.1.0.0-H0ExyKEE7Io8CVBxpaTfTx:Export

@odanoburu any idea? We need to generate the DB files to produce the Freeling files from OWN-EN... Can you help?

use less python

change mill.py to just read the JSON output by mill and produce RDF. (boostrap_legacy_rdf will have to create the same JSON from the RDF for bootstrapping, if we care about this)

we can then parse the JSON with haskell and serialize it to text, with much better performance, and better libraries for testing that nothing breaks. (and then mill is much more self-contained)

faster validation

ideally the validation should be fast enough for interactive use (e.g., in the emacs mode), but is currently somewhat slow (good enough for validating in continuous integration, though)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.