Git Product home page Git Product logo

english-pos-dict's Introduction

languagetool-org

repo for GitHub pages

english-pos-dict's People

Contributors

azadehsafakish avatar danielnaber avatar jaumeortola avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

igorsantos07

english-pos-dict's Issues

changes in the tagger dictionary

  • Removed from the tagger. They are useless because they contain a word separator (/). They could be added to multiwords.txt or just ignored. fcd9226
A/C	A/C	NNP
A/O	A/O	NNP
A/P	A/P	NNP
A/V	A/V	NNP
A/m	A/m	NNP
B/B	B/B	NNP
B/D	B/D	NNP
B/o	B/o	NNP
C/A	C/A	NNP
C/D	C/D	NNP
C/L	C/L	NNP
C/N	C/N	NNP
D/A	D/A	NNP
D/F	D/F	NNP
D/L	D/L	NNP
D/O	D/O	NNP
D/P	D/P	NNP
D/W	D/W	NNP
P/C	P/C	NNP
PL/1	PL/1	NNP
R/D	R/D	NNP
S/D	S/D	NNP
and/or	and/or	CC
b/l	b/l	NN
counts/minute	counts/minute	NN
cycles/second	cycles/second	NN
input/output	input/output	NN
km/h	km/h	NN
m/s	m/s	NN
roll-on/roll-off	roll-on/roll-off	JJ
s/n	s/n	NN
signal/noise	signal/noise	NN
tcp/ip	tcp/ip	NN
w/o	w/o	NN
  • Removed from the tagger. Just a typo? d5f3570
mari,nade	mari,nade	NN
  • Removed from the tagger. If these tags are needed, they should be introduced in disambiguation.xml.
,      ,       ,
.      .       .

regular inflection for nouns and verbs

This is my proposal to generate the inflection of regular nouns and verbs: https://github.com/languagetool-org/english-pos-dict/blob/main/src-dict/inflection.py

This is not critical in any case. Everything that is not defined as a regular inflection will be treated as irregular. That means that the irregular forms have to be written explicitly.
Cases like potato/potatoes, life/lives are not considered regular now, but they could be. It seems better if we don't consider them regular.

That's all we need. Some small details could change when we check the whole dictionary.
Does it look good to you? @AzadehSafakish

Tag words that are currently untagged

Tag words that are currently untagged (~97,000 entries)

This could be automatized in some cases, not all, by identifying:

  • regular verbs
  • regular nouns (with singular and plural)
  • proper nouns (capitalized words)
  • words with -ly (7862 entries), mostly adverbs
  • ...

Even with automation, some manual supervision would be convenient.

Consolidate -isation / -ization words

All words with terminations -isation/-ization from the dictionaries. They are sorted alphabetically so that absolutization comes after absolutisation, and so on.
isation.txt

  • Define shorthands for nouns (#6)
  • Is the spelling -isation/-ization general for all words, without exceptions? If that is true, we could use this rule of thumb: For words -ization/-isation, add the -s- form to the GB, AU, NZ and ZA dicts, and the -z- form to US, CA, and GB (because of the Oxford spelling).
  • Is there a way to know if these nouns are :U or :UN?
  • Is there a way to know when a plural is required or not? Could we just create plural forms for all, even if they are unusual?

simplified format (shorthands) for dictionary entries

A simplified format makes it easier to edit and maintain the dictionary.

We need:

  • to define a format
  • rules to expand the inflected forms from the simplified format (inflected forms for regular verbs, plurals for nouns, etc.)
  • ways to write the exceptions (everything that doesn't fit the regular inflected forms).

To be sure that everything works as expected, we need scripts to convert from simplified format to expanded format, and vice versa. The results must be identical.

Verbs

simplified format: recharge=verb=all
expanded format: recharge=recharge/VB,recharged/VBD,recharging/VBG,recharged/VBN,recharge/VBP,recharges/VBZ=all
The rules are defined here (I will re-write and improve those rules)

Nouns

All tagging possibilities for nouns are here: NN-counted.txt
If we come up with a format for the 8 first common cases, we cover 99% of the nouns in the dict. [But only of those that are regular, or that can be derived with simple rules.]

  55516 NN,NNS
  23761 NN
   8560 NN:UN,NNS
   8260 NN:U,NNS
   4533 NN:UN
   2757 NN:U
   1218 NN,NNS,NNS
    531 NN,NN:U,NNS

For nouns with only one form and one tag (lines 2, 5 and 6), we can use just the actual tag

NN    Noun, singular count noun: bicycle, earthquake, zipper
NNS   Noun, plural: bicycles, earthquakes, zippers
NN:U  Nouns that are always uncountable		#new tag - deviation from Penn, examples: admiration, Afrikaans
NN:UN Nouns that might be used in the plural form and with an indefinite article, depending on their meaning	#new tag - deviation from Penn, examples: establishment, wax, afternoon
NNP   Proper noun, singular: Denver, DORAN, Alexandra
NNPS  Proper noun, plural: Buddhists, Englishmen

Consolidate -ize/-ise words

Description

We need to clean up and consolidate -ize/-ise words across all English variants.
clean-verbs-ise.txt
clean-verbs-ize.txt
pending-verbs-ise.txt
pending-verbs-ize.txt

This entails the following tasks:

DoD

  • Establish general criteria for determining a word's variant(s). Wikipedia appears to be a good source.
  • Using this criteria, verify that all words have the correct variant labels and correct them where necessary
  • Once the correct variant labels are in place, ensure that all entries have a complete and accurate list of inflections. Edit/remove entries if required.

Consolidate -or/-our words

This is list is short.
or-our.txt

  • Check consistency: -or forms for US English and -our for the rest.
  • There are some edge words (to be removed?): glamor, tenour....

Checking the source dictionary

The first version of the source dictionary is here: https://github.com/languagetool-org/english-pos-dict/tree/main/src-dict

I will be adding some comments and ideas here. We can open new issues for some parts of the work.

  • We can proceed to separate the entries by groups: the ones that don't need review, the ones that need some manual review, and so on. For example:
    • words in all spelling dicts and tagged -> no need to review
    • words in all spelling dicts but not tagged -> maybe they can be tagged easily
    • words in US and GB and tagged -> maybe they can be accepted by all variants?
    • ...
  • Check that variant labels are coherent with en-US-GB.txt (use scripting).
  • Some sets of entries look suspicious: untagged words in GB with some prefixes (mis-, out-, over-, re-, under-) seem nonsense words. The same with some affixes (see: survivorshipably... survivorshipry).
  • Words with the tag us-large come from a Hunspell US dictionary that we didn't use until now. It is mentioned in #2
  • We are using a simplified format for regular verbs: recharge=verb=all. (We use a few rules to cover more cases of regular verbs. See here). It would be useful to have something similar for nouns: a simple and quick way to tag a noun. We would need to define the format, and ways to write exceptions.
  • What sources we consider authoritative to determine if a word is GB or US? And AU, CA, ZA, NZ? Are there dictionaries for those variants?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.