repo for GitHub pages
languagetool-org / english-pos-dict Goto Github PK
View Code? Open in Web Editor NEWEnglish POS and dictionary data
English POS and dictionary data
Heya,
Could someone update the GB dictionary for LanguageTool's release at the end of the month?:
https://github.com/marcoagpinto/aoo-mozilla-en-dict
Thanks!
/
). They could be added to multiwords.txt or just ignored. fcd9226A/C A/C NNP
A/O A/O NNP
A/P A/P NNP
A/V A/V NNP
A/m A/m NNP
B/B B/B NNP
B/D B/D NNP
B/o B/o NNP
C/A C/A NNP
C/D C/D NNP
C/L C/L NNP
C/N C/N NNP
D/A D/A NNP
D/F D/F NNP
D/L D/L NNP
D/O D/O NNP
D/P D/P NNP
D/W D/W NNP
P/C P/C NNP
PL/1 PL/1 NNP
R/D R/D NNP
S/D S/D NNP
and/or and/or CC
b/l b/l NN
counts/minute counts/minute NN
cycles/second cycles/second NN
input/output input/output NN
km/h km/h NN
m/s m/s NN
roll-on/roll-off roll-on/roll-off JJ
s/n s/n NN
signal/noise signal/noise NN
tcp/ip tcp/ip NN
w/o w/o NN
mari,nade mari,nade NN
, , ,
. . .
The a and n are switched in the word "dictionary" in the description.
This is my proposal to generate the inflection of regular nouns and verbs: https://github.com/languagetool-org/english-pos-dict/blob/main/src-dict/inflection.py
This is not critical in any case. Everything that is not defined as a regular inflection will be treated as irregular. That means that the irregular forms have to be written explicitly.
Cases like potato/potatoes, life/lives are not considered regular now, but they could be. It seems better if we don't consider them regular.
That's all we need. Some small details could change when we check the whole dictionary.
Does it look good to you? @AzadehSafakish
Check:
Tag words that are currently untagged (~97,000 entries)
This could be automatized in some cases, not all, by identifying:
Even with automation, some manual supervision would be convenient.
All words with terminations -isation/-ization from the dictionaries. They are sorted alphabetically so that absolutization
comes after absolutisation
, and so on.
isation.txt
For words -ization/-isation, add the -s- form to the GB, AU, NZ and ZA dicts, and the -z- form to US, CA, and GB (because of the Oxford spelling).
:U
or :UN
?A simplified format makes it easier to edit and maintain the dictionary.
We need:
To be sure that everything works as expected, we need scripts to convert from simplified format to expanded format, and vice versa. The results must be identical.
simplified format: recharge=verb=all
expanded format: recharge=recharge/VB,recharged/VBD,recharging/VBG,recharged/VBN,recharge/VBP,recharges/VBZ=all
The rules are defined here (I will re-write and improve those rules)
All tagging possibilities for nouns are here: NN-counted.txt
If we come up with a format for the 8 first common cases, we cover 99% of the nouns in the dict. [But only of those that are regular, or that can be derived with simple rules.]
55516 NN,NNS
23761 NN
8560 NN:UN,NNS
8260 NN:U,NNS
4533 NN:UN
2757 NN:U
1218 NN,NNS,NNS
531 NN,NN:U,NNS
For nouns with only one form and one tag (lines 2, 5 and 6), we can use just the actual tag
NN Noun, singular count noun: bicycle, earthquake, zipper
NNS Noun, plural: bicycles, earthquakes, zippers
NN:U Nouns that are always uncountable #new tag - deviation from Penn, examples: admiration, Afrikaans
NN:UN Nouns that might be used in the plural form and with an indefinite article, depending on their meaning #new tag - deviation from Penn, examples: establishment, wax, afternoon
NNP Proper noun, singular: Denver, DORAN, Alexandra
NNPS Proper noun, plural: Buddhists, Englishmen
US, AU and CA original Hunspell dictionaries have two versions, one named "large":
https://github.com/languagetool-org/english-pos-dict/tree/main/spell-data/hunspell
Le'ts see what are the differences, and what version is being used in the binary files.
Description
We need to clean up and consolidate -ize/-ise words across all English variants.
clean-verbs-ise.txt
clean-verbs-ize.txt
pending-verbs-ise.txt
pending-verbs-ize.txt
This entails the following tasks:
DoD
This is list is short.
or-our.txt
The first version of the source dictionary is here: https://github.com/languagetool-org/english-pos-dict/tree/main/src-dict
I will be adding some comments and ideas here. We can open new issues for some parts of the work.
survivorshipably... survivorshipry
).us-large
come from a Hunspell US dictionary that we didn't use until now. It is mentioned in #2recharge=verb=all
. (We use a few rules to cover more cases of regular verbs. See here). It would be useful to have something similar for nouns: a simple and quick way to tag a noun. We would need to define the format, and ways to write exceptions.A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.