hunspell / hunspell Goto Github PK

View Code? Open in Web Editor NEW

2.0K 56.0 227.0 8.25 MB

The most popular spellchecking library.

Home Page: http://hunspell.github.io/

License: GNU Lesser General Public License v2.1

Shell 1.15% C 0.59% C++ 94.17% Makefile 0.41% M4 3.32% Perl 0.36% sed 0.01%

natural-language-processing spellcheck spellchecker stemming spell-check spell-checking-engine spell-checker

hunspell's Introduction

About Hunspell

Hunspell is a free spell checker and morphological analyzer library and command-line tool, licensed under LGPL/GPL/MPL tri-license.

Hunspell is used by LibreOffice office suite, free browsers, like Mozilla Firefox and Google Chrome, and other tools and OSes, like Linux distributions and macOS. It is also a command-line tool for Linux, Unix-like and other OSes.

It is designed for quick and high quality spell checking and correcting for languages with word-level writing system, including languages with rich morphology, complex word compounding and character encoding.

Hunspell interfaces: Ispell-like terminal interface using Curses library, Ispell pipe interface, C++/C APIs and shared library, also with existing language bindings for other programming languages.

Hunspell's code base comes from OpenOffice.org's MySpell library, developed by Kevin Hendricks (originally a C++ reimplementation of spell checking and affixation of Geoff Kuenning's International Ispell from scratch, later extended with eg. n-gram suggestions), see http://lingucomponent.openoffice.org/MySpell-3.zip, and its README, CONTRIBUTORS and license.readme (here: license.myspell) files.

Main features of Hunspell library, developed by László Németh:

Unicode support
Highly customizable suggestions: word-part replacement tables and stem-level phonetic and other alternative transcriptions to recognize and fix all typical misspellings, don't suggest offensive words etc.
Complex morphology: dictionary and affix homonyms; twofold affix stripping to handle inflectional and derivational morpheme groups for agglutinative languages, like Azeri, Basque, Estonian, Finnish, Hungarian, Turkish; 64 thousand affix classes with arbitrary number of affixes; conditional affixes, circumfixes, fogemorphemes, zero morphemes, virtual dictionary stems, forbidden words to avoid overgeneration etc.
Handling complex compounds (for example, for Finno-Ugric, German and Indo-Aryan languages): recognizing compounds made of arbitrary number of words, handle affixation within compounds etc.
Custom dictionaries with affixation
Stemming
Morphological analysis (in custom item and arrangement style)
Morphological generation
SPELLML XML API over plain spell() API function for easier integration of stemming, morpological generation and custom dictionaries with affixation
Language specific algorithms, like special casing of Azeri or Turkish dotted i and German sharp s, and special compound rules of Hungarian.

Main features of Hunspell command line tool, developed by László Németh:

Reimplementation of quick interactive interface of Geoff Kuenning's Ispell
Parsing formats: text, OpenDocument, TeX/LaTeX, HTML/SGML/XML, nroff/troff
Custom dictionaries with optional affixation, specified by a model word
Multiple dictionary usage (for example hunspell -d en_US,de_DE,de_medical)
Various filtering options (bad or good words/lines)
Morphological analysis (option -m)
Stemming (option -s)

See man hunspell, man 3 hunspell, man 5 hunspell for complete manual.

Translations: Hunspell has been translated into several languages already. If your language is missing or incomplete, please use Weblate to help translate Hunspell.

Dependencies

Build only dependencies:

g++ make autoconf automake autopoint libtool

Runtime dependencies:

	Mandatory	Optional
libhunspell
hunspell tool	libiconv gettext	ncurses readline

Compiling on GNU/Linux and Unixes

We first need to download the dependencies. On Linux, gettext and libiconv are part of the standard library. On other Unixes we need to manually install them.

For Ubuntu:

sudo apt install autoconf automake autopoint libtool

Then run the following commands:

autoreconf -vfi
./configure
make
sudo make install
sudo ldconfig

For dictionary development, use the --with-warnings option of configure.

For interactive user interface of Hunspell executable, use the --with-ui option.

Optional developer packages:

ncurses (need for --with-ui), eg. libncursesw5 for UTF-8
readline (for fancy input line editing, configure parameter: --with-readline)

In Ubuntu, the packages are:

libncurses5-dev libreadline-dev

Compiling on OSX and macOS

On macOS for compiler always use clang and not g++ because Homebrew dependencies are build with that.

brew install autoconf automake libtool gettext
brew link gettext --force

Then run:

autoreconf -vfi
./configure
make

Compiling on Windows

Compiling with Mingw64 and MSYS2

Download Msys2, update everything and install the following packages:

pacman -S base-devel mingw-w64-x86_64-toolchain mingw-w64-x86_64-libtool

Open Mingw-w64 Win64 prompt and compile the same way as on Linux, see above.

Compiling in Cygwin environment

Download and install Cygwin environment for Windows with the following extra packages:

make
automake
autoconf
libtool
gcc-g++ development package
ncurses, readline (for user interface)
iconv (character conversion)

Then compile the same way as on Linux. Cygwin builds depend on Cygwin1.dll.

Debugging

It is recommended to install a debug build of the standard library:

libstdc++6-6-dbg

For debugging we need to create a debug build and then we need to start gdb.

./configure CXXFLAGS='-g -O0 -Wall -Wextra'
make
./libtool --mode=execute gdb src/tools/hunspell

You can also pass the CXXFLAGS directly to make without calling ./configure, but we don't recommend this way during long development sessions.

If you like to develop and debug with an IDE, see documentation at https://github.com/hunspell/hunspell/wiki/IDE-Setup

Testing

Testing Hunspell (see tests in tests/ subdirectory):

make check

or with Valgrind debugger:

make check
VALGRIND=[Valgrind_tool] make check

For example:

make check
VALGRIND=memcheck make check

Documentation

features and dictionary format:

man 5 hunspell
man hunspell
hunspell -h

http://hunspell.github.io/

Usage

After compiling and installing (see INSTALL) you can run the Hunspell spell checker (compiled with user interface) with a Hunspell or Myspell dictionary:

hunspell -d en_US text.txt

or without interface:

hunspell
hunspell -d en_GB -l <text.txt

Dictionaries consist of an affix (.aff) and dictionary (.dic) file, for example, download American English dictionary files of LibreOffice (older version, but with stemming and morphological generation) with

wget -O en_US.aff  https://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.aff?id=a4473e06b56bfe35187e302754f6baaa8d75e54f
wget -O en_US.dic https://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.dic?id=a4473e06b56bfe35187e302754f6baaa8d75e54f

and with command line input and output, it's possible to check its work quickly, for example with the input words "example", "examples", "teached" and "verybaaaaaaaaaaaaaaaaaaaaaad":

$ hunspell -d en_US
Hunspell 1.7.0
example
*

examples
+ example

teached
& teached 9 0: taught, teased, reached, teaches, teacher, leached, beached

verybaaaaaaaaaaaaaaaaaaaaaad
# verybaaaaaaaaaaaaaaaaaaaaaad 0

Where in the output, * and + mean correct (accepted) words (* = dictionary stem, + = affixed forms of the following dictionary stem), and & and # mean bad (rejected) words (& = with suggestions, # = without suggestions) (see man hunspell).

Example for stemming:

$ hunspell -d en_US -s
mice
mice mouse

Example for morphological analysis (very limited with this English dictionary):

$ hunspell -d en_US -m
mice
mice  st:mouse ts:Ns

cats
cats  st:cat ts:0 is:Ns
cats  st:cat ts:0 is:Vs

Other executables

The src/tools directory contains the following executables after compiling.

The main executable:
- hunspell: main program for spell checking and others (see manual)
Example tools:
- analyze: example of spell checking, stemming and morphological analysis
- chmorph: example of automatic morphological generation and conversion
- example: example of spell checking and suggestion
Tools for dictionary development:
- affixcompress: dictionary generation from large (millions of words) vocabularies
- makealias: alias compression (Hunspell only, not back compatible with MySpell)
- wordforms: word generation (Hunspell version of unmunch)
- hunzip: decompressor of hzip format
- hzip: compressor of hzip format
- munch (DEPRECATED, use affixcompress): dictionary generation from vocabularies (it needs an affix file, too).
- unmunch (DEPRECATED, use wordforms): list all recognized words of a MySpell dictionary

Example for morphological generation:

$ ~/hunspell/src/tools/analyze en_US.aff en_US.dic /dev/stdin
cat mice
generate(cat, mice) = cats
mouse cats
generate(mouse, cats) = mice
generate(mouse, cats) = mouses

Using Hunspell library with GCC

Including in your program:

#include <hunspell.hxx>

Linking with Hunspell static library:

g++ -lhunspell-1.7 example.cxx
# or better, use pkg-config
g++ $(pkg-config --cflags --libs hunspell) example.cxx

Installing Hunspell (vcpkg)

Alternatively, you can build and install hunspell using vcpkg dependency manager:

git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg integrate install
./vcpkg install hunspell

The hunspell port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please create an issue or pull request on the vcpkg repository.

Dictionaries

Hunspell (MySpell) dictionaries:

Aspell dictionaries (conversion: man 5 hunspell):

ftp://ftp.gnu.org/gnu/aspell/dict

László Németh, nemeth at numbertext org

hunspell's People

Contributors

Stargazers

Watchers

Forkers

changwoo hotschke sillsdev jobava joeliwashige hunspell-bot isotoxin seem-sky silnrsi fxkr plusky crmirand diegolinan modulexcite wiecekm uikit0 pramodk-git bardsoftware asugeno rlugojr idkwim whitesymmetry junaidqadirb sd2017 rakuco loretoparisi kaboomium gadgetsteve carstenlucke davidtc44 spellcheck-ko x1sc0 edwardbetts runt18 monstermmorpg klemenp bitwiseworks broadsoft lucidsoftware matt-lough rrthomas apdplat orgads m3gat0nn4ge slbinilkumar mixaill phcoder rffontenelle pot0to hunter5711 ac abbeygames eagerbeavers tbroadley dominicherzog hi-noikiy rivy leiurus17 dayeol abhiskaushik gong-yuan tbm reneengelhard hgldj1966 renhongkai christwell pamilerinid mtrevisan jagen davidhaynz justrypython nikaban sorcero pannapat mlt synaptekresearch hjlebbink pkolbus zacwalk branch-predictor kazmazhang stbergmann polynomial-c hfl lmmarsano comradekingu niccottrell minghao2016 pzmarius nhabuiduc maggiekim typesettingtools stloeffler git-thinh tranner ultrablox luobende madukan michaelgallacher jayd2446

hunspell's Issues

Heap corruption for UTF8

Hi,

one might trap in heap corruption when running multiple instances of Hunspell with the global variable ‘utf_tbl’ in csutil.cxx.

To avoid this, change

int initialize_utf_tbl() {
utf_tbl = (unicode_info2 *) malloc(CONTSIZE * sizeof(unicode_info2));
…

int initialize_utf_tbl() {
free_utf_tbl(); // <— new
utf_tbl = (unicode_info2 *) malloc(CONTSIZE * sizeof(unicode_info2));
…

and also

void free_utf_tbl() {
if (utf_tbl) free(utf_tbl);
}

void free_utf_tbl() {
if (utf_tbl){
free(utf_tbl);
utf_tbl = NULL; // <— new
}
}

greets
Ingo

Original comment by: idb_winshell

Original Ticket: hunspell/bugs/21

Support for Dutch IJ

Could you add support for the Dutch letter IJ. See the wikipedia article for more details on the IJ

http://en.wikipedia.org/wiki/Dutch\alphabet
http://en.wikipedia.org/wiki/IJ\%28letter%29

The basic problem is that IJ is written, just like in this request, as a I and J, but together they are need to be treated as one letter. Which is a really visible when it is a capital.

Sample (also in wikipedia):
The Dutch word for ice is ijs, when at the beginning of a sentence it must be written as IJs.

Original comment by: ffes

Original Ticket: hunspell/feature-requests/8

a new Win32 DLL/Lib VS2005 solution

I made it based on the original win_api but changed it to VS 2005’s way. Not sure if it’s still fully compatible with existed Delphi example. Another small problem is that it can not get config.h from autoconf, therefore I just copy config.h.in as it. Obviously it loses some information since the constants in config.h.in are all undefined.

Original comment by: b6s

Original Ticket: hunspell/bugs/36

capital � may be SS

In german language there are two ways of writing a
capital _:

1. leave _ as it is to keep the uniqueness (for lists
of names e.g.)
2. write SS instead of _ (the typographical correct way)

So writing “Stra_e” in CAPITAL letters would either be

STRASSE or STRA_E.

Other spell chekers accept both ways of writing it and
this it the right way to do I think.

Do you think this would be possible to do inside hunspell?

Original comment by: bjacke

Original Ticket: hunspell/feature-requests/1

create shared libraries for libhunspell

i.e. a .so not a .a

Patch attached does this

Original comment by: *anonymous

Original Ticket: hunspell/patches/1

c api for hunspell

Attached is a simple patch to add a c api to hunspell to enable it to be used from c programs. Quite simple really, and easy to extend with whatever other methods might be useful

Original comment by: caolan

Original Ticket: hunspell/patches/2

False negatives with leading capital letter and UTF-8

Summary:
Hunspell (using the example program) seems to mark
words that start with a capital letter and contain
non-ASCII characters as mis-spelled and then lists the
same word
as one of the suggestions. This seems to happen only when
using UTF-8 encoded dictionary and string.

Steps to reproduce:
1. Using the example program, and (for example) the
Estonian UTF-8 encoded affix and dict file
(mug.imo.ee/speller/et_EE.zip) check a properly spelled
wordlist where words start with capital letters (included
in that package)

Expected results:
The words should pass the spell check.

Actual results:
Every word is marked as mis-spelled and the suggestion
lists contain the same words.

Notes/Workarounds:
Everything is fine when using Latin-1 for example. The
problem is, there are some letters that don’t map in that
space in Estonian. :-) I haven’t tried this with other
languages, but one of the posts here about Russian seems
to maybe relate to the same problem.

Original comment by: filippl

Original Ticket: hunspell/bugs/11

Empty dictionary file not closed

If hunspell attempts to open a dictionary file with no words in it and a “0” at the start of the file, it returns, but does not close the file handle.

Original comment by: highjinx

Original Ticket: hunspell/bugs/30

alloc not checked for NULL

Dom Lachowicz, the maintainer for Enchant, advised to enter suggestions here since this is the upstream location for enchant’s myspell branch.

It was submitted as:
http://bugzilla.abisource.com/show\_bug.cgi?id=11041

Please use what you find useful and discard the rest.
(not checked for errors).

Thanks,
Jose Da Silva

Original comment by: josedasilva

Original Ticket: hunspell/bugs/34

wildcards in character stripping

Would it be possible to have the “.” wildcard work not
only as a condition, but also to identify character
stripping? For example

SFX A Y 1
SFX A . suf1 .
SFX A . suf2 .

with a dic file like

fooa/A
barb/A
bazc/A

would create

foosuf1
foosuf2
barsuf1
barsuf2
bazsuf1
bazsuf2

much more economically than the current

SFX A Y 1
SFX A a suf1 .
SFX A b suf1 .
SFX A c suf1 .
SFX A a suf2 .
SFX A b suf2 .
SFX A c suf2 .

In the language I’m dealing with, the dozens of suffix
forms for future and conditional all apply to all verb
conjugations, yet the lemmas for these conjugations
differ in their final letter, which is stripped when
the suffixes are added. It seems a waste to have to
duplicate the list of suffixes for each conjugation,
when stripping the wildcard “.” would work for all of them.

Thanks

Original comment by: *anonymous

Original Ticket: hunspell/feature-requests/4

make failed when configuring with "--with-ui" option

When I add “—with-ui” make fails with this error message:

Making all in tools
make³: Entering directory `/home/khaled/work/hunspell/hunspell-1.1.5/src/tools’
if gcc -DHAVE_CONFIG_H -I. -I. -I../.. -I../../src/hunspell -I../../src/parsers -g -O2 -MT munch.o -MD -MP -MF “.deps/munch.Tpo” -c -o munch.o munch.c; \
then mv -f “.deps/munch.Tpo” “.deps/munch.Po”; else rm -f “.deps/munch.Tpo”; exit 1; fi
/bin/sh ../../libtool —mode=link —tag=CC gcc -g -O2 -o munch munch.o
mkdir .libs
gcc -g -O2 -o munch munch.o
if gcc -DHAVE_CONFIG_H -I. -I. -I../.. -I../../src/hunspell -I../../src/parsers -g -O2 -MT unmunch.o -MD -MP -MF “.deps/unmunch.Tpo” -c -o unmunch.o unmunch.c; \
then mv -f “.deps/unmunch.Tpo” “.deps/unmunch.Po”; else rm -f “.deps/unmunch.Tpo”; exit 1; fi
/bin/sh ../../libtool —mode=link —tag=CC gcc -g -O2 -o unmunch unmunch.o
gcc -g -O2 -o unmunch unmunch.o
if g++ -DHAVE_CONFIG_H -I. -I. -I../.. -I../../src/hunspell -I../../src/parsers -g -O2 -MT example.o -MD -MP -MF “.deps/example.Tpo” -c -o example.o example.cxx; \
then mv -f “.deps/example.Tpo” “.deps/example.Po”; else rm -f “.deps/example.Tpo”; exit 1; fi
/bin/sh ../../libtool —mode=link —tag=CXX g++ -g -O2 -o example example.o ../hunspell/libhunspell-1.1.la
g++ -g -O2 -o .libs/example example.o ../hunspell/.libs/libhunspell-1.1.so
creating example
if g++ -DHAVE_CONFIG_H -I. -I. -I../.. -I../../src/hunspell -I../../src/parsers -g -O2 -MT hunspell.o -MD -MP -MF “.deps/hunspell.Tpo” -c -o hunspell.o hunspell.cxx; \
then mv -f “.deps/hunspell.Tpo” “.deps/hunspell.Po”; else rm -f “.deps/hunspell.Tpo”; exit 1; fi
hunspell.cxx: In function ‘int dialog(TextParser*, Hunspell*, char*, char*, char**, int, int)’:
hunspell.cxx:712: error: ‘class Hunspell’ has no member named ‘get_csconv’
make³: * [hunspell.o] Error 1
make³: Leaving directory `/home/khaled/work/hunspell/hunspell-1.1.5/src/tools’
make²: * [all-recursive] Error 1
make²: Leaving directory `/home/khaled/work/hunspell/hunspell-1.1.5/src’
make¹: * [all-recursive] Error 1
make¹: Leaving directory `/home/khaled/work/hunspell/hunspell-1.1.5’
make: * [all] Error 2

Otherwise make goes with no errors

Original comment by: khaledhosny

Original Ticket: hunspell/bugs/20

Incorrect letter case handling for Belarusian

HunSpell counts words at sentence beginning (i.e., with
the first letter capitalized) as errors and suggests
lowercase-only spellings.

A screenshot is attached.

Original comment by: gabix

Original Ticket: hunspell/bugs/3

segfaults with words of >100 chracters

Hi Laci,

when there are words with more than 100 characters,
Hunspell (1.1.4) segfaults during spellchecking. With
“-” being a wordchar, 100 “-”’s in real life texts are
not so unusual as real words with 100 characters :-)

The hunmorph OOo component just refuses to check so
long words but crashes are at least not visible.

Bjoern

Original comment by: *anonymous

Original Ticket: hunspell/bugs/15

aspell dicts -> hunspell dicts

From http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=324639:

snip -

please add information how to build hunspell dictionaries from aspell
dicts. these dictionaries are mentioned on the hunspell home page.
- snip -

Original comment by: *anonymous

Original Ticket: hunspell/feature-requests/6

-g option like in ispell

From http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=342791:

snip -

There is one ispell option missing in hunspell. Porting it would allow Debian to completely migrate to hunspell/myspell:

\- g The input file is in Debian control file format. Ispell will ignore everything outside the Description(s).
-- snip -

Original comment by: *anonymous

Original Ticket: hunspell/feature-requests/5

input encoding support

From http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=350022:

snip -

I would like to have input encoding support. So that hunspell
can (automagically) correct files in any encoding and not only the
encoding of the dictionary.

Exspecially I have problems with UTF-8 encoded german LaTeX files.
- snip -

Original comment by: *anonymous

Original Ticket: hunspell/feature-requests/7

libparsers does not link on Mac for ppc of universal binary

Use this command to configure Universal Binary building on Mac OS X (no matter Intel or PowerPC):

env CFLAGS=“-O -g -isysroot /Developer/SDKs/MacOSX10.4u.sdk -arch i386 -arch ppc” LDFLAGS=“-arch i386 -arch ppc” ./configure —disable-dependency-tracking

The error message:

g++ -g -O2 -arch i386 -arch ppc -o .libs/testparser firstparser.o htmlparser.o latexparser.o manparser.o testparser.o textparser.o -Wl,-bind_at_load ../hunspell/.libs/libhunspell-1.1.dylib
/usr/bin/ld: for architecture ppc
/usr/bin/ld: warning firstparser.o cputype (7, architecture i386) does not match cputype (18) for specified -arch flag: ppc (file not loaded)
/usr/bin/ld: warning htmlparser.o cputype (7, architecture i386) does not match cputype (18) for specified -arch flag: ppc (file not loaded)
/usr/bin/ld: warning latexparser.o cputype (7, architecture i386) does not match cputype (18) for specified -arch flag: ppc (file not loaded)
/usr/bin/ld: warning manparser.o cputype (7, architecture i386) does not match cputype (18) for specified -arch flag: ppc (file not loaded)
/usr/bin/ld: warning testparser.o cputype (7, architecture i386) does not match cputype (18) for specified -arch flag: ppc (file not loaded)
/usr/bin/ld: warning textparser.o cputype (7, architecture i386) does not match cputype (18) for specified -arch flag: ppc (file not loaded)
/usr/bin/ld: warning ../hunspell/.libs/libhunspell-1.1.dylib cputype (7, architecture i386) does not match cputype (18) for specified -arch flag: ppc (file not loaded)
/usr/bin/ld: Undefined symbols:
_main

Original comment by: b6s

Original Ticket: hunspell/bugs/31

tabs in rules

I tried using tabs in the *.aff file, but that resulted
in a segmentation error. It is only very implicitly
documented that tabs are not the normal format. I
think it would just be nice to allow for tabs, as well
as spaces, or at least to document it explicitly.

Thanks!

Original comment by: fouvry

Original Ticket: hunspell/bugs/6

COMPOUNDRULE accepts wrong words

with the following aff/dic file the word “arbeitsfarbige” should not be accepted but it’s not reported to be wrong:

#aff file:
SET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxz

SFX A Y 5
SFX A 0 e .
SFX A 0 er .
SFX A 0 en .
SFX A 0 em .
SFX A 0 es .

COMPOUNDRULE 1
COMPOUNDRULE vw

#dic file:
3
arbeits/v
scheu/Aw
farbig/A

Original comment by: bjacke

Original Ticket: hunspell/bugs/26

a new Win32 DLL/Lib VS2005 solution (2)

Hi,

I also made a solution file for VS2005. Unfortunately, I could not attach it to the original report.

My code changes are based on the 1.1.7 release code.

I made changes in the src code for atypes.hxx and hunspell.hxx.

Any comments are welcome.

best regards,
Ingo

Original comment by: idb_winshell

Original Ticket: hunspell/bugs/37

Spellchecker efficiency

Hi,

With larger lexica (several 100,000 words), there are
efficiency problems. Profile information suggest that
the data structures and algorithms could be improved.
It might be worthwhile to read up on the topic.
(Sorry that this is rather vague: I didn’t profile the
code (colleagues did), but work with the programme very
often – and I am a computational linguist, i.e. aware
of some of the issues.)

Original comment by: *anonymous

Original Ticket: hunspell/bugs/12

Better compound word support by COMPOUNDRULE

This is the same issue I reported on the mailing list
a couple of weeks ago. Since the issue is still
present in Hunspell version 1.1.4 I’ll copy the
information here so it doesn’t get lost.

Hunmorph (in Hunspell versions 1.1.3 and 1.1.4)
crashes when I do the following:

harri@c2:~/tmp/hunspell-bug$ cat test.aff
COMPOUNDRULE 1
COMPOUNDRULE CC
SFX S Y 1
SFX S 0 s . +S
harri@c2:~/tmp/hunspell-bug$ cat test.dic
1
abc/CS [WORD]
harri@c2:~/tmp/hunspell-bug$ cat test.txt
abcabcs
harri@c2:~/tmp/hunspell-bug$ hunmorph test.aff
test.dic test.txt
> abcabcs
Segmentation fault

Hunspell does not crash with the same input. We (at
the Finnish dictionary project) tried to figure out
what is causing this and it seems that something is
wrong at lines 2113-2118 in affixmgr.cxx (in version
1.1.4):

char * m = NULL;
if (compoundflag) m =
affix_check_morph((word+i),strlen(word+i),
compoundflag);
if ((!m || m == ‘\0’) && compoundend)
m = affix_check_morph((word+i),strlen(word+i),
compoundend);
strcat(result, presult);
line_uniq(m);

The null pointer gets passed to line_uniq(). We just
can’t figure out what is the correct way to fix this.

Original comment by: hatp

Original Ticket: hunspell/bugs/10

Incorrect handling of the letter “ё” in Ru

HunSpell ignores the letter “ё”. One explanation: that
letter is not obligatory in the modern Russian orthography
and many people normally use “е” instead. I am among
those that insist on using “ё” and I have the ru_RU_yo.*
module for Russian in OOo. Again, one simple example:
I edit a text with “е”-spelling and have, say, “ее” instead
of “её” that I prefer. “ее” is marked as error and MySpell
suggests “её” among other variants. HunSpell does not.

Original comment by: gabix

Original Ticket: hunspell/bugs/2

altlinuxHyph added code

Some suggested code for substring.c.
Add what looks okay, leave-out what doesn’t seem okay.

1) Moved messages to one area – should allow for messages to be re-used. Could possibly add other languages by using #ifdef (hard-coded) or by inserting messages into arrays (load on demand).

2) Used << and >> for CPUs that don’t have a divide or multiply (less code, plus faster for some CPUs).

3) Added a possible suggested way of dealing with UTF-8 at lines 42/43/44 by using if (c>=0)
Maybe better to code as if (c>=0 && ascii8bit) where ascii8bit=0 for UTF-8 and !=0 for ISO-885x text.

Original comment by: josedasilva

Original Ticket: hunspell/patches/4

"coverity" warnings

coverity does some scanning for potential problem, here’s a patch that silences those warnings. They are fairly unlikely edge-cases but patch attached.

Original comment by: caolan

Original Ticket: hunspell/bugs/35

no % command in PIPE mode

In pipe mode, % is not recongnised as a command
for “Exit terse mode”. This ispell compatibility
command is needed, for example, be emacs’ ispell.el.

current behavior:
-———————————————————
echo “%” | hunspell.exe -a
@(#) International Ispell Version 3.2.06 (but really
Hunspell HUNSPELL_VERSION)1
.1.4 – Magyar 1.1.1
*

expected behavior:
-———————————————————
echo “%” | hunspell.exe a
@(#) International Ispell Version 3.2.06 (but really
Hunspell HUNSPELL_VERSION)1
.1.4 – Magyar 1.1.1
-———————————————————-

Here is a patch that might solve the problem:
diff c hunspell.cxx~ hunspell.cxx
-———————————————————-

hunspell.cxx~ Mon Nov 13 14:13:00 2006
- hunspell.cxx Mon Nov 13 14:19:02 2006
***************
314,319 ****
- 314,320 --
pos = 1;
switch (buf⁰) {
case ‘!’: { break; }
+ case ‘%’: { break; }
case ‘+’: {
delete parser;
parser = new LaTeXParser(wordchars);
-———————————————————-

Original comment by: marot

Original Ticket: hunspell/bugs/17

HunSpell unaware of Esperanto

HunSpell seems to be unaware about Esperanto. Here’s
a line from my dictionary.lst:

DICT eo ANY eo_EO

This works for MySpell (with OOo pre-1.9/2.0), but not
with HunSpell.

Original comment by: gabix

Original Ticket: hunspell/bugs/4

src/tools/munch.h missed in 1.1.5 release

Building hunspell-1.1.5 failed with this error message:

then I found that munch.h is missed from src/tools

$ ls src/tools
example.cxx hunspell.cxx makealias Makefile.am munch.c
hunmorph.cxx hunstem.cxx Makefile Makefile.in unmunch.c

Original comment by: khaledhosny

Original Ticket: hunspell/bugs/19

Syntax error in configure.ac

configure.ac uses
EXPERIMENTAL = “…”
(note the spaces)

This is valid C/C++, but not valid shell code.

Original comment by: bero

Original Ticket: hunspell/bugs/28

patch for possible memory leak

Theres a little memory leak, patch attached, the problem is…

1625 char * m = morph(word);

At conditional (1): “m == 0” taking false path

1626 if(!m) return 0;

Event leaked_storage: Returned without freeing storage “m”
Event pass_arg: Variable “m” not freed or pointed-to in function “line_tok(const char , char **)” [model]
Also see events: [alloc_fn][var_assign][pass_arg]
At conditional (2): “out == 0” taking true path

1627 if (!out) return line_tok(m, out);

Original comment by: caolan

Original Ticket: hunspell/bugs/32

osx/cygwin: invalid conversion from char to const char

hunspell.cxx: In function ‘char* chenc(char*, const char*, const char*)’:
hunspell.cxx:171: error: invalid conversion from ‘char**’ to ‘const char**’
hunspell.cxx:171: error: initializing argument 2 of ‘size_t libiconv(void*, const char**, size_t*, char**, size_t*)’
hunspell.cxx: In function ‘TextParser* get_parser(int, char*, Hunspell*)’:
hunspell.cxx:210: error: invalid conversion from ‘char**’ to ‘const char**’
hunspell.cxx:210: error: initializing argument 2 of ‘size_t libiconv(void*, const char**, size_t*, char**, size_t*)’
hunspell.cxx:232: error: invalid conversion from ‘char**’ to ‘const char**’
hunspell.cxx:232: error: initializing argument 2 of ‘size_t libiconv(void*, const char**, size_t*, char**, size_t*)’
hunspell.cxx:260: error: invalid conversion from ‘char**’ to ‘const char**’
hunspell.cxx:260: error: initializing argument 2 of ‘size_t libiconv(void*, const char**, size_t*, char**, size_t*)’
make³: * [hunspell.o] Error 1
make²: * [all-recursive] Error 1
make¹: * [all-recursive] Error 1
make: * [all] Error 2

Original comment by: b6s

Original Ticket: hunspell/bugs/33

words with leading - accepted with LANG de_DE

with “LANG de_DE” being set words beginning with “-” are falsely accepted when “-” is defined as WORDCHAR. For example:

SET ISO8859-1
TRY esianrtolcdugmphbyfvkwd|v_aij`bqESIANRTOLCDUGMPHBYFVKWD\V-
WORDCHARS -
LANG de_DE

1
test

results with hunspell 1.1.5 in:

> echo -Test -test | hunspell115 -l -d testumlaut
>

while with hunspell 1.1.4 is was (which looks more correct):

> echo -Test -test | hunspell114 -l -d testumlaut
-Test
-test
>

when the “LANG de_DE” definition is removed also hunspell 1.1.5 says the “leading dash words” are incorrect.

Original comment by: bjacke

Original Ticket: hunspell/bugs/24

pkg-config file

Related to 1610756; it makes sense to add a pkg-config file. Easy patch attached.

If 1610756 isn’t applied before this obviously the versioning of the lib has to go…

Original comment by: *anonymous

Original Ticket: hunspell/patches/3

-q/-v option to munch

It would be handy to have a quiet or verbose option to
munch. Currently, it’s spitting out tons of
information. For me, that’s annoying, because I need
to call it almost hundred times (it cannot handle our
lexicon at once: after 16 hours it wasn’t finished
yet). In a script, I sent the stderr output to
/dev/null, but then it took me a day to discover that
the path wasn’t set correctly and the permissions
weren’t right (because they were also sent to
/dev/null). -q would hide the parsing information on
request, but still show other errors.

Original comment by: fouvry

Original Ticket: hunspell/bugs/13

support for comments in .dic files

in .aff files comments are no problem. In .dic files comments would probably be a performance killer but I think it would be okay, to only allow comments at the start of the .dic file. How about ignoring everything at the top of the .dic file up to the first line starting with a numeric which is now the first file defining the number of dictionary entries. An example .dic file could be:

~~-snip~~-
this is the xyz dictionary file for hunspell 1.1.6

author: Foo Faa

2
dictionayentry1/FLAGS
dictionayentry2/FLAGS
~~-snap~~-

Original comment by: bjacke

Original Ticket: hunspell/feature-requests/10

Move logic for locating the dictionaries to the library

Especially useful if the patch for 1731630 gets applied:

It would be useful to move the “locate the dictionary” code from the command line frontend to the library, so applications don’t need to reinvent the wheel to figure out what parameters to pass to the Hunspell constructor.

A new

Hunspell(const char * language=0)

constructor (with language being set to getenv(“LANG” / “LC_MESSAGES” / “LC_ALL”) if it’s 0) would be a nice way to add it to the API.

Original comment by: bero

Original Ticket: hunspell/feature-requests/14

homonym bug with NEEDAFFIX flag and suggestions

It seems like there is another small homonym bug in 1.1.5: when in the dicrionary there are two equal root words, the first one with a PSEUDOROOT flag and the second one with other flags (but without PSEUDOROOT) then the root word is never suggested but it is accepted. For example:

TRY esianrtolcdugmphbyfvkwESIANRTOLCDUGMPHBYFVKW
NEEDAFFIX h
SFX S Y 1
SFX S 0 s .

SFX e Y 1
SFX 0 e .

2
Mull/he
Mull/S

results in:

> echo Mull Mulle Mulls Mall Malle Malls | hunspell -d testdict
Hunspell 1.1.5
*
+ Mull
+ Mull

Mall 17
& Malle 1 22: Mulle
& Malls 1 28: Mulls

btw: when the pseudoroot dictionary entry comes at the end in the dictionary also the root word is being suggested.

Original comment by: bjacke

Original Ticket: hunspell/bugs/23

Suggestion List throws unhandled execption

Hi,

for one dictionary (maybe also for others) I got an access violation in the function:

int SuggestMgr::replchars(char** wlst, const char * word, int ns, int cpdsuggest)

The reason: reptable[i].pattern2 and reptable[i].pattern pointed to NULL. It seems that numrep is higher than the actually filled information in the reptable.

The workaround:
….
for (int i=0; i < numrep; i++ )
{
if (reptable[i].pattern != NULL && reptable[i].pattern2 != NULL) //<— new
{
r = word;
lenr = strlen(reptable[i].pattern2);
….

greets
Ingo

Original comment by: idb_winshell

Original Ticket: hunspell/bugs/25

munch is really slow

Hi,

from http://bugs.debian.org/428284:

snip -

From: Kurt Roeckx <[email protected]>
To: [email protected]
Subject: hunspell-tools: munch is really slow
Date: Sun, 10 Jun 2007 14:27:28 +0200

Package: hunspell-tools
Version: 1.1.5-6

Hi,

The munch command is really slow. It does almost the same as
munchlist from the ispell package, which takes 7.5 seconds here, while
munch takes 22 minutes.

aspell also has something to do that, does it in 2 seconds,
but doesn’t seem to produce simular/compatible output.

They basicly all 3 do the same thing, so I wonder why hunspell’s is
so slow compared to the other 2.

Kurt

[…]
From: Thijs Kinkhorst <[email protected]>
To: [email protected]
Subject: Re: hunspell-tools: munch is really slow
Date: Sun, 10 Jun 2007 19:59:18 +0200

[Message part 1 (text/plain, inline)]

Hi,

> The munch command is really slow. It does almost the same as
> munchlist from the ispell package, which takes 7.5 seconds here, while
> munch takes 22 minutes.

Yes, I can reproduce this here. It’s particularly annoying since the tool is
used in the build process of our package ‘dutch’, and as such every build
takes very long.

I’d highly appreciate it if this could be addressed. Please let me know if you
need more information.

thanks!
Thijs
- snip -

Regards,

Rene

Original comment by: *anonymous

Original Ticket: hunspell/bugs/29

allow NOSUGGEST in affix rules

what do you think about allowing NOSUGGESTS flags inside affix rules like this:

NOSUGGEST n

SFX Z Y 3
SFX Z 0 a b
SFX Z 0 a/n c
SFX Z 0 a/n d

which would all three rules of the Z SFX be valid but the last two ones never being suggested.

Original comment by: bjacke

Original Ticket: hunspell/feature-requests/9

UTF-8 test failure due to TextParser

Hunspell fails the UTF-8 test. Example in the tests
directory:

$ echo $LC_ALL
en_US.UTF-8
$ hunspell -d utf8 < utf8.good
UTF-8 encoding error. Missing continuation byte in 0.
character position:

The reason is that the text parser returns only single
byte characters and thus the u8_u16 conversion fails
for multibyte UTF-8 characters.

TextParser::put_line()
returns a single character token which when fed to
u8_u16() reusults in the error message.

Original comment by: jessbody

Original Ticket: hunspell/bugs/8

Please changes all UTF-8 chararacters in CXX to escaped form

For example, please change UTF-8 BOM to “\xEF\xBB\xBF” since some compiler (e.g. VC++) on some locale (e.g. Chinese) will complain about “newline constant” on UTF-8 characters.

Thank you.

Original comment by: b6s

Original Ticket: hunspell/bugs/22

Problem with letter case for Russian

Russian uses abbreviations “MGc” and “mGc” for “MHz”
and “mHz” respectively. However, if I type “MGC”, it’s
marked as error (althought it should not!) and if I
right-click, among some other suggestions I see
itself (i.e., exactly “MGC”)! And if I choose it in
the drop-down list, it still remains a mistake.

Screenshot attached.

Original comment by: gabix

Original Ticket: hunspell/bugs/9

munch and hunspell v unmunch

Hi,

You said that unmunch lacks some of the capabilities of
hunspell. What is that status with munch? We may have
to compile larger lexica, and would obviously need
munch for that.

And another question: if unmunch cannot generate all
strings from the lexicon, can hunspell find all
suggestions (since that involves the same procedure)?

Thanks,

Frederik

Original comment by: fouvry

Original Ticket: hunspell/support-requests/1

Too little suggestions for misspelled words

HunSpell produces much less suggestions for misspelled
words compared to MySpell. And that means, that in
some occasions you don’t find a variant you’d want to
use.
Here’s one example in Russian. Instead of “Фирма”
(“Firm”, “Company”) I write “Фрима”.
HunSpell suggests for that case:
Прима
Грима
Зрима
Obviously, none of that variants is what I need.
MySpell in turn suggests:
прима
зрима
грима
фирма
фриза
фрица
So, there’s the right variant, although MySpell ignores the
fact that the word should be written with the first letter
capitalized.

Original comment by: gabix

Original Ticket: hunspell/bugs/1

Problem installing dictionary

Hi!
I just got the newest Hunspell (German) for my OO 2.0.
After a lot of browsing I found out that I need a
reworked dictionary as well. I got it here:
http://j3e.de/hunspell/

Now, how to install the friggin thing? I loaded it with
the package manager, but it cannot be activated. It
doesn’t show up in the linguistics-section either. So
Hunspell is pretty much useless to me at this point…
Any suggestions?

cheers,
Tom

tomploeger “at” web.de

Original comment by: *anonymous

Original Ticket: hunspell/support-requests/3

not handling Hebrew dictionary well

(initially posted on https://bugzilla.redhat.com/bugzilla/show\_bug.cgi?id=222213 but moved "upstream)

Description of problem:
After reading the Aspell Hebrew .dic and .aff files (written in the ISO8859-8 charset), Hunspell ignores Hebrew text when run in UTF-8 locale

Version-Release number of selected component (if applicable):
hunspell-1.1.4-3

How reproducible:
allways

Steps to Reproduce:
1. Take the Hebrew Aspell dictionary from ftp://ftp.gnu.org/gnu/aspell/dict/he/aspell6-he-1.0-0.tar.bz2
2. remove the initial space character from then .dic file, and change the SET option from ISO8859-8 to ISO8859-8-I in the .aff file
3. run hunspell -d he-IL in a UTF-8 locale

Actual results:
Latin words are recognized as errors. Hebrew words (in UTF8) are just ignored.
Hebrew words (in ISO8859-8-I) are treated well.

Expected results:
I would expect Hunspell to behave as Aspell6 (and probably myspell) does: no matter in what encoding is the dictionary written in, the spelled text is assumed to be in the user’s runtime locale. If Aspell is run in UTF8 locale, it expects TF8 text. And if it is run in he_IL locale, it expects 8 bit text. I bet this happens in other non-unicode encodings, but I did not check nor looked in the code.

Other than that, it would be nice if Hunspell recognized ISO8859-8 (without -I, for “inverted”). Since ISO8859-8 is almost never used these days, it has become a synonym of ISO8859-8-I. Also, Hunspell could behave more like Aspell and ignore leading whitespace in the first line of .dic file.

Additional info:
This whole bug may disappear if you say that you simply do not support 8 bit encondings. This easy way out would require to convert the dictionary files to UTF8 (which would make them almost twice as big), and would sorrow those of us who keep ISO8859-8-I files.

See also http://ivrix.org.il/bugzilla/show\_bug.cgi?id=83 (Ivrix is where the Hebrew Speller and Hebrew dictionary are made)

Original comment by: *anonymous

Original Ticket: hunspell/bugs/18

ispell.el

Hello!
First of all I thank you for improving the OOo
spellchecker! I’m frequently forced to typeset in
LATEX, and it would be great if hunspell could do the
job there, in my xemacs as well. Since the dictionaries
and recognition of German compounds would not be
changed. I tried a little bit but got stuck. First reason

hunspell -v only returns
Hunspell 1.1.4
hunspell -vv as well.
It would be great if hunspell would return the complete
line as it does if I really try to spellcheck with
parameter list like

hunspell -a -m -d de-DE
@(#) International Ispell Version 3.2.06 (but really
Hunspell HUNSPELL_VERSION)1.1.4
Thanks for your help and best regards

Original comment by: lokros

Original Ticket: hunspell/feature-requests/3

Make an educated guess at the right dictionary to use

When invoked from the command line without a -d parameter, hunspell tries to use “default.{aff,dic}” unconditionally.

On most Linux distributions (and probably other OSes that use hunspell), the dictionary directory contains files named after the supported language, following the same naming scheme as the LANG/LC_* environment variables.

It should be safe to assume the user wants a dictionary for $LANG if none was specified and default.{aff,dic} doesn’t exist.

Original comment by: bero

Original Ticket: hunspell/feature-requests/13

Documentation bugs

It is unclear from the documentation
- whether an entry in the dictionary can have a slash
in it. A precise syntax description would certainly be
helpful.
- whether the word count at the beginning of the file
is compulsory. What happens if the value is incorrect?
- whether comments are possible in the dictionary

Thanks!

Original comment by: fouvry

Original Ticket: hunspell/bugs/7

hunspell / hunspell Goto Github PK

hunspell's Introduction

About Hunspell

Dependencies

Compiling on GNU/Linux and Unixes

Compiling on OSX and macOS

Compiling on Windows

Compiling with Mingw64 and MSYS2

Compiling in Cygwin environment

Debugging

Testing

Documentation

Usage

Other executables

Using Hunspell library with GCC

Installing Hunspell (vcpkg)

Dictionaries

hunspell's People

Contributors

Stargazers

Watchers

Forkers

hunspell's Issues

Recommend Projects

Recommend Topics

Recommend Org