libpinyin / libpinyin Goto Github PK

Library to deal with pinyin.

License: GNU General Public License v3.0

CMake 1.06% Shell 0.07% Python 5.45% C++ 86.71% C 4.25% Makefile 1.41% M4 1.05%

libpinyin's Introduction

libpinyin
Library to deal with pinyin.

The libpinyin project aims to provide the algorithms core for intelligent sentence-based Chinese pinyin input methods.

libpinyin's People

Contributors

Stargazers

Watchers

Forkers

leeight hollylee corefan alick since1q80 hugangling travis-sun echan85 dotfeng wangjunle23 chrox lenky0401 darongmean cmal tizzybec ezc macaddr gk23 gongfupanada gpfvic kejiayefu volcanoscar vanloswang skt041959 zhouxiaobo500 hillwoodroc sawyerlin jimmy54 ii0 archer-sys namistudio giserfly wjie8716 james026yeah mozii haojile lantw44 zhouxinlzu relaxar kernal-gh daxisheng zhaoyongcnsx woodhome koaladaddy palanceli felixonmars xiaoyanlg princelo ping-wu stlcours ljheee suokunlong vintonliu finomxf clarkding solotic linecode xiaoyeye1117 tszhao bitbluesky fear-wall iincity coolan2013 myhololens zghzdxs lo3ck kunlina wuu19 inokinoki yslking 0312birdzhang gongyoyo 31833077 daredemo-daisuki brli xuzhiyi chenshijie-uos jingyu9575 cordlandwehr zhangyuannie zb872676223 mercurius-lee ran-err alebcay godeye asdlei99 ants-xj gerhobbelt stcaaa matias-larsson-matthews hjx302501638 dengbo11 bmwiedemann phenixcdzj001 qindapao

libpinyin's Issues

pinyin_get_candidates and pinyin_choose_candidate are deprecated?

pinyin_get_candidates and pinyin_choose_candidate now seems to have corresponding "full_pinyin" version (though work with both chewing and pinyin).

But the old version pinyin_get_candidates will cause "dead loop" in 0.6.91.

By the way, since the ABI is break (at least for struct "pinyin_instance_t"), please bump the so version.

撤销上词的错误

比如,我输入了:womendoushihaohaizi,然后按2确定上词"我们",这时,首个候选词就成了"都是好孩子".
这时候,万一发现其实"我们"输入错误,比如我想要输入"他们都是好孩子".那么,我势必是需要删除"我们",于是,这里我会点击删除,然后输入"他们",但是可惜的是,我只能输入"tamen",但是却不能确定选择那个汉字,因为候选字列表依旧是"都是好孩子".

Plan for alternatives of Berkeley DB?

This is a feature request for some alternatives of berkeley db, as it's decided to be deprecated and may be dropped in the long term by Arch Linux.

小鹤双拼的零声母问题

比如ang，用ah打不出来。
ei，用ei/ew都打不出来。

貌似必须像微软一样用 o* 这种格式。
但小鹤不是这么规定的。

pinyin_translate_token Meaningless Return Value

bool retval = context->m_phrase_index->get_phrase_item(token, item);

get_phrase_item don't use bool as return value. Should be something like "== ERROR_OK" ?

拼音错误

键入 ge ，会出现“名”的提示，键入 guang ，会出现“观”的提示，键入 a 会出现“拿”的提示。

data files: need to be moved under $(libdir)

Because data files under /usr/share/libpinyin/data/ are all endianness sensitive, they need to be moved to /usr/lib. They are very welcomed to be present at $(libdir)/libpinyin/data, since this would help a lot to satisfy the requirements of Multiarch Specification[1] to make different architecture's files co-installable within a single system.

Again, please make a release earlier so we can catch up with Debian Wheezy freeze, which will be due in this June.

[1]https://wiki.ubuntu.com/MultiarchSpec

pinyin_save is useless without instance.

grep through the code, the only place that make context->modified to be true is inside pinyin_train, which requires a instance as an arguement.

While add_phrase obviously modifies the dict, but if you try to do offline import, pinyin_save will do nothing.

编译出错了。gen_unigram段错误。

../utils/storage/gen_binary_files --table-dir ../data
../utils/storage/import_interpolation < ../data/interpolation.text
../utils/training/gen_unigram
make[2]: *** [gbk_char.bin] 段错误
make[2]: *** 正在删除文件“gbk_char.bin”
make[2]: *** 正在等待未完成的任务....
../utils/storage/import_interpolation < ../data/interpolation.text

** (process:30382): CRITICAL **: bool pinyin::SubPhraseIndex::load(pinyin::MemoryChunk*, pinyin::table_offset_t, pinyin::table_offset_t): assertion `*(buf_begin + index_three - 1) == c_separate' failed
../utils/storage/import_interpolation < ../data/interpolation.text
import_interpolation: import_interpolation.cpp:168: bool parse_bigram(FILE*, pinyin::PhraseLargeTable*, pinyin::FacadePhraseIndex*, pinyin::Bigram*): Assertion `last_single_gram->insert_freq(token2, count)' failed.
/bin/sh: 行 1: 30382 已放弃               ../utils/storage/import_interpolation < ../data/interpolation.text
make[2]: *** [phrase_index.bin] 错误 134
import_interpolation: import_interpolation.cpp:168: bool parse_bigram(FILE*, pinyin::PhraseLargeTable*, pinyin::FacadePhraseIndex*, pinyin::Bigram*): Assertion `last_single_gram->insert_freq(token2, count)' failed.
/bin/sh: 行 1: 30384 已放弃               ../utils/storage/import_interpolation < ../data/interpolation.text
make[2]: *** [gb_char.bin] 错误 134
make[2]: 离开目录“/home/yangtse/aur/libpinyin-git/src/libpinyin-build/data”
make[1]: *** [all-recursive] 错误 1
make[1]: 离开目录“/home/yangtse/aur/libpinyin-git/src/libpinyin-build”
make: *** [all] 错误 2
==> 错误： 在 build() 中发生一个错误。
    正在放弃...

pinyin_add_phrase result looks strange.

call with following sequence

import_iterator_t* iter = pinyin_begin_add_phrases(context, 15);
pinyin_iterator_add_phrase(iter, hz, libpinyin->inst->m_raw_full_pinyin, -1);
pinyin_end_add_phrases(iter);

hz is 灼眼的夏娜
pinyin is zhuoyandexiana.

after that, "zhuoyandexiana" cannnot give the correct candidate as usual, and the word doesn't seem to be saved.

一个奇怪的问题

我现在想要输入：

我不清楚诶(wobuqingchuei)

结果拼音把我的改为：

我不轻吹(wobuqingchui)

那个e怎么没了？

pinyin_save hangs if the user.db is corrupted.

libpinyin should handle error more gracefully.....

The hang is at pinyin::Bigram::save_db(), tmp_db->open()

fail to build with cmake

Building the HEAD version(3f74e38) with CMake fails with the error:

[ 25%] Built target libpinyin
[ 27%] Building CXX object utils/storage/CMakeFiles/import_interpolation.dir/import_interpolation.cpp.o
/home/alick/Documents/Program/libpinyin/utils/storage/import_interpolation.cpp:25:26: 致命错误：utils_helper.h：没有那个文件或目录
编译中断。
make[2]: *** [utils/storage/CMakeFiles/import_interpolation.dir/import_interpolation.cpp.o] 错误 1
make[1]: *** [utils/storage/CMakeFiles/import_interpolation.dir/all] 错误 2
make: *** [all] 错误 2

It seems the cmake stuff hasn't been updated for a long time.
The software version in CMakeLists.txt says 0.2.99.

clear_constraint is removed from api and it break fcitx-libpinyin

As far as I can see all this function need it still inside the internal library, could you add that back?

I look around all the api and didn't find any equivalent.

Add API for lookup Chinese character pronunciation

input : utf8 string with one character,
return: Chewing key array with all possible pronunciation

RFC: support for making user's own table

Just like title said. :-)

For example:
韻 yonh

There should be difference between ShengMu & YunMu as well, like k yonh -> 廣韻。

Do not pollute the name space.

This is only a bad coding style problem, though not very important right now.

pinyin.h contains a using namespace pinyin, which will introduce an additional c++ namespace.

Could you provide tarball without the data file

Since every tarball contains the data file (which is also provided by model.text.tar.gz), it becomes really big... But the data file rarely changed.

Could you provided the src-only tarball so we do not need to download the 10MB file everytime when updating.

ChewingKey's function are not callable since they are not visible.

So basically there is some API which "looks" visable, but actually non visible.

Please solve this by expose them, or add C API for them.

Gnome 3.14 中ibus不能光标跟随

import_interpolation and gen_unigram: segfaults and manpages

While testing and trying to package libpinyin for Debian, I found that import_interpolation and gen_unigram appears to segfault when run.

I know this is by design, that gen_binary_files should be used beforehand, but anyway this is getting in its way of entering Debian. A command installed into /usr/bin must not segfaults, and must have a manpage, though the manpage can be very simple.

Please note Debian Wheezy will freeze in this June, and I need to get the fixed version a lot earlier.
Reasons: 1) I need time to work on libpinyin itself; 2) fcitx-libpinyin and ibus-linpinyin need libpinyin.

pinyin_lookup_token return result is wrong

result might be SEARCH_OK | SEARCH_CONTINUE, which should return true for this function.

Well.. at this workaround-able at this time.

Sqlite vs BDB

for bigram class, we can use sqlite instead BDB, this will be very convenient for Android or other mobile devices.

Issue about pinyin_train()

I use pinyin_train() to update the words frequency, hoping that the sequence of candidate

words changes so that the word that users use the most shows in a priority position in candidate words queue.
But train_result2() seems not work properly. If the constraint->m_type is always NO_CONSTRAINT,
the frequency will never change.
I have tried to change the variable train_next to true. It works. But I'm not sure it is a good solution.
Could you please update the code about updating frequency. Thank you very much!

无法找到“单 shan4”这个字

用的是 fcitx。

Bug with resplit table?

I'm using pinyin_set_options with USE_RESPLIT_TABLE.

And with xian, I get following candidates

先，西安，西岸，锡安，……

when I using pinyin_choose_full_pinyin_candidate with 锡安，the pinyin_guess_sentence will provide 西安.

{Feature Request} BDB -> Kyoto Cabinet Migration Tool

(Splitted from #44)

Would it be possible to have a migration tool, so users using bdb can safely migrate their config folder? Preferably with auto-detection etc, but a simple command line tool could be enough if that's too complicated.

Thanks!

bug in in_chewing_scheme

should return true after check if (search_chewing_tones(m_tone_table, key, &tone))

     if (search_chewing_tones(m_tone_table, key, &tone)) {
         if (symbol)
             *symbol = chewing_tone_table[tone];
         return true;
     }

Failed to build with -Werror=format-security

libpinyin 0.5.0 failed to build when following C(XX)FLAGS are set:

CFLAGS=-g -O2 -Wformat -Wformat-security -Werror=format-security
CXXLAGS=-g -O2 -Wformat -Wformat-security -Werror=format-security

Because I was building it in another machine, log message has lost, but the problem should be easy enough to reproduce.

默认单字“阿”排在“啊”前面。

这是正常的么？

一般来说“阿”出现率虽然高但是很少打单字……

ibus-libpinyin怎么保存个人词库以便重装？

libpinyin的默认词库太烂了，怎么才能保存自己的词库，将来重装系统后恢复词库？

Some characters cannot be inputted?

Some characters are in both GBK and UTF8, but they are not included in libpinyin.

Currently, I want to input 渃. It is ruo. Its GBK is 9Cc, and UTF8 is %E6%B8%83

Could you please add the character? If you have lack of spare time, is there an easy way to contribute characters for the project?

configure error

I tried to install it on ubuntu ,there is an configure error when running "./configure": Cannot find Berkeley DB library version 4. Actually I have installed it with "apt -get install libdb4.7-dev".

How to build the library

I could use cmake to generate the Makefile but I could not build the source codes because the <config.h> was missing.
So, what's the requirement of this library? And how to build the library?
Thanks

G++ 将不再支持非整数类静态常量类内直接初始化

http://gcc.gnu.org/onlinedocs/gcc/Deprecated-Features.html 提到了：
G++ allows static data members of const floating-point type to be declared with an initializer in a class definition. The standard only allows initializers for static members of const integral types and const enumeration types so this extension has been deprecated and will be removed from a future version.

在 phrase_lookup.h 和 pinyin_lookup.h 里面都有：
static const gfloat bigram_lambda = (LAMBDA_PARAMETER);
static const gfloat unigram_lambda = (1 - LAMBDA_PARAMETER);

由于 g++ 的改变，这段代码要在 -std=c++0x 编译通过，需要改为：
static constexpr gfloat bigram_lambda = (LAMBDA_PARAMETER);
static constexpr gfloat unigram_lambda = (1 - LAMBDA_PARAMETER);

我发现似乎所有的 .h 文件都没有用到这两个常量，可以考虑 .h 中只声明，将赋值放在 cpp 文件中？

双拼的韵母很奇怪

比如wz会变成wuei而不是wei，xr会变成xvan而不是xuan。

选词倒是没错，但是这样一来fcitx的cloudpinyin给出的结果就会错。

输入拼音之后删除的问题

我使用 ibus 配合 libpinyin 工作。
当我输入拼音之后再按退格键删除拼音之后，选字框仍然留在屏幕上。
我的 libpinyin 版本是 0.6.92-1，ibus-libpinyin 版本是 1.4.1-1。

Issue of typing English words starting with 'v' or 'u' by using Enter key

For example, I type 'video', and press Enter, I got 'ideo', type 'ubuntu', and press Enter, I got 'buntu'. That should be wrong.

some dictionary name is hard coded when doing the generation

This is quite non-sense.

If you want to generate from special file, you should specify the file name from command line instead.

Document neeeded.

Please give out some document to use this library. Something like doxygen would be useful.

Add bindings for other programming languages

Having language binding for Python, Java, ... would be nice.
BTW, the exported API is defined in src/pinyin.h, right?

don't use ENODATA in import_interpolation.cpp

It's undocumented thus meaningless, exit(1) will work.
ENODATA doesn't exist on some system, e.g kfreebsd.

https://buildd.debian.org/status/fetch.php?pkg=libpinyin&arch=kfreebsd-i386&ver=0.5.92-1&stamp=1337480524

provide separate data for Traditional Chinese

libpinyin should provide tradition chinese data, with theirs own language model and character. Using opencc is not a perfect solution, and opencc cannot provides different characters which are both available in Traditional Chinese, such as 台 and 臺.

libpinyin 0.6.91 failed to build on armel

It build ok on near every arch of Debian except armel.

https://buildd.debian.org/status/package.php?p=libpinyin

上/下一页箭头在某些情况下会丢失

OS:Archlinux(Linux dee 3.8.7-1-ARCH #1 SMP PREEMPT Sat Apr 13 12:52:41 CEST 2013 i686 GNU/Linux)
Desktop:Gnome3.8

举例:输入women,点击.进入下一页,在最右边会出现上一页跟下一页的箭头,但是点击到第三次,箭头就消失了.

Wrong flags used in configure.in and Makefile.ams

In configure.ac, there is:

GLIB2_CPPFLAGS=$PKG_CONFIG --cflags glib-2.0
GLIB2_LDFLAGS=$PKG_CONFIG --libs glib-2.0

But it is not correct, for libibus is acturally a c only libs, it's of no use to specify CPPFLAGS, And the macro PKG_CHECK_MODULES acturally defines GLIB2_CFLAGS and GLIB2_LIBS which could be used in Makefile.am files.

Patch here:
http://svnweb.mageia.org/packages/cauldron/libpinyin/current/SOURCES/libpinyin-0.8.0-link.patch

Wrong PinyinKeyPos when using divide table

Type xianjiaotongdaxue, then choose "西安"

The second PinyinKeyPos will automatically change to begin=3 and end=5, which is wrong, which should be begin=2 and end=4, seems libpinyin use "xi'an" as raw key, but actually passed in raw key is without separator '.

cannot parse "ㄓ" as "zhi" in chewing

"ㄓ" will result as "zh" but not "zhi".

also for "ㄔ" "ㄕ" "ㄖ" "ㄗ" "ㄘ" "ㄙ".

Supporting for "word prediction" in libpinyin

Currently, in libpinyin, i did not find any interface to do "word prediction". For the function of “word prediction”, It means:
When I input “Shang” and then select “上”, is there any way to get the
prediction word(like “海”, “去”…)?
It's normal operation in other pinyin, is there any plan to support it?

Thanks
Best Regards
Jason ZHANG

fcitx-libpinyin relays on the binary tools of libpinyin.

Hope it's not removed.

This would remove a feature from fcitx-libpinyin, thanks.