Git Product home page Git Product logo

libpinyin's Introduction

libpinyin
Library to deal with pinyin.

The libpinyin project aims to provide the algorithms core for intelligent sentence-based Chinese pinyin input methods.

libpinyin's People

Contributors

alebcay avatar alick avatar byvoid avatar cordlandwehr avatar epico avatar felixonmars avatar inokinoki avatar jserv avatar lantw44 avatar leeight avatar matias-larsson-matthews avatar obache avatar puretryout avatar ryuanlu avatar t-chaik avatar wengxt avatar zhangyuannie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libpinyin's Issues

pinyin_get_candidates and pinyin_choose_candidate are deprecated?

pinyin_get_candidates and pinyin_choose_candidate now seems to have corresponding "full_pinyin" version (though work with both chewing and pinyin).

But the old version pinyin_get_candidates will cause "dead loop" in 0.6.91.

By the way, since the ABI is break (at least for struct "pinyin_instance_t"), please bump the so version.

撤销上词的错误

比如,我输入了:womendoushihaohaizi,然后按2确定上词"我们",这时,首个候选词就成了"都是好孩子".
这时候,万一发现其实"我们"输入错误,比如我想要输入"他们都是好孩子".那么,我势必是需要删除"我们",于是,这里我会点击删除,然后输入"他们",但是可惜的是,我只能输入"tamen",但是却不能确定选择那个汉字,因为候选字列表依旧是"都是好孩子".

Plan for alternatives of Berkeley DB?

This is a feature request for some alternatives of berkeley db, as it's decided to be deprecated and may be dropped in the long term by Arch Linux.

小鹤双拼的零声母问题

比如ang,用ah打不出来。
ei,用ei/ew都打不出来。

貌似必须像微软一样用 o* 这种格式。
但小鹤不是这么规定的。

拼音错误

键入 ge ,会出现“名”的提示,键入 guang ,会出现“观”的提示,键入 a 会出现“拿”的提示。

data files: need to be moved under $(libdir)

Because data files under /usr/share/libpinyin/data/ are all endianness sensitive, they need to be moved to /usr/lib. They are very welcomed to be present at $(libdir)/libpinyin/data, since this would help a lot to satisfy the requirements of Multiarch Specification[1] to make different architecture's files co-installable within a single system.

Again, please make a release earlier so we can catch up with Debian Wheezy freeze, which will be due in this June.

[1]https://wiki.ubuntu.com/MultiarchSpec

pinyin_save is useless without instance.

grep through the code, the only place that make context->modified to be true is inside pinyin_train, which requires a instance as an arguement.

While add_phrase obviously modifies the dict, but if you try to do offline import, pinyin_save will do nothing.

编译出错了。gen_unigram段错误。

../utils/storage/gen_binary_files --table-dir ../data
../utils/storage/import_interpolation < ../data/interpolation.text
../utils/training/gen_unigram
make[2]: *** [gbk_char.bin] 段错误
make[2]: *** 正在删除文件“gbk_char.bin”
make[2]: *** 正在等待未完成的任务....
../utils/storage/import_interpolation < ../data/interpolation.text

** (process:30382): CRITICAL **: bool pinyin::SubPhraseIndex::load(pinyin::MemoryChunk*, pinyin::table_offset_t, pinyin::table_offset_t): assertion `*(buf_begin + index_three - 1) == c_separate' failed
../utils/storage/import_interpolation < ../data/interpolation.text
import_interpolation: import_interpolation.cpp:168: bool parse_bigram(FILE*, pinyin::PhraseLargeTable*, pinyin::FacadePhraseIndex*, pinyin::Bigram*): Assertion `last_single_gram->insert_freq(token2, count)' failed.
/bin/sh: 行 1: 30382 已放弃               ../utils/storage/import_interpolation < ../data/interpolation.text
make[2]: *** [phrase_index.bin] 错误 134
import_interpolation: import_interpolation.cpp:168: bool parse_bigram(FILE*, pinyin::PhraseLargeTable*, pinyin::FacadePhraseIndex*, pinyin::Bigram*): Assertion `last_single_gram->insert_freq(token2, count)' failed.
/bin/sh: 行 1: 30384 已放弃               ../utils/storage/import_interpolation < ../data/interpolation.text
make[2]: *** [gb_char.bin] 错误 134
make[2]: 离开目录“/home/yangtse/aur/libpinyin-git/src/libpinyin-build/data”
make[1]: *** [all-recursive] 错误 1
make[1]: 离开目录“/home/yangtse/aur/libpinyin-git/src/libpinyin-build”
make: *** [all] 错误 2
==> 错误: 在 build() 中发生一个错误。
    正在放弃...

pinyin_add_phrase result looks strange.

call with following sequence

import_iterator_t* iter = pinyin_begin_add_phrases(context, 15);
pinyin_iterator_add_phrase(iter, hz, libpinyin->inst->m_raw_full_pinyin, -1);
pinyin_end_add_phrases(iter);

hz is 灼眼的夏娜
pinyin is zhuoyandexiana.

after that, "zhuoyandexiana" cannnot give the correct candidate as usual, and the word doesn't seem to be saved.

一个奇怪的问题

我现在想要输入:

我不清楚诶(wobuqingchuei)

结果拼音把我的改为:

我不轻吹(wobuqingchui)

那个e怎么没了?

fail to build with cmake

Building the HEAD version(3f74e38) with CMake fails with the error:

[ 25%] Built target libpinyin
[ 27%] Building CXX object utils/storage/CMakeFiles/import_interpolation.dir/import_interpolation.cpp.o
/home/alick/Documents/Program/libpinyin/utils/storage/import_interpolation.cpp:25:26: 致命错误:utils_helper.h:没有那个文件或目录
编译中断。
make[2]: *** [utils/storage/CMakeFiles/import_interpolation.dir/import_interpolation.cpp.o] 错误 1
make[1]: *** [utils/storage/CMakeFiles/import_interpolation.dir/all] 错误 2
make: *** [all] 错误 2

It seems the cmake stuff hasn't been updated for a long time.
The software version in CMakeLists.txt says 0.2.99.

Do not pollute the name space.

This is only a bad coding style problem, though not very important right now.

pinyin.h contains a using namespace pinyin, which will introduce an additional c++ namespace.

Could you provide tarball without the data file

Since every tarball contains the data file (which is also provided by model.text.tar.gz), it becomes really big... But the data file rarely changed.

Could you provided the src-only tarball so we do not need to download the 10MB file everytime when updating.

import_interpolation and gen_unigram: segfaults and manpages

While testing and trying to package libpinyin for Debian, I found that import_interpolation and gen_unigram appears to segfault when run.

I know this is by design, that gen_binary_files should be used beforehand, but anyway this is getting in its way of entering Debian. A command installed into /usr/bin must not segfaults, and must have a manpage, though the manpage can be very simple.

Please note Debian Wheezy will freeze in this June, and I need to get the fixed version a lot earlier.
Reasons: 1) I need time to work on libpinyin itself; 2) fcitx-libpinyin and ibus-linpinyin need libpinyin.

Sqlite vs BDB

for bigram class, we can use sqlite instead BDB, this will be very convenient for Android or other mobile devices.

Issue about pinyin_train()

I use pinyin_train() to update the words frequency, hoping that the sequence of candidate 

words changes so that the word that users use the most shows in a priority position in candidate words queue.
But train_result2() seems not work properly. If the constraint->m_type is always NO_CONSTRAINT,
the frequency will never change.
I have tried to change the variable train_next to true. It works. But I'm not sure it is a good solution.
Could you please update the code about updating frequency. Thank you very much!

Bug with resplit table?

I'm using pinyin_set_options with USE_RESPLIT_TABLE.

And with xian, I get following candidates

先,西安,西岸,锡安,……

when I using pinyin_choose_full_pinyin_candidate with 锡安,the pinyin_guess_sentence will provide 西安.

{Feature Request} BDB -> Kyoto Cabinet Migration Tool

(Splitted from #44)

Would it be possible to have a migration tool, so users using bdb can safely migrate their config folder? Preferably with auto-detection etc, but a simple command line tool could be enough if that's too complicated.

Thanks!

bug in in_chewing_scheme

should return true after check if (search_chewing_tones(m_tone_table, key, &tone))

     if (search_chewing_tones(m_tone_table, key, &tone)) {
         if (symbol)
             *symbol = chewing_tone_table[tone];
         return true;
     }

Failed to build with -Werror=format-security

libpinyin 0.5.0 failed to build when following C(XX)FLAGS are set:

CFLAGS=-g -O2 -Wformat -Wformat-security -Werror=format-security
CXXLAGS=-g -O2 -Wformat -Wformat-security -Werror=format-security

Because I was building it in another machine, log message has lost, but the problem should be easy enough to reproduce.

Some characters cannot be inputted?

Some characters are in both GBK and UTF8, but they are not included in libpinyin.

Currently, I want to input 渃. It is ruo. Its GBK is 9Cc, and UTF8 is %E6%B8%83

Could you please add the character? If you have lack of spare time, is there an easy way to contribute characters for the project?

configure error

I tried to install it on ubuntu ,there is an configure error when running "./configure": Cannot find Berkeley DB library version 4. Actually I have installed it with "apt -get install libdb4.7-dev".

How to build the library

I could use cmake to generate the Makefile but I could not build the source codes because the <config.h> was missing.
So, what's the requirement of this library? And how to build the library?
Thanks

G++ 将不再支持非整数类静态常量类内直接初始化

http://gcc.gnu.org/onlinedocs/gcc/Deprecated-Features.html 提到了:
G++ allows static data members of const floating-point type to be declared with an initializer in a class definition. The standard only allows initializers for static members of const integral types and const enumeration types so this extension has been deprecated and will be removed from a future version.

在 phrase_lookup.h 和 pinyin_lookup.h 里面都有:
static const gfloat bigram_lambda = (LAMBDA_PARAMETER);
static const gfloat unigram_lambda = (1 - LAMBDA_PARAMETER);

由于 g++ 的改变,这段代码要在 -std=c++0x 编译通过,需要改为:
static constexpr gfloat bigram_lambda = (LAMBDA_PARAMETER);
static constexpr gfloat unigram_lambda = (1 - LAMBDA_PARAMETER);

我发现似乎所有的 .h 文件都没有用到这两个常量,可以考虑 .h 中只声明,将赋值放在 cpp 文件中?

双拼的韵母很奇怪

比如wz会变成wuei而不是wei,xr会变成xvan而不是xuan。

选词倒是没错,但是这样一来fcitx的cloudpinyin给出的结果就会错。

输入拼音之后删除的问题

我使用 ibus 配合 libpinyin 工作。
当我输入拼音之后再按退格键删除拼音之后,选字框仍然留在屏幕上。
我的 libpinyin 版本是 0.6.92-1,ibus-libpinyin 版本是 1.4.1-1。

Document neeeded.

Please give out some document to use this library. Something like doxygen would be useful.

provide separate data for Traditional Chinese

libpinyin should provide tradition chinese data, with theirs own language model and character. Using opencc is not a perfect solution, and opencc cannot provides different characters which are both available in Traditional Chinese, such as 台 and 臺.

上/下一页箭头在某些情况下会丢失

OS:Archlinux(Linux dee 3.8.7-1-ARCH #1 SMP PREEMPT Sat Apr 13 12:52:41 CEST 2013 i686 GNU/Linux)
Desktop:Gnome3.8

举例:输入women,点击.进入下一页,在最右边会出现上一页跟下一页的箭头,但是点击到第三次,箭头就消失了.

Wrong PinyinKeyPos when using divide table

Type xianjiaotongdaxue, then choose "西安"

The second PinyinKeyPos will automatically change to begin=3 and end=5, which is wrong, which should be begin=2 and end=4, seems libpinyin use "xi'an" as raw key, but actually passed in raw key is without separator '.

Supporting for "word prediction" in libpinyin

Currently, in libpinyin, i did not find any interface to do "word prediction". For the function of “word prediction”, It means:
When I input “Shang” and then select “上”, is there any way to get the
prediction word(like “海”, “去”…)?
It's normal operation in other pinyin, is there any plan to support it?

Thanks
Best Regards
Jason ZHANG

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.