libpinyin / libpinyin Goto Github PK
View Code? Open in Web Editor NEWLibrary to deal with pinyin.
License: GNU General Public License v3.0
Library to deal with pinyin.
License: GNU General Public License v3.0
libpinyin Library to deal with pinyin. The libpinyin project aims to provide the algorithms core for intelligent sentence-based Chinese pinyin input methods.
pinyin_get_candidates and pinyin_choose_candidate now seems to have corresponding "full_pinyin" version (though work with both chewing and pinyin).
But the old version pinyin_get_candidates will cause "dead loop" in 0.6.91.
By the way, since the ABI is break (at least for struct "pinyin_instance_t"), please bump the so version.
比如,我输入了:womendoushihaohaizi,然后按2确定上词"我们",这时,首个候选词就成了"都是好孩子".
这时候,万一发现其实"我们"输入错误,比如我想要输入"他们都是好孩子".那么,我势必是需要删除"我们",于是,这里我会点击删除,然后输入"他们",但是可惜的是,我只能输入"tamen",但是却不能确定选择那个汉字,因为候选字列表依旧是"都是好孩子".
This is a feature request for some alternatives of berkeley db, as it's decided to be deprecated and may be dropped in the long term by Arch Linux.
比如ang,用ah打不出来。
ei,用ei/ew都打不出来。
貌似必须像微软一样用 o* 这种格式。
但小鹤不是这么规定的。
bool retval = context->m_phrase_index->get_phrase_item(token, item);
get_phrase_item don't use bool as return value. Should be something like "== ERROR_OK" ?
键入 ge ,会出现“名”的提示,键入 guang ,会出现“观”的提示,键入 a 会出现“拿”的提示。
Because data files under /usr/share/libpinyin/data/ are all endianness sensitive, they need to be moved to /usr/lib. They are very welcomed to be present at $(libdir)/libpinyin/data, since this would help a lot to satisfy the requirements of Multiarch Specification[1] to make different architecture's files co-installable within a single system.
Again, please make a release earlier so we can catch up with Debian Wheezy freeze, which will be due in this June.
grep through the code, the only place that make context->modified to be true is inside pinyin_train, which requires a instance as an arguement.
While add_phrase obviously modifies the dict, but if you try to do offline import, pinyin_save will do nothing.
../utils/storage/gen_binary_files --table-dir ../data
../utils/storage/import_interpolation < ../data/interpolation.text
../utils/training/gen_unigram
make[2]: *** [gbk_char.bin] 段错误
make[2]: *** 正在删除文件“gbk_char.bin”
make[2]: *** 正在等待未完成的任务....
../utils/storage/import_interpolation < ../data/interpolation.text
** (process:30382): CRITICAL **: bool pinyin::SubPhraseIndex::load(pinyin::MemoryChunk*, pinyin::table_offset_t, pinyin::table_offset_t): assertion `*(buf_begin + index_three - 1) == c_separate' failed
../utils/storage/import_interpolation < ../data/interpolation.text
import_interpolation: import_interpolation.cpp:168: bool parse_bigram(FILE*, pinyin::PhraseLargeTable*, pinyin::FacadePhraseIndex*, pinyin::Bigram*): Assertion `last_single_gram->insert_freq(token2, count)' failed.
/bin/sh: 行 1: 30382 已放弃 ../utils/storage/import_interpolation < ../data/interpolation.text
make[2]: *** [phrase_index.bin] 错误 134
import_interpolation: import_interpolation.cpp:168: bool parse_bigram(FILE*, pinyin::PhraseLargeTable*, pinyin::FacadePhraseIndex*, pinyin::Bigram*): Assertion `last_single_gram->insert_freq(token2, count)' failed.
/bin/sh: 行 1: 30384 已放弃 ../utils/storage/import_interpolation < ../data/interpolation.text
make[2]: *** [gb_char.bin] 错误 134
make[2]: 离开目录“/home/yangtse/aur/libpinyin-git/src/libpinyin-build/data”
make[1]: *** [all-recursive] 错误 1
make[1]: 离开目录“/home/yangtse/aur/libpinyin-git/src/libpinyin-build”
make: *** [all] 错误 2
==> 错误: 在 build() 中发生一个错误。
正在放弃...
call with following sequence
import_iterator_t* iter = pinyin_begin_add_phrases(context, 15);
pinyin_iterator_add_phrase(iter, hz, libpinyin->inst->m_raw_full_pinyin, -1);
pinyin_end_add_phrases(iter);
hz is 灼眼的夏娜
pinyin is zhuoyandexiana.
after that, "zhuoyandexiana" cannnot give the correct candidate as usual, and the word doesn't seem to be saved.
我现在想要输入:
我不清楚诶(wobuqingchuei)
结果拼音把我的改为:
我不轻吹(wobuqingchui)
那个e怎么没了?
libpinyin should handle error more gracefully.....
The hang is at pinyin::Bigram::save_db(), tmp_db->open()
Building the HEAD version(3f74e38) with CMake fails with the error:
[ 25%] Built target libpinyin
[ 27%] Building CXX object utils/storage/CMakeFiles/import_interpolation.dir/import_interpolation.cpp.o
/home/alick/Documents/Program/libpinyin/utils/storage/import_interpolation.cpp:25:26: 致命错误:utils_helper.h:没有那个文件或目录
编译中断。
make[2]: *** [utils/storage/CMakeFiles/import_interpolation.dir/import_interpolation.cpp.o] 错误 1
make[1]: *** [utils/storage/CMakeFiles/import_interpolation.dir/all] 错误 2
make: *** [all] 错误 2
It seems the cmake stuff hasn't been updated for a long time.
The software version in CMakeLists.txt says 0.2.99.
As far as I can see all this function need it still inside the internal library, could you add that back?
I look around all the api and didn't find any equivalent.
input : utf8 string with one character,
return: Chewing key array with all possible pronunciation
Just like title said. :-)
For example:
韻 yonh
There should be difference between ShengMu & YunMu as well, like k yonh
-> 廣韻
。
This is only a bad coding style problem, though not very important right now.
pinyin.h contains a using namespace pinyin, which will introduce an additional c++ namespace.
Since every tarball contains the data file (which is also provided by model.text.tar.gz), it becomes really big... But the data file rarely changed.
Could you provided the src-only tarball so we do not need to download the 10MB file everytime when updating.
So basically there is some API which "looks" visable, but actually non visible.
Please solve this by expose them, or add C API for them.
RT
While testing and trying to package libpinyin for Debian, I found that import_interpolation and gen_unigram appears to segfault when run.
I know this is by design, that gen_binary_files should be used beforehand, but anyway this is getting in its way of entering Debian. A command installed into /usr/bin must not segfaults, and must have a manpage, though the manpage can be very simple.
Please note Debian Wheezy will freeze in this June, and I need to get the fixed version a lot earlier.
Reasons: 1) I need time to work on libpinyin itself; 2) fcitx-libpinyin and ibus-linpinyin need libpinyin.
result might be SEARCH_OK | SEARCH_CONTINUE, which should return true for this function.
Well.. at this workaround-able at this time.
for bigram class, we can use sqlite instead BDB, this will be very convenient for Android or other mobile devices.
I use pinyin_train() to update the words frequency, hoping that the sequence of candidate
words changes so that the word that users use the most shows in a priority position in candidate words queue.
But train_result2() seems not work properly. If the constraint->m_type is always NO_CONSTRAINT,
the frequency will never change.
I have tried to change the variable train_next to true. It works. But I'm not sure it is a good solution.
Could you please update the code about updating frequency. Thank you very much!
用的是 fcitx。
I'm using pinyin_set_options with USE_RESPLIT_TABLE.
And with xian, I get following candidates
先,西安,西岸,锡安,……
when I using pinyin_choose_full_pinyin_candidate with 锡安,the pinyin_guess_sentence will provide 西安.
(Splitted from #44)
Would it be possible to have a migration tool, so users using bdb can safely migrate their config folder? Preferably with auto-detection etc, but a simple command line tool could be enough if that's too complicated.
Thanks!
should return true after check if (search_chewing_tones(m_tone_table, key, &tone))
if (search_chewing_tones(m_tone_table, key, &tone)) {
if (symbol)
*symbol = chewing_tone_table[tone];
return true;
}
libpinyin 0.5.0 failed to build when following C(XX)FLAGS are set:
CFLAGS=-g -O2 -Wformat -Wformat-security -Werror=format-security
CXXLAGS=-g -O2 -Wformat -Wformat-security -Werror=format-security
Because I was building it in another machine, log message has lost, but the problem should be easy enough to reproduce.
这是正常的么?
一般来说“阿”出现率虽然高但是很少打单字……
libpinyin的默认词库太烂了,怎么才能保存自己的词库,将来重装系统后恢复词库?
Some characters are in both GBK and UTF8, but they are not included in libpinyin.
Currently, I want to input 渃. It is ruo. Its GBK is 9Cc, and UTF8 is %E6%B8%83
Could you please add the character? If you have lack of spare time, is there an easy way to contribute characters for the project?
I tried to install it on ubuntu ,there is an configure error when running "./configure": Cannot find Berkeley DB library version 4. Actually I have installed it with "apt -get install libdb4.7-dev".
I could use cmake to generate the Makefile but I could not build the source codes because the <config.h> was missing.
So, what's the requirement of this library? And how to build the library?
Thanks
http://gcc.gnu.org/onlinedocs/gcc/Deprecated-Features.html 提到了:
G++ allows static data members of const floating-point type to be declared with an initializer in a class definition. The standard only allows initializers for static members of const integral types and const enumeration types so this extension has been deprecated and will be removed from a future version.
在 phrase_lookup.h 和 pinyin_lookup.h 里面都有:
static const gfloat bigram_lambda = (LAMBDA_PARAMETER);
static const gfloat unigram_lambda = (1 - LAMBDA_PARAMETER);
由于 g++ 的改变,这段代码要在 -std=c++0x 编译通过,需要改为:
static constexpr gfloat bigram_lambda = (LAMBDA_PARAMETER);
static constexpr gfloat unigram_lambda = (1 - LAMBDA_PARAMETER);
我发现似乎所有的 .h 文件都没有用到这两个常量,可以考虑 .h 中只声明,将赋值放在 cpp 文件中?
比如wz会变成wuei而不是wei,xr会变成xvan而不是xuan。
选词倒是没错,但是这样一来fcitx的cloudpinyin给出的结果就会错。
我使用 ibus 配合 libpinyin 工作。
当我输入拼音之后再按退格键删除拼音之后,选字框仍然留在屏幕上。
我的 libpinyin 版本是 0.6.92-1,ibus-libpinyin 版本是 1.4.1-1。
For example, I type 'video', and press Enter, I got 'ideo', type 'ubuntu', and press Enter, I got 'buntu'. That should be wrong.
This is quite non-sense.
If you want to generate from special file, you should specify the file name from command line instead.
Please give out some document to use this library. Something like doxygen would be useful.
Having language binding for Python, Java, ... would be nice.
BTW, the exported API is defined in src/pinyin.h, right?
libpinyin should provide tradition chinese data, with theirs own language model and character. Using opencc is not a perfect solution, and opencc cannot provides different characters which are both available in Traditional Chinese, such as 台 and 臺.
It build ok on near every arch of Debian except armel.
OS:Archlinux(Linux dee 3.8.7-1-ARCH #1 SMP PREEMPT Sat Apr 13 12:52:41 CEST 2013 i686 GNU/Linux)
Desktop:Gnome3.8
举例:输入women,点击.进入下一页,在最右边会出现上一页跟下一页的箭头,但是点击到第三次,箭头就消失了.
In configure.ac, there is:
GLIB2_CPPFLAGS=$PKG_CONFIG --cflags glib-2.0
GLIB2_LDFLAGS=$PKG_CONFIG --libs glib-2.0
But it is not correct, for libibus is acturally a c only libs, it's of no use to specify CPPFLAGS, And the macro PKG_CHECK_MODULES acturally defines GLIB2_CFLAGS and GLIB2_LIBS which could be used in Makefile.am files.
Patch here:
http://svnweb.mageia.org/packages/cauldron/libpinyin/current/SOURCES/libpinyin-0.8.0-link.patch
Type xianjiaotongdaxue, then choose "西安"
The second PinyinKeyPos will automatically change to begin=3 and end=5, which is wrong, which should be begin=2 and end=4, seems libpinyin use "xi'an" as raw key, but actually passed in raw key is without separator '.
"ㄓ" will result as "zh" but not "zhi".
also for "ㄔ" "ㄕ" "ㄖ" "ㄗ" "ㄘ" "ㄙ".
Currently, in libpinyin, i did not find any interface to do "word prediction". For the function of “word prediction”, It means:
When I input “Shang” and then select “上”, is there any way to get the
prediction word(like “海”, “去”…)?
It's normal operation in other pinyin, is there any plan to support it?
Thanks
Best Regards
Jason ZHANG
Hope it's not removed.
This would remove a feature from fcitx-libpinyin, thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.