coccoc / coccoc-tokenizer Goto Github PK

high performance tokenizer for Vietnamese language

License: GNU Lesser General Public License v3.0

CMake 2.13% Shell 0.22% Java 2.18% C++ 90.87% C 4.19% Python 0.42%

coccoc-tokenizer's Introduction

C++ tokenizer for Vietnamese

This project provides tokenizer library for Vietnamese language and 2 command line tools for tokenization and some simple Vietnamese-specific operations with text (i.e. remove diacritics). It is used in Cốc Cốc Search and Ads systems and the main goal in its development was to reach high performance while keeping the quality reasonable for search ranking needs.

Installing

Building from source and installing into sandbox (or into the system):

$ mkdir build && cd build
$ cmake ..
# make install

To include java bindings:

$ mkdir build && cd build
$ cmake -DBUILD_JAVA=1 ..
# make install

To include python bindings - install cython package and compile wrapper code (only Python3 is supported):

$ mkdir build && cd build
$ cmake -DBUILD_PYTHON=1 ..
# make install

Building debian package can be done with debhelper tools:

$ dpkg-buildpackage <options> # from source tree root

If you want to build and install everything into your sandbox, you can use something like this (it will build everything and install into ~/.local, which is considered as a standard sandbox PREFIX by many applications and frameworks):

$ mdkir build && cd build
$ cmake -DBUILD_JAVA=1 -DBUILD_PYTHON=1 -DCMAKE_INSTALL_PREFIX=~/.local ..
$ make install

Using the tools

Both tools will show their usage with --help option. Both tools can accept either command line arguments or stdin as an input (if both provided, command line arguments are preferred). If stdin is used, each line is considered as one separate argument. The output format is TAB-separated tokens of the original phrase (note that Vietnamese tokens can have whitespaces inside). There's a few examples of usage below.

Tokenize command line argument:

$ tokenizer "Từng bước để trở thành một lập trình viên giỏi"
từng	bước	để	trở thành	một	lập trình	viên	giỏi

Note that it may take one or two seconds for tokenizer to load due to one comparably big dictionary used to tokenize "sticky phrases" (when people write words without spacing). You can disable it by using -n option and the tokenizer will be up in no time. The default behaviour about "sticky phrases" is to only try to split them within urls or domains. With -n you can disable it completely and with -u you can force using it for the whole text. Compare:

$ tokenizer "toisongohanoi, tôi đăng ký trên thegioididong.vn"
toisongohanoi	tôi	đăng ký	trên	the gioi	di dong	vn

$ tokenizer -n "toisongohanoi, tôi đăng ký trên thegioididong.vn"
toisongohanoi	tôi	đăng ký	trên	thegioididong	vn

$ tokenizer -u "toisongohanoi, tôi đăng ký trên thegioididong.vn"
toi	song	o	ha noi	tôi	đăng ký	trên	the gioi	di dong	vn

To avoid reloading dictionaries for every phrase, you can pass phrases from stdin. Here's an example (note that the first line of output is empty - that means empty result for "/" input line):

$ echo -ne "/\nanh yêu em\nbún chả ở nhà hàng Quán Ăn Ngon ko ngon\n" | tokenizer

anh	yêu	em
bún	chả	ở	nhà hàng	quán ăn	ngon	ko	ngon

Whitespaces and punctuations are ignored during normal tokenization, but are kept during tokenization for transformation, which is used internally by Coc Coc search engine. To keep punctuations during normal tokenization, except those in segmented URLs, use -k. To run tokenization for transformation, use -t - notice that this will format result by replacing spaces in multi-syllable tokens with _ and _ with ~.

$ tokenizer "toisongohanoi, tôi đăng ký trên thegioididong.vn" -k
toisongohanoi   ,       tôi     đăng ký trên    the gioi        di dong vn

$ tokenizer "toisongohanoi, tôi đăng ký trên thegioididong.vn" -t
toisongohanoi   ,               tôi             đăng_ký         trên            the_gioi        di_dong vn

The usage of vn_lang_tool is pretty similar, you can see full list of options for both tools by using:

$ tokenizer --help
$ vn_lang_tool --help

Using the library

Use the code of both tools as an example of usage for a library, they are pretty straightforward and easy to understand:

utils/tokenizer.cpp # for tokenizer tool
utils/vn_lang_tool.cpp # for vn_lang_tool

Here's a short code snippet from there:

// initialize tokenizer, exit in case of failure
if (0 > Tokenizer::instance().initialize(opts.dict_path, !opts.no_sticky))
{
	exit(EXIT_FAILURE);
}

// tokenize given text, two additional options are:
//   - bool for_transforming - this option is Cốc Cốc specific kept for backwards compatibility
//   - int tokenize_options - TOKENIZE_NORMAL, TOKENIZE_HOST or TOKENIZE_URL,
//     just use Tokenizer::TOKENIZE_NORMAL if unsure
std::vector< FullToken > res = Tokenizer::instance().segment(text, false, opts.tokenize_option);

for (FullToken t : res)
{
	// do something with tokens
}

Note that you can call segment() function of the same Tokenizer instance multiple times and in parallel from multiple threads.

Here's a short explanation of fields in FullToken structure:

struct Token {
	// position of the start of normalized token (in chars)
	int32_t normalized_start;
	// position of the end of normalized token (in chars)
	int32_t normalized_end;
	// position of the start of token in original text (in bytes)
	int32_t original_start;
	// position of the end of token in original text (in bytes)
	int32_t original_end;
	// token type (WORD, NUMBER, SPACE or PUNCT)
	int32_t type;
	// token segmentation type (this field is Cốc Cốc specific and kept for backwards compatibility)
	int32_t seg_type;
}

struct FullToken : Token {
	// normalized token text
	std::string text;
}

Using Java bindings

A java interface is provided to be used in java projects. Internally it utilizes JNI and the Unsafe API to connect Java and C++. You can find an example of its usage in Tokenizer class's main function:

java/src/java/Tokenizer.java

To run this test class from source tree, use the following command:

$ LD_LIBRARY_PATH=build java -cp build/coccoc-tokenizer.jar com.coccoc.Tokenizer "một câu văn tiếng Việt"

Normally LD_LIBRARY_PATH should point to a directory with libcoccoc_tokenizer_jni.so binary. If you have already installed deb package or make install-ed everything into your system, LD_LIBRARY_PATH is not needed as the binary will be taken from your system (/usr/lib or similar).

Using Python bindings

from CocCocTokenizer import PyTokenizer

# load_nontone_data is True by default
T = PyTokenizer(load_nontone_data=True)

# tokenize_option:
# 	0: TOKENIZE_NORMAL (default)
#	1: TOKENIZE_HOST
#	2: TOKENIZE_URL
print(T.word_tokenize("xin chào, tôi là người Việt Nam", tokenize_option=0))

# output: ['xin', 'chào', ',', 'tôi', 'là', 'người', 'Việt_Nam']

Other languages

Bindings for other languages are not yet implemented but it will be nice if someone can help to write them.

Benchmark

The library provides high speed tokenization which is a requirement for performance critical applications.

The benchmark is done on a typical laptop with Intel Core i5-5200U processor:

Dataset: 1.203.165 Vietnamese Wikipedia articles (Link)
Output: 106.042.935 tokens out of 630.252.179 characters
Processing time: 41 seconds
Speed: 15M characters / second, or 2.5M tokens / second
RAM consumption is around 300Mb

Quality Comparison

The tokenizer tool has a special output format which is similar to other existing tools for tokenization of Vietnamese texts - it preserves all the original text and just marks multi-syllable tokens with underscores instead of spaces. Compare:

$ tokenizer 'Lan hỏi: "điều kiện gì?".'
lan     hỏi     điều kiện       gì

$ tokenizer -f original 'Lan hỏi: "điều kiện gì?".'
Lan hỏi: "điều_kiện gì?".

Using the following testset for comparison with underthesea and RDRsegmenter, we get significantly lower result, but for most of the cases the observed differences are not important for search ranking quality. Below you can find few examples of such differences. Please, be aware of them when using this library.

original         : Em út theo anh cả vào miền Nam.
coccoc-tokenizer : Em_út theo anh_cả vào miền_Nam.
underthesea      : Em_út theo anh cả vào miền Nam.
RDRsegmenter     : Em_út theo anh_cả vào miền Nam.

original         : kết quả cuộc thi phóng sự - ký sự 2004 của báo Tuổi Trẻ.
coccoc-tokenizer : kết_quả cuộc_thi phóng_sự - ký_sự 2004 của báo Tuổi_Trẻ.
underthesea      : kết_quả cuộc thi phóng_sự - ký_sự 2004 của báo Tuổi_Trẻ.
RDRsegmenter     : kết_quả cuộc thi phóng_sự - ký_sự 2004 của báo Tuổi_Trẻ.

original         : cô bé lớn lên dưới mái lều tranh rách nát, trong một gia đình có bốn thế hệ phải xách bị gậy đi ăn xin.
coccoc-tokenizer : cô_bé lớn lên dưới mái lều tranh rách_nát, trong một gia_đình có bốn thế_hệ phải xách bị gậy đi ăn_xin.
underthesea      : cô bé lớn lên dưới mái lều tranh rách_nát, trong một gia_đình có bốn thế_hệ phải xách bị_gậy đi ăn_xin.
RDRsegmenter     : cô bé lớn lên dưới mái lều tranh rách_nát, trong một gia_đình có bốn thế_hệ phải xách bị_gậy đi ăn_xin.

We also don't apply any named entity recognition mechanisms within the tokenizer and have few rare cases where we fail to solve ambiguity correctly. We thus didn't want to provide exact quality comparison results as probably the goals and potential use cases of this library and of those similar ones mentioned above are different and thus precise comparison doesn't make much sense.

Future Plans

We'd love to introduce bindings for Python and maybe other languages later and we'd be happy if somebody can help us doing that. We are also thinking about adding POS tagger and more complex linguistic features later.

If you find any issues or have any suggestions regarding further upgrades, please, report them here or write us through github.

coccoc-tokenizer's People

Contributors

Stargazers

Watchers

Forkers

mikelhpdatke binhvq doanvanthien nightwalker89 tdh4vn luongthanhlam hpham04 tungns-304 trunghieu11 tuanvhuynh ngocphuongnb maihau nhaplycafedang vominhtrius thanhtoan1196 volinh michaellampard 0xflotus hongthaiphi ntson2002 thuvh dracodopham neonqa1102 cuongnv nguyentai1112 nth123 novawish tuan-l nguyenanh rvor manhnd1112 mudzot thangbg huy-lv nguyenlehai minhdtb hungph anhlt309 jacknhat lntutor longhust hiepnguyen1101 phucvdb tranhoangcore bachan canhld94 vic4key txdat nhangg brobusta suale-dev 4kir4vjp datvithanh japlinchen nqtrieu7987 thientu tks1998 dcthienit1997 giahuyng98 dunglason6789p luongdolong leqnam zenthangplus diophung hausuresh minhdc tranhieudev23 winterdl nguyensen trungnghiahoang96 ruanzx nguyen-viet-hung hoangdqvn dangkhoa0894 tech-save hoangvuduyanh truongtrnghia hailoc12 ddpham duydo bizflycloud buom lamhoangtung techainer maxinminax botranvan milynox mgiay hoannc54 wokaio hiepnguyenvan-backup trancong12102 lixibox sonnyit hoainam25699 thaokv dangxuanvuong98 bachdgvn truongcntn2017 devdiary2203

coccoc-tokenizer's Issues

Error when install verson of Python on Mac

I got this error. Please help me to fix this:

running install
running build
running build_ext
skipping 'CocCocTokenizer.cpp' Cython extension (up-to-date)
building 'CocCocTokenizer' extension
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/lap00986/anaconda3/include -arch x86_64 -I/Users/lap00986/anaconda3/include -arch x86_64 -I. -I/Users/lap00986/Documents/product-matching/env/include -I/Users/lap00986/anaconda3/include/python3.7m -c CocCocTokenizer.cpp -o build/temp.macosx-10.7-x86_64-3.7/CocCocTokenizer.o -Wno-cpp -Wno-unused-function -O2 -march=native
warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
CocCocTokenizer.cpp:610:10: fatal error: 'ios' file not found
#include "ios"
^~~~~
1 warning and 1 error generated.
error: command 'gcc' failed with exit status 1

Error in coccoc-tokenizer/dicts/tokenizer/Freq2NontoneUniFile when installing on CentOS 7

When I run the command: make install. The following errors are showed up:

How to fix it? Please help me @bachan @anhducle98 .

How to config options when using Python binding

This is a great project!
Can you please provide document guiding how to use options such as -t, -u... when using Python binding?
Thank you so much!

Python with coccoc_tokenizer

In README file, i found this line:
from CocCocTokenizer import PyTokenizer
but how can i install CocCocTokenizer (I try copy and paste example and i got No module name: 'CocCocTokenizer' error)

java.lang.RuntimeException: Cannot initialize Tokenizer

mình đã build libs java. nhưng khi sử dụng thì gặp lỗi này

Khi build trên Windows với tham số -DBUILD_PYTHON=1 bị lỗi

Khi build trên Windows với tham số -DBUILD_PYTHON=1 thì mình gặp lỗi như sau:

What option available for `tokenize_option` in Python binding ?

print(T.word_tokenize("xin chào, tôi là người Việt Nam", tokenize_option=0))

What tokenize_option available can I pass into tokenize_option.

CocCocTokenizer is worse than VnCoreNLP Tokenizer in information-retrieval tasks

Hi, Have you ever measure metrics like Recall@1, Recall@100 accuracy in any information retrieval tasks before and compare the results to other Vietnamese tokenizing models, say, VnCoreNLP ?

In my own datasets, VnCoreNLP is little bit better than CocCocTokenizer (I use the basic BM25 score)

Need help when install libary for python in conda environment

Dear all,
I've some issues when trying to install the CocCocTokenizer library for python 3.8 in a conda environment.
I've tried:

activate the conda environment before building with Cmake

git clone [email protected]:coccoc/coccoc-tokenizer.git
cd coccoc-tokenizer/
mkdir build && cd build
conda activate py3.8
sudo  cmake -DBUILD_JAVA=1 -DBUILD_PYTHON=1 ..
sudo make install

using DCMAKE_INSTALL_PREFIX flag and point it to site-packages directory

sudo  cmake -DBUILD_JAVA=1 -DBUILD_PYTHON=1 -DCMAKE_INSTALL_PREFIX=/home/huyqnguyen/anaconda3/envs/py3.8 ..

insert conda activate py3.8 to the /python/build_python.sh file

but none of those solutions help me to include the coccoctokenizer in python:

from CocCocTokenizer import PyTokenizer
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'PyTokenizer' from 'CocCocTokenizer' (unknown location)

lỗi khi chạy bằng c++ trên window

Khi chạy bằng c++ mình gặp lỗi này
bạn thử demo giúp mình tokenizer trên c hoặc c++ được ko ạ?

use lib for python

mình đã cài xong phần

$ mkdir build && cd build
$ cmake -DBUILD_PYTHON=1 ..
# make install

nhưng mình vẫn chưa sử dụng được

from CocCocTokenizer import PyTokenizer

Can the tokenizer produce tokens as original case?

I've been using coccoc-tokenizer Java bindings for Vietnamese Elasticsearch analysis plugin, we want to keep tokens as original cases. For example, text "Cộng hòa Xã hội chủ nghĩa Việt Nam" will be tokenized as cộng_hòa xã_hội chủ_nghĩa việt_nam,
what we expect is Cộng_hòa Xã_hội chủ_nghĩa Việt_Nam.

Can we somehow make the tokenizer produce tokens as original cases?

Error openning file, alphabetic

hi Mọi người,

Minh cài lần tiên coccoc-tokenizer thì thành công, mọi việc tốt đẹp.
Khi mình build lần 2 và make install , sau đó chạy lệnh
/usr/local/bin/tokenizer "Cộng hòa Xã hội chủ nghĩa Việt Nam"

thì sinh lỗi
Error openning file, alphabetic

Mình đã xóa bằng tay toàn bộ để build lại nhưng vẫn bị lỗi.
OS: Ubuntu Server 20.04
Mình chưa tìm được các xử lý, có ai có cách nào xóa coccoc-tokenizer triệt để chỉ mình giúp.

Thanks mọi người.

how to install coccoc-tokenizer with pip

i get this error when try install coccoc-tokenizer with pip

build errors on Ubuntu 18.04

I followed the instructions of building this tool from README.md and encountered this error:

In file included from /home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/auxiliary/trie/syllable_da_trie.hpp:10,

             from /home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/auxiliary/trie.hpp:5,

             from /home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/tokenizer.hpp:10,

             from /home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/utils/tokenizer.cpp:3:
/home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/auxiliary/trie/da_trie.hpp: In member function ‘int DATrie<HashNode, Node>::read_from_file(const string&) [with HashNode = MultitermHashTrieNode; Node = MultitermDATrieNode]’:

/home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/auxiliary/trie/da_trie.hpp:237:8: error: ignoring return value of ‘size_t fread(void, size_t, size_t, FILE)’, declared with attribute warn_unused_result [-Werror=unused-result]**

fread(&alphabet_size, sizeof(alphabet_size), 1, in_file);

this happened several times in tokenizer.hpp file

Thêm từ ghép tiếng Việt mới

Xin chào!

Vui lòng hướng dẫn mình cách thêm từ ghép tiếng Việt mới để build lại lib cho mục đích sử dụng riêng thì phải làm như thế nào?

Cảm ơn nhiều!

misunderstanding about segment

I have some misunderstanding about segmentation

why does it consider " " as a token?, I think it is meaningless
with segment method, it keeps case sensitive but removing punctuations
with segment_original method, it makes text case insensitve, but keeps punctuations
" " inside token cannot be changed to "_" (space_positions is empty)

Can you explain, thanks!

V/v giữ nguyên hoa/thường cho bản Java wrapper

Cám ơn Cốc Cốc đã phát triển bộ công cụ tách từ với độ chính xác cao, tốc độ rất nhanh.

Mình đang sử dụng thử thì thấy khi build cho Java thì văn bản bị đưa về hết chữ thường, mình cũng có thử xem phần Java code nhưng không thấy và chưa tìm ra cách để chỉnh lại.

$ LD_LIBRARY_PATH=build java -cp build/coccoc-tokenizer.jar com.coccoc.Tokenizer "một câu văn tiếng Việt"
một	 	câu_văn	 	tiếng_việt	.

Mong nhận được sự giúp đỡ!

Lỗi khi chạy câu lệnh make install

Chào bạn, khi mình cài Step đầu tiên qua Ubuntu LTS, khi chạy câu lệnh make install thì lỗi như sau:
Scanning dependencies of target dict_compiler
[ 12%] Building CXX object CMakeFiles/dict_compiler.dir/utils/dict_compiler.cpp.o
[ 25%] Linking CXX executable dict_compiler
[ 25%] Built target dict_compiler
Scanning dependencies of target vn_lang_tool
[ 37%] Building CXX object CMakeFiles/vn_lang_tool.dir/utils/vn_lang_tool.cpp.o
[ 50%] Linking CXX executable vn_lang_tool
[ 50%] Built target vn_lang_tool
Scanning dependencies of target tokenizer
[ 62%] Building CXX object CMakeFiles/tokenizer.dir/utils/tokenizer.cpp.o
[ 75%] Linking CXX executable tokenizer
[ 75%] Built target tokenizer
Scanning dependencies of target compile_dict
[ 87%] Generating multiterm_trie.dump, syllable_trie.dump, nontone_pair_freq_map.dump
[ 87%] Built target compile_dict
Scanning dependencies of target compile_java
[100%] Generating coccoc-tokenizer.jar
: not foundld_java.sh: 2: ../java/build_java.sh:
../java/build_java.sh: 36: ../java/build_java.sh: Syntax error: end of file unexpected (expecting "then")
CMakeFiles/compile_java.dir/build.make:60: recipe for target 'coccoc-tokenizer.jar' failed
make[2]: *** [coccoc-tokenizer.jar] Error 2
CMakeFiles/Makefile2:215: recipe for target 'CMakeFiles/compile_java.dir/all' failed
make[1]: *** [CMakeFiles/compile_java.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2

Mình chạy với tham số: cmake -DBUILD_JAVA=1 ..

Hy vọng bạn giúp mình với ^^

Build trên hệ điều hành windows

Mình build trên MacOS thì chạy được. Nhưng build trên HĐH Windows thì báo lỗi như sau:

ERROR ON ELASTIC SEARCH 7.13.1

--------------- S U M M A R Y ------------

Command Line: -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT --add-opens=java.base/java.io=ALL-UNNAMED -XX:+UseG1GC -Djava.io.tmpdir=/tmp/elasticsearch-847774562637708686 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Des.cgroups.hierarchy.override=/ -Xms12g -Xmx12g -XX:MaxDirectMemorySize=6442450944 -XX:InitiatingHeapOccupancyPercent=30 -XX:G1ReservePercent=25 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/usr/share/elasticsearch/config -Des.distribution.flavor=default -Des.distribution.type=docker -Des.bundled_jdk=true org.elasticsearch.bootstrap.Elasticsearch -Ebootstrap.memory_lock=true -Enode.name=esnode1 -Ecluster.initial_master_nodes=10.10.2.1, 10.10.2.2 -Enode.data=true -Ediscovery.seed_hosts=10.10.2.1, 10.10.2.2, 10.10.2.3 -Ecluster.name=es-docker-cluster -Enode.master=true

Host: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, 8 cores, 23G, CentOS Linux release 8.4.2105
Time: Thu Aug 5 16:31:14 2021 UTC elapsed time: 65.623954 seconds (0d 0h 1m 5s)

--------------- T H R E A D ---------------

Current thread (0x00007f1e7c028e80): JavaThread "elasticsearch[esnode1][write][T#7]" daemon [_thread_in_native, id=286, stack(0x00007f1d17afb000,0x00007f1d17bfc000)]

Stack: [0x00007f1d17afb000,0x00007f1d17bfc000], sp=0x00007f1d17bf93b0, free space=1016k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [libc.so.6+0x86101] cfree+0x21

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j com.coccoc.Tokenizer.initialize(Ljava/lang/String;)I+0
j com.coccoc.Tokenizer.(Ljava/lang/String;)V+6
j org.apache.lucene.analysis.vi.VietnameseTokenizerImpl.lambda$new$0(Lorg/elasticsearch/analysis/VietnameseConfig;)Lcom/coccoc/Tokenizer;+8
j org.apache.lucene.analysis.vi.VietnameseTokenizerImpl$$Lambda$5942+0x0000000801994660.run()Ljava/lang/Object;+4
J 2859 c2 java.security.AccessController.doPrivileged(Ljava/security/PrivilegedAction;)Ljava/lang/Object; java.base@16 (9 bytes) @ 0x00007f1f5bb8bfd0 [0x00007f1f5bb8bfa0+0x0000000000000030]
j org.apache.lucene.analysis.vi.VietnameseTokenizerImpl.(Lorg/elasticsearch/analysis/VietnameseConfig;Ljava/io/Reader;)V+73
j org.apache.lucene.analysis.vi.VietnameseTokenizer.(Lorg/elasticsearch/analysis/VietnameseConfig;)V+71
j org.elasticsearch.index.analysis.VietnameseTokenizerFactory.create()Lorg/apache/lucene/analysis/Tokenizer;+8
j org.elasticsearch.index.analysis.CustomAnalyzer.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents;+4
j org.apache.lucene.analysis.AnalyzerWrapper.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents;+8
j org.apache.lucene.analysis.AnalyzerWrapper.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents;+8
j org.apache.lucene.analysis.Analyzer.tokenStream(Ljava/lang/String;Ljava/lang/String;)Lorg/apache/lucene/analysis/TokenStream;+58
J 8816 c1 org.apache.lucene.document.Field.tokenStream(Lorg/apache/lucene/analysis/Analyzer;Lorg/apache/lucene/analysis/TokenStream;)Lorg/apache/lucene/analysis/TokenStream; (188 bytes) @ 0x00007f1f5511d944 [0x00007f1f5511abc0+0x0000000000002d84]
j org.apache.lucene.index.DefaultIndexingChain$PerField.invert(ILorg/apache/lucene/index/IndexableField;Z)V+91
j org.apache.lucene.index.DefaultIndexingChain.processField(ILorg/apache/lucene/index/IndexableField;JI)I+113
J 10620 c2 org.apache.lucene.index.DefaultIndexingChain.processDocument(ILjava/lang/Iterable;)V (181 bytes) @ 0x00007f1f5c12b094 [0x00007f1f5c12ae20+0x0000000000000274]
J 10633 c2 org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(Ljava/lang/Iterable;Lorg/apache/lucene/index/DocumentsWriterDeleteQueue$Node;Lorg/apache/lucene/index/DocumentsWriter$FlushNotifications;)J (180 bytes) @ 0x00007f1f5c132aec [0x00007f1f5c132980+0x000000000000016c]
J 10436 c1 org.apache.lucene.index.DocumentsWriter.updateDocuments(Ljava/lang/Iterable;Lorg/apache/lucene/index/DocumentsWriterDeleteQueue$Node;)J (280 bytes) @ 0x00007f1f54d4573c [0x00007f1f54d455a0+0x000000000000019c]
j org.apache.lucene.index.IndexWriter.updateDocuments(Lorg/apache/lucene/index/DocumentsWriterDeleteQueue$Node;Ljava/lang/Iterable;)J+13
J 10062 c1 org.elasticsearch.index.engine.InternalEngine.addDocs(Ljava/util/List;Lorg/apache/lucene/index/IndexWriter;)V (49 bytes) @ 0x00007f1f5496e12c [0x00007f1f5496d960+0x00000000000007cc]
j org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(Lorg/elasticsearch/index/engine/Engine$Index;Lorg/elasticsearch/index/engine/InternalEngine$IndexingStrategy;)Lorg/elasticsearch/index/engine/Engine$IndexResult;+272
j org.elasticsearch.index.engine.InternalEngine.index(Lorg/elasticsearch/index/engine/Engine$Index;)Lorg/elasticsearch/index/engine/Engine$IndexResult;+418
J 10965 c2 org.elasticsearch.index.shard.IndexShard.index(Lorg/elasticsearch/index/engine/Engine;Lorg/elasticsearch/index/engine/Engine$Index;)Lorg/elasticsearch/index/engine/Engine$IndexResult; (316 bytes) @ 0x00007f1f5c1e258c [0x00007f1f5c1e22e0+0x00000000000002ac]
j org.elasticsearch.index.shard.IndexShard.applyIndexOperation(Lorg/elasticsearch/index/engine/Engine;JJJLorg/elasticsearch/index/VersionType;JJJZLorg/elasticsearch/index/engine/Engine$Operation$Origin;Lorg/elasticsearch/index/mapper/SourceToParse;)Lorg/elasticsearch/index/engine/Engine$IndexResult;+230
j org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(JLorg/elasticsearch/index/VersionType;Lorg/elasticsearch/index/mapper/SourceToParse;JJJZ)Lorg/elasticsearch/index/engine/Engine$IndexResult;+49
j org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(Lorg/elasticsearch/action/bulk/BulkPrimaryExecutionContext;Lorg/elasticsearch/action/update/UpdateHelper;Ljava/util/function/LongSupplier;Lorg/elasticsearch/action/bulk/MappingUpdatePerformer;Ljava/util/function/Consumer;Lorg/elasticsearch/action/ActionListener;)Z+456
j org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun()V+45
J 11088 c1 org.elasticsearch.common.util.concurrent.AbstractRunnable.run()V (32 bytes) @ 0x00007f1f5530740c [0x00007f1f55307300+0x000000000000010c]
j org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(Lorg/elasticsearch/action/bulk/BulkShardRequest;Lorg/elasticsearch/index/shard/IndexShard;Lorg/elasticsearch/action/update/UpdateHelper;Ljava/util/function/LongSupplier;Lorg/elasticsearch/action/bulk/MappingUpdatePerformer;Ljava/util/function/Consumer;Lorg/elasticsearch/action/ActionListener;Lorg/elasticsearch/threadpool/ThreadPool;Ljava/lang/String;)V+21
j org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(Lorg/elasticsearch/action/bulk/BulkShardRequest;Lorg/elasticsearch/index/shard/IndexShard;Lorg/elasticsearch/action/ActionListener;)V+71
j org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(Lorg/elasticsearch/action/support/replication/ReplicatedWriteRequest;Lorg/elasticsearch/index/shard/IndexShard;Lorg/elasticsearch/action/ActionListener;)V+7
j org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun()V+16
j org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun()V+24
J 11088 c1 org.elasticsearch.common.util.concurrent.AbstractRunnable.run()V (32 bytes) @ 0x00007f1f5530740c [0x00007f1f55307300+0x000000000000010c]
j java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+92 java.base@16
j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5 java.base@16
j java.lang.Thread.run()V+11 java.base@16
v ~StubRoutines::call_stub

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0xfffffffffffffff7

RAX=0x0 is NULL
RBX=0x00007f1e4c00a620 points into unknown readable memory: 0xffffffffffffffff | ff ff ff ff ff ff ff ff
RCX=0x00007f1f71c90999: <offset 0x0000000000080999> in /lib64/libc.so.6 at 0x00007f1f71c10000
RDX=0x00007f1e4c000080 points into unknown readable memory: 0x00007f1e4c034220 | 20 42 03 4c 1e 7f 00 00
RSP=0x00007f1d17bf93b0 is pointing into the stack for thread: 0x00007f1e7c028e80
RBP=0x00007f1e9413a040 points into unknown readable memory: 0x00007f1e40009110 | 10 91 00 40 1e 7f 00 00
RSI=0x0 is NULL
RDI=0xffffffffffffffff is an unknown value
R8 =0x00007f1e5408319e points into unknown readable memory: 07 00
R9 =0x0000000000000007 is an unknown value
R10=0x0 is NULL
R11=0x0000000000000202 is an unknown value
R12=0x0 is NULL
R13=0x00007f1e54083ea0 points into unknown readable memory: 0x00000000fbad2488 | 88 24 ad fb 00 00 00 00
R14=0x0000000000000018 is an unknown value
R15=0x00007f1d17bf9460 is pointing into the stack for thread: 0x00007f1e7c028e80

Hướng dẫn đầy đủ cài đặt C++ Tokenizer & ES 7.12.1 Analysis Vietnam plugin

*** Môi trường Ubuntu 18.04 (or whatever), phải install Java JDK chứ không phải JRE vì cần javac cho cái C++ Tokenizer. Các file .yml tự làm cho chuẩn theo hường dẫn của các gits. Docker hay VM cũng vậy, đơn giản thế này.

sudo su
apt-get update -y
apt-get upgrade -y
apt-get install build-essential cmake unzip pkg-config gcc-7 g++-7 -y
apt-get install wget curl nano git default-jdk maven -y

cd /

*** Tải ElasticSearch 7.12.1
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.12.1-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.12.1-linux-x86_64.tar.gz
mv elasticsearch-7.12.1-linux-x86_64 /es

** Tải ES Analysis Vietnam
git clone https://github.com/duydo/elasticsearch-analysis-vietnamese.git
cd elasticsearch-analysis-vietnamese
mvn package

** Tải C++ Tokenizer
git clone https://github.com/coccoc/coccoc-tokenizer.git
cd coccoc-tokenizer
mkdir build
cd build
cmake -DBUILD_JAVA=1 ..
make install

** Cài plugin:
cd /es
echo "Y" | ./bin/elasticsearch-plugin install file:///elasticsearch-analysis-vietnamese/target/releases/elasticsearch-analysis-vietnamese-7.12.1.zip

*** Chuẩn bị
groupadd -g 999 nqrt && useradd -r -u 999 -g nqrt nqrt
usermod -aG sudo nqrt
chown nqrt:nqrt /es -R
sysctl -w vm.max_map_count=262144

su nqrt

** Run
export ES_JAVA_OPTS="-Xms2048m -Xmx2048m -Djava.library.path=/usr/local/lib"
cd /es
./bin/elasticsearch

Missed tokenizing entity's name

Thank you for open-sourcing one of the best and blazing fast Vietnamese tokenizer 💯

Today when playing around with CocCocTokenizer python binding, I find out that sometimes it missed tokenizing on entity's name

For example:

>>> T.word_tokenize("Những lần Lam Trường - Đan Trường tái ngộ chung khung hình ở U50")
['Những', 'lần', 'Lam', 'Trường', '-', 'Đan', 'Trường', 'tái_ngộ', 'chung', 'khung_hình', 'ở', 'U50']
# Expected result : ['Những', 'lần', 'Lam_Trường', '-', 'Đan_Trường', 'tái_ngộ', 'chung', 'khung_hình', 'ở', 'U50']

What can I do to help the tokenizer perform better on these cases ?

Lỗi khi build project trên Ubuntu

Mình gặp lỗi này khi chạy make install trên Ubuntu 20.04:

CMake Error at cmake_install.cmake:105 (file):
file INSTALL cannot make directory "/usr/local/share/tokenizer/dicts_text":
No such file or directory.

make: *** [Makefile:74: install] Error 1