Comments (1)
Thanks for the report.
The problem with " "
inside token has been fixed with commit 18a9b93.
The idea of space_positions
is to save the result of sticky-text-segmentation, it has nothing to do with normal tokens.
Could you provide examples for other issues?
" "
should not be considered as a token (except infor_transforming
mode, which is CocCoc specific). If that happens, it can be a bug.- Both
segment
andsegment_original
shouldn't return any punctuation marks, thoughsegment_original
is expected to keep original text format (case-sensitive). For instance, I have:-
segment("Tôi Đăng ký trên theGioididong.vn")
=>{"tôi", "đăng ký", "trên", "the gioi", "di dong", "vn"}
-
segment_original("Tôi Đăng ký trên theGioididong.vn")
=>{"Tôi", "Đăng_ký", "trên", "the_Gioi", "di_dong", "vn"}
-
- Punctuation marks are included in output of tokenizer tool for convenience but they are excluded from
segment_original
's return vector. It also shows that one can recover tokens separator from token endpoints.
from coccoc-tokenizer.
Related Issues (20)
- How to config options when using Python binding HOT 3
- V/v giữ nguyên hoa/thường cho bản Java wrapper HOT 2
- use lib for python HOT 1
- Missed tokenizing entity's name HOT 4
- Build trên hệ điều hành windows HOT 3
- Khi build trên Windows với tham số -DBUILD_PYTHON=1 bị lỗi HOT 1
- lỗi khi chạy bằng c++ trên window HOT 1
- java.lang.RuntimeException: Cannot initialize Tokenizer HOT 2
- Lỗi khi build project trên Ubuntu HOT 1
- Lỗi khi chạy câu lệnh make install HOT 4
- ERROR ON ELASTIC SEARCH 7.13.1 HOT 1
- Hướng dẫn đầy đủ cài đặt C++ Tokenizer & ES 7.12.1 Analysis Vietnam plugin HOT 6
- Can the tokenizer produce tokens as original case? HOT 2
- Need help when install libary for python in conda environment HOT 4
- What option available for `tokenize_option` in Python binding ? HOT 1
- how to install coccoc-tokenizer with pip HOT 1
- Error when install verson of Python on Mac HOT 8
- Error in coccoc-tokenizer/dicts/tokenizer/Freq2NontoneUniFile when installing on CentOS 7 HOT 2
- Python with coccoc_tokenizer HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from coccoc-tokenizer.