qinwf / jiebar Goto Github PK

View Code? Open in Web Editor NEW

342.0 48.0 109.0 22.16 MB

Chinese text segmentation with R. R语言中文分词（文档已更新 🎉 ：https://qinwenfeng.com/jiebaR/ )

License: Other

R 33.95% C++ 61.69% C 2.04% CMake 0.30% Rebol 2.03%

cppjieba jieba chinese-text-segmentation lexical-analysis chinese nlp

jiebar's People

Contributors

Stargazers

Watchers

Forkers

lipengyu dempkwok chly c3h3 tangshiping dedream honglang wertion phlizik wyvern92 twofat xuxinrui30 edwardchu86 yanyiwu liushmh why-not-sky lovelybigdata xinchoubiology korterling parker00811 brightbird kevinwenya eddy1982 zihua gn01830657 flydsc luojiahuli rainliu67 wush978 rujianbin yixf-self hoyoung2015 freezezhao zluckyhou luckystar1992 shihjchen njitclass sfdazsdf kuonanhong wozaijinan veterun youknownothingall benknightdark haohanchen holm-xie lancheo alexdm2016 jzzhao wangtong73 strategist922 hsuan40 suha1988 ericdai hlshao feilx king-slayer zhuanglineu zshucai juzenn lgdkobe24 jiunnguo huyongjun rogerjms uliontse leongkaon guoyu07 mememero21japan di-zhou dataxujing 860136 kelffon linyuyoung mlx0910 simazhi ashtearty machinelearningsets michaelchirico doubleiron jinligen xiaocaofei aiedward 007v taolity gmzhou333 lucysz hotlize long0421 tukeyone simongxking njxisang yehuangcn tangent0 del18687058912 takewiki hope-data-science feilyzhang jing-xinxing dmithrandir aden2018 jsyz07935

jiebar's Issues

切分`..`出现`Error: Cannot open file`错误

使用jiebaR进行分词的时候，出现Error: Cannot open file

library(jiebaR)
seg=worker()
segment('..',seg)

缘起于自己在处理诸如2014.01.23字符串的时候，当通过tm_map删掉数字后，发现仅剩..，而这个字符直接导致上述报错，应该是来自detect.cpp，猜测可能是把它当成了路径？！

请问simhash为什么算出来是20位，如果是10进制就溢出了呀..

因为提供的distance好像只能计算字符串之间的距离，可是我只想保留simhash然后自己写一个函数计算海明距离以节省时间和空间，结果发现上述问题，望指教。
顺便能不能添加利用海明距离的文本聚类功能呢？
感谢

jiebaR在CentOS release 6.4 (Final)环境安装失败

错误信息如下：
install.packages("jiebaR",dep=T)
--- Please select a CRAN mirror for use in this session ---
trying URL 'http://mirrors.xmu.edu.cn/CRAN/src/contrib/jiebaR_0.4.tar.gz'
Content type 'application/x-gzip' length 63702 bytes (62 Kb)
opened URL
downloaded 62 Kb

installing source package ?.iebaR?....
** package ?.iebaR?.successfully unpacked and MD5 sums checked
** libs
g++ -std=c++0x -I/usr/local/lib64/R/include -DNDEBUG -I../inst/include -DLOGGER_LEVEL=LL_WARN -I/usr/local/include -I"/usr/local/lib64/R/library/Rcpp/include" -fpic -g -O2 -c RcppExports.cpp -o RcppExports.o
In file included from /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/stl_algobase.h:66,
from /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/char_traits.h:41,
from /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/ios:41,
from /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/ostream:40,
from /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/iostream:40,
from ../inst/include/segtype.hpp:4,
from ../inst/include/jiebaR.h:5,
from RcppExports.cpp:4:
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/stl_pair.h: In constructor ?.td::pair<_T1, _T2>::pair(_U1&&, _U2&&) [with _U1 = size_t&, _U2 = long int, T1 = long unsigned int, T2 = const CppJieba::DictUnit]?.
../inst/include/lib/Trie.hpp:127: instantiated from here
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/stl_pair.h:90: error: invalid conversion from ?.ong int?.to ?.onst CppJieba::DictUnit?
make: *** [RcppExports.o] Error 1
ERROR: compilation failed for package ?.iebaR?
removing ?.usr/local/lib64/R/library/jiebaR?

The downloaded source packages are in
?.tmp/RtmpGFXG2h/downloaded_packages?
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Warning message:
In install.packages("jiebaR", dep = T) :
installation of package ?.iebaR?.had non-zero exit status

关键词提取可以加入 TextRank 算法吗？

请看这两个相关的issue和PR.
yanyiwu/cppjieba#42
yanyiwu/cppjieba#41

原理请看这里： https://github.com/fxsjy/jieba#基于-textrank-算法的关键词抽取

词性标注不正确

tagger <= "我刚刚在桌子旁边不小心摔坏了一只装水的杯子"
r d p n f d n v ul m x
"我" "刚刚" "在" "桌子" "旁边" "不" "小心" "摔坏" "了" "一只" "装水"
uj n
"的" "杯子"
上面例子中的"装水"词性不明确, "摔坏", "了" 是不是分成"摔"+"坏了" 更恰当？

centos 6.5下安装jiebaR失败（类型转换失败）

install.packages("jiebaR")
Installing package into ‘/home/enn_james/R/x86_64-unknown-linux-gnu-library/3.2’
(as ‘lib’ is unspecified)
also installing the dependency ‘jiebaRD’

试开URL’http://cran.rstudio.com/src/contrib/jiebaRD_0.1.tar.gz'

Content type 'application/x-gzip' length 5051847 bytes (4.8 MB)

downloaded 4.8 MB

试开URL’http://cran.rstudio.com/src/contrib/jiebaR_0.5.tar.gz'

Content type 'application/x-gzip' length 63613 bytes (62 KB)

downloaded 62 KB

installing source package ‘jiebaRD’ ...
** 成功将‘jiebaRD’程序包解包并MD5和检查
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
DONE (jiebaRD)
installing source package ‘jiebaR’ ...
** 成功将‘jiebaR’程序包解包并MD5和检查
** libs
g++ -std=c++0x -I/usr/local/lib64/R/include -DNDEBUG -I../inst/include -DLOGGER_LEVEL=LL_WARN -I/usr/local/include -I"/usr/local/lib64/R/library/Rcpp/include" -fpic -g -O2 -c RcppExports.cpp -o RcppExports.o
In file included from /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/stl_algobase.h:66,
from /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/char_traits.h:41,
from /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/ios:41,
from /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/ostream:40,
from /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/iostream:40,
from ../inst/include/segtype.hpp:4,
from ../inst/include/jiebaR.h:5,
from RcppExports.cpp:4:
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/stl_pair.h: In constructor ‘std::pair<_T1, _T2>::pair(_U1&&, _U2&&) [with _U1 = size_t&, _U2 = long int, T1 = long unsigned int, T2 = const CppJieba::DictUnit]’:
../inst/include/lib/Trie.hpp:127: instantiated from here
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/stl_pair.h:90: 错误：从类型‘long int’到类型‘const CppJieba::DictUnit’的转换无效
make: *** [RcppExports.o] 错误 1
ERROR: compilation failed for package ‘jiebaR’
removing ‘/home/enn_james/R/x86_64-unknown-linux-gnu-library/3.2/jiebaR’
Warning in install.packages :
installation of package ‘jiebaR’ had non-zero exit status

The downloaded source packages are in
‘/tmp/Rtmp9D32gT/downloaded_packages’

install.package("jiebaR")
错误: 没有"install.package"这个函数
install.packages("jiebaR")
Installing package into ‘/home/enn_james/R/x86_

User Define 辞典

您好,
最近想透过R做断字断词及语意情感分析, 找了不少分词辞典及情感辞典
看到网上另一个词库分享, 但是里面的字段格式不太能理解, 不知道各位是否可以指点一下呢?
Download URL:
http://down.51cto.com/data/269758
档案字段格式长这样:

1 扭在 nz 6ff026e67cc327c2 2 930 1 0 3
2 拟在 nz 3ad73d9dc29b7c54 2 10092 0 0 3
3 捻针 nz 52w76148h1f9cei9 2 308 1 0 3
4 怒发冲冠 nfcg 9jue6c3a96b5eoif 4 9313 1 0 3
5 农副产品 nfcp adc3aa31df8f47dd 4 7450 1 0 3
6 女房东 nfd 78foi563e45ga896 3 7108 1 0 3
7 暖风机 nfj bbe96g73c89c3298 3 5116 1 0 3
8 年富力强 nflq 6df5a2e8ba64c9a3 4 13740 1 0 3
9 逆耳忠言 nezy 8h65g473e5e5g52e 4 2285 1 0 3
10 难分难解 nfnj 47a6ce306f3i3d2w 4 7382 1 0 3
11 难分难舍 nfns 7i3eb71865g69aa5 4 6718 1 0 3
12 闹翻天 nft cbe4d1c47ie345a2 3 2694 1 0 3
13 女服务员 nfwy a9cc81f8f08fac43 4 12386 1 0 3
14 逆反心理 nfxl a3i3ba1d2a8ed348 4 6096 1 0 3
15 农副业 nfy c1969cd63ic682bb 3 5468 1 0 3
16 年复一年 nfyn fd18eb2b7afbc1ed 4 27804 1 0 3

关于分词后的文本挖掘的疑问

1.将15年政府报告分词

library(jiebaR)
keys33=worker()
keywd=keys33 <= "I:/report15.txt"

tt1=read.table('I:/report15.segment1438142498.5306.txt',header=FALSE,sep=' ')

结果如下：

> head(tt1)
                           V1
1                      宸ヤ綔
2                      鎶ュ憡
3                        2015
4                   骞\xb4\n3
5                   鏈\x88\n5
6 鏃\xa5\n鍦\xa8\n绗崄浜屽眾

该如何正确分列and提取有用信息，比如想table()统计词频？

關於TF-IDF的演算法

請教一下，
tokenEngine <- worker("keywords", idf = "idf.test.txt")
vector_keywords(c("蘋果","蘋果","蘋果","柳丁","柳丁"), tokenEngine)

在我自定義的idf中：
蘋果 0.7595766
柳丁 4.2784980

這邊的詞頻應該要是：
蘋果 0.6
柳丁 0.4

故蘋果與柳丁的 TF-IDF 應該要分別是 0.455746與1.711399，
但程式碼算出的結果看似毫無關聯：
算出的結果卻是：
蘋果 30.4548
柳丁 20.3032

不求精確的演算法，但希望能大概懂您的邏輯或算法，請您賜教。

[Bug report] 关键词提取模式，`segment(code,jiebar)` 方式报错。

Reproducible example:

library(jiebaR)
#关键词提取
kseg = worker(type = "keywords")

> segment("我爱北京***",kseg)
Error: "segment" %in% class(jiebar) is not TRUE
> kseg["我爱北京***"]
  8.9954   4.6674 
"***"   "北京" 
> kseg <= "我爱北京***"
  8.9954   4.6674 
"***"   "北京"

Test Mix Segment mode

> mseg = worker()
> segment("我爱北京***",mseg)
[1] "我"     "爱"     "北京"   "***"
> mseg["我爱北京***"]
[1] "我"     "爱"     "北京"   "***"
> mseg <= "我爱北京***"
[1] "我"     "爱"     "北京"   "***"

My Env

> devtools::session_info()
Session info ------------------------------------------------------------
 setting  value                         
 version  R version 3.2.4 (2016-03-10)  
 system   x86_64, mingw32               
 ui       RStudio (0.99.879)            
 language (EN)                          
 collate  Chinese (Simplified)_China.936
 tz       Asia/Shanghai                 
 date     2016-03-21                    

Packages -----------------------------------------------------------------
 package        * version date       source        
  ...
 jiebaR         * 0.8     2016-01-30 CRAN (R 3.2.3)
 jiebaRD        * 0.1     2015-01-04 CRAN (R 3.2.3)
 memoise          1.0.0   2016-01-29 CRAN (R 3.2.3)
 microbenchmark * 1.4-2.1 2015-11-25 CRAN (R 3.2.3)
 Rcpp             0.12.3  2016-01-10 CRAN (R 3.2.3)
  ...

添加新词无效

我使用的是jiebaR 0.8， ubuntu15.10 , 问题如下：

library(jiebaR)
载入需要的程辑包：jiebaRD
seg=worker()
new_user_word(seg,"北京大学","n")
[1] TRUE
seg["北京大学"]
[1] "北京" "大学"

其中“北京大学”并没有作为一个单词被分出来，而是变成了两个词。

想请问一下jiebaR分词可以选择分词粒度吗？如何实现

jiebaRAPI.h 在OSX 上發生錯誤

Hi,

我在OS X 上使用 clang++ (clang-700.1.76)編譯時，因為有

#include <Rinternals.h>

所以發生錯誤。

請問這裡能不能改成：

#define R_NO_REMAP
#include <Rinternals.h>

來避免相關錯誤呢？

謝謝

query-work set with a dict freezes in declaration

segmenter <- worker(type = "query", dict = "dict/scel.dict.utf8")

When this line is run, R freezes. I've run other kinds of workers specifying user instead of dict and no problems occur.

Would you kindly illustrate the difference in specifying user and dict?

Also, could you reproduce the bug?

My jeibaR library is the development version here; on ubuntu 14.04.

Thank you very much.

自定义分词词库

为什么我自定义词库后，不能用我自定义的词来分词。
test = worker(user = "selfdict_160115.txt")
之后，依然没有用
在Rwordseg包中，insertWords(selfdict)，可以按自己新增的词来分词，但Rwordseg包分词太细，jiebaR可以实现这样吗

Clang 3.8 RC UBSAN test

Unicode Supplementary Character issues

There are some characters which are valid in unicode UTF-8, but after I add to the dictionary, it has this error.
../inst/include/lib/DictTrie.hpp:130 ERROR Decode 的𠝹刀 failed.
This character is

General category: Lo - Letter, other
Canonical combining class: 0 - Spacing, split, enclosing, reordrant, & Tibetan subjoined
Bidirectional category: L - Left-to-right
Unicode version:
As text: 𠝹
Decimal: 132985
HTML escape: 𠝹
URL escape: %F0%A0%9D%B9
More alternative forms
View data in UniHan database
View in PDF code charts (page 34, approx [40Mb file!])
More properties at CLDR's Property demo
Descriptions at decodeUnicode
Java data at FileFormat
Unicode block: CJK Unified Ideographs Extension B
Script group: undefined

get_idf函数中调用的get_idf_cpp函数源代码

您好，在调用get_idf函数之后值全为0 ，help之后显示其中调用了get_idf_cpp函数，请问这个函数的源代码是否可以告知？

自定义一定不进行切分字符，以及自定义一定会进行切分的字符

https://groups.google.com/forum/#!topic/jiebar/QTsNpuSTPEs

分词之后字不见了……

我用一个hmm的worker来拆”旌竿幖幖旗㸌㸌，意气横鞭归故乡。“这句诗，结果是"旌竿" "幖" "幖" "旗" "意气" "横" "鞭" "归" "故乡"，”㸌㸌“哪去了？

parallel包重载jiebaR导致worker()时间戳不一致

# init jieba
library(jiebaR)
seg_local=worker()
# init cluster
library(parallel)
cl=makeCluster(3)
# init args and functions
args=c('abc def','abd efg','ah gs fhg')
get_seg_local=function(d) segment(d,seg_local)
get_seg_remote=function(d) segment(d,seg_remote)

clusterEvalQ(cl,library(jiebaR))
# ======================
# 本地定义worker()并export
# ======================
clusterExport(cl,'seg_local')
# clusterExport(cl,'get_seg_local')
parLapply(cl,args,get_seg_local)
# Error in checkForRemoteErrors(val) : 
#   3 nodes produced errors; first error: Please create a new worker after jiebaR is reloaded.

# ========================
# 远程定义master节点的worker()
# ========================
clusterCall(cl,function(){
    seg_remote=worker()
})
parLapply(cl,args,get_seg_remote)
# Error in checkForRemoteErrors(val) : 
#   3 nodes produced errors; first error: 找不到对象'seg_remote'

本地声明的报错信息主要是时间戳的不一致导致，Line 42

https://github.com/qinwf/jiebaR/blob/master/R/segment.R

当然，第二种方案报错并不是jiebaR的问题（我自己找了不少相关资料，但始终不得解），想请教一下对于jiebaR在并行计算中是否有更好的解决方案，谢谢！

R CMD check 错误

using log directory 'E:/soft/jiebaR/branches/parallel.Rcheck'
using R version 3.0.3 (2014-03-06)
using platform: x86_64-w64-mingw32 (64-bit)
using session charset: CP936
checking for file 'parallel/DESCRIPTION' ... ERROR
Required field missing or empty:
'Author'

最近计划用 bookdown 重写一下文档教程，欢迎大家提一下意见和建议

计划 - 新的接口

重新规划新的 API，让大家用起来方便一些。下面是一些想法：

1、分离 Cppjieba 中的分词，关键词提取，Simhash 的方法为小的模块，不相互依赖。Cppjieba 5.0 增加了 Textrank 的模块，现有的接口想把这个模块整合起来，使用起来感觉可能会不方便。

在原有的 Cppjieba 的代码中，关键词提取和Simhash 的步骤是包含了分词步骤的，而这两个步骤其实可以独立出来，用户先分词，然后执行后面两步，比如：

text %>% 
  fenci() %>% 
  key_tfidf() %>%  # key_textrank()
  simhash()

2、分离筛选标点，筛选停止词，读取文件，bylines 等逻辑到单独的函数，这样用户可以自定义需要的步骤，也减少 ifelse 的损失。

text %>% 
  rm_sym() %>%
  rm_stopwords %>%
  fenci()

组合函数链，比如：

rm_sym_stopwords = function(txt){
  txt %>% 
    rm_sym %>% 
    rm_stopwords()
}

text %>% 
  rm_sym_stopwords %>%
  fenci()

Problem with dev version of devtools

checking tests ... ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
  12: test("./CPP_API")
  13: load_all(pkg, quiet = TRUE) at /Users/hadley/Documents/devtools/devtools/R/test.r:50
  14: check_suggested("roxygen2") at /Users/hadley/Documents/devtools/devtools/R/load.r:88
  15: check_dep_version(pkg, version, compare) at /Users/hadley/Documents/devtools/devtools/R/utils.r:63
  16: stop("Dependency package ", dep_name, " not available.") at /Users/hadley/Documents/devtools/devtools/R/package-deps.r:56

  testthat results ================================================================
  OK: 14 SKIPPED: 0 FAILED: 2
  1. Error: C_API 
  2. Error: CPP_API 

  Error: testthat unit tests failed
  Execution halted

Could you please look into this as soon as possible? I really want to release devtools to CRAN in the next couple of days so that I can use the latest version in class next week.

From the error, I'm guess it's something to do with roxygen2, which devtools now only suggests, instead of requiring.

MAC下安装失败

> devtools::install_github("qinwf/jiebaR")
Downloading github repo qinwf/jiebaR@master
Installing jiebaR
'/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL  \
  '/private/var/folders/sd/2_0mf8p100v7dl3p_2cs5m_80000gn/T/RtmpiBlUkZ/devtools53cf225d0fdf/qinwf-jiebaR-6c1d204'  \
  --library='/Users/13k/R' --install-tests 

* installing *source* package ‘jiebaR’ ...
** libs
clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I../inst/include -DLOGGER_LEVEL=LL_WARN -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Users/13k/R/Rcpp/include"   -fPIC  -Wall -mtune=core2 -g -O2 -c RcppExports.cpp -o RcppExports.o
clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I../inst/include -DLOGGER_LEVEL=LL_WARN -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Users/13k/R/Rcpp/include"   -fPIC  -Wall -mtune=core2 -g -O2 -c detect.cpp -o detect.o
clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I../inst/include -DLOGGER_LEVEL=LL_WARN -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Users/13k/R/Rcpp/include"   -fPIC  -Wall -mtune=core2 -g -O2 -c segtype.cpp -o segtype.o
clang++ -std=c++11 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/lib -o jiebaR.so RcppExports.o detect.o segtype.o -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
installing to /Users/13k/R/jiebaR/libs
** R
** inst
** tests
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
sh: line 1: 21527 Segmentation fault: 11  '/Library/Frameworks/R.framework/Resources/bin/R' --no-save --slave 2>&1 < '/var/folders/sd/2_0mf8p100v7dl3p_2cs5m_80000gn/T//RtmpecMLgy/file54083f9095e3'

 *** caught segfault ***
address 0x0, cause 'unknown'

Traceback:
 1: .Call(Module__classes_info, xp)
 2: Module(module, mustStart = TRUE, where = env)
 3: doTryCatch(return(expr), name, parentenv, handler)
 4: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 5: tryCatchList(expr, classes, parentenv, handlers)
 6: tryCatch(Module(module, mustStart = TRUE, where = env), error = function(e) e)
 7: loadModule("mod_mpseg", TRUE)
 8: (function (ns) {    loadModule("mod_mpseg", TRUE)    loadModule("mod_mixseg", TRUE)    loadModule("mod_query", TRUE)    loadModule("mod_hmmseg", TRUE)    loadModule("mod_tag", TRUE)    loadModule("mod_key", TRUE)    loadModule("mod_sim", TRUE)})(<environment>)
 9: doTryCatch(return(expr), name, parentenv, handler)
10: tryCatchOne(expr, names, parentenv, handlers[[1L]])
11: tryCatchList(expr, classes, parentenv, handlers)
12: tryCatch((function (ns) {    loadModule("mod_mpseg", TRUE)    loadModule("mod_mixseg", TRUE)    loadModule("mod_query", TRUE)    loadModule("mod_hmmseg", TRUE)    loadModule("mod_tag", TRUE)    loadModule("mod_key", TRUE)    loadModule("mod_sim", TRUE)})(<environment>), error = function(e) e)
13: eval(expr, envir, enclos)
14: eval(substitute(tryCatch(FUN(WHERE), error = function(e) e),     list(FUN = f, WHERE = where)), where)
15: .doLoadActions(where, attach)
16: methods:::cacheMetaData(ns, TRUE, ns)
17: loadNamespace(package, c(which.lib.loc, lib.loc))
18: doTryCatch(return(expr), name, parentenv, handler)
19: tryCatchOne(expr, names, parentenv, handlers[[1L]])
20: tryCatchList(expr, classes, parentenv, handlers)
21: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        msg <- conditionMessage(e)        sm <- strsplit(msg, "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && identical(getOption("show.error.messages"),         TRUE)) {        cat(msg, file = stderr())        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
22: try({    ns <- loadNamespace(package, c(which.lib.loc, lib.loc))    env <- attachNamespace(ns, pos = pos, deps)})
23: library(pkg_name, lib.loc = lib, character.only = TRUE, logical.return = TRUE)
24: withCallingHandlers(expr, packageStartupMessage = function(c) invokeRestart("muffleMessage"))
25: suppressPackageStartupMessages(library(pkg_name, lib.loc = lib,     character.only = TRUE, logical.return = TRUE))
26: doTryCatch(return(expr), name, parentenv, handler)
27: tryCatchOne(expr, names, parentenv, handlers[[1L]])
28: tryCatchList(expr, classes, parentenv, handlers)
29: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        msg <- conditionMessage(e)        sm <- strsplit(msg, "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && identical(getOption("show.error.messages"),         TRUE)) {        cat(msg, file = stderr())        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
30: try(suppressPackageStartupMessages(library(pkg_name, lib.loc = lib,     character.only = TRUE, logical.return = TRUE)))
31: tools:::.test_load_package("jiebaR", "/Users/13k/R")
aborting ...
ERROR: loading failed
* removing ‘/Users/13k/R/jiebaR’
Error: Command failed (1)

英文+数字的分词

默认的时候貌似吧英文和数字分开了，比如AK47、N95、X-900会把数字和字母裂开。

实际上，这种字母和数字以及连字符合在一起的经常是型号啥的，不太容易搜集齐全放进词库。因此建议英文和数字直接连在一起的不要切开~

不是是否妥当？

输出关键词提取的排序后的所有结果

原安装的jiebaR在Microsoft-r 3.3.1下加载运行提示grep中正则匹配模式不兼容

错误提示是新的grep函数对unicode属性的PCRE匹配模式不兼容：

详情如下：

> test_worker <- worker('tag')
> test_worker <= '这是一个测试句子。'
Error in grep("(*UCP)^[^⺀-　〡-﹏a-zA-Z0-9]*$", result, perl = TRUE, :
invalid regular expression '(*UCP)^[^⺀-　〡-﹏a-zA-Z0-9]*$'
In addition: Warning message:
In grep("(*UCP)^[^⺀-　〡-﹏a-zA-Z0-9]*$", result, perl = TRUE, :
PCRE pattern compilation error
'this version of PCRE is not compiled with Unicode property support'
at '(*UCP)^[^⺀-　〡-﹏a-zA-Z0-9]*$'

How can I use pass the result of jiebaR to tm

As known,if we want to use library tm to build Corpus,our input should be string that sep by "SPACE",for example: "I am a student from DUT."
But jiebaR returns a vector of ["I","am","a","student","from","DUT"],how can I use it as input of tm

计算词频时不显示中文

freq(wk[kexue])
char freq
1 <U+953B><U+70BC> 1
2 <U+591A><U+52A0> 1
3 <U+6709><U+6240> 1
4 <U+672C><U+9886> 1
5 <U+8FD9> 1
6 <U+4E4B> 1
7 <U+7ECF><U+6D4E><U+5B66><U+5BB6> 1
8 <U+7ED3><U+8BBA> 1
9 <U+559C><U+6076> 1
10 <U+4E00><U+8D77> 1

jiebaR for multiple text files

The following procedure could be very useful for many who like to apply jiebaR.
But, it encounters the following error message. Please help to streamline it.
Error in segment(code, jiebar) : Argument 'code' must be an string.

Here is the codes:

library(tm)
library(jiebaR)

xdir1 = "~/R/all_ANSI/"
xdir2 = "~/R/all_ANSI_out/"
mixseg = worker() 
raw <- list.files(path = xdir1, pattern = "*.txt")
for (f in raw)
{ 
    xpath = paste(xdir1,f,sep="/")
    xdata = read.table(xpath,stringsAsFactors=FALSE, sep="\t")

    M = lapply(1:length(xdata$text),function(i) mixseg <= xdata$text[[i]])

    xpath2 = paste(xdir2,f,sep="/") 
    write(M,xpath2) 
}

在其他的R套件中的C/C++函數中呼叫jiebaR做分詞

Hi,

我是 FeatureHashing的作者。目前FeatureHashing的套件中具備有直接對英文語句做斷詞（split）後直接利用hashing trick產生 sparse matrix 給更進一步的機器學習套件使用。（e.g.： https://github.com/wush978/FeatureHashing/blob/master/vignettes/SentimentAnalysis.Rmd）

我很有興趣在FeatureHashing中也提供jiebaR的斷詞功能。由於核心功能都是用Rcpp開發的，所以想要問問你有沒有興趣提供接口（大概類似 eddelbuettel/digest#10 ），讓FeatureHashing或第三方套件可以在C/C++中直接呼叫jiebaR的分詞功能呢？

Wush

package cidian, decode_scel error

decode_scel(scel = "D:/Sougou/SogouInput/8.0.0.8381/scd/14108.scel",

        output = "D:/R/R-3.3.1/library/jiebaRD/dict/sougou.dict",

```
        cpp = TRUE)
```

Error in eval(substitute(expr), envir, enclos) : not a valid .scel file?

Anything wrong in coding ?

测试用例江大桥识别不出

我看到用户词典中有江大桥这个词, 但是运行用例发现江大桥识别不出来了 😒

> mixseg = worker()
> mixseg["江州市长江大桥参加了长江大桥的通车仪式"]
[1] "江州"     "市"       "长江大桥" "参加"     "了"       "长江大桥" "的"      
[8] "通车"     "仪式"    

#测试用户词典
> mixseg["江州市长弗洛格参加了长江大桥的通车仪式"]
 [1] "江州"     "市长"     "弗洛"     "格"       "参加"     "了"      
 [7] "长江大桥" "的"       "通车"     "仪式"      

> new_user_word(mixseg,'弗洛格','nz')
[1] TRUE
> mixseg["江州市长弗洛格参加了长江大桥的通车仪式"]
[1] "江州"     "市长"     "弗洛格"   "参加"     "了"       "长江大桥" "的"      
[8] "通车"     "仪式"    

#版本信息
> devtools::session_info()
Session info -------------------------------------------------------------------
 setting  value                         
 version  R version 3.2.4 (2016-03-10)  
 system   x86_64, mingw32               
 ui       RStudio (0.99.879)            
 language (EN)                          
 collate  Chinese (Simplified)_China.936
 tz       Asia/Shanghai                  
 date     2016-03-17                    

Packages -----------------------------------------------------------------------
 package    * version date       source        
 devtools     1.10.0  2016-01-23 CRAN (R 3.2.3)
 digest       0.6.9   2016-01-08 CRAN (R 3.2.3)
 jiebaR     * 0.8     2016-01-30 CRAN (R 3.2.3)
 jiebaRD    * 0.1     2015-01-04 CRAN (R 3.2.3)
 memoise      1.0.0   2016-01-29 CRAN (R 3.2.3)
 Rcpp         0.12.3  2016-01-10 CRAN (R 3.2.3)

测试发现用户字典是有用的，但是我发现之前的版本是可以将用户词典里面的江大桥分出来的，是不是在哪个地方更改了权重？

cc @qinwf

Buiild with gcc 5.3.1 on debian with -std=gnu++14

tl;dr

error: redefinition of default argument for ‘class _Hash’
     class unordered_map
           ^
In file included from /usr/include/c++/4.9/tr1/unordered_map:42:0,

==> devtools::check(args = c('--no-build-vignettes'))

Updating jiebaR documentation
Loading jiebaR
Re-compiling jiebaR
'/usr/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore CMD  \
  INSTALL '/home/out/桌面/jiebaR'  \
  --library='/tmp/RtmpD6k3xl/devtools_install_a59564f2ca2' --no-R --no-data  \
  --no-help --no-demo --no-inst --no-docs --no-exec --no-multiarch  \
  --no-test-load --preclean 

* installing *source* package ‘jiebaR’ ...
g++ -I/usr/share/R/include -DNDEBUG -std=gnu++14 -I../inst/include -DLOGGER_LEVEL=LL_WARN  -I"/home/out/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include"   -fpic  -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g  -c RcppExports.cpp -o RcppExports.o
** libs
In file included from /usr/include/c++/4.9/unordered_map:48:0,
                 from /home/out/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/Rcpp/platform/compiler.h:153,
                 from /home/out/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/Rcpp/r/headers.h:48,
                 from /home/out/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/RcppCommon.h:29,
                 from /home/out/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/Rcpp.h:27,
                 from ../inst/include/lib/Trie.hpp:7,
                 from ../inst/include/lib/DictTrie.hpp:14,
                 from ../inst/include/lib/MPSegment.hpp:9,
                 from ../inst/include/lib/MixSegment.hpp:5,
                 from ../inst/include/segtype.hpp:9,
                 from ../inst/include/jiebaR.h:5,
                 from RcppExports.cpp:4:
/usr/include/c++/4.9/bits/unordered_map.h:98:11: error: redefinition of default argument for ‘class _Hash’
     class unordered_map
           ^
In file included from /usr/include/c++/4.9/tr1/unordered_map:42:0,
                 from ../inst/include/lib/Limonp/StdExtension.hpp:10,
                 from ../inst/include/lib/Limonp/StringUtil.hpp:24,
                 from ../inst/include/lib/DictTrie.hpp:11,
                 from ../inst/include/lib/MPSegment.hpp:9,
                 from ../inst/include/lib/MixSegment.hpp:5,
                 from ../inst/include/segtype.hpp:9,
                 from ../inst/include/jiebaR.h:5,
                 from RcppExports.cpp:4:
/usr/include/c++/4.9/tr1/unordered_map.h:177:5: note: original definition appeared here
     class _Hash = hash<_Key>,
     ^
In file included from /usr/include/c++/4.9/unordered_set:48:0,
                 from /home/out/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/Rcpp/platform/compiler.h:162,
                 from /home/out/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/Rcpp/r/headers.h:48,
                 from /home/out/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/RcppCommon.h:29,
                 from /home/out/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/Rcpp.h:27,
                 from ../inst/include/lib/Trie.hpp:7,
                 from ../inst/include/lib/DictTrie.hpp:14,
                 from ../inst/include/lib/MPSegment.hpp:9,
                 from ../inst/include/lib/MixSegment.hpp:5,
                 from ../inst/include/segtype.hpp:9,
                 from ../inst/include/jiebaR.h:5,
                 from RcppExports.cpp:4:
/usr/include/c++/4.9/bits/unordered_set.h:93:11: error: redefinition of default argument for ‘class _Hash’
     class unordered_set
           ^
In file included from /usr/include/c++/4.9/tr1/unordered_set:42:0,
                 from ../inst/include/lib/Limonp/StdExtension.hpp:11,
                 from ../inst/include/lib/Limonp/StringUtil.hpp:24,
                 from ../inst/include/lib/DictTrie.hpp:11,
                 from ../inst/include/lib/MPSegment.hpp:9,
                 from ../inst/include/lib/MixSegment.hpp:5,
                 from ../inst/include/segtype.hpp:9,
                 from ../inst/include/jiebaR.h:5,
                 from RcppExports.cpp:4:
/usr/include/c++/4.9/tr1/unordered_set.h:170:5: note: original definition appeared here
     class _Hash = hash<_Value>,
     ^
/usr/lib/R/etc/Makeconf:143: recipe for target 'RcppExports.o' failed
make: *** [RcppExports.o] Error 1
ERROR: compilation failed for package ‘jiebaR’
* removing ‘/tmp/RtmpD6k3xl/devtools_install_a59564f2ca2/jiebaR’
错误: Command failed (1)
停止执行

Exited with status 1.

请教关于distance的问题

我用stringdist包计算的hamming距离与用jiebaR计算的distance距离不一致，是不是我使用方法有问题啊？
参考链接：https://qinwenfeng.com/jiebaR/sim.html
测试一：

library(jiebaR)
words = "hello world!"
simhasher = worker("simhash",topn=2)
distance(words, "江州市长江大桥参加了长江大桥的通车仪式",simhasher)

a <- tobin(simhasher[words]$simhash)
b <- tobin(simhasher["江州市长江大桥参加了长江大桥的通车仪式"]$simhash)
stringdist::stringdist(a,b,method = "hamming")

测试二：

sim = worker("simhash")
vector_distance(c("今天","天气","真的","十分","不错","的","感觉"),c("真的","十分","不错","的","感觉"),sim)

res = vector_simhash(c("今天","天气","真的","十分","不错","的","感觉"),sim) 
a<-tobin(res$simhash)
res1 = vector_simhash(c("真的","十分","不错","的","感觉"),sim)
b<-tobin(res1$simhash)
stringdist::stringdist(a,b,method = "hamming")

无法从github下載安装jiebaR

Hi @qinwf ,
我利用Rstudio的devtools套件安装目前在github的jiebaR套件，可是在安装jiebaR会发生下列错误：「'stoull' was not declared in this scope」。

我安装的R言版本如下图所示：

计算机的gcc版本如下图所示：

分词Error in file_coding(code[1]) : Cannot open file

@qinwf 您好，我在分词时碰到了如下问题：

wk["三维工程()排污费征收标准将提高一倍四类环保企业望受益 ://..////."]
Error in file_coding(code[1]) : Cannot open file

当我去掉最后一个"."就可以了

wk["三维工程()排污费征收标准将提高一倍四类环保企业望受益 ://..////"]
[1] "排污费" "征收" "标准" "提高" "一倍" "四类" "环保" "企业" "受益"

这是什么原因呢？

idf.utf8来源及更新疑问

在计算 tf-idf 时会用到 idf.utf8 文件，请问此文件下每个词的weight值是怎么来的？
随着语言的扩充这个字典包会不会定期有更新？

nature tagging with tokenized words

Hi @qinwf ,

First off, thanks a lot for the wonderful package :)

I'd love to know if there's a way to nature tag already tokenized words (say, in a vector).

Currently when I run the tagger, it will breakdown my already tokenized vector of words. My use-case is to tag nature for words in my user-dictionary so they have the right nature instead of the ones I gave in dictionary creation.

Thanks again and look forward to your insights.

Cheers,

Andrew

keywords及其重要度系数提取疑问

keys=worker(type='keywords',topn=30)
keywd=keys <= "I:/rwork/cnseg/report15.txt"
keywd
attr(keywd,'names')
[1] "425.433" "351.371" "343.874" "286.865" "241.128" "218.025" "204.13"  "200.926" "196.841"
[10] "196.479" "192.414" "184.07"  "180.495" "163.491" "160.962" "160.903" "140.446" "139.77" 
[19] "138.752" "131.364" "129.173" "127.921" "126.653" "117.667" "114.576" "109.752" "109.271"
[28] "107.422" "106.618" "101.501"

如何提取keywords和tf-idf形成类似于

keywords	tf-idf
政府	192.414
创新	351.371

的data.frame格式

distance函数的使用疑问

在microsoft-r [email protected]下安装jiebaR提示找不到include目录

无论是从CRAN的repo还是使用install_github安装，
无论是安装到用户library路径还是系统library路径，都出现同样的错误，
提示：sh: I/usr/lib64/microsoft-r/3.3/lib64/R/include: No such file or directory

具体报错信息如下：
* installing *source* package ‘jiebaR’ ...

** libs

I/usr/lib64/microsoft-r/3.3/lib64/R/include -DNDEBUG -I../inst/include -DLOGGING_LEVEL=LL_WARNING -DU_STATIC_IMPLEMENTATION -I"/usr/lib64/microsoft-r/3.3/lib64/R/library/Rcpp/include" -c RcppExports.cpp -o RcppExports.o