qinwf / jiebar Goto Github PK
View Code? Open in Web Editor NEWChinese text segmentation with R. R语言中文分词 (文档已更新 🎉 :https://qinwenfeng.com/jiebaR/ )
License: Other
Chinese text segmentation with R. R语言中文分词 (文档已更新 🎉 :https://qinwenfeng.com/jiebaR/ )
License: Other
使用jiebaR进行分词的时候,出现Error: Cannot open file
library(jiebaR)
seg=worker()
segment('..',seg)
缘起于自己在处理诸如2014.01.23
字符串的时候,当通过tm_map删掉数字后,发现仅剩..
,而这个字符直接导致上述报错,应该是来自detect.cpp,猜测可能是把它当成了路径?!
因为提供的distance好像只能计算字符串之间的距离,可是我只想保留simhash然后自己写一个函数计算海明距离以节省时间和空间,结果发现上述问题,望指教。
顺便能不能添加利用海明距离的文本聚类功能呢?
感谢
错误信息如下:
install.packages("jiebaR",dep=T)
--- Please select a CRAN mirror for use in this session ---
trying URL 'http://mirrors.xmu.edu.cn/CRAN/src/contrib/jiebaR_0.4.tar.gz'
Content type 'application/x-gzip' length 63702 bytes (62 Kb)
opened URL
downloaded 62 Kb
The downloaded source packages are in
?.tmp/RtmpGFXG2h/downloaded_packages?
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Warning message:
In install.packages("jiebaR", dep = T) :
installation of package ?.iebaR?.had non-zero exit status
请看这两个相关的issue和PR.
yanyiwu/cppjieba#42
yanyiwu/cppjieba#41
tagger <= "我刚刚在桌子旁边不小心摔坏了一只装水的杯子"
r d p n f d n v ul m x
"我" "刚刚" "在" "桌子" "旁边" "不" "小心" "摔坏" "了" "一只" "装水"
uj n
"的" "杯子"
上面例子中的"装水"词性不明确, "摔坏", "了" 是不是分成"摔"+"坏了" 更恰当?
install.packages("jiebaR")
Installing package into ‘/home/enn_james/R/x86_64-unknown-linux-gnu-library/3.2’
(as ‘lib’ is unspecified)
also installing the dependency ‘jiebaRD’
试开URL’http://cran.rstudio.com/src/contrib/jiebaRD_0.1.tar.gz'
downloaded 4.8 MB
试开URL’http://cran.rstudio.com/src/contrib/jiebaR_0.5.tar.gz'
downloaded 62 KB
The downloaded source packages are in
‘/tmp/Rtmp9D32gT/downloaded_packages’
install.package("jiebaR")
错误: 没有"install.package"这个函数
install.packages("jiebaR")
Installing package into ‘/home/enn_james/R/x86_
您好,
最近想透过R做断字断词及语意情感分析, 找了不少分词辞典及情感辞典
看到网上另一个词库分享, 但是里面的字段格式不太能理解, 不知道各位是否可以指点一下呢?
Download URL:
http://down.51cto.com/data/269758
档案字段格式长这样:
1 扭在 nz 6ff026e67cc327c2 2 930 1 0 3
2 拟在 nz 3ad73d9dc29b7c54 2 10092 0 0 3
3 捻针 nz 52w76148h1f9cei9 2 308 1 0 3
4 怒发冲冠 nfcg 9jue6c3a96b5eoif 4 9313 1 0 3
5 农副产品 nfcp adc3aa31df8f47dd 4 7450 1 0 3
6 女房东 nfd 78foi563e45ga896 3 7108 1 0 3
7 暖风机 nfj bbe96g73c89c3298 3 5116 1 0 3
8 年富力强 nflq 6df5a2e8ba64c9a3 4 13740 1 0 3
9 逆耳忠言 nezy 8h65g473e5e5g52e 4 2285 1 0 3
10 难分难解 nfnj 47a6ce306f3i3d2w 4 7382 1 0 3
11 难分难舍 nfns 7i3eb71865g69aa5 4 6718 1 0 3
12 闹翻天 nft cbe4d1c47ie345a2 3 2694 1 0 3
13 女服务员 nfwy a9cc81f8f08fac43 4 12386 1 0 3
14 逆反心理 nfxl a3i3ba1d2a8ed348 4 6096 1 0 3
15 农副业 nfy c1969cd63ic682bb 3 5468 1 0 3
16 年复一年 nfyn fd18eb2b7afbc1ed 4 27804 1 0 3
1.将15年政府报告分词
library(jiebaR)
keys33=worker()
keywd=keys33 <= "I:/report15.txt"
tt1=read.table('I:/report15.segment1438142498.5306.txt',header=FALSE,sep=' ')
结果如下:
> head(tt1)
V1
1 宸ヤ綔
2 鎶ュ憡
3 2015
4 骞\xb4\n3
5 鏈\x88\n5
6 鏃\xa5\n鍦\xa8\n绗崄浜屽眾
该如何正确分列and提取有用信息,比如想table()统计词频?
請教一下,
tokenEngine <- worker("keywords", idf = "idf.test.txt")
vector_keywords(c("蘋果","蘋果","蘋果","柳丁","柳丁"), tokenEngine)
在我自定義的idf中:
蘋果 0.7595766
柳丁 4.2784980
這邊的詞頻應該要是:
蘋果 0.6
柳丁 0.4
故蘋果與柳丁的 TF-IDF 應該要分別是 0.455746與1.711399,
但程式碼算出的結果看似毫無關聯:
算出的結果卻是:
蘋果 30.4548
柳丁 20.3032
不求精確的演算法,但希望能大概懂您的邏輯或算法,請您賜教。
library(jiebaR)
#关键词提取
kseg = worker(type = "keywords")
> segment("我爱北京***",kseg)
Error: "segment" %in% class(jiebar) is not TRUE
> kseg["我爱北京***"]
8.9954 4.6674
"***" "北京"
> kseg <= "我爱北京***"
8.9954 4.6674
"***" "北京"
> mseg = worker()
> segment("我爱北京***",mseg)
[1] "我" "爱" "北京" "***"
> mseg["我爱北京***"]
[1] "我" "爱" "北京" "***"
> mseg <= "我爱北京***"
[1] "我" "爱" "北京" "***"
> devtools::session_info()
Session info ------------------------------------------------------------
setting value
version R version 3.2.4 (2016-03-10)
system x86_64, mingw32
ui RStudio (0.99.879)
language (EN)
collate Chinese (Simplified)_China.936
tz Asia/Shanghai
date 2016-03-21
Packages -----------------------------------------------------------------
package * version date source
...
jiebaR * 0.8 2016-01-30 CRAN (R 3.2.3)
jiebaRD * 0.1 2015-01-04 CRAN (R 3.2.3)
memoise 1.0.0 2016-01-29 CRAN (R 3.2.3)
microbenchmark * 1.4-2.1 2015-11-25 CRAN (R 3.2.3)
Rcpp 0.12.3 2016-01-10 CRAN (R 3.2.3)
...
我使用的是jiebaR 0.8, ubuntu15.10 , 问题如下:
library(jiebaR)
载入需要的程辑包:jiebaRD
seg=worker()
new_user_word(seg,"北京大学","n")
[1] TRUE
seg["北京大学"]
[1] "北京" "大学"
其中“北京大学”并没有作为一个单词被分出来,而是变成了两个词。
Hi,
我在OS X 上使用 clang++ (clang-700.1.76)編譯時,因為有
#include <Rinternals.h>
所以發生錯誤。
請問這裡能不能改成:
#define R_NO_REMAP
#include <Rinternals.h>
來避免相關錯誤呢?
謝謝
segmenter <- worker(type = "query", dict = "dict/scel.dict.utf8")
When this line is run, R freezes. I've run other kinds of workers specifying user
instead of dict
and no problems occur.
Would you kindly illustrate the difference in specifying user
and dict
?
Also, could you reproduce the bug?
My jeibaR library is the development version here; on ubuntu 14.04.
Thank you very much.
为什么我自定义词库后,不能用我自定义的词来分词。
test = worker(user = "selfdict_160115.txt")
之后,依然没有用
在Rwordseg包中,insertWords(selfdict),可以按自己新增的词来分词,但Rwordseg包分词太细,jiebaR可以实现这样吗
There are some characters which are valid in unicode UTF-8, but after I add to the dictionary, it has this error.
../inst/include/lib/DictTrie.hpp:130 ERROR Decode 的𠝹刀 failed.
This character is
General category: Lo - Letter, other
Canonical combining class: 0 - Spacing, split, enclosing, reordrant, & Tibetan subjoined
Bidirectional category: L - Left-to-right
Unicode version:
As text: 𠝹
Decimal: 132985
HTML escape: 𠝹
URL escape: %F0%A0%9D%B9
More alternative forms
View data in UniHan database
View in PDF code charts (page 34, approx [40Mb file!])
More properties at CLDR's Property demo
Descriptions at decodeUnicode
Java data at FileFormat
Unicode block: CJK Unified Ideographs Extension B
Script group: undefined
您好,在调用get_idf函数之后值全为0 ,help之后显示其中调用了get_idf_cpp函数,请问这个函数的源代码是否可以告知?
我用一个hmm的worker来拆”旌竿幖幖旗㸌㸌,意气横鞭归故乡。“这句诗,结果是"旌竿" "幖" "幖" "旗" "意气" "横" "鞭" "归" "故乡",”㸌㸌“哪去了?
# init jieba
library(jiebaR)
seg_local=worker()
# init cluster
library(parallel)
cl=makeCluster(3)
# init args and functions
args=c('abc def','abd efg','ah gs fhg')
get_seg_local=function(d) segment(d,seg_local)
get_seg_remote=function(d) segment(d,seg_remote)
clusterEvalQ(cl,library(jiebaR))
# ======================
# 本地定义worker()并export
# ======================
clusterExport(cl,'seg_local')
# clusterExport(cl,'get_seg_local')
parLapply(cl,args,get_seg_local)
# Error in checkForRemoteErrors(val) :
# 3 nodes produced errors; first error: Please create a new worker after jiebaR is reloaded.
# ========================
# 远程定义master节点的worker()
# ========================
clusterCall(cl,function(){
seg_remote=worker()
})
parLapply(cl,args,get_seg_remote)
# Error in checkForRemoteErrors(val) :
# 3 nodes produced errors; first error: 找不到对象'seg_remote'
本地声明的报错信息主要是时间戳的不一致导致,Line 42
https://github.com/qinwf/jiebaR/blob/master/R/segment.R
当然,第二种方案报错并不是jiebaR的问题(我自己找了不少相关资料,但始终不得解),想请教一下对于jiebaR在并行计算中是否有更好的解决方案,谢谢!
重新规划新的 API,让大家用起来方便一些。下面是一些想法:
1、分离 Cppjieba 中的分词,关键词提取,Simhash 的方法为小的模块,不相互依赖。Cppjieba 5.0 增加了 Textrank 的模块,现有的接口想把这个模块整合起来,使用起来感觉可能会不方便。
在原有的 Cppjieba 的代码中,关键词提取和Simhash 的步骤是包含了分词步骤的,而这两个步骤其实可以独立出来,用户先分词,然后执行后面两步,比如:
text %>%
fenci() %>%
key_tfidf() %>% # key_textrank()
simhash()
2、分离筛选标点,筛选停止词,读取文件,bylines 等逻辑到单独的函数,这样用户可以自定义需要的步骤,也减少 ifelse 的损失。
text %>%
rm_sym() %>%
rm_stopwords %>%
fenci()
组合函数链,比如:
rm_sym_stopwords = function(txt){
txt %>%
rm_sym %>%
rm_stopwords()
}
text %>%
rm_sym_stopwords %>%
fenci()
checking tests ... ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
12: test("./CPP_API")
13: load_all(pkg, quiet = TRUE) at /Users/hadley/Documents/devtools/devtools/R/test.r:50
14: check_suggested("roxygen2") at /Users/hadley/Documents/devtools/devtools/R/load.r:88
15: check_dep_version(pkg, version, compare) at /Users/hadley/Documents/devtools/devtools/R/utils.r:63
16: stop("Dependency package ", dep_name, " not available.") at /Users/hadley/Documents/devtools/devtools/R/package-deps.r:56
testthat results ================================================================
OK: 14 SKIPPED: 0 FAILED: 2
1. Error: C_API
2. Error: CPP_API
Error: testthat unit tests failed
Execution halted
Could you please look into this as soon as possible? I really want to release devtools to CRAN in the next couple of days so that I can use the latest version in class next week.
From the error, I'm guess it's something to do with roxygen2, which devtools now only suggests, instead of requiring.
> devtools::install_github("qinwf/jiebaR")
Downloading github repo qinwf/jiebaR@master
Installing jiebaR
'/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL \
'/private/var/folders/sd/2_0mf8p100v7dl3p_2cs5m_80000gn/T/RtmpiBlUkZ/devtools53cf225d0fdf/qinwf-jiebaR-6c1d204' \
--library='/Users/13k/R' --install-tests
* installing *source* package ‘jiebaR’ ...
** libs
clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I../inst/include -DLOGGER_LEVEL=LL_WARN -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Users/13k/R/Rcpp/include" -fPIC -Wall -mtune=core2 -g -O2 -c RcppExports.cpp -o RcppExports.o
clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I../inst/include -DLOGGER_LEVEL=LL_WARN -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Users/13k/R/Rcpp/include" -fPIC -Wall -mtune=core2 -g -O2 -c detect.cpp -o detect.o
clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I../inst/include -DLOGGER_LEVEL=LL_WARN -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Users/13k/R/Rcpp/include" -fPIC -Wall -mtune=core2 -g -O2 -c segtype.cpp -o segtype.o
clang++ -std=c++11 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/lib -o jiebaR.so RcppExports.o detect.o segtype.o -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
installing to /Users/13k/R/jiebaR/libs
** R
** inst
** tests
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
sh: line 1: 21527 Segmentation fault: 11 '/Library/Frameworks/R.framework/Resources/bin/R' --no-save --slave 2>&1 < '/var/folders/sd/2_0mf8p100v7dl3p_2cs5m_80000gn/T//RtmpecMLgy/file54083f9095e3'
*** caught segfault ***
address 0x0, cause 'unknown'
Traceback:
1: .Call(Module__classes_info, xp)
2: Module(module, mustStart = TRUE, where = env)
3: doTryCatch(return(expr), name, parentenv, handler)
4: tryCatchOne(expr, names, parentenv, handlers[[1L]])
5: tryCatchList(expr, classes, parentenv, handlers)
6: tryCatch(Module(module, mustStart = TRUE, where = env), error = function(e) e)
7: loadModule("mod_mpseg", TRUE)
8: (function (ns) { loadModule("mod_mpseg", TRUE) loadModule("mod_mixseg", TRUE) loadModule("mod_query", TRUE) loadModule("mod_hmmseg", TRUE) loadModule("mod_tag", TRUE) loadModule("mod_key", TRUE) loadModule("mod_sim", TRUE)})(<environment>)
9: doTryCatch(return(expr), name, parentenv, handler)
10: tryCatchOne(expr, names, parentenv, handlers[[1L]])
11: tryCatchList(expr, classes, parentenv, handlers)
12: tryCatch((function (ns) { loadModule("mod_mpseg", TRUE) loadModule("mod_mixseg", TRUE) loadModule("mod_query", TRUE) loadModule("mod_hmmseg", TRUE) loadModule("mod_tag", TRUE) loadModule("mod_key", TRUE) loadModule("mod_sim", TRUE)})(<environment>), error = function(e) e)
13: eval(expr, envir, enclos)
14: eval(substitute(tryCatch(FUN(WHERE), error = function(e) e), list(FUN = f, WHERE = where)), where)
15: .doLoadActions(where, attach)
16: methods:::cacheMetaData(ns, TRUE, ns)
17: loadNamespace(package, c(which.lib.loc, lib.loc))
18: doTryCatch(return(expr), name, parentenv, handler)
19: tryCatchOne(expr, names, parentenv, handlers[[1L]])
20: tryCatchList(expr, classes, parentenv, handlers)
21: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L msg <- conditionMessage(e) sm <- strsplit(msg, "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") if (w > LONG) prefix <- paste0(prefix, "\n ") } else prefix <- "Error : " msg <- paste0(prefix, conditionMessage(e), "\n") .Internal(seterrmessage(msg[1L])) if (!silent && identical(getOption("show.error.messages"), TRUE)) { cat(msg, file = stderr()) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))})
22: try({ ns <- loadNamespace(package, c(which.lib.loc, lib.loc)) env <- attachNamespace(ns, pos = pos, deps)})
23: library(pkg_name, lib.loc = lib, character.only = TRUE, logical.return = TRUE)
24: withCallingHandlers(expr, packageStartupMessage = function(c) invokeRestart("muffleMessage"))
25: suppressPackageStartupMessages(library(pkg_name, lib.loc = lib, character.only = TRUE, logical.return = TRUE))
26: doTryCatch(return(expr), name, parentenv, handler)
27: tryCatchOne(expr, names, parentenv, handlers[[1L]])
28: tryCatchList(expr, classes, parentenv, handlers)
29: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L msg <- conditionMessage(e) sm <- strsplit(msg, "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") if (w > LONG) prefix <- paste0(prefix, "\n ") } else prefix <- "Error : " msg <- paste0(prefix, conditionMessage(e), "\n") .Internal(seterrmessage(msg[1L])) if (!silent && identical(getOption("show.error.messages"), TRUE)) { cat(msg, file = stderr()) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))})
30: try(suppressPackageStartupMessages(library(pkg_name, lib.loc = lib, character.only = TRUE, logical.return = TRUE)))
31: tools:::.test_load_package("jiebaR", "/Users/13k/R")
aborting ...
ERROR: loading failed
* removing ‘/Users/13k/R/jiebaR’
Error: Command failed (1)
默认的时候貌似吧英文和数字分开了,比如AK47、N95、X-900会把数字和字母裂开。
实际上,这种字母和数字以及连字符合在一起的经常是型号啥的,不太容易搜集齐全放进词库。因此建议英文和数字直接连在一起的不要切开~
不是是否妥当?
错误提示是新的grep函数对unicode属性的PCRE匹配模式不兼容:
详情如下:
> test_worker <- worker('tag')
> test_worker <= '这是一个测试句子。'
Error in grep("(*UCP)^[^⺀- 〡-﹏a-zA-Z0-9]*$", result, perl = TRUE, :
invalid regular expression '(*UCP)^[^⺀- 〡-﹏a-zA-Z0-9]*$'
In addition: Warning message:
In grep("(*UCP)^[^⺀- 〡-﹏a-zA-Z0-9]*$", result, perl = TRUE, :
PCRE pattern compilation error
'this version of PCRE is not compiled with Unicode property support'
at '(*UCP)^[^⺀- 〡-﹏a-zA-Z0-9]*$'
As known,if we want to use library tm to build Corpus,our input should be string that sep by "SPACE",for example: "I am a student from DUT."
But jiebaR returns a vector of ["I","am","a","student","from","DUT"],how can I use it as input of tm
freq(wk[kexue])
char freq
1 <U+953B><U+70BC> 1
2 <U+591A><U+52A0> 1
3 <U+6709><U+6240> 1
4 <U+672C><U+9886> 1
5 <U+8FD9> 1
6 <U+4E4B> 1
7 <U+7ECF><U+6D4E><U+5B66><U+5BB6> 1
8 <U+7ED3><U+8BBA> 1
9 <U+559C><U+6076> 1
10 <U+4E00><U+8D77> 1
The following procedure could be very useful for many who like to apply jiebaR.
But, it encounters the following error message. Please help to streamline it.
Error in segment(code, jiebar) : Argument 'code' must be an string.
Here is the codes:
library(tm)
library(jiebaR)
xdir1 = "~/R/all_ANSI/"
xdir2 = "~/R/all_ANSI_out/"
mixseg = worker()
raw <- list.files(path = xdir1, pattern = "*.txt")
for (f in raw)
{
xpath = paste(xdir1,f,sep="/")
xdata = read.table(xpath,stringsAsFactors=FALSE, sep="\t")
M = lapply(1:length(xdata$text),function(i) mixseg <= xdata$text[[i]])
xpath2 = paste(xdir2,f,sep="/")
write(M,xpath2)
}
Hi,
我是 FeatureHashing的作者。目前FeatureHashing的套件中具備有直接對英文語句做斷詞(split)後直接利用hashing trick產生 sparse matrix 給更進一步的機器學習套件使用。(e.g.: https://github.com/wush978/FeatureHashing/blob/master/vignettes/SentimentAnalysis.Rmd)
我很有興趣在FeatureHashing中也提供jiebaR的斷詞功能。由於核心功能都是用Rcpp開發的,所以想要問問你有沒有興趣提供接口(大概類似 eddelbuettel/digest#10 ),讓FeatureHashing或第三方套件可以在C/C++中直接呼叫jiebaR的分詞功能呢?
Wush
decode_scel(scel = "D:/Sougou/SogouInput/8.0.0.8381/scd/14108.scel",
output = "D:/R/R-3.3.1/library/jiebaRD/dict/sougou.dict",
cpp = TRUE)
Error in eval(substitute(expr), envir, enclos) : not a valid .scel file?
Anything wrong in coding ?
我看到用户词典中有江大桥
这个词, 但是运行用例发现江大桥识别不出来了 😒
> mixseg = worker()
> mixseg["江州市长江大桥参加了长江大桥的通车仪式"]
[1] "江州" "市" "长江大桥" "参加" "了" "长江大桥" "的"
[8] "通车" "仪式"
#测试用户词典
> mixseg["江州市长弗洛格参加了长江大桥的通车仪式"]
[1] "江州" "市长" "弗洛" "格" "参加" "了"
[7] "长江大桥" "的" "通车" "仪式"
> new_user_word(mixseg,'弗洛格','nz')
[1] TRUE
> mixseg["江州市长弗洛格参加了长江大桥的通车仪式"]
[1] "江州" "市长" "弗洛格" "参加" "了" "长江大桥" "的"
[8] "通车" "仪式"
#版本信息
> devtools::session_info()
Session info -------------------------------------------------------------------
setting value
version R version 3.2.4 (2016-03-10)
system x86_64, mingw32
ui RStudio (0.99.879)
language (EN)
collate Chinese (Simplified)_China.936
tz Asia/Shanghai
date 2016-03-17
Packages -----------------------------------------------------------------------
package * version date source
devtools 1.10.0 2016-01-23 CRAN (R 3.2.3)
digest 0.6.9 2016-01-08 CRAN (R 3.2.3)
jiebaR * 0.8 2016-01-30 CRAN (R 3.2.3)
jiebaRD * 0.1 2015-01-04 CRAN (R 3.2.3)
memoise 1.0.0 2016-01-29 CRAN (R 3.2.3)
Rcpp 0.12.3 2016-01-10 CRAN (R 3.2.3)
测试发现用户字典是有用的,但是我发现之前的版本是可以将用户词典里面的江大桥分出来的,是不是在哪个地方更改了权重?
cc @qinwf
tl;dr
error: redefinition of default argument for ‘class _Hash’
class unordered_map
^
In file included from /usr/include/c++/4.9/tr1/unordered_map:42:0,
==> devtools::check(args = c('--no-build-vignettes'))
Updating jiebaR documentation
Loading jiebaR
Re-compiling jiebaR
'/usr/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore CMD \
INSTALL '/home/out/桌面/jiebaR' \
--library='/tmp/RtmpD6k3xl/devtools_install_a59564f2ca2' --no-R --no-data \
--no-help --no-demo --no-inst --no-docs --no-exec --no-multiarch \
--no-test-load --preclean
* installing *source* package ‘jiebaR’ ...
g++ -I/usr/share/R/include -DNDEBUG -std=gnu++14 -I../inst/include -DLOGGER_LEVEL=LL_WARN -I"/home/out/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include" -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c RcppExports.cpp -o RcppExports.o
** libs
In file included from /usr/include/c++/4.9/unordered_map:48:0,
from /home/out/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/Rcpp/platform/compiler.h:153,
from /home/out/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/Rcpp/r/headers.h:48,
from /home/out/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/RcppCommon.h:29,
from /home/out/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/Rcpp.h:27,
from ../inst/include/lib/Trie.hpp:7,
from ../inst/include/lib/DictTrie.hpp:14,
from ../inst/include/lib/MPSegment.hpp:9,
from ../inst/include/lib/MixSegment.hpp:5,
from ../inst/include/segtype.hpp:9,
from ../inst/include/jiebaR.h:5,
from RcppExports.cpp:4:
/usr/include/c++/4.9/bits/unordered_map.h:98:11: error: redefinition of default argument for ‘class _Hash’
class unordered_map
^
In file included from /usr/include/c++/4.9/tr1/unordered_map:42:0,
from ../inst/include/lib/Limonp/StdExtension.hpp:10,
from ../inst/include/lib/Limonp/StringUtil.hpp:24,
from ../inst/include/lib/DictTrie.hpp:11,
from ../inst/include/lib/MPSegment.hpp:9,
from ../inst/include/lib/MixSegment.hpp:5,
from ../inst/include/segtype.hpp:9,
from ../inst/include/jiebaR.h:5,
from RcppExports.cpp:4:
/usr/include/c++/4.9/tr1/unordered_map.h:177:5: note: original definition appeared here
class _Hash = hash<_Key>,
^
In file included from /usr/include/c++/4.9/unordered_set:48:0,
from /home/out/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/Rcpp/platform/compiler.h:162,
from /home/out/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/Rcpp/r/headers.h:48,
from /home/out/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/RcppCommon.h:29,
from /home/out/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/Rcpp.h:27,
from ../inst/include/lib/Trie.hpp:7,
from ../inst/include/lib/DictTrie.hpp:14,
from ../inst/include/lib/MPSegment.hpp:9,
from ../inst/include/lib/MixSegment.hpp:5,
from ../inst/include/segtype.hpp:9,
from ../inst/include/jiebaR.h:5,
from RcppExports.cpp:4:
/usr/include/c++/4.9/bits/unordered_set.h:93:11: error: redefinition of default argument for ‘class _Hash’
class unordered_set
^
In file included from /usr/include/c++/4.9/tr1/unordered_set:42:0,
from ../inst/include/lib/Limonp/StdExtension.hpp:11,
from ../inst/include/lib/Limonp/StringUtil.hpp:24,
from ../inst/include/lib/DictTrie.hpp:11,
from ../inst/include/lib/MPSegment.hpp:9,
from ../inst/include/lib/MixSegment.hpp:5,
from ../inst/include/segtype.hpp:9,
from ../inst/include/jiebaR.h:5,
from RcppExports.cpp:4:
/usr/include/c++/4.9/tr1/unordered_set.h:170:5: note: original definition appeared here
class _Hash = hash<_Value>,
^
/usr/lib/R/etc/Makeconf:143: recipe for target 'RcppExports.o' failed
make: *** [RcppExports.o] Error 1
ERROR: compilation failed for package ‘jiebaR’
* removing ‘/tmp/RtmpD6k3xl/devtools_install_a59564f2ca2/jiebaR’
错误: Command failed (1)
停止执行
Exited with status 1.
我用stringdist包计算的hamming距离与用jiebaR计算的distance距离不一致,是不是我使用方法有问题啊?
参考链接:https://qinwenfeng.com/jiebaR/sim.html
测试一:
library(jiebaR)
words = "hello world!"
simhasher = worker("simhash",topn=2)
distance(words, "江州市长江大桥参加了长江大桥的通车仪式",simhasher)
a <- tobin(simhasher[words]$simhash)
b <- tobin(simhasher["江州市长江大桥参加了长江大桥的通车仪式"]$simhash)
stringdist::stringdist(a,b,method = "hamming")
测试二:
sim = worker("simhash")
vector_distance(c("今天","天气","真的","十分","不错","的","感觉"),c("真的","十分","不错","的","感觉"),sim)
res = vector_simhash(c("今天","天气","真的","十分","不错","的","感觉"),sim)
a<-tobin(res$simhash)
res1 = vector_simhash(c("真的","十分","不错","的","感觉"),sim)
b<-tobin(res1$simhash)
stringdist::stringdist(a,b,method = "hamming")
Hi @qinwf ,
我利用Rstudio的devtools套件安装目前在github的jiebaR套件,可是在安装jiebaR会发生下列错误:「'stoull' was not declared in this scope」。
计算机的gcc版本如下图所示:
@qinwf 您好,我在分词时碰到了如下问题:
wk["三维工程()排污费征收标准将提高一倍 四类环保企业望受益 ://..////."]
Error in file_coding(code[1]) : Cannot open file
当我去掉最后一个"."就可以了
wk["三维工程()排污费征收标准将提高一倍 四类环保企业望受益 ://..////"]
[1] "排污费" "征收" "标准" "提高" "一倍" "四类" "环保" "企业" "受益"
这是什么原因呢?
在计算 tf-idf 时会用到 idf.utf8 文件,请问此文件下每个词的weight值是怎么来的?
随着语言的扩充这个字典包会不会定期有更新?
Hi @qinwf ,
First off, thanks a lot for the wonderful package :)
I'd love to know if there's a way to nature tag already tokenized words (say, in a vector).
Currently when I run the tagger, it will breakdown my already tokenized vector of words. My use-case is to tag nature for words in my user-dictionary so they have the right nature instead of the ones I gave in dictionary creation.
Thanks again and look forward to your insights.
Cheers,
Andrew
keys=worker(type='keywords',topn=30)
keywd=keys <= "I:/rwork/cnseg/report15.txt"
keywd
attr(keywd,'names')
[1] "425.433" "351.371" "343.874" "286.865" "241.128" "218.025" "204.13" "200.926" "196.841"
[10] "196.479" "192.414" "184.07" "180.495" "163.491" "160.962" "160.903" "140.446" "139.77"
[19] "138.752" "131.364" "129.173" "127.921" "126.653" "117.667" "114.576" "109.752" "109.271"
[28] "107.422" "106.618" "101.501"
如何提取keywords和tf-idf形成类似于
keywords | tf-idf |
---|---|
政府 | 192.414 |
创新 | 351.371 |
的data.frame格式
无论是从CRAN的repo还是使用install_github安装,
无论是安装到用户library路径还是系统library路径,都出现同样的错误,
提示:sh: I/usr/lib64/microsoft-r/3.3/lib64/R/include: No such file or directory
具体报错信息如下:
* installing *source* package ‘jiebaR’ ...
** libs
I/usr/lib64/microsoft-r/3.3/lib64/R/include -DNDEBUG -I../inst/include -DLOGGING_LEVEL=LL_WARNING -DU_STATIC_IMPLEMENTATION -I"/usr/lib64/microsoft-r/3.3/lib64/R/library/Rcpp/include" -c RcppExports.cpp -o RcppExports.o
sh: I/usr/lib64/microsoft-r/3.3/lib64/R/include: No such file or directory
/usr/lib64/microsoft-r/3.3/lib64/R/etc/Makeconf:141: recipe for target 'RcppExports.o' failed
make: [RcppExports.o] Error 127 (ignored)
I/usr/lib64/microsoft-r/3.3/lib64/R/include -DNDEBUG -I../inst/include -DLOGGING_LEVEL=LL_WARNING -DU_STATIC_IMPLEMENTATION -I"/usr/lib64/microsoft-r/3.3/lib64/R/library/Rcpp/include" -c detect.cpp -o detect.o
sh: I/usr/lib64/microsoft-r/3.3/lib64/R/include: No such file or directory
/usr/lib64/microsoft-r/3.3/lib64/R/etc/Makeconf:141: recipe for target 'detect.o' failed
make: [detect.o] Error 127 (ignored)
I/usr/lib64/microsoft-r/3.3/lib64/R/include -DNDEBUG -I../inst/include -DLOGGING_LEVEL=LL_WARNING -DU_STATIC_IMPLEMENTATION -I"/usr/lib64/microsoft-r/3.3/lib64/R/library/Rcpp/include" -c get_idf.cpp -o get_idf.o
sh: I/usr/lib64/microsoft-r/3.3/lib64/R/include: No such file or directory
/usr/lib64/microsoft-r/3.3/lib64/R/etc/Makeconf:141: recipe for target 'get_idf.o' failed
make: [get_idf.o] Error 127 (ignored)
I/usr/lib64/microsoft-r/3.3/lib64/R/include -DNDEBUG -I../inst/include -DLOGGING_LEVEL=LL_WARNING -DU_STATIC_IMPLEMENTATION -I"/usr/lib64/microsoft-r/3.3/lib64/R/library/Rcpp/include" -c get_tuple.cpp -o get_tuple.o
sh: I/usr/lib64/microsoft-r/3.3/lib64/R/include: No such file or directory
/usr/lib64/microsoft-r/3.3/lib64/R/etc/Makeconf:141: recipe for target 'get_tuple.o' failed
make: [get_tuple.o] Error 127 (ignored)
gcc -std=gnu99 -I/usr/lib64/microsoft-r/3.3/lib64/R/include -DNDEBUG -I../inst/include -DLOGGING_LEVEL=LL_WARNING -DU_STATIC_IMPLEMENTATION -I"/usr/lib64/microsoft-r/3.3/lib64/R/library/Rcpp/include" -fpic -DU_STATIC_IMPLEMENTATION -O2 -g -c init.c -o init.o
I/usr/lib64/microsoft-r/3.3/lib64/R/include -DNDEBUG -I../inst/include -DLOGGING_LEVEL=LL_WARNING -DU_STATIC_IMPLEMENTATION -I"/usr/lib64/microsoft-r/3.3/lib64/R/library/Rcpp/include" -c segtype-v4.cpp -o segtype-v4.o
sh: I/usr/lib64/microsoft-r/3.3/lib64/R/include: No such file or directory
/usr/lib64/microsoft-r/3.3/lib64/R/etc/Makeconf:141: recipe for target 'segtype-v4.o' failed
make: [segtype-v4.o] Error 127 (ignored)
I/usr/lib64/microsoft-r/3.3/lib64/R/include -DNDEBUG -I../inst/include -DLOGGING_LEVEL=LL_WARNING -DU_STATIC_IMPLEMENTATION -I"/usr/lib64/microsoft-r/3.3/lib64/R/library/Rcpp/include" -c util.cpp -o util.o
sh: I/usr/lib64/microsoft-r/3.3/lib64/R/include: No such file or directory
/usr/lib64/microsoft-r/3.3/lib64/R/etc/Makeconf:141: recipe for target 'util.o' failed
make: [util.o] Error 127 (ignored)
I/usr/lib64/microsoft-r/3.3/lib64/R/include -DNDEBUG -I../inst/include -DLOGGING_LEVEL=LL_WARNING -DU_STATIC_IMPLEMENTATION -I"/usr/lib64/microsoft-r/3.3/lib64/R/library/Rcpp/include" -c word_freq.cpp -o word_freq.o
sh: I/usr/lib64/microsoft-r/3.3/lib64/R/include: No such file or directory
/usr/lib64/microsoft-r/3.3/lib64/R/etc/Makeconf:141: recipe for target 'word_freq.o' failed make: [word_freq.o] Error 127 (ignored)
-shared -L/usr/lib64/microsoft-r/3.3/lib64/R/lib -o jiebaR.so RcppExports.o detect.o get_idf.o get_tuple.o init.o segtype-v4.o util.o word_freq.o -L/usr/lib64/microsoft-r/3.3/lib64/R/lib -lR
sh: line 2: -shared: command not found
/usr/lib64/microsoft-r/3.3/lib64/R/share/make/shlib.mk:6: recipe for target 'jiebaR.so' failed
make: *** [jiebaR.so] Error 127
ERROR: compilation failed for package ‘jiebaR’
* removing ‘/home/da/R/x86_64-pc-linux-gnu-library/3.3/jiebaR’
Error: Command failed (1)
session_Info():
Session info ------------------------------------------------------------------------------------------------------------------
setting value
version R version 3.3.1 (2016-06-21)
system x86_64, linux-gnu
ui RStudio (0.99.902)
language (EN)
collate zh_CN.UTF-8
tz PRC
date 2016-10-26
Packages ----------------------------------------------------------------------------------------------------------------------
package * version date source
colorspace 1.2-6 2015-03-11 CRAN (R 3.3.0)
devtools * 1.12.0 2016-06-24 CRAN (R 3.3.1)
digest 0.6.9 2016-01-08 CRAN (R 3.2.5)
ggplot2 2.1.0 2016-03-01 CRAN (R 3.3.0)
gtable 0.2.0 2016-02-26 CRAN (R 3.3.0)
memoise 1.0.0 2016-01-29 CRAN (R 3.2.5)
munsell 0.4.3 2016-02-13 CRAN (R 3.3.0)
plyr 1.8.4 2016-06-08 CRAN (R 3.3.1)
Rcpp 0.12.5 2016-05-14 CRAN (R 3.3.0)
RevoUtils 10.0.1 2016-08-24 local
RevoUtilsMath * 10.0.0 2016-06-15 local
scales 0.4.0 2016-02-26 CRAN (R 3.3.0)
withr 1.0.2 2016-06-20 CRAN (R 3.3.1)
sessionInfo():
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=zh_CN.UTF-8 LC_COLLATE=zh_CN.UTF-8
[5] LC_MONETARY=zh_CN.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=zh_CN.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] devtools_1.12.0 RevoUtilsMath_10.0.0
loaded via a namespace (and not attached):
[1] colorspace_1.2-6 scales_0.4.0 plyr_1.8.4 RevoUtils_10.0.1 tools_3.3.1 withr_1.0.2 gtable_0.2.0
[8] rstudioapi_0.6 memoise_1.0.0 Rcpp_0.12.5 ggplot2_2.1.0 grid_3.3.1 digest_0.6.9 munsell_0.4.3
系统gcc版本号5.4.0 20160609
我在windows 7下使用jiebaR對一個文本文件(UTF-8編碼)進行分詞,由於incomplete final line的問題,會返回一個錯誤信息,這個信息裡面的文件名編碼是用的GBK編碼吧。
例如我使用如下的命令
segmentParSet1 <= "寂寞像一只蚊子.txt"
返回是
In readLines(input.r, n = lines, encoding = encoding) : incomplete final line found on '瀵傚癁鍍忎竴鍙殜瀛?txt'
雖然在文件管理器顯示產生的新文件的文件名是
寂寞像一只蚊子.segment.2015-12-09_18_34_49,
但是為什麼在R console窗口裡面的時候,是那一串亂碼,是將原來的文件名以GBK方式輸出的結果。雖然對結果沒有什麼影響,但是這個可以修改嗎?
謝謝
打算把 https://github.com/qinwf/THULACR 合并进来,这样可以有两个分词 engine 可以选。THULAC 的 tagging 做得好很多。
需要有一个比较统一、简单的接口,还需要考虑一下。
这样差不多就像把所有接口重写一遍了。现有的接口会继续保留。
根据 Simhash 与海明距离 的介绍,是先分词 继而提取关键字 再计算simhash值和海明距离。
疑问: 分词之后在提取关键字的时候是不是依靠tf_idf值? 如果是,那么 tf_idf= tf比率 * idf值(是否直接从jiebaRD-dict-idf.utf8中获取的?) 如果在idf.utf8文件中不存在这个关键词是不是意味着他就没有tf_idf值也就不会出现在关键词中?
library(jiebaR)
cutter=worker(type='keywords',user = 'D:/R/soft/library/jiebaRD/dict/usrdic_20161102.utf8', stop_word = 'D:/R/soft/library/jiebaRD/dict/stop_words.utf8', ,bylines = TRUE)
563.482 518.433 208.951 199.566 190.731
"360" "手机" "数据线" "差评" "客服"
出来的结果是整个文档的关键词,如果想提取每行的关键词该怎么设置?
另,如果设置type='mix
该怎么过滤掉停用词?以下是自己尝试过滤的code,但是貌似没有效果,请帮忙修改 多谢
removewords <- function(target_words,stop_words){ target_words = target_words[target_words%in%stop_words==FALSE] return(target_words) }
stopwd=readLines('D:/R/soft/library/jiebaRD/dict/stop_words.utf8',encoding = 'UTF-8')
class(stopwd)
[1] character
content3=sapply(content2,FUN = removewords,stopwd)
I am using jiebaR on linux and windows ,I got different result :
cc2 = worker()
cc2['测试停词abd we']
on Linux :
[1] "测试" "停词" "abd" "we"
on Windows :
[1] "测试" "停" "词"
what is the problem?
cutter = worker(type='mix')
result_segment = cutter["我是你的测试文本,用于测试过滤分词效果。"]
result_segment
[1] "我" "是" "你" "的" "测试" "文本" "用于" "测试" "过滤" "分词" "效果
疑问:
默认的stop_word词典中包含了 '我' '是' '你' '的' 等实际意义不大可以规避的词,
为什么在分词结果中还会出现这些词,需要filter_segment()才能规避掉?
停用词的存在意义是什么?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.