Git Product home page Git Product logo

jiebar's Introduction

jiebaR 中文分词

Build Status Build status codecov DOI

"结巴"中文分词的R语言版本,支持多种分词模式,同时有词性标注,关键词提取,文本Simhash相似度比较等功能。项目使用了RcppCppJieba进行开发。

细胞词库转换可以使用 cidian 包 :https://github.com/qinwf/cidian/

特性

  • 支持 Windows,Linux,Mac 操作系统。
  • 通过 Rcpp 实现同时加载多个分词系统,可以分别使用不同的分词模式和词库。
  • 支持多种分词模式、中文姓名识别、关键词提取、词性标注以及文本Simhash相似度比较等功能。
  • 支持加载自定义用户词库,设置词频、词性。
  • 同时支持简体中文、繁体中文分词。
  • 支持自动判断编码模式。
  • 比原"结巴"中文分词速度快,是其他R分词包的5-20倍。
  • 安装简单,无需复杂设置。
  • 可以通过Rpy2jvmr等被其他语言调用。
  • 基于MIT协议。

安装

通过CRAN安装:

install.packages("jiebaR")
library("jiebaR")

cc = worker()
cc["这是一个测试"] # or segment("这是一个测试", cc)

# [1] "这是" "一个" "测试"

同时还可以通过Github安装开发版,建议使用 gcc >= 4.9 编译,Windows需要安装 Rtools

library(devtools)
install_github("qinwf/jiebaRD")
install_github("qinwf/jiebaR")
library("jiebaR")

使用指南 与 演示

使用指南http://qinwenfeng.com/jiebaR/

正在撰写的文档 : https://jiebaR.qinwf.com/

Shiny 演示https://qinwf.shinyapps.io/jiebaR-shiny/

细胞词库转换https://github.com/qinwf/cidian/

问题

使用中遇到的任何问题,都可以:

jiebaR

This is a package for Chinese text segmentation, keyword extraction and speech tagging. jiebaR supports four types of segmentation modes: Maximum Probability, Hidden Markov Model, Query Segment and Mix Segment.

Features

  • Support Windows, Linux,and Mac.
  • Using Rcpp to load different segmentation worker at the same time.
  • Support Chinese text segmentation, keyword extraction, speech tagging and simhash computation.
  • Custom dictionary path.
  • Support simplified Chinese and traditional Chinese.
  • New words identification.
  • Auto encoding detection.
  • Fast text segmentation.
  • Easy installation.
  • MIT license.

Installation

Install the latest development version from GitHub:

devtools::install_github("qinwf/jiebaR")

Install from CRAN:

install.packages("jiebaR")

jiebar's People

Contributors

qinwf avatar qinwf-nuan avatar yanyiwu avatar

Watchers

James Chang avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.