Git Product home page Git Product logo

textcluster's Introduction

短文本聚类

项目介绍

短文本聚类是常用的文本预处理步骤,可以用于洞察文本常见模式、分析设计语义解析规范、加速相似句子查询等。本项目实现了内存友好的短文本聚类方法,并提供了相似句子查询接口。

依赖库

pip install tqdm jieba

使用方法

聚类

python cluster.py --infile ./data/infile \
--output ./data/output

具体参数设置可以参考cluster.py文件内_get_parser()函数参数说明,包含设置分词词典、停用词、匹配采样数、匹配度阈值等。

查询

参考search.py代码里Searcher类的使用方法,如果用于查询标注数据的场景,使用分隔符:::将句子与标注信息拼接起来。如我是海贼王:::(λx.海贼王),处理时会只对句子进行匹配。

算法原理

算法原理

文件路径

TextCluster
|      README.md
|      LICENSE
|      cluster.py                    聚类程序
|      search.py                     查询程序
|      
|------utils                         公共功能模块
|    |    __init__.py
|    |    segmentor.py               分词器封装
|    |    similar.py                 相似度计算函数
|    |    utils.py                   文件处理模块
|
|------data
|    |    infile                     默认输入文本路径,用于测试中文模式
|    |    infile_en                  默认输入文本路径,用于测试英文模式
|    |    seg_dict                   默认分词词典
|    |    stop_words                 默认停用词路径

注:本方法仅面向短文本,长文本聚类可根据需求选用SimHash, LDA等其他算法。

Text Cluster

Introduction

Text cluster is a normal preprocess procedure to analysis text feature. This project implements a memory friendly method only for short text cluster. For long text, it is preferable to choose SimHash or LDA or others according to demand.

Requirements

pip install tqdm spacy

Usage

Clustering

python cluster.py --infile ./data/infile_en \
--output ./data/output \
--lang en

For more configure arguments description, see _get_parser() in cluster.py, including stop words setting, sample number.

Search

Basic Idea

Algorithm_en

File Structure

TextCluster
|      README.md
|      LICENSE
|      cluster.py                    clustering function
|      search.py                     search function
|      
|------utils                         utilities
|    |    __init__.py
|    |    segmentor.py               tokenizer wrapper
|    |    similar.py                 similarity calculator
|    |    utils.py                   file process module
|
|------data
|    |    infile                     default input file path, to test Chinese mode
|    |    infile_en                  default input file path, to test English mode
|    |    seg_dict                   default tokenizer dict path
|    |    stop_words                 default stop words path

Other Language

For other specific language, modify tokenizer wrapper in ./utils/segmentor.py.

textcluster's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

textcluster's Issues

Can't process sentences made up of words in a different order

从AINLP上看到推送了你的项目,项目有个bug,无法将不同顺序单词组成的句子归到同一类中,建议加入原有的聚类簇之后,不要break,只是不要再进行聚类类别判定或者新建,而对该样本的每个单词的p_bucket列表都加入该类,同时需要调整新建类别的逻辑,如果第一个单词匹配不通过,建立一个tmp词汇,将该单词放入,然后对下一个单词进行判定,直到所有的都不通过,才建立新的类别,并将tmp词汇里的p_bucket都加上这个新的类别。
以上

threshold

parser.add_argument('--threshold', type=float, default=0.3, help='Threshold for matching.')

这个threshold的作用是什么呀

短文本 or 长文本

题主,请问短文本和长文本在数量上是如何区分的?
比如150词左右的句子是长文本还是短文本呀?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.