Git Product home page Git Product logo

cn-text-classifier's Introduction

cn-text-classifier

中文文本聚类实验

Chinese Text Cluster Experiments

实验数据

实验数据来源于多个新闻网站爬取的新闻, 包含教育类510篇, 游戏类231篇, 医疗类388篇, 体育类412篇. 其中, 教育类及医疗类同时是投融资新闻中的细分类别, 用于测试细粒度的聚类能否区分.

有关于新闻内容来源的获取, 请参阅这个仓库: Finance and Investment Info Spider Collections - 投融资信息爬虫集合

实验步骤

文本聚类的一般步骤是:

  1. 文本预处理 包含分段分句, 分词及去停用词等
  2. 语料向量化或词袋化 本实验使用了 sklearn 的 TF-IDF 相关包
  3. 文本降维 常见的降维有 PCA 主成分降维, TSVD 截断奇异值分解降维, t-SNE降维等, 本实验使用 PCA 降维, t-SNE 更适用于图像和视频等降维, 速度较慢.
  4. 应用聚类算法并调参
  5. (可选)结果可视化及聚类效果评判 聚类结果可视化已在 tools 文件夹中的 visualizer.py 中实现, 鉴于 DBSCAN 有识别噪声的能力, 在该实验中单独加入噪声可视化. 聚类效果评判分为外部信息指标和内部信息指标, 外部信息指标依靠标注好的数据 src/labeled_data.csv, 相关知识请参阅: 无监督学习 - 聚类度量指标

实验结果预览

K-Means 聚类实验 K-Means

------K-Means Experiment-------
adjusted_rand_score: 0.993424
FMI: 0.993424
Silhouette: 0.392882
 CHI: 610.273556
------End------

Birch 聚类实验 Birch

-------Birch Experiment-------
adjusted_rand_score: 0.978233
FMI: 0.978233
Silhouette: 0.392189
 CHI: 605.710339
------End------

DBSCAN 聚类实验 DBSCAN

------DBSCAN Experiment-------
adjusted_rand_score: 0.905969
FMI: 0.905969
Silhouette: 0.379187
CHI: 366.856356
Estimated number of noise points: 102 
------End------

公众号: 程序员的碎碎念

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.