Git Product home page Git Product logo

hongloumeng-nlp's Introduction

红楼梦文本分析(仅供个人学习)

文本来源

https://github.com/EaconTang/gitbook-hongloumeng

合并文本并清洗

# 前80回
cat text/00*.md text/01*.md text/02*.md text/03*.md text/04*md text/05*md text/06*md text/07*md text/080*md|sed 's/<[^<]*>//g' > hongloumeng_80.md

# 120回
cat text/*.md|sed 's/<[^<]*>//g' > hongloumeng_120.md

关键词提取

全本的关键词TOP20提取如下:

   宝玉:    200.181
   笑道:    131.055
   如今:     91.042
   贾母:     83.691
   凤姐:     81.403
   众人:     79.472
   黛玉:     76.535
   起来:     74.356
   宝钗:     69.825
   说道:     69.268
   知道:     67.191
  王夫人:     66.828
   只见:     64.565
   不知:     61.469
   没有:     57.878
  老太太:     57.734
   贾政:     57.468
   听见:     51.813
   丫头:     51.105
   姑娘:     50.802

这是基于textrank算法的识别结果,作为主角的宝玉,其重要性远大于其他,其他的重要性排序:贾母,凤姐,黛玉,宝钗,王夫人,贾政。探春和史湘云还没出现,贾政就出现了,这个有点出乎意料。不过,我们对比看一下前80回的关键词:

 0     宝玉:    176.243
 1     笑道:    139.965
 2     如今:     77.360
 3     贾母:     72.228
 4     众人:     70.671
 5     凤姐:     67.626
 6     黛玉:     64.076
 7     起来:     58.345
 8     宝钗:     58.306
 9     只见:     53.129
10     说道:     53.087
11     不知:     52.603
12     知道:     52.576
13    王夫人:     50.951
14     一面:     50.016
15     丫头:     43.507
16    老太太:     41.776
17     姑娘:     41.627
18    凤姐儿:     40.716
19     东西:     40.491
20     不能:     39.475
21     一时:     38.776
22     太太:     38.699
23     贾珍:     37.971
24     出去:     36.900
25     只得:     36.242
26     探春:     34.916
27     湘云:     34.755
28     没有:     34.395
29     奶奶:     34.341
30     进来:     34.333
31     听见:     34.137
32     贾琏:     33.874
33     姐姐:     32.682
34     今日:     32.466
35     贾政:     32.174
36     回来:     31.704
37     晴雯:     30.928
38     尤氏:     30.453
39     原来:     29.847

在前80回的关键词TOP20里果然没有了贾政(排在36名,可见后40回贾政的重要性加强了,不过也是前80回他很多时候都在外部公干),而且人物的重要性可能更加长尾了,宝玉这个词的重要性下降了,其他的词也跟着下降了。探春和史湘云都在30名以内。

不过这并不能说明后40回不是同一个作者所写。

不足的是,接口尚未支持按词性过滤关键词。

角色重要度识别

通过实体识别中的人名识别,然后统计各个人名在各回是否出现,角色的重要度定义为该人物出现的章节总数。

hongloumeng-nlp's People

Contributors

cyy0523xc avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.