Git Product home page Git Product logo

spider_cnki's Introduction

Spider_CNKI

Spider CNKI in python

File

  • spider_cnki.py

    python spider_cnki.py

    抓知网文献数据

    这一步运行完后会在数据库中产生文献信息数据,数据在表 articles 和 resort_articles 中

  • get_slink.py

    python get_slink.py

    生成文献引用网络

    这一步运行完后会产生引用关系数据,数据在表 slink 中,并且生成 slink.net 文件

  • data_cnki_py_db

    sql File

    运行前请保证你已在 mysql 数据库中建立了相应的表结构(在 data_cnki_py_db/cnki_py1_db.sql 文件中)

    请直接导入 data_cnki_py_db/cnki_py1_db.sql 文件

    并且,在文件 spider_cnki.py 的 main 中修改你的数据库相关配置信息(库名、用户名、密码):

    def main():
        conn = pymysql.connect(host='localhost', port=3306, user='root', passwd='', db='cnki_py1_db', charset='utf8mb4', cursorclass=pymysql.cursors.DictCursor)
  • in.txt 爬虫入口

    我是使用 in.txt 里面的文章链接作为爬虫的入口,每一个链接都会引发一棵引文树,采集的多了互有交叉就形成了一个引文网。

    具体的流程:

    文章链接入口 -> 引文列表 -> 遍历文章链接 ->...

    如图:

    qq20161013-0

    循环次数通过 spider_cnki.py 中的全局变量 times 设置,这里我设置的是 5,即循环 5 次。

    times = 5

    in.txt 下链接数目酌情增减,我跑这个花了几十分钟吧(还是一两个小时?具体忘了)

    需求量小的话可以改少一点链接入口,也可以改小全局循环次数,需求大的相反。

  • slink.net 文献引用网络

    抓到的数量如图(Gephi 上显示):

    qq20161013-1

spider_cnki's People

Contributors

zyt01 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

argog jn7163

spider_cnki's Issues

问题反馈!

执行python spider_cnki.py

有如下报错:

select error
Traceback (most recent call last):
  File "spider_cnki.py", line 321, in <module>
    main()
  File "spider_cnki.py", line 316, in main
    reorder_data(conn)
  File "spider_cnki.py", line 95, in reorder_data
    for row in select_cur:
  File "/usr/local/lib/python2.7/dist-packages/pymysql/cursors.py", line 277, in fetchone
    self._check_executed()
  File "/usr/local/lib/python2.7/dist-packages/pymysql/cursors.py", line 76, in _check_executed
    raise err.ProgrammingError("execute() first")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.