Git Product home page Git Product logo

toutiaocrawler's Introduction

ToutiaoCrawler

接口示例:

2018.6.5更新
https://toutiao.com/search_content/?offset=0&format=json&keyword=手机&autoload=true&count=20&cur_tab=1&from=search_tab

参数说明:

keywordk:搜索的关键字
count:本页文章数量
cur_tab:当前页数

调试方法:

F12选择Network/All,选择preview/data节点

Demo:

ToutiaoCrawler\ToutiaoCrawler\demo.py 这里可以根据需求获取文章标题、标签、内容链接

Demo效果以及调试示例:


--------------------以下为项目代码,部分接口已失效--------------------

  • 需要python3.6版本
  • 首先安装需要的包,使用pycharm打开会自动安装
  1. 创建数据库和数据表ToutiaoCrawler/toutiao.sql;配置mysql连接ToutiaoCrawler/ToutiaoCrawler/Utils/Util.py
  2. 运行Crawler/get_toutiao_news_byapi.py 获取新闻列表【此接口16年开发,部分已失效】
  3. 运行Crawler/get_toutiao_content_byapi.py 获取新闻内容
  • (到这一步数据库已经有内容了)
  1. 运行Analysis/levenshtein.py 计算编辑距离
  2. 运行svd/svd.py 奇异值分解
  3. 运行svd/test_kmeans.py 进行聚类分析和绘图
  • 如果需要txt文件,执行Utils/list_to_txt.py  

toutiaocrawler's People

Contributors

haibincoder avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

toutiaocrawler's Issues

文章列表无法正常获取,返回HTTP301

现在get_toutiao_news_byapi.py里使用的url是:
http://www.toutiao.com/api/pc/feed/?category=__all__&utm_source=toutiao&widen=1&max_behot_time=0&max_behot_time_tmp=0&tadrequire=true&as=A1B5D9F152FBC03&cp=59123B3CE0B3FE1
网上搜索结果显示301是资源改变位置了。
请问是api的url改动了吗?从哪里可以获取新的api接口呢 谢谢。

此外,该url直接访问是可以打开,返回json结果的。
开头是{"has_more": false, "message": "success"。
请问这个是不是爬虫的配置造成的,谢谢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.