Git Product home page Git Product logo

newspaper-crawler-scripts's Introduction

Newspaper Crawler Scripts

Set of scripts for crawling newspaper websites. Please find the available scripts below

Setup

pip3 install -r requirements.txt

Todo

[ ] Extract common code into a decorator

Contribute

Scripts for more news websites are welcome. Please save the text scraped in UTF-8 encoding. Please refer to the newspapers list file and pick one to scrape.

Latest Script

crawler-oneindia.py under malayalam has the latest code, you can use this a template for future crawlers.

Directory structure

<newspaper_name>
  title.list --> acts as a index for other directories.
  articles
  -- 2018
  ---- Dec
  ---- May
  -- 2017
  ---- Jun
  ---- Aug
  -- 2016
  ---- Oct
  ---- Jan
  abstracts
  -- 2018
  ---- Dec
  ---- May
  -- 2017
  ---- Jun
  ---- Aug
  -- 2016
  ---- Oct
  ---- Jan  

Available scripts.

Tamil

Site URL script
Nakkheeran http://nakkheeran.in/ tamil/crawler-nakkheeran.py
Dailythanthi http://dailythanthi.com/ tamil/crawler-dailythanthi.py
Tamil The Hindu http://tamil.thehindu.com/ tamil/crawler-tamil-hindu.py
Puthiyathalaimurai http://puthiyathalaimurai.com/ tamil/crawler-puthiyathalaimurai.py
Dinamani http://dinamani.com/ tamil/crawler-dinamani.py

Malayalam

Site URL script
Manorama http://www.manoramaonline.com/ malayalam/crawler-manorama.py
Asianet News https://www.asianetnews.com/ malayalam/crawler-asianet.py
One India https://malayalam.oneindia.com/ malayalam/crawler-oneindia.py

Bengali

Site URL script
Ananadabazar https://www.anandabazar.com Bengali/crawler-anandabazar.py
Aajkal https://www.aajkaal.in Bengali/crawler-aajkal.py

Konkani

Site URL script
Konkani Kaniyo http://konkani-kaniyo-in-nagri.blogspot.com konkani/crawler-konkani-kaniyo.py

Marathi

Site URL script
Lokmat http://www.lokmat.com/ marathi/crawler-lokmat.py
Maharashtratimes https://maharashtratimes.indiatimes.com/ marathi/crawler-maharashtratimes.py
Loksatta https://www.loksatta.com marathi/crawler-loksatta.py
ABPmajha https://abpmajha.abplive.in marathi/crawler-abpmajha.py

newspaper-crawler-scripts's People

Contributors

vanangamudi avatar adamshamsudeen avatar athj avatar vipulchodankar avatar hardipinders avatar jaseemck avatar meain avatar rudrakshk avatar nike47 avatar simmranvermaa avatar husain-zaidi avatar anoopmsivadas avatar subins2000 avatar sayoni26 avatar srihari-palivela avatar utkarsh1800 avatar aswindinesh avatar pythagaurang avatar pranshul972 avatar danimg95 avatar dependabot[bot] avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.