Git Product home page Git Product logo

bigcows's Introduction

Scholar Scraper

I wrote this simple utility to scrape citation statistics of researcher profiles on Google Scholar, using it as an opportunity to learn node.js. I began with a list of information retrieval researchers, but have since expanded to include a separate list of researchers in human-computer interaction. The results are here.

Editorial note: This list contains only researchers who have a Google Scholar profile; names were identified by snowball sampling and various other ad hoc techniques. If you wish to see a name added, please email me or send a pull request. I will endeavor to periodically run the crawl to gather updated statistics. Of course, scholarly achievement is only partially measured by citation counts, which are known to be flawed in many ways. Evaluations of scholars should include comprehensive examination of their research contributions.

Rerunning the Scraper

Assuming you have node.js installed, rerun the scraper as follows:

$ npm install request cheerio async
$ node scrape.js ./people-ir.json > stats-ir.js
$ node scrape.js ./people-db.json > stats-db.js
$ node scrape.js ./people-nlp.json > stats-nlp.js
$ node scrape.js ./people-hci.json > stats-hci.js
$ node scrape.js ./people-stratosphere.json > stats-stratosphere.js

To scrape the images:

$ node download-images.js ./stats-ir.js
$ node download-images.js ./stats-db.js
$ node download-images.js ./stats-nlp.js
$ node download-images.js ./stats-hci.js
$ node download-images.js ./stats-stratosphere.js

Then open up index.html and it should display the new statistics.

bigcows's People

Contributors

jimmy0017 avatar lintool avatar mbernst avatar paulmcnamee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

bigcows's Issues

Paper citations?

Does this also grab the # of citations for each of the author's papers?

"year" column is not accurate

Noted by @dragomirradev

The "year" column is based on the earliest year in the citation count histogram, which in fact is not the earliest year in terms of publications.

For example:
Screen Shot 2019-08-24 at 10 26 34 AM

But see:

Screen Shot 2019-08-24 at 10 26 57 AM

One reasonable hypothesis is that the histogram is capped at 20 years... but here's a counterexample:

Screen Shot 2019-08-24 at 10 28 54 AM

No idea what's going on.

From a crawling perspective, the histogram is easy to get. Getting actual earliest requires sort pubs by time and then "scrolling".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.