Git Product home page Git Product logo

hn's Introduction

Crunching Hacker News

An analysis of Hacker News made from the entire HN history (up until Q1 2015) and using command line tools only. See the result here

Description

This project was started with the intention of getting better with both command line tools and d3.js. Hacker News Data proved to be a great material to reach these goals as it was easily collectible and of relatively good quality.

This project is made of 3 differents parts:

Hacker News Data Dumps

The dump is splitted in 2 files: stories.csv and comments.csv. Both files are compressed with 7Zip compression to abide by GitHub policy against data ware housing.

As HN item's ids are listed in order, it is easy to see that some items are missing from the dumps. After a quick research, it appears that the missing items account for deleted posts which are not returned by the API.

Stories.csv

The file contains 1553934 entries, is 171M big (uncompressed) and uses the following column titles.

id, created_at, created_at_i, author, points, url_hostname, num_comments, title

Comments.csv

The file contains 7111949 entries, is 959M big (uncompressed) and uses the following column titles.

id, created_at, created_at_i, author, points, story_id, parent_id, url_hostname, comment_text

Crawler

The current project contains every posts and comments since HN start in 2006 and up until Q1 2015 (March 31st). The data has been retrieved via the HN Algolia Api rather than the Firebase one. This is mostly due to rate limitation. A simple CLI parser has been included in the project so everybody can get the dataset up-to-date with it.

Stories and comments need to be updated separately as the data fetched for both differ. The parser retrieves every story or comment posted after a given timestamp. Here is how to use it:

$ node crawler/crawler.js -f [output_filename.csv] -d [data to be retrieved ( 'story' || 'comment')] -t [timestamp]

It will output a csv with the following header:

stories.csv:
id, created_at, created_at_i, author, points, url_hostname, num_comments, title
comments.csv:
id, created_at, created_at_i, author, points, story_id, parent_id, url_hostname, comment_text

This tool is a functional but quick and dirty parser. There is a lot of room for optimization but as the main bottleneck here is the API rate limitation it is probably not worth spending too much time on it.

Scripts

All the data used in the data visualization come directly from the crawler's output. It is then parsed and formatted using CLI tools only.

Active users

A simple awk script runs through the data set and records a user activity on a quarterly basis. The scripts then outputs the sum of active users per quarter.

$ awk -F, -v OFS="," -f ./bin/active_users.awk ./data/stories.csv ./data/comments.csv | sort >> ./output/active_users.csv

Submissions

It is very similar to the active users script. Just run it as follows:

$ awk -F, -v OFS="," -f ./bin/submissions.awk ./data/stories.csv ./data/comments.csv | sort >> ./output/submissions.csv

Top Stories

First we get the stories sorted in descending order suing the sort tool. Then we filter out most of the result to keep only the top 10 results per year.

$ ./bin/top_stories.sh

Cohort Analysis

This is probably the most convoluted part of the analysis. The main idea behind the script is to get for each user, the dates (ie.e year + quarter) of its first and last contributions to HN.

To do this, we first concatenate and sort the stories and comments by date. Then, for each user, we store in a hash map the corresponding dates as concatenated strings.

We then iterate over this hash map and aggregates the cohort figures.

To launch the script, simply run:

$ ./bin/cohort.sh

Word Count

The word frequency analysis is made by looping through each post title, splitting them into lowercase word and keeping only the most frequents ones.

$  cat data/stories.csv | awk -F, -v OFS="," -v timestamp=1325376000  -f ./scripts/word_freq.awk | sort -nr --field-separator="," --key=2 > output/word_freq.csv

I then used a list of most frequent words to filter out the irrelevant words (I manaully kept a few interesting one s in the context of HN). This post filtering was made with the following script:

$ awk -F, -v OFS="," 'NF==1 { common[$1] = 1 }  { if (common[$1] != 1) { print $0 } }' utils/common_words.csv output/word_freq.csv > output/word_freq_filtered.csv

NB: this part is the only opinionated part of the whole analysis, as I had to filter and cluster the words manually for the sake of the visualization.

Top Domains

$ awk -F, v- OFS="," ./bin/sources.awk data/stories.csv | sort --field-separator="," -k1,1nr -k5,5nr | awk -F, -v OFS="," '{ if (year[$1] < 10 && length($2) > 2) { print $0; year[$1]++ } }' > output/sources.csv

Karma

First we need to get everybody's karma. To do so, we loop through the posts and comments and sums all of a user's points. The data is not so accurate for 2 reasons: the HN API do not provide the comments scores anymore and deleted posts can no longer be retrieved but their associated score is still taken into account.

Once we have each user's karma, we only need to cluster the data by karma range.

$ awk -F, -v OFS="," -f ./bin/karma.awk data/stories.csv data/comments.csv | sort -t, -n -k1,1n > output/karma.csv

Data-visualization

The dataviz was made using d3.js and CSS3 only.

If you wish to make changes on it, I used npm as a build tool. The list of availables scripts are listed in the package.json:

{
    "test": "echo \"Error: no test specified\" && exit 1",
    "jade": "jade src/**.jade --out dist -P",
    "stylus": "stylus src/stylesheets/style.styl --out dist/css",
    "uglify": "uglifyjs src/js/*.js -o dist/js/main.min.js",
    "concat": "uglifyjs src/js/*.js -o dist/js/main.js -b",
    "watch": "watch 'npm run jade' src/ & watch 'npm run stylus' src/stylesheets/ & watch 'npm run concat' src/js/"
}

hn's People

Contributors

arnauddri avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.