liu431 / big-data-project Goto Github PK

View Code? Open in Web Editor NEW

2.0 4.0 5.0 77.34 MB

CAPP 30123 Class Project

Python 10.03% Jupyter Notebook 89.69% Shell 0.28%

hadoop mrjob-dataproc stackoverflow programming-language sentiment-analysis

big-data-project's Introduction

Hi 👋, I'm Li Liu

Data & Software Engineer

🔭 I’m currently working on being an expert in data and ML engineering
📝 I regularly write tech blogs on https://medium.com/@liliu.data
💬 Ask me about Python, AWS, Google Cloud, SQL, Data Engineering, Machine Learning, Econometrics, Running
📫 How to reach me [email protected]

Blogs posts

Connect with me:

Languages and Tools:

big-data-project's People

Contributors

Stargazers

Watchers

Forkers

sanittawan tonofshell dhruvalb curioustauseef

big-data-project's Issues

Data Download

I got all of the compressed data downloaded, it looks like many of the files are 1/5 the size compressed.

Week 6 - Check in with Prof. Wachs

The goals of this check-in are:

Done with the first stage of data cleaning
Propose an improved analytical question

Data description

Relational Diagram of the files (@dhruvalb)
MPI run time experiment

Exploratory Analysis

Top 15 tags (@tonofshell) - this file OR this file (I'm confused. Are they the same?)
Users Activities (@tonofshell) - which users are most active - this file
Questions with most answers per year (@tonofshell) - this file
Users with gold answer badges locations (@tonofshell) - this file
2-grams of tags that appear together (network of tags) (@tonofshell) - this file

Main Analysis

Time series plots of each language

Please ask @liu431 for the output

Week 8 - Check in with Prof. Wachs

The goals of this check-in are:

Meeting agenda for Week 8

Hi everyone,

As Adam must be busy working on getting us access to G cloud buckets for us, I thought I would help him get some of the meeting agendas for tomorrow down. Please feel free to add.

Agenda

Adam & Nikki briefly report on the uploading and cleaning the datasets
Discuss how to implement the main analysis "For each programming language, as it becomes more popular/commonplace, do answer providers become more hostile, meaner, negative towards question askers? The underlying assumption is that answer providers should be more positive/willing to give good and kind answers to questions related to nascent programming languages.
Li briefly reports on his sentiment analysis code
Dhruval reports on her findings on the datasets

Goal

Nail down what columns from which data set we would need to do the analysis
How to operationalize it using MapReduce and/or MPI?

How to Export

Hello Devs,
I want all StackOverflow data in CSVs
like
Post Title | Answer like that

Is there any way ?

For Adam to run on Dataproc

These can be done in Dataproc.

Please follow these steps:

run Questions with highest number of answers per year decrs_max_ans_q.py with Posts.csv
run Users who answered/asked questions the most decrs_users_activities.py with Posts.csv
Add a column ",badges" to Badges.csv, save it to a new file in the bucket
Add a column ",users" to Users.csv, save it to a new file in the bucket
Cat Badges.csv and Users.csv, save it to a new file "badges_users.csv"
run The locations where users with gold answer badges are from decrs_users_gold_ans.py with "bagdes_users.csv"
run 2-grams tags that are usually tagged together decrs_n_grams_tags.py with Posts.csv

I will add more to the list as we have more. Thanks!

Subset and clean data

This task is due Wednesday, May 8, 2019.

What to do:

Subset 500 lines of files that you're assigned
Parse XML to CSV
Save two CSV files to ./data directory on Github repo

Here's the assignment:

Files	Name
badges	Adam
comments	Adam
post history	Dhruval
post links	Dhruval
posts	Li
tags	Li
users	Nikki
votes	Nikki

Data Problems and Questions

I have found a few issues that will add to the complexity of our analysis that I think we should start thinking about. Feel free to add any questions or problems you find with the data to this issue.

The data is in XML format NOT raw text files
- Can we parse XML line by line?
- Depending on the XML parser, this is likely stored in Python as a list of dictionaries, one dictionary for each row in the XML file.
  - Each row is one user or comment
Any comment or About Me attribute (basically anything with a sentence or more of text) within each XML row is in HTML format
- Do we simply drop all HTML tags and how would we do this?
- Do we process the data and save it to disk in another format to use for our analysis?

Sample data files missing

Hi guys,

I want to point out that, for some reasons, the sample processed data files are gone. Is that on purpose?

Does anybody have trouble with scp? I could not transfer files from remote to local for some reasons. Is there a better way to do it? As of now, I created a private Git repo where I can push my stuff there and retrieve it on my local, but it is super inconvenient.

test

Week 7 plan - Division of Labor

Task	Name	Date
Data uploading
convert to CSV (Adam's code, CSV module, get rid of tags)	Nikki, Adam	Fri May 17
Upload to buckets	Adam	Fri May 17
Figure out sharing/access	Adam	Fri May 17
Data prep
Decide necessary vars	Dhruval	Fri May 17
Find data keys	Dhruval	Fri May 17
Decide on data structure	Dhruval	Fri May 17
Join data using MapReduce	Dhruval, Nikki	Fri May 17
Sentiment logistics
Decide on a dictionary	Li	Fri May 17
Decide on sentiments	Li	Fri May 17
Define sentiments (n-grams)	Li	Fri May 17
Specify inputs for models	Li	Fri May 17
Data analysis
NLTK	Li	Fri May 17
what other packages to use	??
split up analysis on clusters	??
Pres/Viz
??	??
??	??