Git Product home page Git Product logo

big-data-project's Introduction

Hi ๐Ÿ‘‹, I'm Li Liu

Data & Software Engineer

  • ๐Ÿ”ญ Iโ€™m currently working on being an expert in data and ML engineering

  • ๐Ÿ“ I regularly write tech blogs on https://medium.com/@liliu.data

  • ๐Ÿ’ฌ Ask me about Python, AWS, Google Cloud, SQL, Data Engineering, Machine Learning, Econometrics, Running

  • ๐Ÿ“ซ How to reach me [email protected]

Blogs posts

Connect with me:

liu431 @liliu.data liu431

Languages and Tools:

aws azure css3 docker flask git hadoop html5 linux mssql mysql pandas postgresql python scikit_learn seaborn tensorflow

liu431

ย liu431

liu431

big-data-project's People

Contributors

dhruvalb avatar liu431 avatar sanittawan avatar tonofshell avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

big-data-project's Issues

Data Download

I got all of the compressed data downloaded, it looks like many of the files are 1/5 the size compressed.

Visualization to make

Data description

  1. Relational Diagram of the files (@dhruvalb)
  2. MPI run time experiment

Exploratory Analysis

  1. Top 15 tags (@tonofshell) - this file OR this file (I'm confused. Are they the same?)
  2. Users Activities (@tonofshell) - which users are most active - this file
  3. Questions with most answers per year (@tonofshell) - this file
  4. Users with gold answer badges locations (@tonofshell) - this file
  5. 2-grams of tags that appear together (network of tags) (@tonofshell) - this file

Main Analysis

  1. Time series plots of each language
  • Please ask @liu431 for the output

Meeting agenda for Week 8

Hi everyone,

As Adam must be busy working on getting us access to G cloud buckets for us, I thought I would help him get some of the meeting agendas for tomorrow down. Please feel free to add.

Agenda

  1. Adam & Nikki briefly report on the uploading and cleaning the datasets

  2. Discuss how to implement the main analysis "For each programming language, as it becomes more popular/commonplace, do answer providers become more hostile, meaner, negative towards question askers? The underlying assumption is that answer providers should be more positive/willing to give good and kind answers to questions related to nascent programming languages.

  3. Li briefly reports on his sentiment analysis code

  4. Dhruval reports on her findings on the datasets

Goal

  1. Nail down what columns from which data set we would need to do the analysis

  2. How to operationalize it using MapReduce and/or MPI?

How to Export

Hello Devs,
I want all StackOverflow data in CSVs
like
Post Title | Answer like that

Is there any way ?

For Adam to run on Dataproc

These can be done in Dataproc.

Please follow these steps:

  1. run Questions with highest number of answers per year decrs_max_ans_q.py with Posts.csv

  2. run Users who answered/asked questions the most decrs_users_activities.py with Posts.csv

  3. Add a column ",badges" to Badges.csv, save it to a new file in the bucket

  4. Add a column ",users" to Users.csv, save it to a new file in the bucket

  5. Cat Badges.csv and Users.csv, save it to a new file "badges_users.csv"

  6. run The locations where users with gold answer badges are from decrs_users_gold_ans.py with "bagdes_users.csv"

  7. run 2-grams tags that are usually tagged together decrs_n_grams_tags.py with Posts.csv

I will add more to the list as we have more. Thanks!

Subset and clean data

This task is due Wednesday, May 8, 2019.

What to do:

  • Subset 500 lines of files that you're assigned
  • Parse XML to CSV
  • Save two CSV files to ./data directory on Github repo

Here's the assignment:

Files Name
badges Adam
comments Adam
post history Dhruval
post links Dhruval
posts Li
tags Li
users Nikki
votes Nikki

Data Problems and Questions

I have found a few issues that will add to the complexity of our analysis that I think we should start thinking about. Feel free to add any questions or problems you find with the data to this issue.

  • The data is in XML format NOT raw text files
    • Can we parse XML line by line?
    • Depending on the XML parser, this is likely stored in Python as a list of dictionaries, one dictionary for each row in the XML file.
      • Each row is one user or comment
  • Any comment or About Me attribute (basically anything with a sentence or more of text) within each XML row is in HTML format
    • Do we simply drop all HTML tags and how would we do this?
    • Do we process the data and save it to disk in another format to use for our analysis?

Sample data files missing

Hi guys,

I want to point out that, for some reasons, the sample processed data files are gone. Is that on purpose?

Problem with `scp`

Does anybody have trouble with scp? I could not transfer files from remote to local for some reasons. Is there a better way to do it? As of now, I created a private Git repo where I can push my stuff there and retrieve it on my local, but it is super inconvenient.

Week 7 plan - Division of Labor

Task Name Date
Data uploading
convert to CSV (Adam's code, CSV module, get rid of tags) Nikki, Adam Fri May 17
Upload to buckets Adam Fri May 17
Figure out sharing/access Adam Fri May 17
Data prep
Decide necessary vars Dhruval Fri May 17
Find data keys Dhruval Fri May 17
Decide on data structure Dhruval Fri May 17
Join data using MapReduce Dhruval, Nikki Fri May 17
Sentiment logistics
Decide on a dictionary Li Fri May 17
Decide on sentiments Li Fri May 17
Define sentiments (n-grams) Li Fri May 17
Specify inputs for models Li Fri May 17
Data analysis
NLTK Li Fri May 17
what other packages to use ??
split up analysis on clusters ??
Pres/Viz
?? ??
?? ??

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.