Git Product home page Git Product logo

yearup_challenge_2's Introduction

Main Challenge: Generate Twitter CVE Metrics

Overview

Get ready to have some fun with data and explore how we can process it using Python!

This challenge may look a little daunting, but just have fun with it and do what you can using the starter code provided.

The goal of this challenge is to generate a number of metrics from Twitter CVE data similar to what we saw used the last challenge.

In this challenge there's a full month of data, each day in it's own file in the data/yearup folder

The starter code will load each of the log files in the data/yearup folder, filter out invalid tweets, and extract certain data fields from each tweet.

There is also a framework for a simple report generator and example metrics functions to get you started.

For the bonus challenge, this program will also write out a CSV file "twitter_data.csv" to the current directory

ℹ️ Tips

  • use Python3 to run yearup_challenge2.py (python3 yearup_challenge2.py)
  • run yearup_challenge2.py from inside it's current directory (otherwise you'll have trouble accessing the data)

Main Challenge: Metrics

ℹ️ Places were you need to make updates are marked with TODO in the comments, anything marked with "INFO" comments are optional for debugging

  1. First, I need your help fixing the filter_tweet_data function by adding some code here
  2. Next, I need you to update print_tweet_data_metrics function by writing new functions to compute the metrics and adding a call to that code here

For #2, I've added 3 examples above the TODO that you can use/copy/modify as you see fit. You should notice a pattern for each section:

  1. call a function to get the computed metrics in a dictionary
  2. print the title of the metric for the report
  3. print the metrics (a pretty_print function has been provided to handle formatting)

Each of your metrics functions should return a dictionary where the KEY is the name of the metrics and the VALUE is the value of the metrics.
One example of nested dictionaries is shown in get_weekday_metrics_for_by_cve in case you want to try that.

Review the metrics functions in the examples, you should be able to copy and modify them to create new metrics for the report.

For the main challenge add as many of the additional metrics as you can: (they get harder further down the list)

  • number of tweets for each cve (like challenge #1)
  • most popular day of the week for all tweets
  • total number (sum) of "followers" who could have have seen any tweet for a CVE (use user follower count in tweet)
  • number of cves not from 2020 (remember the year is in the format of the CVE: CVE-YYYY-... date_year is the year the tweet was sent)
  • count of CVEs from each CVE release year (using the year in the CVE number)
  • average number of tweets for each user (user_status_count is the number of tweets they've sent)
  • date that the CVE was first seen in a tweet and the date it was last seen in the tweet

Feel free to get creative and add any additional metrics you like. If you get REALLY adventurous, you can extract additional fields in the extract_data_from_tweet_json function

Be sure to review an example of the extracted tweet data to help you find the data you'll need

⚠️ Notice that I've marked a comment "# NO NEED TO EDIT BELOW THIS LINE, FEEL FREE TO MAKE CHANGES BUT BE CAREFUL" As this says, everything below is starter code and there should be no need to update it, but you're welcome to make changes if you like.

Bonus Challenge: SQL

⚠️ This bonus challenge may be quite difficult and I haven't had the time to provide any starter code for it, check back and I may provide some tips later this week. We may also have a chance to do some demos on Thursday.

Your Challenge

Use the csv file generated twitter_data.csv

  • Option 1: If you have access to an AWS account, setup a new Athena data base and table to run SQL commands on the data in S3.
  • Option 2: If you have access to an AWS account, you could also setup a free-tier RDS instance.
  • Option 3: Setup a SQL database on your laptop (e.g. mysql, sqlite) create a new database and table, then load data from the CSV file to run SQL commands on the data

⚠️ If you choose option #1 with Athena be sure to gzip the CSV file (gzip twitter_data.csv) before uploading it to S3 to save on any costs. This will reduce it from ~16MB -> ~3MB, so each Athena query should only cost about 2 cents

Beyond the Challenge

Feel free to have fun and explore what you can do with this data.

If you have different ideas than what's in the challenge, feel free to submit and demo them as well.

yearup_challenge_2's People

Contributors

ryanwsmith avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.