Git Product home page Git Product logo

sparkify_churn_project_with_aws_emr's Introduction

Sparkify_Churn_Project_with_AWS_EMR

Motivation

In our previous repository, we explored Sparkify dataset with medium size using pyspark. In this project, we make use of Amazon Web Sevice's EMR technology accompanied with Spark to further investigate the full dataset (which is as large as 12GB).

Introduction

In order to analyze the full dataset efficiently, we created EMR clusters with Spark in AWS and attached the EMR_notebook to this Git Repository. The complete process is documented in medium post. Please have a look at here. We then use pyspark kernel and Jupyter notebook for coding and visualization. The goal of this project is to predict if a user will churn or not on the basis of past event and activity log. If we could accurately predict potentially churning users, we could provide promotion or special offer to those users before they make up their mind churning.

Dataset

In this notebook, we use a subset of the full dataset for modeling. The schema of this dataframe is as below.

|-- artist: string (nullable = true) 
|-- auth: string (nullable = true) 
|-- firstName: string (nullable = true) 
|-- gender: string (nullable = true) 
|-- itemInSession: long (nullable = true) 
|-- lastName: string (nullable = true) 
|-- length: double (nullable = true) 
|-- level: string (nullable = true) 
|-- location: string (nullable = true) 
|-- method: string (nullable = true) 
|-- page: string (nullable = true) 
|-- registration: long (nullable = true) 
|-- sessionId: long (nullable = true) 
|-- song: string (nullable = true) 
|-- status: long (nullable = true) 
|-- ts: long (nullable = true) 
|-- userAgent: string (nullable = true) 
|-- userId: string (nullable = true) 

Libraries Used

  • re (Regular Expression)
  • pyspark (Spark interface for Python)
  • pandas (Dataframe manipulation)
  • matplotlib (Plotting)
  • seaborn (Advanced Plotting)

Files and Folders

  • images(image storage)
  • Sparkify_Pyspark.ipynb (Notebook for the project)
  • Sparkify_Pyspark.html (HTML for the Notebook)

Summary

After analyzing full dataset, we found that most variables we investigate do not have significant difference to the medium dataset. Perhaps this means nediumn dataset is chosen from full dataset in a homogeneous way and thus analyzing medium dataset may already be acceptable. The regression result of full dataset is a bit better for test set evaluation (up to 0.78 f1 score). As for evaluation on whole dataset, the best f1 score decreased to 0.61%. In my opinion, the way we deal with imbalance dataset is still not enough. Most unchurned users are not recognized by the model, leading to false positives.

Acknowledgement

Special thank to Udacity for providing the dataset and to AWS for providing amazing tools for big data analysis

sparkify_churn_project_with_aws_emr's People

Contributors

burgerwu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.