Light

ubc-mds / dsci_525_group_6 Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 1.0 929 KB

MDS DSCI 525 Group 6 Web and Cloud Computing Project

License: MIT License

Jupyter Notebook 28.68% HTML 71.32%

dsci_525_group_6's Introduction

raynMan

Using big data to predict daily Australian rainfall!

Project Goals

Our goal is to develop and deploy cloud-based ensemble machine learning models to predict daily Australian rainfall. The data we are using consists of modelled and observed daily rainfall data over NSW, Australia between the years 1889 to 2014, originally accessed from the figshare platform. The modelled data has been kindly provided by CMIP6, an international collaboration of climate model outputs from different groups around the world. We will be gathering, processing, and deploying the consolidated outputs of separate climate models into a big data machine learning application predicting future target rainfall measurements. The final model will be deployed for others to use in their own analyses!

Usage

Clone the GitHub repository
From project root directory, navigate to notebooks folder and open rainfall_analysis.ipynb
Click on the Run menu and then click on Run All Cells

Dependencies

* R
* Python
* pandas
* rpy2
* dask
* pyarrow
* dplyr

Please note that this notebook is resource intensive and may not run on some machines

Contributors

Group 6 Members:

Kangbo Lu - @KangboLu
Craig McLaughlin - @cmmclaug
Debananda Sarkar - @debanandasarkar
Kevin Shahnazari - @kshahnazari1998

Attributions

MDS DSCI 525 Instructor Gittu George - @ggeorg02
Data compiled by MDS Instructor Tom Beuzen - @tbeuzen

Modelled data provided by CMIP6: https://www.wcrp-climate.org/wgcm-cmip/wgcm-cmip6
Data is supported by the Pangeo project: https://pangeo-data.github.io/pangeo-cmip6-cloud/
Observed data is supplied by the Australian SILO database: https://www.longpaddock.qld.gov.au/silo/

dsci_525_group_6's People

Contributors

Watchers

Forkers

debanandasarkar

dsci_525_group_6's Issues

Combining data CSVs

Combining data CSVs
rubric={correctness:10,reasoning:10}

Use one of the following options to combine data CSVs into a single CSV.
Pandas
DASK
When combining the csv files make sure to add extra column called "model" that identifies the model (tip : you can get this column populated from the file name eg: for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON)
Compare run times and memory usages of these options on different machines within your team, and summarize your observations in your milestone notebook.

Warning: Some of you might not be able to do it on your laptop. It's fine if you're unable to do it. Just make sure you check memory usage and discuss the reasons why you might not have been able to run this on your laptop.

Downloading the data

Downloading the data
rubric={correctness:10}

Download the data from figshare to your local computer using the figshare API (you can make use of requests library).
Extract the zip file, again programmatically, similar to how we did it in class.
You can download the data and unzip it manually. But we learned about APIs, and so we can do it in a reproducible way with the requests library, similar to how we did it in class.

There are 5 files in the figshare repo. The one we want is: data.zip

Create Repository and Project Structure

Creating repository and project structure
rubric={mechanics:10}

Similar to previous project courses, create a public repository under UBC-MDS org for your project.
Write brief introduction of the project in the README.
Create a folder called notebooks in the repository and create a notebook for this milestone in that folder.

Team Work Contract

Team-work contract
rubric={correctness:10}

Similar to what you did in DSCI 522 and DSCI 524, create a team-work contract. The contract should outline how you are committed to work together so that you are accountable to one another. Again, you may start with your team contract document from previous project courses and adapt it for your new team. It is a fairly personal document and please do not push it into your public repositories. Instead, save it somewhere your team can easily share it, and you can share a link to it, or a copy with us in your submission to Canvas to prove you did this.

Load the combined CSV to memory and perform a simple EDA

Load the combined CSV to memory and perform a simple EDA
rubric={correctness:10,reasoning:10}

Investigate at least two of the following approaches to reduce memory usage while performing the EDA (e.g., value_counts).

Changing dtype of your data
Load just columns what we want
Loading in chunks
Dask

Discuss your observations.

Perform a simple EDA in R

Perform a simple EDA in R
rubric={correctness:15,reasoning:10}

Pick an approach to transfer the dataframe from python to R.

Parquet file
Feather file
Pandas exchange
Arrow exchange

Discuss why you chose this approach over others.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.