Git Product home page Git Product logo

dsci_525_group_6's Introduction

raynMan

Using big data to predict daily Australian rainfall!

Project Goals

Our goal is to develop and deploy cloud-based ensemble machine learning models to predict daily Australian rainfall. The data we are using consists of modelled and observed daily rainfall data over NSW, Australia between the years 1889 to 2014, originally accessed from the figshare platform. The modelled data has been kindly provided by CMIP6, an international collaboration of climate model outputs from different groups around the world. We will be gathering, processing, and deploying the consolidated outputs of separate climate models into a big data machine learning application predicting future target rainfall measurements. The final model will be deployed for others to use in their own analyses!

Usage

  • Clone the GitHub repository
  • From project root directory, navigate to notebooks folder and open rainfall_analysis.ipynb
  • Click on the Run menu and then click on Run All Cells

Dependencies

* R
* Python
* pandas
* rpy2
* dask
* pyarrow
* dplyr

Please note that this notebook is resource intensive and may not run on some machines

Contributors

Group 6 Members:

  • Kangbo Lu - @KangboLu
  • Craig McLaughlin - @cmmclaug
  • Debananda Sarkar - @debanandasarkar
  • Kevin Shahnazari - @kshahnazari1998

Attributions

MDS DSCI 525 Instructor Gittu George - @ggeorg02
Data compiled by MDS Instructor Tom Beuzen - @tbeuzen

Modelled data provided by CMIP6: https://www.wcrp-climate.org/wgcm-cmip/wgcm-cmip6
Data is supported by the Pangeo project: https://pangeo-data.github.io/pangeo-cmip6-cloud/
Observed data is supplied by the Australian SILO database: https://www.longpaddock.qld.gov.au/silo/

dsci_525_group_6's People

Contributors

cmmclaug avatar debanandasarkar avatar kangbolu avatar kshahnazari1998 avatar

Watchers

 avatar  avatar

Forkers

debanandasarkar

dsci_525_group_6's Issues

Combining data CSVs

  1. Combining data CSVs
    rubric={correctness:10,reasoning:10}
  • Use one of the following options to combine data CSVs into a single CSV.

  • Pandas

  • DASK

  • When combining the csv files make sure to add extra column called "model" that identifies the model (tip : you can get this column populated from the file name eg: for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON)

  • Compare run times and memory usages of these options on different machines within your team, and summarize your observations in your milestone notebook.

Warning: Some of you might not be able to do it on your laptop. It's fine if you're unable to do it. Just make sure you check memory usage and discuss the reasons why you might not have been able to run this on your laptop.

Downloading the data

  1. Downloading the data
    rubric={correctness:10}
  • Download the data from figshare to your local computer using the figshare API (you can make use of requests library).
  • Extract the zip file, again programmatically, similar to how we did it in class.
    You can download the data and unzip it manually. But we learned about APIs, and so we can do it in a reproducible way with the requests library, similar to how we did it in class.

There are 5 files in the figshare repo. The one we want is: data.zip

Create Repository and Project Structure

  1. Creating repository and project structure
    rubric={mechanics:10}
  • Similar to previous project courses, create a public repository under UBC-MDS org for your project.
  • Write brief introduction of the project in the README.
  • Create a folder called notebooks in the repository and create a notebook for this milestone in that folder.

Team Work Contract

  1. Team-work contract
    rubric={correctness:10}

Similar to what you did in DSCI 522 and DSCI 524, create a team-work contract. The contract should outline how you are committed to work together so that you are accountable to one another. Again, you may start with your team contract document from previous project courses and adapt it for your new team. It is a fairly personal document and please do not push it into your public repositories. Instead, save it somewhere your team can easily share it, and you can share a link to it, or a copy with us in your submission to Canvas to prove you did this.

Load the combined CSV to memory and perform a simple EDA

  1. Load the combined CSV to memory and perform a simple EDA
    rubric={correctness:10,reasoning:10}
  • Investigate at least two of the following approaches to reduce memory usage while performing the EDA (e.g., value_counts).
  • Changing dtype of your data
  • Load just columns what we want
  • Loading in chunks
  • Dask
  • Discuss your observations.

Perform a simple EDA in R

  1. Perform a simple EDA in R
    rubric={correctness:15,reasoning:10}
  • Pick an approach to transfer the dataframe from python to R.
  • Parquet file
  • Feather file
  • Pandas exchange
  • Arrow exchange
  • Discuss why you chose this approach over others.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.