RAPIDS Notebooks-Contrib

Intro

Welcome to the contributed notebooks repo! (formerly known as Notebooks-Extended)

The purpose of this collection of notebooks is to help users understand what RAPIDS has to offer, learn why, how, and when including RAPIDS in a data science pipeline makes sense, and contain community contributions of RAPIDS knowledge. The difference between this repo and the Notebooks Repo are:

These are vetted, community-contributed notebooks (includes RAPIDS team member contributions).
These notebooks won't run on airgapped systems, which is one of our container requirements. Many RAPIDS notebooks use additional PyData ecosystem packages, and include code for downloading datasets, thus they require network connectivity. If running on a system with no network access, please download all the data that you plan to use ahead of time or simply use the core notebooks repo.

Installation

Please use the BUILD.md to check the pre-requisite packages and installation steps.

Contributing

Please see our guide for contributing to notebooks-contrib.

Exploring the Repo

getting_started_notebooks - “how to start using RAPIDS”. Contains notebooks showing "hello worlds", getting started with RAPIDS libraries, and tutorials around RAPIDS concepts.
intermediate_notebooks - “how to accomplish your workflows with RAPIDS”. Contains notebooks showing algorthim and workflow examples, benchmarking tools, and some complete end-to-end (E2E) workflows.
advanced_notebooks - "how to master RAPIDS". Contains notebooks showing kernel customization and advanced end-to-end workflows.
colab_notebooks - contains colab versions of popular notebooks to quickly try out in browser
blog notebooks - contains shared notebooks mentioned and used in blogs that showcase RAPIDS workflows and capabilities
conference notebooks - contains notebooks used in conferences, such as GTC
competition notebooks - contains notebooks used in competitions, such as Kaggle

/data contains small data samples used for purely functional demonstrations. Some notebooks include cells that download larger datasets from external websites.

The /data folder is also symlinked into /rapids/notebooks/extended/data so you can browse it from JupyterLab's UI.

Our Notebooks

Getting Started Notebooks:

Folder	Notebook Title	Description
basics	Dask_Hello_World	This notebook shows how to quickly setup Dask and run a "Hello World" example.
basics	Getting_Started_with_cuDF	This notebook shows how to get started with GPU DataFrames using cuDF in RAPIDS.
basics	hello_streamz	This notebook demonstrates use of cuDF to perform streaming word-count using a small portion of the Streamz API.
basics	streamz_weblogs	This notebook provides an example of how to do streaming web-log processing with RAPIDS, Dask, and Streamz.
intro_tutorials	01_Introduction_to_RAPIDS	This notebook shows at a high level what each of the packages in RAPIDS are as well as what they do.
intro_tutorials	02_Introduction_to_cuDF	This notebook shows how to work with cuDF DataFrames in RAPIDS.
intro_tutorials	03_Introduction_to_Dask	This notebook shows how to work with Dask using basic Python primitives like integers and strings.
intro_tutorials	04_Introduction_to_Dask_using_cuDF_DataFrames	This notebook shows how to work with cuDF DataFrames using Dask.
intro_tutorials	05_Introduction_to_Dask_cuDF	This notebook shows how to work with cuDF DataFrames distributed across multiple GPUs using Dask.
intro_tutorials	06_Introduction_to_Supervised_Learning	This notebook shows how to do GPU accelerated Supervised Learning in RAPIDS.
intro_tutorials	07_Introduction_to_XGBoost	This notebook shows how to work with GPU accelerated XGBoost in RAPIDS.
intro_tutorials	08_Introduction_to_Dask_XGBoost	This notebook shows how to work with Dask XGBoost in RAPIDS.
intro_tutorials	09_Introduction_to_Dimensionality_Reduction	This notebook shows how to do GPU accelerated Dimensionality Reduction in RAPIDS.
intro_tutorials	10_Introduction_to_Clustering	This notebook shows how to do GPU accelerated Clustering in RAPIDS.

Intermediate Notebooks:

Folder	Notebook Title	Description
examples	linear_regression_demo.ipynb	In this notebook we will show how to use linear regression and its GPU accelerated implementation present in RAPIDS.
examples	ridge_regression_demo	Demonstration of using both NetworkX and cuGraph to compute the the number of Triangles in our test dataset.
examples	umap_demo	In this notebook we will show how to use UMAP and its GPU accelerated implementation present in RAPIDS.
examples	rf_demo	Demonstration of using both cuml and sklearn to train a RandomForestClassifier on the Higgs dataset.
examples	weather	Demonstration of using Dask and cuDF to process and analyze weather history
E2E-> mortgage	mortgage_e2e	This is an end to end notebook consisting of `ETL`, `data conversion` and `machine learning for training` operations performed on the mortgage dataset.
E2E-> mortgage	mortgage_e2e_deep_learning	This notebook combines the RAPIDS GPU data processing with a PyTorch deep learning neural network to predict mortgage loan delinquency.
E2E-> taxi	NYCTaxi	Demonstrates multi-node ETL for cleanup of raw data into cleaned train and test dataframes. Shows how to run multi-node XGBoost training with dask-xgboost. Blog
E2E-> synthetic_3D	rapids_ml_workflow_demo	A 3D visual showcase of a machine learning workflow with RAPIDS (load data, transform/normalize, train XGBoost model, evaluate accuracy, use model for inference). Along the way we compare the performance gains of RAPIDS [GPU] vs sklearn/pandas methods [CPU].
E2E-> census	census_education2income_demo	In this notebook we use 50 years of census data to see how education affects income.
E2E-> gdelt	Ridge_regression_with_feature_encoding	An end to end example using ridge regression on the gdelt dataset. Includes ETL with `cuDF`, feature scaling/encoding, and model training and evaluation with `cuML`
benchmarks	cuml_benchmarks	The purpose of this notebook is to benchmark all of the single GPU cuML algorithms against their skLearn counterparts, while also providing the ability to find and verify upper bounds.
benchmarks-> cugraph_benchmarks	louvain_benchmark	This notebook benchmarks performance improvement of running the Louvain clustering algorithm within cuGraph against NetworkX.
benchmarks-> cugraph_benchmarks	pagerank_benchmark	This notebook benchmarks performance improvement of running PageRank within cuGraph against NetworkX.

Advanced Notebooks:

Folder	Notebook Title	Description
tutorials	rapids_customized_kernels	This notebook shows how create customized kernels using CUDA to make your workflow in RAPIDS even faster.

Blog Notebooks:

Folder	Notebook Title	Description
cyber -> flow_classification	flow_classification_rapids	The `cyber` folder contains the associated companion files for the blog GPU Accelerated Cyber Log Parsing with RAPIDS, by Bianca Rhodes US, Bhargav Suryadevara, and Nick Becker. This notebook demonstrates how to load netflow data into cuDF and create a multiclass classification model using XGBoost.
cyber -> network_mapping	lanl_network_mapping_using_rapids	The `cyber` folder contains the associated companion files for the blog GPU Accelerated Cyber Log Parsing with RAPIDS, by Bianca Rhodes US, Bhargav Suryadevara, and Nick Becker. This notebook demonstrates how to parse raw windows event logs using cudf and uses cuGraph's pagerank model to build a network graph.
cyber -> raw_data_generator	run_raw_data_generator	The `cyber` folder contains the associated companion files for the blog GPU Accelerated Cyber Log Parsing with RAPIDS, by Bianca Rhodes US, Bhargav Suryadevara, and Nick Becker. The notebook is used showcase how to generate raw logs from the parsed LANL 2017 json data. The intent is to use the raw data to demonstrate parsing capabilities using cuDF.
databricks	RAPIDS_PCA_demo_avro_read	The `databricks` folder is the companion file repository to the blog RAPIDS can now be accessed on Databricks Unified Analytics Platform by Ikroop Dhillon, Karthikeyan Rajendran, and Taurean Dyer. This notebooks purpose is to showcase RAPIDS on Databricks use thier sample datasets and show the CPU vs GPU comparison for the PCA algorithm. There is also an accompanying HTML file for easy Databricks import.
regression	regression_blog_notebook	This is the companion notebook for the blog Essential Machine Learning with Linear Models in RAPIDS: part 1 of a series by Paul Mahler. It showcases an end to end notebook using the try_this dataset and cuML's implementation of ridge regression.
nlp -> show_me_the_word_count_gutenberg	show_me_the_word_count_gutenberg	This is the notebook for blog Show Me The Word Count by Vibhu Jawa, Nick Becker, David Wendt, and Randy Gelhausen. This notebook showcases nlp pre-processing capabilties of nvstrings+cudf on the Gutenberg dataset.

Conference Notebooks:

Folder	Notebook Title	Description
GTC_SJ_2019	GTC_tutorial_instructor	This is the instructor notebook for the hands on RAPIDS tutorial presented at San Jose's GTC 2019. It contains all the demonstrated solutions.
GTC_SJ_2019	GTC_tutorial_student	This is the exercise-filled student notebook for the hands on RAPIDS tutorial presented at San Jose's GTC 2019

Competition Notebooks:

Folder	Notebook Title	Description
kaggle-> landmark	cudf_stratifiedKfold_1000x_speedup	This notebook demonstrates the cuDF implementation of a stratified kfold operation that achieved a 1000x speed up for the Google Landmark Recognition competition
kaggle-> malware	malware_time_column_explore	This notebook studies the difference between train and test datasets in order to develop a robust validation scheme.
kaggle-> malware	rapids_solution_gpu_only	This notebook contains the GPU based RAPIDS solution to achieve 0.695 private LB in 12 minutes
kaggle-> malware	rapids_solution_gpu_vs_cpu	This notebook compares the CPU versus the GPU solution to achieve 0.695 private LB
kaggle-> plasticc-> notebooks	rapids_lsst_full_demo	This notebook demos the full CPU and GPU implementation of the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog
kaggle-> plasticc-> notebooks	rapids_lsst_gpu_only_demo	This GPU only based notebook shows the RAPIDS speedup of the the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog
kaggle-> santander	cudf_tf_demo	This financial industry facing notebook is the cudf-tensorflow approach from the RAPIDS.ai team for Santander Customer Transaction Prediction. Placed 17/8808. Blog
kaggle-> santander	E2E_santander_pandas	This This financial data modelling notebook is the Pandas based version the RAPIDS.ai team's best single model for Santander Customer Transaction Prediction competition. Placed 17/8808. Blog
kaggle-> santander	E2E_santander	This financial data modelling notebook is the cuDF based version of the RAPIDS.ai team's best single model for Santander Customer Transaction Prediction competition. It allows you to compare cuDF performance to the Pandas version. Placed 17/8808. Blog.

Additional Information

The intermediate_notebooks folder also includes a small subset of the Mortgage Dataset used in the notebooks and the full image set from the Fashion MNIST dataset.
utils: contains a set of useful scripts for interacting with RAPIDS
For our notebook examples and tutorials found in our standard containers, please see the Notebooks Repo

leonracsis / notebooks-contrib Goto Github PK

notebooks-contrib's Introduction

RAPIDS Notebooks-Contrib

Intro

Installation

Contributing

Exploring the Repo

Our Notebooks

Getting Started Notebooks:

Intermediate Notebooks:

Advanced Notebooks:

Blog Notebooks:

Conference Notebooks:

Competition Notebooks:

Additional Information

notebooks-contrib's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent