Git Product home page Git Product logo

workbench-example-rapids-cudf's Introduction

NVIDIA AI Workbench: Introduction

This is an NVIDIA AI Workbench example Project that provides a short introduction of the cuDF library, a Python GPU-accelerated DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF also provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming. Users in the AI Workbench Beta Program can get up and running with this Project in minutes.

Have questions? Please direct any issues, fixes, suggestions, and discussion on this project to the DevZone Members Only Forum thread here.

Project Description

Included in this project are nine tutorial notebooks. The first six are relatively easy to run; the last three (*) may require a low GPU RAM user ( < 16GB) to push the project to heavier hardware to run all of the performance benchmarks. Good news: Workbench makes this easy!

  • cudf-pandas-demo: This notebook demonstrates the acceleration that cudf.pandas gives over vanilla Pandas. The example runs through loading some data with Pandas and getting some performance numbers, then running the same code again with the cudf.pandas plugin to show the speedup that is possible with NVIDIA hardware.

  • rapids_cudf_pandas_accelerator_mode: This notebooks shows you how to GPU accelerate your existing workflow with zero code change.

  • 10min: This is a short introduction to cuDF and Dask-cuDF, geared mainly towards new users.

    cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API in the style of pandas.

    Dask is a flexible library for parallel computing in Python that makes scaling out your workflow smooth and simple. On the CPU, Dask uses Pandas to execute operations in parallel on DataFrame partitions.

    Dask-cuDF extends Dask where necessary to allow its DataFrame partitions to be processed using cuDF GPU DataFrames instead of Pandas DataFrames. For instance, when you call dask_cudf.read_csv(...), your cluster’s GPUs do the work of parsing the CSV file(s) by calling cudf.read_csv().

    Which libraries do I use? If your workflow is fast enough on a single GPU or your data comfortably fits in memory on a single GPU, you would want to use cuDF. If you want to distribute your workflow across multiple GPUs, have more data than you can fit in memory on a single GPU, or want to analyze data spread across many files at once, you would want to use Dask-cuDF.

  • cupy-interop: This notebook provides introductory examples of how you can use cuDF and CuPy together to take advantage of CuPy array functionality (such as advanced linear algebra operations).

  • missing-data: In this section, we will discuss missing (also referred to as NA) values in cudf. cudf supports having missing values in all dtypes. These missing values are represented by . These values are also referenced as “null values”.

  • Introduction_to_Strings: This notebook shows how to manipulate strings with cuDF DataFrames.

  • Introduction_to_Exploratory_Data_Analysis_using_cuDF: This notebook shows how to perform basic EDA with cuDF DataFrames

  • Introduction_to_Time_Series_Data_Analysis_using_cuDF: This notebook shows how to do EDA on time-series DataFrame with cuDF

  • performance-comparisons (*): This notebook compares the performance of cuDF and pandas. The comparisons performed are on identical data sizes. This notebook primarily showcases the factor of speedups users can have when the similar pandas APIs are run on GPUs using cudf. This notebook is written to measure performance on NVIDIA GPUs with large memory. Performance results may vary by data size, as well as the CPU and GPU used.


Important Considerations:

  • The notebook titled performance-comparisons.ipynb may take a long time to execute on laptop and/or workstation hardware. This is because we are running benchmarks and conducting dataframe operations on massive datasets using both Pandas and cuDF. Feel free to adjust the num_rows variable as needed.

  • If working locally on a laptop or workstation, also consider pushing this project to heavier hardware (original notebook authors used 2x H100 GPUs) to run this notebook. Good news: NVIDIA AI Workbench makes this push easy!


System Requirements:

  • Operating System: Ubuntu 22.04
  • CPU requirements: None, tested with Intel® Xeon® Gold 6240R CPU @ 2.40GHz
  • GPU requirements: Any NVIDIA training GPU, tested with NVIDIA A100-40GB
  • NVIDIA driver requirements: Latest driver version
  • Storage requirements: 40GB

Quickstart

The notebook(s) in this project were adapted from the RAPIDS cuDF Github repository, which can be found here.

If you have NVIDIA AI Workbench already installed, you can use this Project in AI Workbench on your choice of machine by:

  1. Forking this Project to your own GitHub namespace and copying the clone link

    https://github.com/[your_namespace]/<project_name>.git

  2. Opening a shell and activating the Context you want to clone into by

    $ nvwb list contexts
    
    $ nvwb activate <desired_context>
    
  3. Cloning this Project onto your desired machine by running

    $ nvwb clone project <your_project_url>
    
  4. Opening the Project by

    $ nvwb list projects
    
    $ nvwb open <project_name>
    
  5. Starting JupyterLab by

    $ nvwb start jupyterlab
    
  6. Navigate to the code directory of the project. Then, open the notebooks provided and begin working through them at your own pace. Happy coding!


Tip: Use nvwb help to see a full list of commands.


Tested On

This notebook has been tested with an NVIDIA A100-40gb GPU and an Intel(R) Xeon(R) Gold 6240R CPU (2.40GHz) on the following version of NVIDIA AI Workbench: nvwb 0.2.66 (internal; linux; amd64; go1.18.10; Tue Sep 12 18:50:21 UTC 2023)

License

This NVIDIA AI Workbench example project is under the Apache 2.0 License

workbench-example-rapids-cudf's People

Contributors

nv-edwli avatar jacobtomlinson avatar taureandyernv avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.