Git Product home page Git Product logo

spark-rapids-examples's Introduction

spark-rapids-examples

A repo for Spark related utilities and examples using the Rapids Accelerator,including ETL, ML/DL, etc.

Enterprise AI is built on ETL pipelines and relies on AI infrastructure to effectively integrate and process large amounts of data. One of the fundamental purposes of RAPIDS Accelerator is to effectively integrate large ETL and ML/DL pipelines. Rapids Accelerator for Apache Spark offers seamless integration with Machine learning frameworks such XGBoost, PCA. Users can leverage the Apache Spark cluster with NVIDIA GPUs to accelerate the ETL pipelines and then use the same infrastructure to load the data frame into single or multiple GPUs across multiple nodes to train with GPU accelerated XGBoost or a PCA. In addition, if you are using a Deep learning framework to train your tabular data with the same Apache Spark cluster, we have leveraged NVIDIA’s NVTabular library to load and train the data across multiple nodes with GPUs. NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems. We also add MIG support to YARN to allow CSPs to split an A100/A30 into multiple MIG devices and have them appear like a normal GPU.

Please see the Rapids Accelerator for Spark documentation for supported Spark versions and requirements. It is recommended to set up Spark Cluster with JDK8.

Getting Started Guides

1. Microbenchmark guide

The microbenchmark on RAPIDS Accelerator For Apache Spark is to identify, test and analyze the best queries which can be accelerated on the GPU. For detail information please refer to this guide.

2. Xgboost examples guide

We provide three similar Xgboost benchmarks, Mortgage, Taxi and Agaricus. Try one of the "Getting Started Guides". Please note that they target the Mortgage dataset as written with a few changes to EXAMPLE_CLASS and dataPath, they can be easily adapted with each other with different datasets.

3. TensorFlow training on Horovod Spark example guide

We provide a Criteo Benchmark to demo ETL and deep learning training on Horovod Spark, please refer to this guide.

4. PCA example guide

This is an example of the GPU accelerated PCA algorithm running on Spark. For detail information please refer to this guide.

5. YARN 3.3.0+ MIG support

6. YARN 3.1.2 until YARN 3.3.0 MIG support

API

1. Xgboost examples API

These guides focus on GPU related Scala and python API interfaces.

Troubleshooting

You can trouble-shooting issues according to following guides.

Contributing

See the Contributing guide.

Contact Us

Please see the RAPIDS website for contact information.

License

This content is licensed under the Apache License 2.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.