Git Product home page Git Product logo

a-unified-cpu-gpu-protocol-for-gnn-training's Introduction

Setup

  1. Setup a Python environment (>=3.11). Install PyTorch (>=2.0.1) and Deep Graph Library (>=1.1).
  2. Install TCMalloc: https://google.github.io/tcmalloc/quickstart.html
    Note: need to install Bazel first (also mentioned in the link above).
    Note 2: Alternative approach to install TCMalloc: AUTOMATIC1111/stable-diffusion-webui#10117
  3. After installation, preload tcmalloc when running your code. For example:
    LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4.5.3 python main.py

Usage

1. Download dataset:

python dataset.py --dataset ogbn-products --data_path your_data_path

2. (Optional) Load feature matrix into shared memory:

MAG240M datasets have large feature matrices (380G). To reduce the memory consumption and feature loading time, we first load the required data into the shared memory, then conduct one or more trials at the same time.

python load_mag_to_shm.py --data_path your_data_path

Note that load_mag_to_shm.py requires a large amount of available memory (roughly 800G) when copying data from disk to shared memory, and consumes a smaller amount of memory (roughly 400G) after movement completes. Be sure to spare enough memory during the data movement.

3. Preprocess training node workload:

python workload.py --sampler neighbor --dataset ogbn-products --data_path your_data_path 

4. Training GNNs on CPUs and GPUs:

python main.py --dataset ogbn-products --data_path your_data_path
               --cpu_process 2 --gpu_process 1
               --batch_type dynamic --cached_ratio 0.2
               --sampler neighbor --model sage

Important Arguments:

  • --dataset: the training datasets. Available choices [reddit, ogbn-products, mag240M]
  • --cpu_process: Number of CPU computing processes used in training. Available choices [0, 1, 2, 4]
  • --gpu_process: Number of GPU computing processes (devices) used in training. Available choices [0, 1]
  • --batch_type: Strategy of workload assignment. Available choices ['none', 'static', 'dynamic', 'dynamic_hard']
  • --cached_ratio: Ratios of node features cached in GPU.
  • --sampler: the mini-batch sampling algorithm. Available choices [shadow, neighbor]
  • --model: GNN model. Available choices [gcn, sage]

Hint: arguments --cpu_process and --gpu_process can be set to 0 for baseline comparison.
Note: while we only test our library using three datasets, two samplers, and two types of GNN model, other setups should also work as our library is compatible with DGL. Please refer to the DGL document for a full list of available sampler, GNN models, etc.
Note 2: large memory space (512 GB or above) is highly recommended.

a-unified-cpu-gpu-protocol-for-gnn-training's People

Contributors

hydrapse avatar jasonlin316 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.