Git Product home page Git Product logo

gpuservers's Introduction

iBanks NVIDIA powered GPU cluster

Nodes and Access

The head node ibanks.hep.caltech.edu is running the nfs home and data server. It has a public (regular network) and a private (10G) IP.

Worker nodes are a follows, in chronological order of creation

  • titans.hep.caltech.edu is an MSI desktop with 2TB of local disk, and runs 2 NVidia GeForce GTX Titan X
  • passed-pawn-klmx.hep.caltech.edu is a cocolink server with 200G local disk, and runs 8 NVidia Titan X (Pascal)
  • culture-plate-sm.hep.caltech.edu is a Supermicro server with 2T of local SSD, and runs 8 NVidia GeForce GTX 1080
  • imperium-sm.hep.caltech.edu is a Supermicro server with 2T of local SSD, and runs 8 NVidia GeForce GTX 1080
  • flere-imsaho-sm.hep.caltech.edu is a Supermicro server with 2T of local SSD, and runs 6 NVidia Titan Xp (Pasca)

All server have a public (regular network) and a private (10G) IP. SSH key is the prefered authentication. Please let the admins know if you need help setting this up.

The linux password is not centralized yet, and you have to change it locally to secure access to jupyter notebooks.

Credits

If you are a user of the cluster, we are happy to help make progress. When producing publication or public presentation, please be so kind as to warn Prof. Spiropulu and Dr. Vlimant, just for accounting purpose. Please include the following latex acknowledgement for support

Part of this work was conducted at  "\textit{iBanks}", the AI GPU cluster at Caltech. We acknowledge NVIDIA, SuperMicro  and the Kavli Foundation for their support of "\textit{iBanks}".

Data Storage

The home directory should be used for software and although there is room, please prevent from putting too much data within your home directory.

The /bigdata/ volume is mounted on all nodes. It is a 20TB raid array mounted over nfs. Please use the /bigdata/shared/ directory and contact the admins if in the need for private directory.

The /data/ volume is mounted on some nodes, not all on SSD. This is the prefered temporary location for data needed for intensive I/O.

The

Setup

It is important to note that I/O on the nfs mounted volume is not as efficient as with local disk, so please use care and monitor performance of your applications.

For ipython, the following directory better be local

mkdir /tmp/$USER/ipython -p
ln -s /tmp/$USER/ipython .ipython

For cuda, the same applies to

mkdir -p /tmp/$USER/cuda/
export CUDA_CACHE_PATH=/tmp/$USER/cuda/      

It is recommended to have export CUDA_CACHE_PATH in your login file.

To use only a selected GPU, run nvidia-smi or gpustat to see GPU utilization, then set export CUDA_VISIBLE_DEVICES=n to a the index of the GPU you want to use. In python one can either set the environment variable or use import setGPU (gets one device automatically).

Software

Theano

The compilation directory might be better off being a local directory, like in /tmp

mkdir -p /tmp/$USER/theano_compile
rm -r ~/.theano
ln -s /tmp/$USER/theano_compile ~/.theano

The theano rc file should look like

[nvcc]
[global]
device=cuda

One can add base_compiledir=/tmp/$USER/theano_compile/ directly in the global section if need be.

Tensorflow

Tensorflow is greedy in using GPUs and it is mandatory to use export CUDA_VISIBLE_DEVICES=n (where n is the index of a device, or coma separated index) to use only a selected device, if not explicitly controlled within the application.

Jupyter Hub

We are building some docker images that are posted on dockerhub: https://hub.docker.com/u/caltechcms/

Jupyter Notebook

The users can start a jupyter notebook server on each machine using the command

source /bigdata/shared/Software/jupyter/restart.sh

This will provide back a url to which you can connect. The password is the linux password. If you are connecting using ssh key, as recommended, please contact an admin to get a password. The port that is assigned to you is defined in /bigdata/shared/Software/jupyter/ports if you are not in there, please contact an admin.

MPI

mpi is installead on the cluster and will allow to run jobs across the cluster. For this, the main requirement is to have passwd-less access on all machines. Setup your ssh key for authentication to the host

mpi-ibanks
mpi-culture-plate-sm
mpi-imperium-sm
mpi-passed-pawn-klmx
mpi-titans
mpi-flere-imsaho-sm

which run over the private IP.

Create the .openmpi directory and link the file /bigdata/shared/Software/mpi/mca-params.conf in it.

Run an mpi test with the command

mpirun --hostfile /bigdata/shared/Software/mpi/hostfile -n 18 /bigdata/shared/Software/mpi/mpi4py-examples/03-scatter-gather

mpirun --hostfile /bigdata/shared/Software/mpi/hostfile -n 18 /bigdata/shared/Software/mpi/mpi4py-examples/08-matrix-matrix-product

mpirun --hostfile /bigdata/shared/Software/mpi/hostfile -n 10 /bigdata/shared/Software/mpi/keras_mnist.py

which should both run. Contact the admin if it does not (the debugging options are --mca odls_base_verbose 100 --mca btl_base_verbose 100).

The default number of slots per node is the number of GPU (as this is the primary usage). One can override this limitation by running

mpirun --map-by node --hostfile /bigdata/shared/Software/mpi/hostfile -n 100 /bigdata/shared/Software/mpi/mpi4py-examples/08-matrix-matrix-product

gpuservers's People

Contributors

dkcira avatar p234a137 avatar vlimant avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.