Git Product home page Git Product logo

songe's Introduction

songe

Son of Grid Engine with GPU support

This is based on Son of Grid Engine 8.1.9. Added suppotrs for GPU scheduling, works for searial, paralle, MPI, array jobs.(Check this repository too: https://github.com/prod-feng/sge-gpu/tree/master)

Supports seting up hybriding GPU and CPU jobs on the same nodes for resource sharing between users at the same time.

Compile tested using the followinf commands:

#go to the source folder

cd $SGE_CODE/source

./aimk -no-java -no-jni -no-secure -spool-classic -only-depend -no-qmon -debug

./scripts/zerodepend

./aimk -no-java -no-jni -no-secure -spool-classic  -no-qmon -debug depend

./aimk -no-java -no-jni -no-secure -spool-classic  -no-qmon -debug

#to install:

export SGE_ROOT=/where/you/want

scripts/distinst -all -local -noexit
(install both sgemaster and execd, good foe testing)

#debug:

source ./dist/util/dl.sh
 
dl 10

(then restart sgemaster, and/or execd services)

(also see here: https://github.com/prod-feng/sge-gpu/tree/master)

To set a hybrid CPU and GPU cluster

Simultaneously run CPU jobs and GPUs jobs on the same node, one would better to reserve 1 CPU core per 1 GPU. The way I do is:(suppose nodeA has 40CPU core, 2 GPUs) (there should be other ways)

  1. Define a "rcpus" comsumable:
qconf -sc|grep rcpu
rcpus               rcpus      INT         <=    YES         YES        10       0

Here default debit is "10".

  1. Set the value of the "rcpus" as:
40 cores - 2 cores(for GPU) =38 X 10 =380

rcpus=380+2 =382

  1. Update the node
qconf -se nodeA
...
complex_values        slots=40,rcpus=382,ngpus=2
...
  1. Name CPU queues as "cpu*", and GPU queues as "gpu*'
  2. Add JSV to SGE conf
qconf -sconf

...
jsv_url                      /somewhere/mygpujsv.pl
...

In this "mygpujsv.pl" file, add the codes:


        if ( $key=~/gpu/) {
          jsv_sub_add_param('l_hard', 'rcpus',1);
        }
.......
(or add the following to make sure)

        if ( $key=~/cpu/) {
          jsv_sub_add_param('l_hard', 'rcpus',10);
        }

In this way, CPU jobs will debit 10 "rcpus", which can only use 38 cores; while, GPU jobs will debit 1 "rcpus". So reservs 2 CPU cores for GPU jobs.

====================================

Update 04/10/2024

Improve support fractioned -l ngpus=0.5. Useful for multithreading GPU jobs which needs multiple CPU cores, but 1 or several GPUs.

-pe openmp 10    # requests 10 cpu cores

-l ngpus=0.2     # so here 10X0.2=2 GPUs for this job on the same node.

Added protection for MT for multiple worker threads. Or set #SGE_ROOT/default/common/bootstrap to be:

listener_threads       1
worker_threads         1

And then restart the sgemaster service. The already running GPU jobs need to be re-submitted(the GPU info in SGE is not spooled).

NO Guarantee!

======================================

Update: Oct., 2018

Add patched_files_ge2011.11.p1.0.1.tar.gz .

Fix a bug to support GPU array jobs properly. Only for GE2011.p1 now.

NO Guarantee!

======================================

Update: some bug fixes. April, 2018.

======================================

This is a package which is designed to enable GPU scheduling capability to GE2011.11p1(not applicable to other versions yet). With this patch, SGE can support GPU without any external Wraper tools.

Patch to Son of Grid Engine 8.18 is available too in the following link.

First, recompile and rebuild the source code.

Second, you need to set a consumable, named "ngpus"(hard coded in the patched files). And assign value of it to each node. Like the following:

$ qconf -sc

#name               shortcut   type        relop requestable consumable default  urgency 
 ...
ngpus               gpu        INT         <=    YES         YES        0        5000

and,

qconf -se node1
...
complex_values        slots=12,ngpus=2,...

When you submit a GPU job, you need to run the command:

qsub -l ngpus=1 ...

This also works for parallel jobs.

qsub -pe openmpi 4 -l ngpus=1 ...

Here, "-l ngpus=1" request 1 GPU for 1 process.

It supports multiple GPU scheduling on multiple nodes for parallel jobs(MPI, etc.) as well.

For example, if node1 and node2 each has 4 GPUs installed.

On node1, JobA uses GPU0, JobB uses GPU2;

On node2, jobC uses GPU 1 and GPU 2.

And then JobZ requestes 4 GPUs, the patched SGE can dispatch GPU1 and GPU3 on node1, GPU0 and GPU3 on node2 to JobZ, and set the environment for the job on node1 as:

CUDA_VISIBLE_DEVICES=1,3  (0,2 are alreadt used by jobA and jobB)

and environment for the same job on node2 as:

CUDA_VISIBLE_DEVICES=0,3  (1,2 are already used by jobC)

For non-GPU jobs, CUDA_VISIBLE_DEVICES is set to be empty.

With this patch, you do not need wrapper or load sensor anymore to schedule multiple GPU jobs.

Check the link too: http://sourceforge.net/projects/ge-gpu/?source=directory

songe's People

Contributors

prod-feng avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.