Git Product home page Git Product logo

drmaa-python's People

Contributors

bitdeli-chef avatar chenliangomc avatar dan-blanchard avatar danielskatz avatar davidr avatar desilinguist avatar esirola avatar jakirkham avatar jan-janssen avatar moisesluza avatar stverhae avatar tomgreen66 avatar willfurnass avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

drmaa-python's Issues

Specifying the Interpretter

When using qsub with SGE, there is a flag -S that specifies the interpreter to use when running the script. Is there a way to do that in drmaaa or should I use; nativeSpecification?

Error installing via pip on Python 3

Hello drmaa-python developers,

I've recently tried to install the package via pip. It worked fine when I'm was Python 2. But on Python 3, I got the following error:

Collecting drmaa
  Could not fetch URL https://pypi.python.org/simple/drmaa/: There was a problem confirming the ssl certificate: [Errno 2] No such file or directory - skipping
  Could not find a version that satisfies the requirement drmaa (from versions: )
No matching distribution found for drmaa

I tried this with pip install drmaa and pip install drmaa==0.7.6. It seems like something is either off in the PyPI metadata or for some reason the actual package is not uploaded correctly.

Any ideas what may be going wrong there?

Thanks in advance for looking into this :),

Ability to Access Job Ids Across Sessions

Is there anyway that I can save the job id of a long running task and then check its status from another Session? I understand that I can re-start the session as long as I have the contact for that session but that requirement seems silly since using just the jobid I can run qstat.

DrmCommunicationException error on initialization through uwsgi service

Hi dan-blanchard,

I am using your package to write a web api interface for the grid engine. But there is an error message that I hope you could share some insights.

Here is the codes that cause the error

#!/usr/bin/python
import os
os.environ['DRMAA_LIBRARY_PATH'] = "/opt/sge6/lib/linux-x64/libdrmaa.so"
os.environ['SGE_ROOT'] = "/opt/sge6"
import drmaa
# drmaa session init
s = drmaa.Session()
s.initialize()

And here is the error message from uwsgi service

raceback (most recent call last):
  File "app.py", line 19, in <module>    s.initialize()
  File "/usr/local/lib/python2.7/dist-packages/drmaa/__init__.py", line 274, in initialize
    _w.init(contactString)
  File "/usr/local/lib/python2.7/dist-packages/drmaa/wrappers.py", line 59, in init
    return _lib.drmaa_init(contact, error_buffer, sizeof(error_buffer))
  File "/usr/local/lib/python2.7/dist-packages/drmaa/errors.py", line 90, in error_check
    raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value))
drmaa.errors.DrmCommunicationException: code 2: unable to send message to qmaster using port 6444 on host "master": got send error

Import constants without DRMAA C library

In the application I develop on, we delay importing drmaa so that $DRMAA_LIBRARY_PATH can be set in the application config. We also map the state constants to corresponding strings.

We're going to be expanding that state mapping to include severity ordering, and I'm planning to wrap all of these (strings, severities, constants) up in a class so we can define comparison functions for the severity ordering and simplify state stringification.

That would be more easily done outside the scope of "DRMAA job runner" initialization linked above, but unfortunately, drmaa may not be importable outside of that scope. This leads me to two questions:

  1. Would it be possible to make drmaa.const importable without a libdrmaa? Or...
  2. Would you be interested in incorporating the higher level state wrapping I am planning to write for Galaxy directly in to the drmaa library?

I'm happy to submit a PR for either, if either sound good to you.

Session returns empty contact information.

I'm running drmaa-python on a Torque/PBS system. Im using the latest pbs-drmaa library version 1.0.19. I'm running the following piece of code:

import drmaa

def main():
    with drmaa.Session() as s:
        print('A DRMAA object was created')
        print('Supported contact strings: %s' % s.contact) #returns an empty string.

Unfortunately s.contact and s.contactString are both empty. Any ideas? Is this a pbs-drmaa related issue? s.drmaaImplementation returns:

DRMAA for PBS Pro v. 1.0.19 <http://sourceforge.net/projects/pbspro-drmaa/>

Issue of drmaa-python not directing the standard error to my desired filepath

Dear All DRMAA Python Users and Developers,

I have the abovementioned issue, and have not been able to direct the standard error messages out to the desired file path whenever I run a process on my distributed SGE-controlled cluster. In my Python code, I have the following statements that are:

jt.outputPath = ':' + self.stdout_log_filepath  
jt.errorPath = ':' + self.stderr_log_filepath

And in my debugging I have inserted print statements before and after the errorPath line to ensure that it was actually executed. I have run the same process on a local setting and been able to get both the standard error and standard output be written to my desired files. With further debugging, I examined the contents of the self.stdout_log_filepath standard output when I ran it in a distributed manner, and realized that what I saw in self.stderr_log_filepath when I ran the process locally was actually found in the self.stdout_log_filepath standard output file in the distributed case.

Somehow it seems like DRMAA has not separated the standard error from the standard output streams, and the standard error stream gets written out to the same output file path as the standard error, and thus both std error and std output both go into self.stdout_log_filepath. But I really need the standard error to be separated and to go into its self.stderr_log_filepath file.

Could this be a bug? Or am I using the API wrongly? I would greatly appreciate any help on this issue to solve it.

Thank you very much!

Segmentation Fault @ jobStatus using Torque

I am having trouble running examples/example5.py. When it gets to s.jobStatus(jobid) I get a "segmentation fault". Every other example seems to work. I can see nodes created and I get an output. It is just the jobStatus that is causing me problems. Any thoughts on where to troubleshoot.

Clarification about Memory Leaks

The docs state:

It is the responsibility of the application to make sure one of these two functions [wait() or synchronize()] is called for every job. Not doing so creates a memory leak.

What exactly does that mean and where is the leak (in the Python application or in the cluster software)? Does it mean that If I intend my python program to return immediately (like in example 2), then I should call:

retval = s.wait(curjob, drmaa.Session.TIMEOUT_NO_WAIT)

before exiting / closing the session?

Question: session bound to thread that created it?

Do all of the following statements have to happen on the same thread?

session = drmaa.Session()
session.initialize()
job_id = session.runJob(job_template)
session.control(job_id, drmaa.JobControlAction.TERMINATE)
session.wait(job_id, drmaa.Session.TIMEOUT_WAIT_FOREVER)

I notice that calling control from a different thread does not terminate the job, while it works fine when calling everything from the same thread. I'm using a lock on runJob and control to ensure there are no race conditions (I'm not locking on anything else as the documentation only mentioned a need for locking when using control, or should I lock on the rest also?).

Thanks in advance

Release v0.7.9

Would be nice to get a fresh patch release given some of the fixes included since v0.7.8.

Running Batch Jobs

Hi I'm trying to run a batch job utilizing the --array slurm option. Wondering if this is possible using drmaa-python. I know there is a runBulkJobs(...), however it seems that this doesn't run an array of jobs. There doesn't seem to be any $SLURM_ARRAY_TASK_ID (or the likes) associated with that run environment.

with drmaa.Session() as s:
                try:
                    # create job template
                    jt = s.createJobTemplate()
                    jt.nativeSpecification='--mem-per-cpu='+self.memory + ' --array=1-3' + ' --time=' + self.time
                    jt.remoteCommand = command
                    print(jt.nativeSpecification)

                    # run job
                    joblist = s.runBulkJobs(jt, 1, 3, 1)

                    # wait for the return value
                    s.synchronize(joblist, self.convertToSeconds(), False)

                    for curjob in joblist:
                        print('Collecting job ' + curjob)
                        retval = s.wait(curjob, drmaa.Session.TIMEOUT_WAIT_FOREVER)
                        print('Job: {0} finished with status {1} and was aborted {2}'.format(retval.jobId, retval.exitStatus, retval.wasAborted))
                        if retval.wasAborted == True:
                            print("Ran out of memeory using: " + self.memory)
                            self.increaseMemory(6000)
                        # if zero exit code then break and job is over
                        elif retval.exitStatus == 0:
                            break
                except drmaa.ExitTimeoutException:
                    print("Ran out of time using: " + self.time)
                    self.increaseTime(6)
                except drmaa.OutOfMemoryException:
                    print("Ran out of memeory using: " + self.memory)
                    self.increaseMemory(6000)

when I try and run this I get a segmentation fault.

OUTPUT

(gdb) run run.py "echo $SLURM_ARRAY_TASK_ID" 01:00:00 100
Starting program: /apps/python/2.7.6/bin/python run.py "echo $SLURM_ARRAY_TASK_ID" 01:00:00 100
[Thread debugging using libthread_db enabled]
--mem-per-cpu=100 -a=1-3 --time=01:00:00

Program received signal SIGSEGV, Segmentation fault.
drmaa_release_job_ids (values=0x0) at drmaa_base.c:297
297 drmaa_base.c: No such file or directory.
    in drmaa_base.c
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6_4.6.x86_64 libcom_err-1.41.12-18.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 openssl-1.0.1e-16.el6_5.14.x86_64
(gdb) backtrace
#0  drmaa_release_job_ids (values=0x0) at drmaa_base.c:297
#1  0x00002aaab15cecbc in ffi_call_unix64 () at /apps/python/Python-2.7.6/Modules/_ctypes/libffi/src/x86/unix64.S:76
#2  0x00002aaab15ce393 in ffi_call (cif=<value optimized out>, fn=0x2aaab301b290 <drmaa_release_job_ids>, rvalue=<value optimized out>, avalue=0x7fffffffb0d0) at /apps/python/Python-2.7.6/Modules/_ctypes/libffi/src/x86/ffi64.c:522
#3  0x00002aaab15c6006 in _call_function_pointer (pProc=0x2aaab301b290 <drmaa_release_job_ids>, argtuple=0x7fffffffb1a0, flags=4353, argtypes=<value optimized out>, restype=0x2aaaaae5ebd0, checker=0x0) at /apps/python/Python-2.7.6/Modules/_ctypes/callproc.c:836
#4  _ctypes_callproc (pProc=0x2aaab301b290 <drmaa_release_job_ids>, argtuple=0x7fffffffb1a0, flags=4353, argtypes=<value optimized out>, restype=0x2aaaaae5ebd0, checker=0x0) at /apps/python/Python-2.7.6/Modules/_ctypes/callproc.c:1183
#5  0x00002aaab15bdcf3 in PyCFuncPtr_call (self=<value optimized out>, inargs=<value optimized out>, kwds=0x0) at /apps/python/Python-2.7.6/Modules/_ctypes/_ctypes.c:3929
#6  0x00002aaaaaaf79b3 in PyObject_Call (func=0x93b530, arg=<value optimized out>, kw=<value optimized out>) at Objects/abstract.c:2529
#7  0x00002aaaaaba6ad9 in do_call (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4239
#8  call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4044
#9  PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#10 0x00002aaaaab1b887 in gen_send_ex (gen=0x971e60, arg=0x0, exc=<value optimized out>) at Objects/genobject.c:84
#11 0x00002aaaaab2f656 in listextend (self=0x981200, b=<value optimized out>) at Objects/listobject.c:872
#12 0x00002aaaaab2fae0 in list_init (self=0x981200, args=<value optimized out>, kw=<value optimized out>) at Objects/listobject.c:2458
#13 0x00002aaaaab5a8a8 in type_call (type=<value optimized out>, args=0x97bf90, kwds=0x0) at Objects/typeobject.c:745
#14 0x00002aaaaaaf79b3 in PyObject_Call (func=0x2aaaaae5b0a0, arg=<value optimized out>, kw=<value optimized out>) at Objects/abstract.c:2529
#15 0x00002aaaaaba6ad9 in do_call (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4239
#16 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4044
#17 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#18 0x00002aaaaaba917e in PyEval_EvalCodeEx (co=0x78f4b0, globals=<value optimized out>, locals=<value optimized out>, args=<value optimized out>, argcount=4, kws=0x9addb8, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3253
#19 0x00002aaaaaba7332 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#20 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#21 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#22 0x00002aaaaaba807e in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4107
#23 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#24 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#25 0x00002aaaaaba807e in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4107
#26 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#27 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#28 0x00002aaaaaba917e in PyEval_EvalCodeEx (co=0x7743b0, globals=<value optimized out>, locals=<value optimized out>, args=<value optimized out>, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3253
#29 0x00002aaaaaba9292 in PyEval_EvalCode (co=<value optimized out>, globals=<value optimized out>, locals=<value optimized out>) at Python/ceval.c:667
#30 0x00002aaaaabc8e40 in run_mod (fp=0x82a030, filename=<value optimized out>, start=<value optimized out>, globals=0x67e510, locals=0x67e510, closeit=1, flags=0x7fffffffbe20) at Python/pythonrun.c:1370
#31 PyRun_FileExFlags (fp=0x82a030, filename=<value optimized out>, start=<value optimized out>, globals=0x67e510, locals=0x67e510, closeit=1, flags=0x7fffffffbe20) at Python/pythonrun.c:1356
#32 0x00002aaaaabc901f in PyRun_SimpleFileExFlags (fp=0x82a030, filename=0x7fffffffc4bd "run.py", closeit=1, flags=0x7fffffffbe20) at Python/pythonrun.c:948
#33 0x00002aaaaabdeb34 in Py_Main (argc=<value optimized out>, argv=<value optimized out>) at Modules/main.c:640
#34 0x00000039ad21ecdd in __libc_start_main () from /lib64/libc.so.6
#35 0x0000000000400669 in _start ()

gdb debug backtrace gives the following result

aside: I'm also having trouble with it throwing a OutOfMemoryException.. Therefore am forced to assume it was aborted due to memory (not preferable) so advice on what's happening there would be great.

Thanks!

specifying the queue name?

When testing the example2.py, I get a wrong message as follows:
fk mr qg23 _9m0 n8 sa
I'm wondering if the default queue name is not suitable in my SEG. If so, how can I change the default queue name? Thank you!

MultiThreading and MUNGE

Hi,

I have happily used drmaa-python for many years with our SGE cluster. Just recently a new cluster was installed, and this time it is configured to use MUNGE security.

If I create and submit a simple job, everything works fine, but if I run the job submission as part of a thread pool I get an error about MUNGE security.

For example:

import drmaa
from multiprocessing.pool import ThreadPool
import tempfile
import os
import stat

pool = ThreadPool(2)

session = drmaa.Session()
session.initialize()

def pTask(n):

    smt = "ls . > test.out"
    script_file = tempfile.NamedTemporaryFile(mode="w", dir=os.getcwd(), delete=False)
    script_file.write(smt)
    script_file.close()
    print "Job is in file %s" % script_file.name
    os.chmod(script_file.name, stat.S_IRWXG | stat.S_IRWXU)
    jt = session.createJobTemplate()     
    print "jt created"
    jt.jobEnvironment = {'BASH_ENV': '~/.bashrc'}
    print "environment set"
    jt.remoteCommand = os.path.join(os.getcwd(),script_file.name)
    print "remote command set" 
    jobid = session.runJob(jt)
    print "Job submitted with id: %s, waiting ..." % jobid
    retval = session.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)

pool.map(pTask, (1,))

produces the following output

Job is in file /home/userid/tmpbRa0IO
jt created
environment set
error: getting configuration: MUNGE authentication failed: Invalid credential format
remote command set
Traceback (most recent call last):
  File "test_threads.py", line 31, in <module>
    pool.map(pTask, (1,))
  File "/home/mb1ims/.conda/envs/sharc/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/home/mb1ims/.conda/envs/sharc/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
drmaa.errors.DeniedByDrmException: code 17: MUNGE authentication failed: Invalid credential format

so the first sign of trouble is when jt.remoteCommand is set, but the script continues and gives an unhandled python error when session.runJob is executed.

Curious how to solve this undefined symbol item and very grateful for any pointers

Forgive my ignorance I am just the sysadmin of a cluster and not very python literate.

We recently upgraded to Torque 5.0 and were told that now this python library when talking to the new Torque 5.0 libdrmaa.so issues this error which I believe is because that library is C++ based for some of these functions and this is a mangled name matter:

>>> import drmaa
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/site-packages/drmaa/__init__.py", line 63, in <module>
    from .session import JobInfo, JobTemplate, Session
  File "/usr/lib/python2.6/site-packages/drmaa/session.py", line 39, in <module>
    from drmaa.helpers import (adapt_rusage, Attribute, attribute_names_iterator,
  File "/usr/lib/python2.6/site-packages/drmaa/helpers.py", line 36, in <module>
    from drmaa.wrappers import (drmaa_attr_names_t, drmaa_attr_values_t,
  File "/usr/lib/python2.6/site-packages/drmaa/wrappers.py", line 56, in <module>
    _lib = CDLL(libpath, mode=RTLD_GLOBAL)
  File "/usr/lib64/python2.6/ctypes/__init__.py", line 353, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /opt/torque/lib/libdrmaa.so.0: undefined symbol: _Z19drmaa_attrib_lookupPKcj

I have the source of the libdrmaa.so.0 noted here. And I know the python is rather old. I dimly recall wrapping things with extern C in some past life. I believe the unmangled name is:

c++filt _Z19drmaa_attrib_lookupPKcj
drmaa_attrib_lookup(char const*, unsigned int)

Again, sorry for posting what I suspect is an issue in the new drmaa library...

DeprecationWarning on Python 3.6

We get the following warnings using the drmaa module:

  /opt/anaconda/lib/python3.6/site-packages/drmaa/session.py:340: DeprecationWarning: generator 'run_bulk_job' raised StopIteration
    return list(run_bulk_job(jobTemplate, beginIndex, endIndex, step))

rusage.contents could contain multiple equals signs

In helpers.py:

    def adapt_rusage(rusage)

It splits the attributes in rusage.contents with:

   k, v = attr.split('=')

On my system (PBS Pro 12.2 and pbs-drmma library) there is the possibility of an attribute containing an equals signs so it requires:

    k, v = attr.split('=',1)

Otherwise it does not cleanly end the job and causes problems in software such as Galaxy.
Happy to provide patch.

drmaa.errors.InvalidArgumentException

Hi there,

I'm having trouble getting my scripts to execute using drmaa-python. I have a GridEngine functioning well (I use it all the time), so I know that is not the problem. My command that I would like to send to the GridEngine is a python script (the same script that is calling this drmaa code - only a different entry point), followed by a list of arguments.

I am trying to send multiple jobs with different command line arguments. So I put the JobTemplate code in a loop, all accessing the drmaa.Session() instance. I would like them all to execute, then wait for the all to finish before completing the code. At least that's what I'm going for with the code below.

Using the code below:

# Loop through each VCF in the input list and send it to the job scheduler
s = drmaa.Session()
s.initialize()

for in_file in in_files:
    # Create the job template
    jt = s.createJobTemplate()
    jt.remoteCommand = 'python'
    jt.args = [os.path.realpath(__file__),
                   'test.run',
                   '--input',
                   in_file,
                   '--my flag'
        ]
    print 'Executing: {0} {1}'.format(jt.remoteCommand, ' '.join(jt.args))

    jt.joinFiles = True
    jt.outputPath = ':{0}/{1}.out'.format(os.path.dirname(in_file), os.path.basename(in_file))
    jt.errorPath = ':{0}/{1}.err'.format(os.path.dirname(in_file), os.path.basename(in_file))

    if run:
        jobid = s.runJob(jt)
        print '{0}: RUN: {1}'.format(str(jobid), os.path.basename(in_file))
    else:
        print 'DRY RUN:  {0}'.format(os.path.basename(in_file))

    # Cleanup
    s.deleteJobTemplate(jt)

# Wait for this chunk of jobs to finish as to not clutter the queue
s.synchronize(s.JOB_IDS_SESSION_ALL, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)

# Exit the job scheduler session
s.exit()

I see the runtime error:
drmaa.errors.InvalidArgumentException: code 4: Job id, "D", is not a valid job id

I have no idea where this could be coming from because it looks - to me - like my JobTemplate arguments are ok. Could it be something with the s.synchronize()?

Does anyone have any idea what could be causing this? Or where I can begin debugging?

Thanks!

DRMAA_ERRNO_NO_RUSAGE ignore on wait

On wait method, if method returns DRMAA_ERRNO_NO_RUSAGE, value is considered as an error.

This is a valid value, application simply consider there is no available usage data.

How to run a job with 'qsub' arguments or script flags

When I run a script on an HPC with a scheduler such as SGE, my script might contain flags like these:

#!/bin/bash
#$ -pe openmpi 32
#$ -A TensorFlow
#$ -N rqsub_tile
#$ -cwd
#$ -S /bin/bash
#$ -q gpu0.q
#$ -l excl=true

run_script.sh 

Or, I might run a qsub command like this:

$ qsub -wd $PWD -o :${qsub_logdir}/ -e :${qsub_logdir}/ -j y -N "$job_name" -pe threaded 6-18 -l mem_free=10G -l mem_token=10G run_script.sh

I've spent a lot of time reading the docs, running through the examples, and Google'ing, but I cant find anything that actually shows how to use these parameters with this Python DRMAA library. Is it described somewhere? It sounds like something that might be part of the JobTemplate described here, but the docs do not mention this, or tell you much at all really.

Import Error

When I run import drmaa on the SGE cluster head node, I get this error

OSError: dlopen(/global/sge/lib/darwin-x86/libdrmaa.dylib, 10): Symbol not found: _environ
  Referenced from: /global/sge/lib/darwin-x86/libdrmaa.dylib
  Expected in: flat namespace

I've been using the grid engine commands without any issues. Any idea why this happens?

Thanks!

jobEnvironment dropping content after = in value

I am running this script:

   with drmaa.Session() as s:
       jt = s.createJobTemplate()
       jt.remoteCommand = "python"
       jt.args = ['env.py']
       jt.jobEnvironment = {'ALEX':'alex=2'}
       jt.joinFiles=True
       jobid = s.runJob(jt)
       s.deleteJobTemplate(jt)

with this script env.py:

import os
print os.environ

When I look at my environment, I see:

'ALEX': 'alex', 

Cannot execute binary file

I'm using gridmap but the errors are (I'm pretty sure) related to something here.

/idiap/temp/<user>/programs/anaconda3/bin/python: /idiap/temp/<user>/programs/anaconda3/bin/python: cannot execute binary file
find: '/tmp/.vbox-root-ipc': Permission denied
find: '/tmp/systemd-private-b923f82753fe4a1985eea86cc7be13d6-colord.service-1enXie': Permission denied
find: '/tmp/systemd-private-b923f82753fe4a1985eea86cc7be13d6-nagios-nrpe-server.service-JdVQLm': Permission denied
find: '/tmp/thunderbird': Permission denied
find: '/tmp/MozillaMailnews': Permission denied
find: '/tmp/pulse-PKdhtXMmr18n': Permission denied
find: '/tmp/mozilla_abittar0': Permission denied
find: '/tmp/ssh-yAgIKvbICjPQ': Permission denied
find: '/tmp/firefox-esr': Permission denied
find: '/tmp/.vbox-nantonel-ipc': Permission denied
find: '/tmp/systemd-private-b923f82753fe4a1985eea86cc7be13d6-haveged.service-5pU4qb': Permission denied
find: '/tmp/orbit-abittar': Permission denied
find: '/tmp/ssh-V5WCynGVPpLl': Permission denied

I'm using -b yes, the file has the right permissions.. I have no idea what's going on

Using wurlitzer to capture and log stdout/stderr from libdrmaa

Some DRMAA implementations print information to stdout/stderr. While this information can be useful, having it print to stderr or stdout can be disruptive (e.g. launching jobs with DRMAA in an IPython terminal session) as it will get mingled with other information. It would be ideal to capture this stdout/stderr and instead redirect to somewhere like a Python log, which users can redirect as they see fit.

The main trick is capturing the C-level stdout/stderr pipes and redirecting them to something else that can used with logging. Note that working with sys.stdout/sys.stdout in Python will not address this. For a good explanation of why, please read this blog post by Eli Bendersky. Fortunately a lot of work has already been done on this and there now exists a package, wurlitzer, that handles this redirection logic for us. So it would be good if we could use this to redirect stdout/stderr from libdrmaa to a logger (possibly using StringIO as an intermediary).

drmaa.errors.DeniedByDrmException: code 17: job rejected: no script in your request

add -clear parameter in <jt.nativeSpecification>, we got problem as title said.
remove -clear parameter, everything went well.
please give me some clue about this issue.
THANKS.

script:
#!/usr/bin/env python

import drmaa
import os

def main():
"""
Submit a job.
Note, need file called sleeper.sh in current directory.
"""
with drmaa.Session() as s:
print('Creating job template')
jt = s.createJobTemplate()
jt.workingDirectory = os.getcwd()
jt.remoteCommand = os.path.join(os.getcwd(), 'test.sh')
jt.nativeSpecification = "-clear -binding linear:1 -P MASSspe -q bc.q -cwd -l vf=0.5g -l num_proc=1"
#jt.nativeSpecification = "-binding linear:1 -P MASSspe -q bc.q -cwd -l vf=0.5g -l num_proc=1"
jobid = s.runJob(jt)
print('Your job has been submitted with ID %s' % jobid)

    print('Cleaning up')
    s.deleteJobTemplate(jt)

if name=='main':
main()

binary / script flag

When using qsub with SGE, there is a flag -b that specifies whether to treat the command as a binary or script. if -b y, qsub will assume the command is available on the compute hosts (and not necessarily the submission host) whereas if -b n, the file will be copied over from the submission.

Is there a way to do that in drmaaa or should I use either nativeSpecification or something like transferFiles?

Full text from man page:

   -b y[es]|n[o]
          Available for qsub, qrsh only. Qalter does not allow changing this option. This option cannot be embedded in the script file itself.

          Gives  the  user the possibility to indicate explicitly whether command should be treated as binary or script. If the argument of -b is ’y’, then command  may be a binary or script.  The command might
          not be accessible from the submission host.  Nothing except the path of the command will be transferred from the submission host to the execution host. Path aliasing will be applied  to  the  path  of
          command before it is executed.

          If  the  argument  of  -b is ’n’ then command needs to be a script, and it will be handled as such. The script file has to be accessible by the submission host. It will be transferred to the execution
          host. qsub/qrsh will search directive prefixes within scripts.

          qsub will implicitly use -b n, whereas qrsh will apply the -b y option if nothing else is specified.

          The value specified with this option or the corresponding value specified in qmon will only be passed to defined JSV instances if the value is yes.  The name of the parameter will be b. The value will
          be y also when the long form yes was specified during submission.  (See -jsv option below or find more information concerning JSV in jsv(1).)

          Please  note  that submission of command as a script (-b n) can have a significant performance impact, especially for short running jobs and big job scripts.  Script submission adds a number of opera-
          tions to the submission process: The job script needs to be
          - parsed at client side (for special comments)
          - transferred from submit client to qmaster
          - spooled in qmaster
          - transferred to execd at job execution
          - spooled in execd
          - removed from spooling both in execd and qmaster once the job is done
          If job scripts are available on the execution nodes, e.g. via NFS, binary submission can be the better choice.

Capturing the stdout and stderr?

Is there a supported way to capture the stdout/stderr of each job?
Or would I have to add a > output_filename to the job template arguments list?

setting walltime

How do I specify the equivalent of "-l walltime=1:00:00"?

Your examples should include this.

I've tried:
jt.hardWallclockTimeLimit = ':2:00'
jt.hardWallclockTimeLimit = '2:00'
jt.hardWallclockTimeLimit = 120

What else can be done?

Segmentation fault when an integer is passed via JobTemplate.args

The subject says it all.

This code segfaults:

jt = s.createJobTemplate()
jt.remoteCommands = "/bin/echo"
jt.args = ["John", "Doe", 12]
jt.joinFiles = True

self.s.runJob(jt)

s.deleteJobTemplate(jt)

Converting 12 to a string solves it.

System: CentOS 6.5 64bit - Rocks 6.1.1
Queue: SGE - OGS/GE 2011.11p1

set_vector_attribute truncates all arguments greater than 1024 chars

This can be easily seen by with following simple test (and can be stepped through with pdb):

import drmaa

def main():
    session = drmaa.Session()
    session.initialize()
    job_template = session.createJobTemplate()
    args = ["", '5,7,9,14,17,22,31,32,36,37,38,48,50,55,60,63,65,68,70,72,78,79,82,85,86,88,89,90,91,92,95,96,99,100,105,106,111,114,118,119,120,123,126,127,128,131,132,135,136,138,145,148,150,151,154,155,156,164,168,172,174,182,183,184,187,192,193,195,197,198,199,202,203,207,208,212,216,218,220,222,225,226,230,233,239,242,245,251,257,258,259,260,261,265,266,268,279,280,283,284,286,287,288,289,294,298,301,303,304,306,314,318,325,329,330,332,334,337,344,346,352,364,366,369,370,373,379,381,382,384,390,391,392,394,402,403,407,408,414,419,425,431,433,434,439,440,444,445,446,459,460,461,466,468,469,470,473,476,480,486,487,494,495,497,501,515,516,517,522,528,529,533,535,536,538,539,543,546,547,550,551,552,553,562,565,568,569,570,572,575,578,580,582,583,587,588,590,592,598,601,603,606,607,608,613,620,623,626,627,628,629,630,631,634,639,647,650,654,655,658,669,672,673,675,676,681,683,694,697,698,703,706,709,710,712,716,719,725,733,736,741,743,745,748,752,754,756,757,761,765,766,770,771,772,773,774,776,778,779,780,786,788,792,793,796,803,805,808,809,811,812,813,814,817,820,823,829,831,834,840,842,845,850,851,854,856,857,858,864,867,870,871,875,876,878,879,881,885,890,894,895,897,901,902,903,904,905,912,915,921,924,925,927,932,934,936,937,939,941,942,948,951,952,954,955,960,961,965,971,973,977,978,979,980,984,985,995,996,998,1000,1001,1004,1007,1009,1011,1012,1015,1016,1017,1018,1021,1025,1026,1027,1031,1035,1040,1046,1047,1048,1051,1052,1053,1056,1057,1058,1061,1065,1066,1071,1072,1085,1087,1088,1089,1092,1093,1094,1095,1098,1102,1103,1106,1108,1111,1112,1117,1118,1123,1127,1133,1135,1136,1143,1146,1147,1151,1153,1154,1156,1160,1161,1163,1164,1165,1172,1176,1179,1180,1181,1183,1184,1185,1188,1189,1190,1191,1192,1193,1194,1200,1201,1204,1206,1207,1209,1210,1211,1212,1216,1219,1222,1226,1237,1240,1243,1244,1245,1248,1250,1253,1255,1256,1257,1261,1268,1269,1276,1281,1283,1286,1290,1292,1295,1296,1299,1301,1302,1303,1307,1310,1321,1322,1329,1332,1335,1337,1342,1343,1344,1348,1350,1353,1354,1357,1358,1359,1360,1372,1373,1376,1377,1384,1385,1390,1391,1393,1394,1398,1399,1401,1408,1409,1413,1414,1417,1419,1425,1431,1435,1436,1439,1440,1443,1444,1446,1447,1451,1452,1459,1460,1461,1465,1466,1469,1470,1471,1475,1476,1477,1479,1481,1485,1486,1487,1489,1491,1492,1495,1499,1500,1503,1505,1508,1513,1516,1528,1531,1532,1538,1547,1549,1550,1552,1554,1556,1561,1564,1566,1569,1573,1584,1585,1586,1590,1591,1593,1595,1596,1597,1599,1601,1606,1610,1611,1613,1614,1617,1621,1623,1626,1628,1630,1639,1640,1644,1647,1648,1651,1652,1655,1659,1660,1664,1668,1674,1679,1681,1684,1691']
    job_template.args = map(str, args)
    print job_template.args
    session.exit()

if __name__=='__main__':
    main()

which shows clearly that job_template.args is truncated at 1024 chars. Setting job_template.args calls __set__ from drmaa/helpers.py:

181          def __set__(self, instance, value):
182              c(drmaa_set_vector_attribute, instance,
183  ->              self.name, string_vector(value))

string_vector(value) has no truncating effect on the passed arguments. It is drmaa_set_vector_attribute that appears to have a buffer overflow of some sort and caps all arguments at 1024 chars.

Fix Travis Badge

The current badge shows the build failing when it is not.

Perhaps use:

.. image:: https://travis-ci.org/pygridtools/drmaa-python.svg?branch=master
    :target: https://travis-ci.org/pygridtools/drmaa-python

drmaa-python wait not playing nice with Condor

condor_version 
$CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
$CondorPlatform: x86_64_rhap_6.3 $

using the example3 (example1.py, example1.1.py, example2.py, and example2.1.py all work fine)

./example3.py
Creating job template
DEBUG: Join_files is set
DEBUG: drmaa_join_files: y
DEBUG: drmaa_v_argv: ?:i
DEBUG: drmaa_remote_command: /home/leipzigj/drmaa-python/examples/sleeper.sh
Your job has been submitted with id variome.chop.edu.108921.0
DEBUG: -> wait_job(variome.chop.edu.108921.0)
DEBUG: Sleeping for a momentDEBUG: Sleeping for a momentDEBUG: Sleeping for a momentDEBUG: Sleeping for a momentDEBUG: Resulting stat value is 200
DEBUG: RUsage data: submission_time=1433953498, start_time=1433953499, end_time=1433953502
DEBUG: Unreferencing job variome.chop.edu.108921.0
DEBUG: Not removing job variome.chop.edu.108921.0 yet (ref_count: 1 -> 0)
DEBUG: Marking job variome.chop.edu.108921.0 as DISPOSED
DEBUG: Removing job info for variome.chop.edu.108921.0 (0x26e58c0, 0x26e58c0, (nil), 1)
DEBUG: <- wait_job(variome.chop.edu.108921.0)
Traceback (most recent call last):
  File "./example3.py", line 29, in <module>
    main()
  File "./example3.py", line 21, in main
    retval = s.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)
  File "/nas/is1/bin/variome-env/lib/python3.3/site-packages/drmaa/session.py", line 480, in wait
    c(drmaa_wcoredump, byref(coredumped), stat)
  File "/nas/is1/bin/variome-env/lib/python3.3/site-packages/drmaa/helpers.py", line 299, in c
    return f(*(args + (error_buffer, sizeof(error_buffer))))
  File "/nas/is1/bin/variome-env/lib/python3.3/site-packages/drmaa/errors.py", line 151, in error_check
    raise _ERRORS[code - 1](error_string)
drmaa.errors.InvalidArgumentException: code 4: Invalid argument

Should search for drmaa library in LD_LIBRARY_PATH folders

Title pretty much says it all.

I have the drmaa lib files in my ~/usr/lib folder, which is in my LD_LIBRARY_PATH. I don't see why I should have to add an extra environment variable to tell you where the drmaa lib file is, since I am using pretty standard stuff?

Am I the only one here?

Job finished, but failed

I tried to invistigate to check this error, but actually i can not.

I use drmaa-python 0.7.6 for submitting jobs to SGE6 2u5 but when i ran this code as shown:

    #!/usr/bin/env python
    import drmaa
    import time
    import os
    def main():
        """Submit a job, and check its progress.
        Note, need file called sleeper.sh in home directory.
        """
        s = drmaa.Session()
        s.initialize()
        print 'Creating job template'
        jt = s.createJobTemplate()
        jt.remoteCommand = os.getcwd() + '/sleeper.sh'
        jt.args = ['42','Simon says:']
        jt.joinFiles=True
        jobid = s.runJob(jt)
        print 'Your job has been submitted with id ' + jobid
        # Who needs a case statement when you have dictionaries?
        decodestatus = {
            drmaa.JobState.UNDETERMINED: 'process status cannot be determined',
            drmaa.JobState.QUEUED_ACTIVE: 'job is queued and active',
            drmaa.JobState.SYSTEM_ON_HOLD: 'job is queued and in system hold',
            drmaa.JobState.USER_ON_HOLD: 'job is queued and in user hold',
            drmaa.JobState.USER_SYSTEM_ON_HOLD: 'job is queued and in user and system hold',
            drmaa.JobState.RUNNING: 'job is running',
            drmaa.JobState.SYSTEM_SUSPENDED: 'job is system suspended',
            drmaa.JobState.USER_SUSPENDED: 'job is user suspended',
            drmaa.JobState.DONE: 'job finished normally',
            drmaa.JobState.FAILED: 'job finished, but failed',
            }
        for ix in range(10):
            print 'Checking ' + str(ix) + ' of 10 times'
            print decodestatus[s.jobStatus(jobid)]
            time.sleep(5)
        print 'Cleaning up'
        s.deleteJobTemplate(jt)
        s.exit()
    if __name__=='__main__':
        main()

i got this error:

Creating job template
Your job has been submitted with id 42
Checking 0 of 10 times
job is queued and active
Checking 1 of 10 times
job is queued and active
Checking 2 of 10 times
job finished, but failed
Checking 3 of 10 times
job finished, but failed
Checking 4 of 10 times
job finished, but failed 
Checking 5 of 10 times
job finished, but failed
Checking 6 of 10 times
job finished, but failed
Checking 7 of 10 times
job finished, but failed
Checking 8 of 10 times
job finished, but failed
Checking 9 of 10 times
job finished, but failed
Cleaning up

Question about other queue systems

Hello,
I didn't now how to ask you a question so I put this here, I hope this suits you well.
I have a question, I have a job that works for my sge system, it is a basic code:

with drmaa.Session() as s:
       jt = s.createJobTemplate()
       jt.remoteCommand = os.path.join(os.getcwd(), 'script.sh')
       jobid = s.runJob(jt)
       s.deleteJobTemplate(jt)

but now I have to execute this in a pbs and slurm system, will this work? or is it only suitable for sge?
For example if we need something like:

export SLURM_ROOT=/path/to/gridengine
export PBS_ROOT=/path/to/gridengine

Thanks

drmaa communication exception/errors while running on SLURM

Hello everyone
I am trying to run some jobs on a SLURM cluster and I am running through this issue

drmaa.errors.DrmCommunicationException: code 2: unable to send message to qmaster using port 6444 on host xxxx 

The code I am using to test prior to running any jobs is

from __future__ import print_function
import os
import drmaa

LOGS = "logs/"
if not os.path.isdir(LOGS):
    os.mkdir(LOGS)

s = drmaa.Session()
s.initialize()
print("Supported contact strings:", s.contact)
print("Supported DRM systems:", s.drmsInfo)
print("Supported DRMAA implementations:", s.drmaaImplementation)
print("Version", s.version)

jt = s.createJobTemplate()
jt.remoteCommand = "/usr/bin/echo"
jt.args = ["Hello", "world"]
jt.jobName = "testdrmaa"
jt.jobEnvironment = os.environ.copy()
jt.workingDirectory = os.getcwd()

jt.outputPath = ":" + os.path.join(LOGS, "job-%A_%a.out")
jt.errorPath = ":" + os.path.join(LOGS, "job-%A_%a.err")

jt.nativeSpecification = "--ntasks=2 --mem-per-cpu=50 --partition=1day"

print("Submitting", jt.remoteCommand, "with", jt.args, "and logs to", jt.outputPath)
ids = s.runBulkJobs(jt, beginIndex=1, endIndex=10, step=1)
print("Job submitted with ids", ids)

s.deleteJobTemplate(jt)

That I found useful and was posted by a member of the community here.

Can anyone tell me if we need to setup any environment variables prior to run jobs on slurm ?

Thanks

New maintainers needed

It has been almost 3 years since I worked somewhere that used a DRMAA grid computation system, so I am not the best person to maintain this anymore. @desilinguist, I know ETS is still using this extensively via gridmap, so can I point people your way with issues from now on? I'm planning on removing myself as an owner of pygridtools entirely, since I just don't have the bandwidth (or even the dev setup) to work on these anymore.

Relocating to pygridtools organization

This is more of an announcement than an issue. The developers of pythongrid and I agreed to join forces and create one organization storing all Python grid computing repos. I'm moving DRMAA Python there.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.