Git Product home page Git Product logo

hecbiosim / longbow Goto Github PK

View Code? Open in Web Editor NEW
17.0 6.0 3.0 8.23 MB

Longbow is a tool for automating simulations on a remote HPC machine. Longbow is designed to mimic the normal way an application is run locally but allows simulations to be sent to powerful machines.

Home Page: http://www.hecbiosim.ac.uk

License: Other

Python 98.41% Shell 1.59%
high-performance-computing job-submission automation scientific-computing pbs-pro torque lsf-jobs slurm sge bioinformatics

longbow's Introduction

image

image

image

image

image

image

Documentation Status

Longbow

Longbow is an automated simulation submission and monitoring tool. Longbow is designed to reproduce the look and feel of using software on the users local computer with the difference that the heavy lifting is done by a supercomputer.

Longbow will automatically generate the necessary submit files and handle all initial file transfer, monitor jobs, transfer files at configurable intervals and perform final file transfer and cleanup.

Longbow can be used to launch one-off jobs, generate ensembles of similar jobs or even run many different jobs over many different supercomputers.

Out of the box, Longbow is currently supporting the PBS/Torque, LSF, SGE, Slurm, SoGE schedulers and ships with application plugins for commonly used bio-molecular simulation softwares AMBER, CHARMM, GROMACS, LAMMPS, NAMD. Longbow is however highly configurable and will function normally with generic software without plugins, however plugins can easily be made to extend Longbow to fully support applications and schedulers that do not ship out of the box.

Using Longbow can be as simple as the following example:

local: executable -a arg1 -b arg2 -c arg3

remote: longbow executable -a arg1 -b arg2 -c arg3

Longbow is also available to developers of applications which require support for automating job submission. Longbow is available as a convenient and light-weight python API that can be integrated in a number of different way.

Licensing

Longbow is released under the BSD 3-clause license. A copy of this license is provided when Longbow is downloaded and installed.

Citing

If you make use of Longbow in your own code or in production simulations that result in publishable output, then please reference our paper:

Gebbie-Rayet, J, Shannon, G, Loeffler, H H and Laughton, C A 2016 Longbow: A Lightweight Remote Job Submission Tool. Journal of Open Research Software, 4: e1, DOI: http://dx.doi.org/10.5334/jors.95

Installation

Releases can be installed either via pip or can be installed manually, to install via pip:

pip install longbow

or to install manually (see docs) Longbow can be downloaded here:

http://www.hecbiosim.ac.uk/longbow

and then extract and run the setup.py script to install.

Documentation

Documentation for Longbow users can be found here:

http://www.hecbiosim.ac.uk/longbow-docs

Examples

Example files can be installed either through the Longbow command-line or by downloading from the HECBioSim website manually:

longbow --examples

http://www.hecbiosim.ac.uk/longbow-examples

Support

Support for any issues arising from using Longbow, whether these are questions, to report a bug or to suggest new ideas. You should use the Longbow forums here:

https://github.com/HECBioSim/Longbow/issues

Developers

Developers that wish to contribute to Longbow are welcome. We do ask that if you wish to contribute to the Longbow base code that you contact us first.

The following resources are available to developers:

Code repository: https://github.com/hecbiosim/longbow

Unit testing: https://travis-ci.org/HECBioSim/Longbow

Code coverage: https://coveralls.io/github/HECBioSim/Longbow

longbow's People

Contributors

anjohan avatar gshannon1986 avatar jimboid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

longbow's Issues

Bug in file recognition

Files referenced on command line that do not exist are ignored. This should raise an error

Bash auto complete

Work in this bash auto complete script so that users can use command-line completion with Longbow

_longbow()
{
    local cur prev opts
    COMPREPLY=()
    cur="${COMP_WORDS[COMP_CWORD]}"
    prev="${COMP_WORDS[COMP_CWORD-1]}"
    opts="--about --debug --disconnect --examples --help --hosts --job --jobname --log --recover --resource --replicates --verbose --version"

    if [[ ${cur} == -* ]] ; then
        COMPREPLY=( $(compgen -W "${opts}" -- ${cur}) )
        return 0
    fi
}
complete -F _longbow longbow

Investigate alternatives to landscape.io

Take a look at alternatives to landscape.io, it seems that the service is broken and will be unavailable for quite some time still (already 5 months at date of this post). There does not seem to be much information other than "our servers are over capacity", "therefore we stopped checks" and "we are working on it" so nothing about the future, no dates etc.

Codeclimate seems to do what this project needs, however it seems to pick up numerous more problems than landscape did, some fair but others more annoying and unavoidable (duplicates, especially because we have a per-plugin code structure). Codeclimate appears to also do coverage, so could possible drop coverage.io too and just use cc + travis.

SGE parallel environments issue

The fix issued in issue #5 works but there is a case when a cluster has gpus, by removing the pe flag altogether for jobs running one cpu works for jobs run on serial queues, however if submitting jobs to a gpu configured in a certain way then jobs won't run without the -pe mpi 1. This is a special case, presumably due to gpus being tied to non-serial queues, an override flag should be included.

Cleanup remaining imports and function calls

At the moment there are large blocks of imports to import functions from the corelibrary. This should be cleaned up so that one simply does a single import of longbow and then call the method as longbow.method()

This should be tested to make sure there are no side effects for things such as logging etc before pulling into development branch

In software plugins add ability to be able to differentiate between files and parameters

To be able to support more applications and do better checking of the command-line for applications, it is crucial to be able to differentiate if a argument should be a file or a parameter.

An example is the new gromacs, they now have all utilities within a gmx wrapper, so mdrun would be invoked by:

gmx mdrun blah blah blah

We can't currently handle these.

Bug in setup script

The python version check simply issues a warning if the python version is out of scope, this should issue a warning and then fail.

Reported by Hannes Loeffler

Keyboard interrupt during recovery mode execution results in ungraceful exit

There is no exception handling during recovery mode to capture things such as the user interrupt. The following options could be done:

  1. The same code that handles this behaviour in the longbowmain method could be implemented in the recovery method. This could allow a deviation in intended behaviour of the interrupt signal between normal operating mode and recovery mode.

  2. The code from longbowmain method could simply be moved up a level to the main method and then cover both longowmain and recovery methods. Although there will need to be a way of passing the jobs data structure up the chain as it is not currently in the namespace. This could be done either by declaring an empty dict in the namespace and passing it down to be initialised, or find a way to append it to the exception and pull it out where needed.

Support python 3.6

This could be as simple as adding python 3.6 to the unit testing tools. However if there are problems then they should be listed here and fixed.

plugin specific params

Parameters that are only available within one plug-in whether it is an application or scheduler should begin with the name of the app or scheduler such as "cluster" => "lsf-cluster" so users unfamiliar with schedulers will understand which parameters are generic and which aren't.

Remove custom install imports

Supporting custom installs is pointless given the power of tools such as pip and easyinstall/setuptools etc. The imports needed to support custom installs simply creates more problems than it solves and looks ugly and unpythonic. It is also a pain to write tests that can cover this so leads to an annoyance with coverage.

This should be withdrawn for future versions.

Extend worked examples in documentation

Some users have requested that in addition to the simple examples that appear in the documentation, that we show some examples of how to use some of the more advanced capabilities. This could either be spun out into a separate document or pull out examples into its own section of the documentation.

Further validation on all inputs

Adding validation to inputs from all sources and raising errors should they be out of range/wrong format etc will provide an extra layer of protection and make use of Longbow more intuitive.

Have longbow auto change the user directory case

So Longbow is moving away from having the capital L in its code base and directory structure. Longbow version 1.5.0 should change the existing user directory from .Longbow to .longbow automatically and inform the user this has taken place. For first time installers this should simply be skipped.

Friendly naming for recovery files

It was suggested at a recent talk about Longbow that the recovery method could be extended in a way that friendly names could be used for the sessions to recover rather than the dated filenames.

They could perhaps also still use the filenames (because they prevent data loss in conflicting names) but single/parts of multijobs could be extracted from sessions.

module override parameter has gone missing

In older versions of Longbow there used to be a MODULEOVERRIDE parameter that could be optionally provided in application plugin files to alter the default naming of modules from the name of the plugin file to something else. This could be used by users to set the module name if something peculiar had been used.

This should be restated along with tests, as along the line this functionality has disappeared.

This will need adding to dev docs as well.

Refactor the exception code in entrypoints.py

There is code and unit test duplication for the longbow() and recovery() methods. The try except blocks in the longbow and recovery methods that handle the user interrupt are identical. This code should be moved up to the launcher() scope where it will feature just once.

Doing this will require refactoring the code slightly so that the jobs structure is in the scope of launcher, it will need passing in to longbow() as part of parameters.

Allow launching arbitrary software

Longbow should allow the launching of arbitrary software, however this should be done by the user having to provide a flag such as --nochecks. This will indicate to Longbow that the user intends to do such things and will then be able to distinguish between mistakes and intentional behaviour.

Longbow should still perform some checks, but the application specific checking should be reduced.

Use timers for staging and polling

Get rid of the blocking sleep function and separate the staging and polling into two timers so they can be changed independently.

The sleep timer sometimes doesn't behave the way a user would expect.

Remove persistent development branch

This branch is no longer required for testing, this was created back in the days when manual testing was done and functionality had to be checked before merging to master. Now CI tools do this at commit and PR time, so this is no longer needed.

Parallel environments in SGE

The -pe flag in the SGE plugin should not always be called ib, this parameter is under the control of the system administrator and should have a parameter to modify this. This parameter should disappear if number of cores are 1 and possibly mpi is a sensible default for this to start.

The method configuration.processjobs() should not require hosts file

The processjobs method will raise an exception if a hosts file is not supplied, this check is already performed in the Longbow application layer so should not be done again. This is prohibitive to developers using the library as they might well wish to simply pass a dictionary of data to be Longbowized.

Gromacs -deffnm on replicate jobs

The -deffnm parameter seems to be picking up or at least pointing to a global file location when used with a global input file -s

Error when launching with executable on absolute path

The following traceback is generated when trying to launch own compiled gromacs

2017-08-04 13:58:40 - ERROR - longbow - '/work/c01/c01/jtg2/gmx/bin/mdrun_mpi' Traceback (most recent call last): File "/home/jimboid/.local/lib/python2.7/site-packages/longbow/corelibs/entrypoints.py", line 224, in launcher longbow(parameters) File "/home/jimboid/.local/lib/python2.7/site-packages/longbow/corelibs/entrypoints.py", line 266, in longbow jobs = configuration.processconfigs(parameters) File "/home/jimboid/.local/lib/python2.7/site-packages/longbow/corelibs/configuration.py", line 175, in processconfigs _processconfigsfinalinit(jobs) File "/home/jimboid/.local/lib/python2.7/site-packages/longbow/corelibs/configuration.py", line 444, in _processconfigsfinalinit jobs[job]["modules"] = modules[jobs[job]["executable"]] KeyError: '/work/c01/c01/jtg2/gmx/bin/mdrun_mpi'

Issue with environments

Some HPC machines seem to create a problem where the environment is not loaded under a non-login shell. This includes environment variables for the scheduler and even the Linux module system. A simple cure is that the user can add "source /etc/profile" into their ~/.bashrc file. Machines where this doesn't happen seem to have a .bashrc setup to reference the various environment files by default.

Simply asking for a login shell does not seem to work either, this appears to be specific to shells requested through a script. I'm not sure if there is some sort of safeguards preventing certain sessions from having a login shell (perhaps the tty issue).

There are several solutions to this:

  1. Pass information to the user through documentation, informing them to how to add the relevant line to their .bashrc
  2. Append the line "source /etc/profile" to every line that gets passed through to subprocess (not sure if this is bad)
  3. Find out if there is a proper way to get a valid login shell when using python subprocess.

It is unclear at this stage which solution is the best one, more work and investigation is required so that we can support as many machines as possible.

Add email flag

Add email flag so that users can specify that complete jobs send emails.

Remove capitalisation from download examples unit tests

Once Longbow v1.5.0 has been released along with its examples, the unit test for the downloadexamples method will need to have the remainder of the capitals removed. Currently they cannot be taken out due to the tests performing an actual download.

Implement update mode for disconnected sessions

When using the disconnect mode with --disconnect, currently the only way to reconnect is via the --recover option, but this reconnects permanently without then disconnecting again. A new mode should be implemented to reconnect to update the status of the jobs, transfer files and then disconnect again. This should be able to be done as many times as the user wishes up until completion of all jobs.

Adapt testing suite into unit tests

The current testing suite has lots of functional and partial unit test support. A new set of unit tests should be made available where functional tests are also included but using mock where possible to simulation external connections and files etc.

Remove reference to local cluster in docs

Remove the sentence "Longbow has been developed with those that use a local batch system in their lab such as Condor in mind. However Longbow can just as easily be ran directly from a desktop machine terminal."

Dis-connectable sessions have made this obsolete, as rightly pointed out by @gshannon1986

Friendlier imports

Place non private methods into the longbow package init.py. This means that what would have been something like this:

import longbow.corelibs.applications

applications.checkapp()

now becomes

import longbow

longbow.checkapp()

Makes the code more readable in complex software with many imports.

Check user documentation for continuity

Check user docs to make sure it reflects the more recent code changes.

A point to do with replicates has been pointed out as unclear, this is to do with the creation of the repx directories when global files are used.

Re-licensing

Look at a more permissive license than the GPLv2 and release the next release and each release after under this.

GPLv2 is not friendly for inclusion into other software under some circumstances, at the very least LGPL should be used.

Monitoring stage thrashing cpu core

This is totally unnecessary behaviour, this will be due to the introduction of the new timing code, where as the old code used to do lots of checks that would prevent full utilisation of the core, now the new code allows the checks to be done only at specific time intervals. The effect of this is that the core running longbow is thrashed, whilst the only thing happening is spinning the counter and waiting to do checks.

Perhaps a solution to this is to work out some sleep intervals that will take account of both the status checking time interval and the file staging time interval.

Implement cleanup for expired recoveryfiles

Recovery files tend to build up rapidly over time, currently the user has to periodically clean these out by themselves. A better solution would be either of:

  1. Only keep a small number (say the last week)
  2. Provide a command-line flag such as --recovery-clean and just wipe out anything in the directory with it.
  3. Rearrange how the file is handled within the code. Use the global jobfile parameter simply for initial creation. Then store a handle to the file under each job (no other place to put it). This way can make use of the filename throughout Longbow, no need to create new recovery file on recovery or reconnections and can add code to the cleanup function to kill the recovery file once all jobs are done.

Support for new GROMACS

Gromacs can now be called using the form:

$ gmx mdrun blah blah blah

ARCHER now has this available through the module system so we should support it.

Sub-queuing breaks on multiple instances

When running several instances of Longbow, where one of them maxes out the queue and the other ends up queuing jobs, never gets to submit its jobs. This also happens to varying degrees when the queue is split over multiple versions, the instance performing queuing never grows its share of the machine.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.