Git Product home page Git Product logo

crab2's People

Contributors

belforte avatar edelmann avatar ericvaandering avatar jpata avatar ktf avatar perilousapricot avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crab2's Issues

"multicrab" support implementation

Original Savannah ticket 6377 reported by slacapra on Sun Feb 10 10:36:05 2008.

The basic idea is to allow the user to define a list of datasets (and other parameters if needed) to be processed by the same code.

Starting from a set of common configuration parameters plus a set of list of configuration keys, multicrab must be able to threat it as a task composed by many sub-tasks (one sub-task for a different dataset to analyze).

use case: run the same analysis code on RealData/MC/privateMC just in one task... (not three different tasks)

submission to the CAF using the server

Original Savannah ticket 6392 reported by spiga on Sun Feb 10 11:56:27 2008.

The submission to the caf must be enabled and tested also using the server.

CPU/wallclock requirements in JDL

Original Savannah ticket 6525 reported by afanfani on Sun Mar 2 16:38:22 2008.

Since there are several Aborted jobs that are due to user's jobs hitting the batch queue limit (*) one could foresee to:
a) set by default CPU/wallclock requirements in the JDL to high
values. Make a census of the existing values published by the sites in order to find the default setting (and maybe to spot potentially mispublishing sites)
b) notify to the users the change in the behaviour implemented in a) so that if they know to have short jobs they could lower the default setting via crab.cfg parameters
This should be a pretty fast implementation.

In (*) is also suggested to have a forked process that could watch the CMS job, kill it and report some nice message.
To be evaluated in future.

(*)
https://hypernews.cern.ch/HyperNews/CMS/get/crabDevelopment/642.html

extract output file name from pset

Original Savannah ticket 6375 reported by spiga on Sun Feb 10 10:03:13 2008.

Enable crab to extract the name of the output file will be produced by the application directly from the outputmodule.
This will be retrieved/staged-out by default. Other output file, if there, must be declared by the user as usual

improvement on bosslite listmatch

Original Savannah ticket 6745 reported by gcodispo on Tue Apr 15 19:04:50 2008.

since for us it is enough t ohave just one site which match the requirements,
we can stop the lcginfo query once one site is matched..
more in general if we have more than one site hosting data, we can stop the loop over them as soon as one CE is matched for one of the different site hosting requested data.

implementation of a "localcopy" feature to take output file from a SE to the user local area

Original Savannah ticket 6388 reported by spiga on Sun Feb 10 11:41:45 2008.

If the job output is copied to a SE, as it must be done for largish output, the user typically does not have interactive access to that SE from the node where he/she works. So, in order to use the output, the first thing is to copy locally the output produced. Crab should provide such a functionality, using the proper copy command (srmcp or lcg-cp or cmscp or whatever) for all the files.

enable the "global" crab.cfg support

Original Savannah ticket 6376 reported by slacapra on Sun Feb 10 10:12:58 2008.

As discussed with Stijn, it could be useful to instrument crab to perform a Hierarchical search of the configuration files.
In many case i.e. for a collaborating group of users, most of the configuration parameters are the same while just few thing differ from a users to another.. once defined a basic common configuration
in a template of crab.cfg, it can be used for all the users..overwriting the key where needed.

CRAB homepage and twiki reorganization

Original Savannah ticket 6674 reported by fanzago on Mon Mar 31 11:26:57 2008.

To remove some links from CRAB homepage menu' updating html page contents. To update the twiki CRAB page. To move the CRAB howto in the twiki page adding info for publication

GetOutput Migration from Boss to bosslite

Original Savannah ticket 6490 reported by mcinquil on Tue Feb 26 13:00:04 2008.

Need to migrate the GetOutput component from boss to bosslite usage.
Methods and name should be the same but many definitions and sql tables aren't the same... many check are needed.

implement a new feature which allow an easy report about submitted jobs/task

Original Savannah ticket 6391 reported by ewv on Sun Feb 10 11:54:27 2008.

Crab -report
This is an important feature needed both for aborted and well finished jobs even if with different scope.

* For aborted and not well finished jobs more in general, detailed infos about the reasons of the failure are needed, both if it is a Grid reason and if it is an application specific problem. 

* For well finished jobs the information to report to the users are the number of files correctly processed/not correctly processed, the number of processed events wrt the number of asked events. Also provenance, integrated luminosity (if suitable) or any other relevant information needed to produce physically sensible plot by user. For each job of the submitted task (both direct or through the server, the functionality must be there in each case), all these kind of infos must me reported back (via FJR adding infos if needed) and stored in the properly table. Finally the -report will just manage and get the stored infos.

support to additional requirements management

Original Savannah ticket 6493 reported by farinafa on Tue Feb 26 13:13:28 2008.

A given task could have the need to wait to match a requirement before the server starts to manage it.

Command manager component should be able to threat also this field.

test of: new support for py pset @ WN level and api to extract hnUserName usage

Original Savannah ticket 6526 reported by None on Sun Mar 2 17:18:56 2008.

CRAB_2_1_1_pre1 is installed on
/afs/cern.ch/cms/ccs/wm/scripts/Crab/Dev/

This is supporting both py pset management at WN level and extract ion of hnUserName using Simon's API.

both are working but should be tested...

The py pset management is needed to support CMSSW_2_x_x series.

add the merge support for analysis jobs

Original Savannah ticket 6387 reported by None on Sun Feb 10 11:39:45 2008.

We should have a way to merge the output of splitted jobs. If the output is retrieved locally (directly via sandbox or via -localCopy) crab can do this using EdmFastMerge? (for edm output) or hadd (for generic root files). More in general, once the output is put on a given SE, we might send a merging job to the proper CE, which collect all the files produced in the local SE, merge them, and possibly delete the unmerged files. This can be done automatically by the server or on-demand. Most probably the merging process used by PA can be used.

Disk space and server cleaning management

Original Savannah ticket 6385 reported by mcinquil on Sun Feb 10 11:34:21 2008.

A server cleaning system is crucial (see also the WMS experience...). Even if we assume that the jobs submitted to the server should copy the output directly to a SE, the server need a component that takes care to check the quota available on the server SE performing action like

-- notify user and admin
-- clean the disk (moving and re-moving old stuff)

test

Original Savannah ticket 6425 reported by spiga on Sun Feb 17 10:28:47 2008.

test

Improve the Error Handler making the jobs resubission more smart

Original Savannah ticket 6386 reported by fanzago on Sun Feb 10 11:37:37 2008.

The management of the jobs resubmission on the server must be improved. It is a crucial point to maximize the job success rate. The plug-in structure to implement different logic/action for different categories of errors is ready. Here is needed to:

-- define and implement the logics and the policies.
here will be the place where to implement a sort of queue system (define dynamic black list...)

new client-server communication system

Original Savannah ticket 6380 reported by farinafa on Sun Feb 10 11:01:44 2008.

After the first achieved experience we are re-design the architecture related to server front-end components and back-end submission in order to simplify and optimize the interactions among the components and the overall provided throughput. The guideline is based on the assumption that the task is composed by a xml description and a set of specific files.

Client-server communication

Move from the actual dropBox system (lcg-cp based, spool server component to get new tasks) to an explicit invocation mechanism (WebService, general benefits given by the explicit submission triggering)

implement task flavor management in command menager component

Original Savannah ticket 6492 reported by farinafa on Tue Feb 26 13:08:25 2008.

The submitted tasks could differ each other for the related flavor.
The main difference are on the WorkFlow which the jobs must follow...
The command manager must be able to discriminate tasks considering this info.

change the seed manipulation

Original Savannah ticket 6428 reported by ewv on Sun Feb 17 12:56:52 2008.

With the upcoming changes in the RandomNumberGeneratorService in
CMSSW_2_X_X, DaveE. is developing a standard python API
in CMSSW so that both users and WM Tools (CRAB, PA) can use the same
standard interface to the service in the (python) cfg file.

There is a prototype for this interface at:
http://home.fnal.gov/~evansde/prodagent_tools/RandomService.py

Eric proposal :

With the new RandomNumberGeneratorService and the switch to python
config files for CMSSW, I think we have a chance to change how CRAB does
this for the better and make the whole thing scalable so that
a) we don't have to change CRAB every time a seed is added
b) users can supply their own code that uses seeds and CRAB can handle
that just as well as the "official" ones for mixing, Digis, etc.

a) By default, all seeds in the job will be randomized by using the
populate() method in Dave's class. The CRAB user will not have to
remember which seeds in job exist and take pains to change them all

b) The user can supply a list of seeds not to change at all. They will
be held at their values in the config file:
[CMSSW].preserve_seeds = mixSeed,vertexSeed

c) The user can also supply a list of seeds to increment by the job #:
[CMSSW].increment_seeds = trackDigiSeed,ecalDigiSeed
should they want to use CRAB generate repeatable events

d) We could even allow a flag that changed the default behavior from a)
to b) or c) for all seeds.

But, I think the usual case that people are interested in would be case
a) if they are just using CRAB to do small-scale private MC generation.

development of condor_g scheduler script for BossLite

Original Savannah ticket 6384 reported by ewv on Sun Feb 10 11:31:41 2008.

Develop the specific script to interact with the real scheduler to be added to the existing one in the bosslite layer.
The script will be used through the developed API making transparent the interaction with the real scheduler

Retrieving jobs in Done status

Original Savannah ticket 6778 reported by None on Tue Apr 22 16:43:40 2008.

The status of a finished job (but not yet retrieved) is Done, both on the server and client side: from the client is not yet possible to retrieve the output before the server does... So, it is needed a solution to avoid the problem that the user retrieves the job output from the server before it is ready on the server itself (till the 003 release there was the 'Retrieving by the server' status).

InputSandbox size handling

Original Savannah ticket 6635 reported by None on Mon Mar 24 11:55:40 2008.

An user is reporting that with an user compiled code (default.tgz) of 9.2MB all his jobs submitted via WMS get aborted (*) because the download of the input sandbox fails.
While reducing the user compiled code to 5.4MB the jobs can successfully start and complete.
The maximum inputsandbox size allowed by default in CRAB is 9.5MB .

For the time being:
if ~9MB is really causing systematic problems everywhere and it's not a temporary effect of a WMS or a site then the default
in CRAB should be lowered.

For the future with CRABSERVER:
dealing of InputSandbox via SE should remove this limitation.

(*)
Reason = Cannot download BossArchive_1_g0.tgz_25743 from
gsiftp://wms011.cnaf.infn.it:2811/var/glite/SandboxDir/pN/https_3a_2f_2fwms007.cnaf.infn.it_3a9000_2fpNQ0OL1k-5E4YdzCRvHouw/input/BossArchive_1_g0.tgz_25743

Lmitiation due to the Input sandbox

Original Savannah ticket 1704 reported by None on Tue Feb 15 07:44:12 2005.

The problem related to the input sandbox quota limit (8M) should be bypassed because executable and libreries are passed to a job via a tar.gz files with variable sizes..

Possible solution:
to implement in CRAB a piece of code to handle the copy of the tar.gz file in a specific SE used hardcoded by CRAB.

The SE can be eventually specified by the user.
Possible problem comes if different users use an executable
with the same name store it in the same location in the SE.

The problem can be solved with BOSS but it still not implemented.

The storage of the tar.gz. file on the SE does not mean that that the file has to registered on the replica catalog but it is necessary to match the information of the lfn with the pfn on the SE..

Output files should be also stored in SE and and eventually deleted at the end of a specified task.

The retrieve of the output and the deletion of the output
file from the SE is not stills implemented in BOSS.

JobTracking and GetOutput fine tuning

Original Savannah ticket 6873 reported by mcinquil on Tue May 20 03:32:48 2008.

Current settings may not be the best and need a fine tuning.
I would suggest a reduction of the poll rates (e.g. pollInterval = 1'800) in order to reduce the overall load and in particular to free resources for status query threads and output handling threads. This while leaving the Query interval quite short (up to 120).
This should results in a small delay in the finished jobs handling and a speed up of the job status update.
This preliminary work should give an idea about the actual delay of the two components an open to new optimizations.

No resubmission allowed if getoutput fails

Original Savannah ticket 6697 reported by None on Sun Apr 6 06:18:41 2008.

Users are reporting () that when they get getoutput operation fails the job stay in "Done" status and it's not possible to resubmit it.
I think the higher occurrance now might be related to a bug in the cleanup sandbox by the WMS (
*).
However is there a stategy in CRAB to address/workaround such cases?

(*)
https://hypernews.cern.ch/HyperNews/CMS/get/crabFeedback/1034.html
https://hypernews.cern.ch/HyperNews/CMS/get/crabFeedback/1042.html

(**) from [email protected]
Date: Fri, 4 Apr 2008 14:03:14 +0200
From: Maarten Litmaath <[email protected]>
To: Yvan Calas <[email protected]>,
Cristina Aiftimiei <[email protected]>
Cc: [email protected]
Subject: WARNING - bug in cleanup-sandboxes script !!!

Hi all,
> > > http://litmaath.home.cern.ch/litmaath/cleanup-sandboxes/
I just found a bug in the script: it removes files based on their
"mtime" time stamp, but that value can be set to a long time ago
for files that are sent as part of an ISB tar ball !!!

For now, please comment out the cron job (and restart cron, just
to be sure) or better remove the rpm. I will make a new version.

Thanks,
Maarten

[BUG] Problem on task declare after submission failure

Original Savannah ticket 6841 reported by farinafa on Mon May 12 04:43:23 2008.

Sometimes -when it is not possible to submit jobs- the thread of the component CrabServerWorker seems to finish silently(); then when the component is restarted there is a failure because the component tries to declare the task another time (*).

Mattia

(*)
2008-05-12 11:01:29,435:Creating cursor object for session: mcinquil_crab_0_080512_110112_24d99237-3bdb-480f-b9f8-2b0678f9947a
2008-05-12 11:01:29,437:Transaction committed
2008-05-12 11:01:29,437:Worker worker_3_mcinquil_crab_0_080512_110112_24d99237-3bdb-480f-b9f8-2b0678f9947a submitting a new task
2008-05-12 11:01:29,437:Worker worker_3_mcinquil_crab_0_080512_110112_24d99237-3bdb-480f-b9f8-2b0678f9947a pre-submission checks passed
2008-05-12 11:01:29,438:Worker worker_3_mcinquil_crab_0_080512_110112_24d99237-3bdb-480f-b9f8-2b0678f9947a listmatched jobs, now submitting
2008-05-12 11:01:29,438:Turl
2008-05-12 11:01:30,425:MS: get requested

(**)
2008-05-12 11:11:35,388:Traceback (most recent call last):
File "/data/new_server/code_area/CRAB/CRABSERVER/lib/CrabServerWorker/FatWorker.py", line 187, in run
self.submissionDriver()
File "/data/new_server/code_area/CRAB/CRABSERVER/lib/CrabServerWorker/FatWorker.py", line 208, in submissionDriver
taskObj = self.blDBsession.declare(self.taskXML, self.proxy)
File "/data/new_server/code_area/PRODCOMMON/lib/ProdCommon/BossLite/API/BossLiteAPI.py", line 163, in declare
self.saveTask( task )
File "/data/new_server/code_area/PRODCOMMON/lib/ProdCommon/BossLite/API/BossLiteAPI.py", line 206, in saveTask
self.removeTask( task )
File "/data/new_server/code_area/PRODCOMMON/lib/ProdCommon/BossLite/API/BossLiteAPI.py", line 239, in removeTask
task.remove( self.db )
File "/data/new_server/code_area/PRODCOMMON/lib/ProdCommon/BossLite/DbObjects/Task.py", line 219, in remove
raise TaskError("The following task instance cannot be removed" +
TaskError: 'The following task instance cannot be removed since it is not completely specified: Task instance None\n cfgName : /data/crabSE/maselli_crab_0_080508_125902_2840141f-5731-436b-98fa-56025472f822/CMSSW.cfg\n globalSandbox : /data/crabSE/maselli_crab_0_080508_125902_2840141f-5731-436b-98fa-56025472f822/default.tgz,/data/crabSE/maselli_crab_0_080508_125902_2840141f-5731-436b-98fa-56025472f822/CMSSW.sh\n jobType : None\n name : maselli_crab_0_080508_125902_2840141f-5731-436b-98fa-56025472f822\n outputDirectory : /data/crabSE/maselli_crab_0_080508_125902_2840141f-5731-436b-98fa-56025472f822\n scriptName : /data/crabSE/maselli_crab_0_080508_125902_2840141f-5731-436b-98fa-56025472f822/CMSSW.sh\n serverName : None\n startDirectory : /afs/cern.ch/user/m/maselli/scratch0/CMSSW_2_0_6/src/CalibMuon/DTCalibration/test//\n user_proxy : anonymous\n'

ErrorHandler plugin for the submission

Original Savannah ticket 6806 reported by fanzago on Wed Apr 30 05:22:43 2008.

A plugin is needed in order to catch the submission failures related to mainly to missing input sandbox files.

In particular, the pluging should catch the message from the CrabServerWorker with the reason of the failure then check that the TaskTracking has marked the whole task as "NotSubmitted" and, finally, publish to the Notification an order to sent a related e-mail.

CRAB wrapper cleaning and reorganization

Original Savannah ticket 6389 reported by fanzago on Sun Feb 10 11:44:41 2008.

The wrapper must be cleaned and reorganized where needed basically with the aim to obtain

* more cleare error messages reporting
* easy to parse output (it must allow also an easier implementation of -report functionality

One of the most important thing is to check the error codes reported to the dashboard.

[BUG] Incorrect padding error in crab -status

Original Savannah ticket 6867 reported by farinafa on Mon May 19 10:15:26 2008.

Executing 'crab -status' sometimes happen that jobs are frozen in a status and jobs seems to not proceed (even if on the server are proceeding).
Executing 'crab -status -debug 10' this traceback is printed out this ().
When the error (
) disappears the status is showed and updated correctly...

(*)
crab. WARNING: Problem while decompressing fresh status from the server.
crab. Traceback (most recent call last):
File "/afs/cern.ch/cms/ccs/wm/scripts/Crab/CRAB_2_2_0/python/StatusServer.py", line 48, in resynchClientSide
reportXML = zlib.decompress( base64.urlsafe_b64decode(handledXML) )
File "/build1/162p1/slc4_ia32_gcc345/external/python/2.4.2-CMS3q/lib/python2.4/base64.py", line 112, in urlsafe_b64decode
return b64decode(s, '-_')
File "/build1/162p1/slc4_ia32_gcc345/external/python/2.4.2-CMS3q/lib/python2.4/base64.py", line 76, in b64decode
raise TypeError(msg)
TypeError: Incorrect padding

change users LFN namespace, add check on user datasetpath

Original Savannah ticket 6374 reported by fanzago on Sun Feb 10 04:54:21 2008.

suggested user LFN namespace at DMWM workshop is:
/store/user/hn_username and /store/temp/hn_username
using the hypernews user name istead of the name extracted by the DN.
There are "api" to get the hypernews user name
from SiteDB (provided by Simon)

In addition as discussed with Lee crab must enable a check for the validity of datasetpath choosed by the users

Safe check on retrieved OSB

Original Savannah ticket 6779 reported by farinafa on Tue Apr 22 16:48:31 2008.

Need to add a safe check on getoutput operation executed from client side: output sandbox should be deleted from the server-SE just when it is effectively -and successfully- retrieved.

PADA Analysis: JDL Ranking

Original Savannah ticket 6653 reported by afanfani on Thu Mar 27 14:14:05 2008.

Default UI rank is:
Rank=-other.GlueCEStateEstimatedResponseTime
however most sites publish bogus values.

A suggestion is to move to gLite user guide's example:

Rank = (other.GlueCEStateWaitingJobs == 0 ?
other.GlueCEStateFreeCPUs : -other.GlueCEStateWaitingJobs);

More in general try with CRAB different rankings, check the correctness of the publication by the sites of the different attributes used in the ranking, etc, trying to find a decent ranking, better than the default one, for the job distribution at the sites.

support for SRMv2

Original Savannah ticket 6379 reported by spiga on Sun Feb 10 10:50:24 2008.

CRAB must be able to support the stage-out with SRMv2.

LSF (CAF ) script for bosslite

Original Savannah ticket 6390 reported by spiga on Sun Feb 10 11:51:35 2008.

In order to make crab working @ CAF once switched off the actual boss, it's needed to implement the related script for bosslite.

CRAB-SA/Client and BossLite integration

Original Savannah ticket 6383 reported by spiga on Sun Feb 10 11:24:33 2008.

Switch-off the old BOSS and integrating the python api to

  • interact (request and update informations) with the database
  • interact with the various schedulers

Enable crab to get Histo names from JobReport

Original Savannah ticket 6527 reported by slacapra on Sun Mar 2 17:22:49 2008.

In CMSSW_2_0_x, files written by the TFileService (the only reliable and
supported way for users to write histogram files from EDAnalyzers) will
be listed in the Job report.

They will be listed like this:

<AnalysisFile>
<FileName>/uscms/home/ewv/TFileService/CMSSW_2_0_0_pre2/src/PhysicsTools/UtilAlgos/test/histo.root</FileName>
<Source Value="TFileService" />
</AnalysisFile>

With this modification, we can think about adding a CRAB option to add
all files the user is writing to the output file list automatically
rather than relying on the user to spell their filename correctly twice.

Leak on submission threads for process subcription

Original Savannah ticket 6840 reported by farinafa on Mon May 12 03:58:26 2008.

The CrabServerWorker registers a submitting thread as a new component. This task to fill the ms_process table with an entry for each new thread and every thread is different from the previous one (this means that for each (re)submission a new component is registered on the db, slowing down the performances...).

Mattia

re-deign of submission component structure

Original Savannah ticket 6382 reported by farinafa on Sun Feb 10 11:18:34 2008.

Move from the actual CRAB-like submission to a more safe, robust and smart (focus on the resubmission logic) submission component based directly on BOSS-lite API, supporting also the resubmission

implement proxy manager component

Original Savannah ticket 6426 reported by mcinquil on Sun Feb 17 12:13:06 2008.

The actual tests on the server show the needed to implement a component which take care to manage the user proxy cached on the server.
It should check the validity, remove the expired one, trigger the related job-archive operation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.