Git Product home page Git Product logo

crab3-glideinwms-meta's People

Contributors

perilousapricot avatar

Watchers

 avatar  avatar  avatar  avatar

crab3-glideinwms-meta's Issues

Fix last-mile of filecache-less job submission

By default, CRAB3 tries to write the user's sandbox (containing their shared libraries) to a remote REST service, which the WNs wget on start. I've added an option to forgo shipping the sandbox, but there needs to be the additional WN-specific bits that detect when the sandbox came along, and if it did, skip the remainder.

Weird spurious error with Schedd.submit()

Hey @bbockelm,

I'm working on adding some functional tests for the interactions with condor, which involves making dummy jobs, then verifying that the status/kill/resubmit functionality works. When I do that, I get the following occasionally:

======================================================================
ERROR: testJustRootJob (__main__.TestStatus)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "CRABClient/test/python/CRABClient_t/Commands_t/gwmsStatus_t.py", line 42, in testJustRootJob
    jobInjector.makeDBSObject()
  File "/home/vagrant/sync/CRABServer/src/python/CRABInterface/DagmanTestTools/TestJobInjector.py", line 72, in makeDBSObject
    retval = schedd.submit(dbsClassad)
RuntimeError: Failed to commmit and disconnect from queue.

----------------------------------------------------------------------
Ran 1 test in 0.161s

FAILED (errors=1)
[localhost] ~/sync $ condor_q


-- Submitter: localhost.localdomain : <10.0.2.15:46607> : localhost.localdomain
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
[localhost] ~/sync $ ./testing-run CRABClient/test/python/CRABClient_t/Commands_t/gwmsStatus_t.py 
Using installation at /home/vagrant/sync
got initial kwargs {'requestarea': '/tmp/tmpHKZPnN', 'config': <WMCore.Configuration.Configuration object at 0x1edd7d0>}
DEBUG:CRABLogger.DataWorkflow:Got kwargs {'requestarea': '/tmp/tmpHKZPnN', 'config': <WMCore.Configuration.Configuration object at 0x1edd7d0>} 
DEBUG:CRABLogger.DataWorkflow:Setting request area to /tmp/tmpHKZPnN
terminate called after throwing an instance of 'boost::python::error_already_set'
./testing-run: line 29: 26535 Aborted                 python2.6 "$@"
[localhost] ~/sync $ condor_q


-- Submitter: localhost.localdomain : <10.0.2.15:46607> : localhost.localdomain
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

(the boost exception happens several times after that) The test injector code that gets raised is

    def makeDBSObject(self):
        dbsClassad = classad.parseOld(dbsString)
        schedd = htcondor.Schedd()
        retval = schedd.submit(dbsClassad)
        return dbsClassad 

(where basically I have a string with the DBS classad called dbsClassad).

Is there anything apparent I should be doing to keep this from happening?

Checking on an old (reaped) DAG shouldn't raise an InvalidParameter

trying to crab -status on a 2 month old task gives:

Traceback (most recent call last):
  File "./crab", line 169, in <module>
    client()
  File "./crab", line 156, in __call__
    self.cmd()
  File "/home/vagrant/sync/CRABClient/src/python/CRABClient/Commands/status.py", line 53, in __call__
    dictresult, status, monitorUrl = self.doStatus({ 'workflow' : self.cachedinfo['RequestName']})
  File "/home/vagrant/sync/CRABClient/src/python/CRABClient/Commands/status.py", line 32, in doStatus
    dictresult = dag.status(data['workflow'], '', userproxy=self.proxyfilename)
  File "/home/vagrant/sync/CRABServer/src/python/CRABInterface/DagmanDataWorkflow.py", line 542, in status
    results = self.getRootTasks(workflow, schedd)
  File "/home/vagrant/sync/CRABServer/src/python/CRABInterface/DagmanDataWorkflow.py", line 438, in getRootTasks
    raise InvalidParameter("An invalid workflow name was requested: %s" % workflow)
InvalidParameter

Sort out weirdness with condor python bindings

So far, there's a few issues with the condor python bindings we need to check out/possibly push upstream

  • import htcondor dies if CONDOR_CONFIG env variable doesn't exist. It does _exit() so it's untrappable
  • How should we manage different endianness/versions of python/condor? If we pull and override the local python install with one bootstrapped from CMS, how can we best hope that CMS's python will be compatible with the site's htcondor libraries? Could we maybe a few versions of the classad module put in cvmfs?

Try to disambiguate condor's held state from a task's killed state

Right before a job runs, condor will temporarily move the job to the 'held' state while it spools input files. This is trouble because we use the held state for tasks that were killed by the user

We should set the HoldReason to something unique like "Killed by CRAB3 user" and then add a conditional in the status query to only count jobs as killed if they have that hold reason. Perhaps with that we need to add a CRAB3 state called "InTransition" or something for all the held jobs that were being held for condor reasons and not CRAB3 reasons

Need a way to propagate condor_dagman errors upstream

Recently I hit this:

09/08/13 07:48:07 ******************************************************
09/08/13 07:48:07 ** condor_dagman (CONDOR_DAGMAN) STARTING UP
09/08/13 07:48:07 ** /usr/bin/condor_dagman
09/08/13 07:48:07 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1)
09/08/13 07:48:07 ** Configuration: subsystem:DAGMAN local:<NONE> class:DAEMON
09/08/13 07:48:07 ** $CondorVersion: 8.0.2 Aug 15 2013 BuildID: 162062 $
09/08/13 07:48:07 ** $CondorPlatform: x86_64_RedHat5 $
09/08/13 07:48:07 ** PID = 20149
09/08/13 07:48:07 ** Log last touched 9/8 07:48:07
09/08/13 07:48:07 ******************************************************
09/08/13 07:48:07 Using config source: /etc/condor/condor_config
09/08/13 07:48:07 Using local config sources: 
09/08/13 07:48:07    /etc/condor/condor_config.local
09/08/13 07:48:07 DaemonCore: command socket at <10.0.2.15:48826>
09/08/13 07:48:07 DaemonCore: private command socket at <10.0.2.15:48826>
09/08/13 07:48:07 DAGMAN_USE_STRICT setting: 1
09/08/13 07:48:07 DAGMAN_VERBOSITY setting: 3
09/08/13 07:48:07 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880
09/08/13 07:48:07 DAGMAN_DEBUG_CACHE_ENABLE setting: False
09/08/13 07:48:07 DAGMAN_SUBMIT_DELAY setting: 0
09/08/13 07:48:07 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
09/08/13 07:48:07 DAGMAN_STARTUP_CYCLE_DETECT setting: False
09/08/13 07:48:07 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 50
09/08/13 07:48:07 DAGMAN_USER_LOG_SCAN_INTERVAL setting: 5
09/08/13 07:48:07 DAGMAN_DEFAULT_PRIORITY setting: 0
09/08/13 07:48:07 DAGMAN_ALWAYS_USE_NODE_LOG setting: True
09/08/13 07:48:07 DAGMAN_SUPPRESS_NOTIFICATION setting: True
09/08/13 07:48:07 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114
09/08/13 07:48:07 DAGMAN_RETRY_SUBMIT_FIRST setting: True
09/08/13 07:48:07 DAGMAN_RETRY_NODE_FIRST setting: False
09/08/13 07:48:07 DAGMAN_MAX_JOBS_IDLE setting: 0
09/08/13 07:48:07 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
09/08/13 07:48:07 DAGMAN_MAX_PRE_SCRIPTS setting: 0
09/08/13 07:48:07 DAGMAN_MAX_POST_SCRIPTS setting: 0
09/08/13 07:48:07 DAGMAN_ALLOW_LOG_ERROR setting: False
09/08/13 07:48:07 DAGMAN_MUNGE_NODE_NAMES setting: True
09/08/13 07:48:07 DAGMAN_PROHIBIT_MULTI_JOBS setting: False
09/08/13 07:48:07 DAGMAN_SUBMIT_DEPTH_FIRST setting: False
09/08/13 07:48:07 DAGMAN_ALWAYS_RUN_POST setting: True
09/08/13 07:48:07 DAGMAN_ABORT_DUPLICATES setting: True
09/08/13 07:48:07 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: True
09/08/13 07:48:07 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
09/08/13 07:48:07 DAGMAN_AUTO_RESCUE setting: True
09/08/13 07:48:07 DAGMAN_MAX_RESCUE_NUM setting: 100
09/08/13 07:48:07 DAGMAN_WRITE_PARTIAL_RESCUE setting: True
09/08/13 07:48:07 DAGMAN_DEFAULT_NODE_LOG setting: null
09/08/13 07:48:07 DAGMAN_GENERATE_SUBDAG_SUBMITS setting: True
09/08/13 07:48:07 DAGMAN_MAX_JOB_HOLDS setting: 100
09/08/13 07:48:07 DAGMAN_HOLD_CLAIM_TIME setting: 20
09/08/13 07:48:07 ALL_DEBUG setting: 
09/08/13 07:48:07 DAGMAN_DEBUG setting: 
09/08/13 07:48:07 argv[0] == "condor_dagman"
09/08/13 07:48:07 argv[1] == "-Lockfile"
09/08/13 07:48:07 argv[2] == "/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.lock"
09/08/13 07:48:07 argv[3] == "-AutoRescue"
09/08/13 07:48:07 argv[4] == "1"
09/08/13 07:48:07 argv[5] == "-DoRescueFrom"
09/08/13 07:48:07 argv[6] == "0"
09/08/13 07:48:07 argv[7] == "-Dag"
09/08/13 07:48:07 argv[8] == "/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag"
09/08/13 07:48:07 argv[9] == "-Dagman"
09/08/13 07:48:07 argv[10] == "/usr/bin/condor_dagman"
09/08/13 07:48:07 argv[11] == "-CsdVersion"
09/08/13 07:48:07 argv[12] == "$CondorVersion: 8.0.2 Aug 15 2013 BuildID: 162062 $"
09/08/13 07:48:07 Default node log file is: </var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log>
09/08/13 07:48:07 DAG Lockfile will be written to /var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.lock
09/08/13 07:48:07 DAG Input file is /var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag
09/08/13 07:48:07 Ignoring value of DAGMAN_LOG_ON_NFS_IS_ERROR.
09/08/13 07:48:07 Parsing 1 dagfiles
09/08/13 07:48:07 Parsing /var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag ...
09/08/13 07:48:07 Dag contains 3 total jobs
09/08/13 07:48:07 Sleeping for 12 seconds to ensure ProcessId uniqueness
09/08/13 07:48:19 Bootstrapping...
09/08/13 07:48:19 Number of pre-completed nodes: 0
09/08/13 07:48:19 Of 3 nodes total:
09/08/13 07:48:19  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
09/08/13 07:48:19   ===     ===      ===     ===     ===        ===      ===
09/08/13 07:48:19     0       0        0       0       1          2        0
09/08/13 07:48:19 0 job proc(s) currently held
09/08/13 07:48:19 Registering condor_event_timer...
09/08/13 07:48:20 Unable to get log file from submit file DBSDiscovery.submit (node DBSDiscovery); using default (/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log)
09/08/13 07:48:20 MultiLogFiles: truncating log file /var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log
09/08/13 07:48:20 Submitting Condor Node DBSDiscovery job(s)...
09/08/13 07:48:20 submitting: condor_submit -a dag_node_name' '=' 'DBSDiscovery -a +DAGManJobId' '=' '2872 -a DAGManJobId' '=' '2872 -a submit_event_notes' '=' 'DAG' 'Node:' 'DBSDiscovery -a log' '=' '/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never DBSDiscovery.submit
09/08/13 07:48:20 From submit: Submitting job(s)
09/08/13 07:48:20 From submit: ERROR: Failed to parse command file (line 2).
09/08/13 07:48:20 failed while reading from pipe.
09/08/13 07:48:20 Read so far: Submitting job(s)ERROR: Failed to parse command file (line 2).
09/08/13 07:48:20 ERROR: submit attempt failed
09/08/13 07:48:20 submit command was: condor_submit -a dag_node_name' '=' 'DBSDiscovery -a +DAGManJobId' '=' '2872 -a DAGManJobId' '=' '2872 -a submit_event_notes' '=' 'DAG' 'Node:' 'DBSDiscovery -a log' '=' '/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never DBSDiscovery.submit
09/08/13 07:48:20 Job submit try 1/6 failed, will try again in >= 1 second.
09/08/13 07:48:20 Of 3 nodes total:
09/08/13 07:48:20  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
09/08/13 07:48:20   ===     ===      ===     ===     ===        ===      ===
09/08/13 07:48:20     0       0        0       0       1          2        0
09/08/13 07:48:20 0 job proc(s) currently held
09/08/13 07:48:25 Unable to get log file from submit file DBSDiscovery.submit (node DBSDiscovery); using default (/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log)
09/08/13 07:48:25 Submitting Condor Node DBSDiscovery job(s)...
09/08/13 07:48:25 submitting: condor_submit -a dag_node_name' '=' 'DBSDiscovery -a +DAGManJobId' '=' '2872 -a DAGManJobId' '=' '2872 -a submit_event_notes' '=' 'DAG' 'Node:' 'DBSDiscovery -a log' '=' '/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never DBSDiscovery.submit
09/08/13 07:48:25 From submit: Submitting job(s)
09/08/13 07:48:25 From submit: ERROR: Failed to parse command file (line 2).
09/08/13 07:48:25 failed while reading from pipe.
09/08/13 07:48:25 Read so far: Submitting job(s)ERROR: Failed to parse command file (line 2).
09/08/13 07:48:25 ERROR: submit attempt failed
09/08/13 07:48:25 submit command was: condor_submit -a dag_node_name' '=' 'DBSDiscovery -a +DAGManJobId' '=' '2872 -a DAGManJobId' '=' '2872 -a submit_event_notes' '=' 'DAG' 'Node:' 'DBSDiscovery -a log' '=' '/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never DBSDiscovery.submit
09/08/13 07:48:25 Job submit try 2/6 failed, will try again in >= 2 seconds.
09/08/13 07:48:30 Unable to get log file from submit file DBSDiscovery.submit (node DBSDiscovery); using default (/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log)
09/08/13 07:48:30 Submitting Condor Node DBSDiscovery job(s)...
09/08/13 07:48:30 submitting: condor_submit -a dag_node_name' '=' 'DBSDiscovery -a +DAGManJobId' '=' '2872 -a DAGManJobId' '=' '2872 -a submit_event_notes' '=' 'DAG' 'Node:' 'DBSDiscovery -a log' '=' '/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never DBSDiscovery.submit
09/08/13 07:48:30 From submit: Submitting job(s)
09/08/13 07:48:30 From submit: ERROR: Failed to parse command file (line 2).
09/08/13 07:48:30 failed while reading from pipe.
09/08/13 07:48:30 Read so far: Submitting job(s)ERROR: Failed to parse command file (line 2).
09/08/13 07:48:30 ERROR: submit attempt failed
09/08/13 07:48:30 submit command was: condor_submit -a dag_node_name' '=' 'DBSDiscovery -a +DAGManJobId' '=' '2872 -a DAGManJobId' '=' '2872 -a submit_event_notes' '=' 'DAG' 'Node:' 'DBSDiscovery -a log' '=' '/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never DBSDiscovery.submit
09/08/13 07:48:30 Job submit try 3/6 failed, will try again in >= 4 seconds.
09/08/13 07:48:35 Unable to get log file from submit file DBSDiscovery.submit (node DBSDiscovery); using default (/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log)
09/08/13 07:48:35 Submitting Condor Node DBSDiscovery job(s)...
09/08/13 07:48:35 submitting: condor_submit -a dag_node_name' '=' 'DBSDiscovery -a +DAGManJobId' '=' '2872 -a DAGManJobId' '=' '2872 -a submit_event_notes' '=' 'DAG' 'Node:' 'DBSDiscovery -a log' '=' '/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never DBSDiscovery.submit
09/08/13 07:48:35 From submit: Submitting job(s)
09/08/13 07:48:35 From submit: ERROR: Failed to parse command file (line 2).
09/08/13 07:48:35 failed while reading from pipe.
09/08/13 07:48:35 Read so far: Submitting job(s)ERROR: Failed to parse command file (line 2).
09/08/13 07:48:35 ERROR: submit attempt failed
09/08/13 07:48:35 submit command was: condor_submit -a dag_node_name' '=' 'DBSDiscovery -a +DAGManJobId' '=' '2872 -a DAGManJobId' '=' '2872 -a submit_event_notes' '=' 'DAG' 'Node:' 'DBSDiscovery -a log' '=' '/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never DBSDiscovery.submit
09/08/13 07:48:35 Job submit try 4/6 failed, will try again in >= 8 seconds.
09/08/13 07:48:46 Unable to get log file from submit file DBSDiscovery.submit (node DBSDiscovery); using default (/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log)
09/08/13 07:48:46 Submitting Condor Node DBSDiscovery job(s)...
09/08/13 07:48:46 submitting: condor_submit -a dag_node_name' '=' 'DBSDiscovery -a +DAGManJobId' '=' '2872 -a DAGManJobId' '=' '2872 -a submit_event_notes' '=' 'DAG' 'Node:' 'DBSDiscovery -a log' '=' '/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never DBSDiscovery.submit
09/08/13 07:48:46 From submit: Submitting job(s)
09/08/13 07:48:46 From submit: ERROR: Failed to parse command file (line 2).
09/08/13 07:48:46 failed while reading from pipe.
09/08/13 07:48:46 Read so far: Submitting job(s)ERROR: Failed to parse command file (line 2).
09/08/13 07:48:46 ERROR: submit attempt failed
09/08/13 07:48:46 submit command was: condor_submit -a dag_node_name' '=' 'DBSDiscovery -a +DAGManJobId' '=' '2872 -a DAGManJobId' '=' '2872 -a submit_event_notes' '=' 'DAG' 'Node:' 'DBSDiscovery -a log' '=' '/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never DBSDiscovery.submit
09/08/13 07:48:46 Job submit try 5/6 failed, will try again in >= 16 seconds.
09/08/13 07:49:03 Unable to get log file from submit file DBSDiscovery.submit (node DBSDiscovery); using default (/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log)
09/08/13 07:49:03 Submitting Condor Node DBSDiscovery job(s)...
09/08/13 07:49:03 submitting: condor_submit -a dag_node_name' '=' 'DBSDiscovery -a +DAGManJobId' '=' '2872 -a DAGManJobId' '=' '2872 -a submit_event_notes' '=' 'DAG' 'Node:' 'DBSDiscovery -a log' '=' '/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never DBSDiscovery.submit
09/08/13 07:49:03 From submit: Submitting job(s)
09/08/13 07:49:03 From submit: ERROR: Failed to parse command file (line 2).
09/08/13 07:49:03 failed while reading from pipe.
09/08/13 07:49:03 Read so far: Submitting job(s)ERROR: Failed to parse command file (line 2).
09/08/13 07:49:03 ERROR: submit attempt failed
09/08/13 07:49:03 submit command was: condor_submit -a dag_node_name' '=' 'DBSDiscovery -a +DAGManJobId' '=' '2872 -a DAGManJobId' '=' '2872 -a submit_event_notes' '=' 'DAG' 'Node:' 'DBSDiscovery -a log' '=' '/var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.nodes.log -a log_xml' '=' 'False -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a +KeepClaimIdle' '=' '20 -a notification' '=' 'never DBSDiscovery.submit
09/08/13 07:49:03 Job submit failed after 6 tries.
09/08/13 07:49:03 Shortcutting node DBSDiscovery retries because of submit failure(s)
09/08/13 07:49:03 Of 3 nodes total:
09/08/13 07:49:03  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
09/08/13 07:49:03   ===     ===      ===     ===     ===        ===      ===
09/08/13 07:49:03     0       0        0       0       0          2        1
09/08/13 07:49:03 0 job proc(s) currently held
09/08/13 07:49:03 Aborting DAG...
09/08/13 07:49:03 Writing Rescue DAG to /var/lib/condor/spool/2872/0/cluster2872.proc0.subproc0/master_dag.rescue001...
09/08/13 07:49:03 Note: 0 total job deferrals because of -MaxJobs limit (0)
09/08/13 07:49:03 Note: 0 total job deferrals because of -MaxIdle limit (0)
09/08/13 07:49:03 Note: 0 total job deferrals because of node category throttles
09/08/13 07:49:03 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
09/08/13 07:49:03 Note: 0 total POST script deferrals because of -MaxPost limit (0)
09/08/13 07:49:03 Of 3 nodes total:
09/08/13 07:49:03  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
09/08/13 07:49:03   ===     ===      ===     ===     ===        ===      ===
09/08/13 07:49:03     0       0        0       0       0          2        1
09/08/13 07:49:03 0 job proc(s) currently held
09/08/13 07:49:03 **** condor_dagman (condor_DAGMAN) pid 20149 EXITING WITH STATUS 1

But, the following code didn't see an error (it got exit code 0)

# There used to be -Suppress_notification here. Why?
condor_dagman -f -l . -Lockfile $PWD/$1.lock -AutoRescue 1 -DoRescueFrom 0 -Dag $PWD/$1 -Dagman `which condor_dagman` -CsdVersion "$CONDOR_VERSION"
EXIT_STATUS=$?

# We do this after the job because dagman will cowardly refuse to overwrite any pre-existing file, even if it's empty
touch $1.rescue.001
echo "dag_bootstrap_startup exited with code $EXIT_STATUS"
exit $EXIT_STATUS

@bbockelm, is there a way to query the result of a dag? I assume that there's some condor_dagman magic that I can use to get the return values of the different nodes in the DAG, right?

Problem with crab kill

When trying to kill a task:

./crab3 kill -t crab_bbockelm_crab3_5/ -i 12

I get

results are Traceback (most recent call last):
  File "/home/vagrant/sync/CRABServer/src/python/CRABInterface/DagmanDataWorkflow.py", line 324, in kill
    schedd.edit(finished_jobConst, "DAGManJobId", "-1")
RuntimeError: Unable to edit jobs matching constraint

With this as finished_jobConst:

trying to match jobconst DAGManJobId =?= 1783 && ExitCode =?= 0

and this is the output from condor_q

1770.0   vagrant         8/13 12:15   0+00:07:37 C  0   732.4 gWMS-CMSRunAnalysi
1771.0   vagrant         8/13 12:15   0+00:07:36 C  0   732.4 gWMS-CMSRunAnalysi
1772.0   vagrant         8/13 12:15   0+00:07:15 C  0   732.4 gWMS-CMSRunAnalysi
1773.0   vagrant         8/13 12:15   0+00:07:10 C  0   732.4 gWMS-CMSRunAnalysi
1774.0   vagrant         8/13 12:15   0+00:06:29 C  0   170.9 gWMS-CMSRunAnalysi
1775.0   vagrant         8/13 12:15   0+00:06:14 C  0   170.9 gWMS-CMSRunAnalysi
1776.0   vagrant         8/13 12:15   0+00:06:03 C  0   219.7 gWMS-CMSRunAnalysi
1777.0   vagrant         8/13 12:15   0+00:06:03 C  0   170.9 gWMS-CMSRunAnalysi
1778.0   vagrant         8/13 12:15   0+00:06:16 C  0   195.3 gWMS-CMSRunAnalysi
1779.0   vagrant         8/13 12:15   0+00:06:06 C  0   170.9 gWMS-CMSRunAnalysi
1780.0   vagrant         8/13 12:42   0+00:30:10 H  0   122.1 dag_bootstrap_star
1783.0   vagrant         8/13 12:43   0+00:29:22 H  0   0.3  condor_dagman     

28 jobs; 26 completed, 0 removed, 0 idle, 0 running, 2 held, 0 suspended

It looks like the constraints don't end up matching anything, so condor complains. I don't want to simply trap RuntimeError. Is there a better way this can be detected?

add a test to verify the python-generated classads match the remotecondor ones

Right now, there's two code paths for submitting to condor, the gsissh version and the direct python module version. Changes in the input classads have to be manually synced between the two. Add some unit tests to verify that the classads generated by both paths match. @bbockelm - do the classad/htcondor modules have some sort of method where I could take the RemoteCondor classad strings, turn it into a classad object, then compare with the one generated in submitDirect?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.