Git Product home page Git Product logo

cromwellonazure's Introduction

Welcome to Cromwell on Azure

Logo

Cromwell is a workflow management system for scientific workflows, orchestrating the computing tasks needed for genomics analysis. Originally developed by the Broad Institute, Cromwell is also used in the GATK Best Practices genome analysis pipeline. Cromwell supports running scripts at various scales, including your local machine, a local computing cluster, and on the cloud.

Cromwell on Azure configures all Azure resources needed to run workflows with Cromwell on the Microsoft Azure cloud, and uses the GA4GH TES backend for orchestrating the tasks that create a workflow. The installation sets up an Azure Kubernetes cluster to run the Cromwell, TES, and Trigger Service containers, and uses the Azure Batch PaaS service to execute each task in a workflow on its own VM, enabling scale-out to thousands of machines. Cromwell workflows can be written using the WDL scripting language. To see examples of WDL scripts - see this 'Learn WDL' repository on GitHub.

Latest release

Documentation

All documentation has been moved to our wiki!

Getting Started?

Got Questions?

Want to Contribute?

Check out our contributing guidelines and Code of Conduct and submit a PR! We'd love to have you.

Related Projects

Genomics Data Analysis with Jupyter Notebooks on Azure

cromwellonazure's People

Contributors

arctechsameer avatar ashanhol avatar bemosk avatar bmurri avatar cbangur avatar dependabot[bot] avatar ducatimonster916 avatar giventocode avatar jacorbello avatar javierromancsa avatar jbagga avatar jlester-msft avatar joedrowan avatar jsaun avatar lynnlangit avatar mattmcl4475 avatar olesya13 avatar omixer avatar patmagee avatar saulobejo avatar tal66 avatar tonybendis avatar vsmalladi avatar yuliadub avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cromwellonazure's Issues

Specify master node instance type/sizing

Is it possible to configure/specify instance size when deploying? Seems like a B series would be perfect for the master node which is probably snoozing most of the time and not doing much heavy lifting.

If not, feature request to allow specifying instance size?

Delete intermediate files to save on storage costs

Update: This might already be supported in Cromwell. See method deleteFilesOrGotoFinalState in:
https://github.com/broadinstitute/cromwell/blob/199958375bb67df767e2363372147eafd3ad24ac/engine/src/main/scala/cromwell/engine/workflow/WorkflowActor.scala. Per comment in line 584, it seems like both the workflow option delete_intermediate_output_files and system.delete-workflow-files in Cromwell config need to be set to true.
Test this setup first.

After workflow completes, delete all files from the workflow's directory in cromwell-executions container that are not part of the final output (as determined by calling Cromwell API after workflow completion).
This behaviour should be configurable, for example:

  1. Enable/disable (default disable)
  2. How long to keep the intermediate files around after workflow completion (days?) (default 0)
  3. Enable/disable for failed workflows (default disable)
  4. Maybe have different duration for the failed workflows.
    This most likely belongs to Trigger Service.
    Also see https://github.com/broadinstitute/cromwell/releases/tag/49, there is now a workflow option delete_intermediate_output_files.

Support storage account key rotation

Currently, if the storage account keys are rotated during workflow execution, the tasks already in progress will fail, most likely during file upload.
Avoid this by supplying two SAS tokens to the task (one for key1 and one for key2). During file download/upload try the 1st and fallback to the 2nd if the 1st fails.
Customer can then rotate key1, wait for tasks that already started and tasks that will start in the next hour, to complete, then rotate key2. One hour is there because of the internal key cache.

Calling WDL Engine Functions Causes TES to Fail on Initialization

When calling WDL engine functions within a task that write their output to the execution directory (ie write_lines, write_tsv) the TES task will fail to initialize and report a SYSTEM_ERROR. When a task contains an engine function like one of these, cromwell will write the output of that function to the execution directory for the given task in a tmp file. When submitting the task to TES Cromwell will then inform the tes task of the input. THis seems to be causing some sort of problem.

You can reliably reproduce this issue with the following WDL. Note the call to write_lines int the FlattenArray task. This will produce a file corresponding to the tes input named: f_a.FlattenArray.lines.0

task FlattenArray {
    Array[String] intervals
    File lines = write_lines(intervals)
    command <<<
        cat ${lines}
    >>>

    output {
        String so = read_string(stdout())
    }

    runtime {
        docker: "python:2.7"
        memory: "2 GB"
    }
}


workflow f_a {
    Array[String] intervals = ["1","2","3"]
    
    call FlattenArray {
        input: intervals = intervals
    }

    output {
        String o = FlattenArray.so
    }
}

The Resulting truncated TES Object looks like the following. Note how cromwell turns the Engine function iun

{
    "id": "2bf04d92_9e4ad57df7644b329f265b65ea22e9d9",
    "state": "SYSTEM_ERROR",
    "name": "f_a.FlattenArray",
    "description": "2bf04d92-a822-4a91-9be7-07d6b266be0e:BackendJobDescriptorKey_CommandCallNode_f_a.FlattenArray:-1:1",
    "inputs": [
        {
            "name": "f_a.FlattenArray.lines.0",
            "description": "f_a.f_a.FlattenArray.lines.0",
            "url": "/cromwell-executions/f_a/2bf04d92-a822-4a91-9be7-07d6b266be0e/call-FlattenArray/execution/write_lines_c0710d6b4f15dfa88f600b0e6b624077.tmp",
            "path": "/cromwell-executions/f_a/2bf04d92-a822-4a91-9be7-07d6b266be0e/call-FlattenArray/execution/write_lines_c0710d6b4f15dfa88f600b0e6b624077.tmp",
            "type": "FILE",
            "content": null
        },
        {
            "name": "commandScript",
            "description": "f_a.FlattenArray.commandScript",
            "url": "/cromwell-executions/f_a/2bf04d92-a822-4a91-9be7-07d6b266be0e/call-FlattenArray/execution/script",
            "path": "/cromwell-executions/f_a/2bf04d92-a822-4a91-9be7-07d6b266be0e/call-FlattenArray/execution/script",
            "type": "FILE",
            "content": null
        }
    ],
    "outputs": [
        ...
    ],
    ...
    "logs": [
        {
            "logs": null,
            "metadata": null,
            "start_time": null,
            "end_time": null,
            "outputs": null,
            "system_logs": [
                "Object reference not set to an instance of an object.",
                "   at TesApi.Web.BatchScheduler.IsCromwellCommandScript(TesInput inputFile) in D:\\a\\1\\s\\src\\TesApi.Web\\BatchScheduler.cs:line 123\n   at TesApi.Web.BatchScheduler.GetTesInputFileUrl(TesInput inputFile, String taskId, List`1 queryStringsToRemoveFromLocalFilePaths) in D:\\a\\1\\s\\src\\TesApi.Web\\BatchScheduler.cs:line 488\n   at TesApi.Web.BatchScheduler.<>c__DisplayClass25_0.<<ConvertTesTaskToBatchTaskAsync>b__10>d.MoveNext() in D:\\a\\1\\s\\src\\TesApi.Web\\BatchScheduler.cs:line 394\n--- End of stack trace from previous location where exception was thrown ---\n   at TesApi.Web.BatchScheduler.ConvertTesTaskToBatchTaskAsync(TesTask task) in D:\\a\\1\\s\\src\\TesApi.Web\\BatchScheduler.cs:line 394\n   at TesApi.Web.BatchScheduler.AddBatchJobAsync(TesTask tesTask) in D:\\a\\1\\s\\src\\TesApi.Web\\BatchScheduler.cs:line 141"
            ]
        }
    ]
}

The actual problem seems to be a NPE encountered when calling the following code. Indicationg that either the Name or the inputFile are null. My guess is that since the input file already exists in the execution directory and it is not the command script, then some attributae of that input is being changed during processing

        private static bool IsCromwellCommandScript(TesInput inputFile)
        {
            return inputFile.Name.Equals("commandScript");
        }

Could not acquire change log lock

Cromwell log contains:
Failed to instantiate Cromwell System. Shutting down Cromwell.
liquibase.exception.LockException: Could not acquire change log lock.  Currently locked by ...

The following needs to execute on every mysql startup, if table DATABASECHANGELOGLOCK exists:
UPDATE DATABASECHANGELOGLOCK SET LOCKED=0, LOCKGRANTED=null, LOCKEDBY=null where ID=1

Add Metadata or tags to Batch Pools to Track Costs

It would be nice to be able to figure out the costs associated with a specific Batch Pool to the task that initiated it. Additionally it would also be useful to add the cromwell-id to figure out the entire cost of a workflow:

Suggested tags:

  • cromwell-id
  • tes-id
  • pass through tags or labels from cromwell?

TES Returns a SYSTEM_ERROR when a Job is Submitted

I am working on the same system as @jacorbello .

Submission of a workflow to Cromwell produces the following error (shown for the built-in test workflow):

2020-05-28 18:21:16,518 INFO - MaterializeWorkflowDescriptorActor [UUID(33fc9c17)]: Parsing workflow as WDL draft-2
2020-05-28 18:21:17,081 INFO - MaterializeWorkflowDescriptorActor [UUID(33fc9c17)]: Call-to-Backend assignments: test.hello -> TES
2020-05-28 18:21:18,337 INFO - WorkflowExecutionActor-33fc9c17-d1a2-4d66-90a0-e35ec48d11ef [UUID(33fc9c17)]: Starting test.hello
2020-05-28 18:21:20,080 INFO - TesAsyncBackendJobExecutionActor [UUID(33fc9c17)test.hello:NA:1]: echo 'Hello World!'
2020-05-28 18:21:20,110 INFO - TesAsyncBackendJobExecutionActor [UUID(33fc9c17)test.hello:NA:1]: Calculated TES outputs (found 4):
Output(Some(rc),Some(test.hello.rc),Some(/cromwell-executions/test/33fc9c17-d1a2-4d66-90a0-e35ec48d11ef/call-hello/execution/rc),/cromwell-executions/test/33fc9c17-d1a2-4d66-90a0-e35ec48d11ef/call-hello/execution/rc,Some(FILE))
Output(Some(stdout),Some(test.hello.stdout),Some(/cromwell-executions/test/33fc9c17-d1a2-4d66-90a0-e35ec48d11ef/call-hello/execution/stdout),/cromwell-executions/test/33fc9c17-d1a2-4d66-90a0-e35ec48d11ef/call-hello/execution/stdout,Some(FILE))
Output(Some(stderr),Some(test.hello.stderr),Some(/cromwell-executions/test/33fc9c17-d1a2-4d66-90a0-e35ec48d11ef/call-hello/execution/stderr),/cromwell-executions/test/33fc9c17-d1a2-4d66-90a0-e35ec48d11ef/call-hello/execution/stderr,Some(FILE))
Output(Some(commandScript),Some(test.hello.commandScript),Some(/cromwell-executions/test/33fc9c17-d1a2-4d66-90a0-e35ec48d11ef/call-hello/execution/script),/cromwell-executions/test/33fc9c17-d1a2-4d66-90a0-e35ec48d11ef/call-hello/execution/script,Some(FILE))

2020-05-28 18:21:20,114 INFO - TesAsyncBackendJobExecutionActor [UUID(33fc9c17)test.hello:NA:1]: Calculated TES inputs (found 1):
Input(Some(commandScript),Some(test.hello.commandScript),Some(/cromwell-executions/test/33fc9c17-d1a2-4d66-90a0-e35ec48d11ef/call-hello/execution/script),/cromwell-executions/test/33fc9c17-d1a2-4d66-90a0-e35ec48d11ef/call-hello/execution/script,Some(FILE),None)

2020-05-28 18:21:20,115 INFO - TesAsyncBackendJobExecutionActor [UUID(33fc9c17)test.hello:NA:1]: echo 'Hello World!'
2020-05-28 18:21:25,357 INFO - TesAsyncBackendJobExecutionActor [UUID(33fc9c17)test.hello:NA:1]: job id: 33fc9c17_4cb68ea28c954832ba62239d1bcf0681
2020-05-28 18:21:25,461 INFO - TesAsyncBackendJobExecutionActor [UUID(33fc9c17)test.hello:NA:1]: TES reported an error for Job 33fc9c17_4cb68ea28c954832ba62239d1bcf0681: 'SYSTEM_ERROR'
2020-05-28 18:21:25,463 INFO - TesAsyncBackendJobExecutionActor [UUID(33fc9c17)test.hello:NA:1]: Status change from - to FailedOrError

This error is reproduced within TES:

root@__________:/# curl http://tes/v1/tasks/a1b6d359_ea3c7013f4f94e348305e3bd081c12c5
{"id":"a1b6d359_ea3c7013f4f94e348305e3bd081c12c5","state":"SYSTEM_ERROR"}

The TES logs suggest that TES is attempting to find files that it is unable to:

fail: TesApi.Web.BatchScheduler[0]
The specified blob does not exist.
Microsoft.WindowsAzure.Storage.StorageException: The specified blob does not exist.
at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteAsyncInternal[T](RESTCommand1 cmd, IRetryPolicy policy, OperationContext operationContext, CancellationToken token) at Microsoft.WindowsAzure.Storage.Blob.CloudBlob.DownloadRangeToStreamAsync(Stream target, Nullable1 offset, Nullable1 length, AccessCondition accessCondition, BlobRequestOptions options, OperationContext operationContext, IProgress1 progressHandler, CancellationToken cancellationToken)
at Microsoft.WindowsAzure.Storage.Blob.CloudBlockBlob.DownloadTextAsync(Encoding encoding, AccessCondition accessCondition, BlobRequestOptions options, OperationContext operationContext, IProgress1 progressHandler, CancellationToken cancellationToken) at TesApi.Web.BatchScheduler.GetTesInputFileUrl(TesInput inputFile, String taskId, List1 queryStringsToRemoveFromLocalFilePaths) in D:\a\1\s\src\TesApi.Web\BatchScheduler.cs:line 527
at TesApi.Web.BatchScheduler.<>c__DisplayClass25_0.<b__10>d.MoveNext() in D:\a\1\s\src\TesApi.Web\BatchScheduler.cs:line 428
--- End of stack trace from previous location where exception was thrown ---
at TesApi.Web.BatchScheduler.ConvertTesTaskToBatchTaskAsync(TesTask task) in D:\a\1\s\src\TesApi.Web\BatchScheduler.cs:line 428
at TesApi.Web.BatchScheduler.AddBatchJobAsync(TesTask tesTask) in D:\a\1\s\src\TesApi.Web\BatchScheduler.cs:line 144
Request Information
RequestID:f11fe78b-501e-006a-0f27-35942f000000
RequestDate:Thu, 28 May 2020 19:40:28 GMT
StatusMessage:The specified blob does not exist.
ErrorCode:BlobNotFound
ErrorMessage:The specified blob does not exist.
RequestId:f11fe78b-501e-006a-0f27-35942f000000
Time:2020-05-28T19:40:28.6548537Z

The Cromwell execution folder for this workflow shows only the script file.

root@___________:/# ls cromwell-executions/test/33fc9c17-d1a2-4d66-90a0-e35ec48d11ef/call-hello/execution/
script

The Azure Batch Queue shows no Jobs or Pools created.

Any insight that you can offer is appreciated!

Allow usage of Service Principal Credentials for TES

When running in a containerized environment like Kubernetes none of the current authentication methods for Az will work using the official docker image. It would be great to be able to directly supply the credentials for a service principal as a private key to the TES api.

I have had to create a new docker image with the azure CLI installed in order to pass in the service-principal credentials. This works, however it would be great if this was directly supported by TES itself.

Handle unusable batch nodes due to full disk

When a batch node's VM disk is full, it is held with an 'unusable' state and stalls the workflow. Since this is essentially a failure, it should error out and stop the batch node.

Exit when encountering invalid commandline arguments

Hello,

A minor improvement idea: I think it would be better to exit when deploy-cromwell-on-azure encountered an unknown argument. It happily takes anything (try --wossname) as argument. In my case I wondered why

deploy-cromwell-on-azure-linux --subscriptionid $sid --regionname $region --resourcegroup $rg

created a new resource group, instead of using $rg, only to then notice that the argument name should in fact be --resourcegroupname.

Andreas

Document best practices for Optimizing your WDL

This will provide guidance for best practices for optimizing Cromwell on Azure for your workflow, relating portions such as scattering, gathering, and multi-treading to types of tasks common to genomics workflows.

Multiple Users configuration

We've received multiple requests for guidance around what the architecture and setup should be for a subscription that wants to have multiple users with a single SubscriptionID. I will write up guidance to be included in the CoA readme for addressing this situation.

Add caching to CosmosDB API calls

From CosmosDB perf tips

Cache document URIs whenever possible for the best read performance. You need to define logic to cache the resource ID when you create a resource. Lookups based on resource IDs are faster than name-based lookups, so caching these values improves performance.

Documentation - Linux Compilation Instructions

Would be useful to have a section in the readme or seperate building file with instructions.

For linux, good to know (from #87 )

wget https://packages.microsoft.com/config/ubuntu/18.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb
sudo dpkg -i packages-microsoft-prod.deb

sudo add-apt-repository universe
sudo apt-get update
sudo apt-get install apt-transport-https
sudo apt-get update
sudo apt-get install dotnet-sdk-3.1
sudo apt-get install dotnet-sdk-2.2

dotnet build
dotnet publish -r linux-x64

Job preparation task sometimes hangs

Job preparation task (downloads the input files before running the main task) sometimes does not exit after all the files are downloaded. This causes the Batch job & node to run indefinitely.
Solution: Limit the execution duration of the preparation task to 4 hours.

Batch pools with no jobs

When a batch job is rapidly created/deleted, the corresponding auto pool might survive and continue consuming resources indefinitely. Discover such "orphan" pools and delete them in a background task.

Insufficient Access Error even as Owner on subscription

Hello,

In reference to the deploy-cromwell-on-azure- executable, I'm getting an error that doesn't make sense as I have the proper permissions.

I have two subscriptions on which I am owner and I get the same error when trying to deploy. One of the two subs I am an owner on tenant, too. I've tried using the binary mentioned in the quickstart on both Windows and Linux.

Here's the error:

Running...
Insufficient access to deploy. You must be: 1) Owner of the subscription, or 2) Contributor and User Access Administrator of the subscription, or 3) Owner of the resource group

WDL glob non-functional

This is on the same system as recently discussed. Other WDL workflows have worked successfully. The glob functionality does not successfully gather files and fails the Cromwell workflow.

Sample WDL:

workflow test {

	call output_printer {
		input:
			DOCKER="ubuntu:16.04"

	}

	call gather_outputs_array {
		input:
			DOCKER="ubuntu:16.04",
			OUTPUTS_ARRAY=output_printer.OUTPUTS_ARRAY
	}

}

task output_printer {

	String DOCKER

	command {
		for i in $(seq 1 10); do echo "$i" > "output_$i.txt"; done
	}

	output {
		Array[File] OUTPUTS_ARRAY = glob("output_*.txt")
	}

	runtime {
		docker: "${DOCKER}"
	}
}

task gather_outputs_array {

	Array[File] OUTPUTS_ARRAY
	String DOCKER

	command {

		cat ${sep = " " OUTPUTS_ARRAY} > output.txt

	}

	output {
		File OUTPUT_FILE = "output.txt"
	}

	runtime {
		docker: "${DOCKER}"
	}
}

Cromwell Logs:

root@___________:/# cat cromwell-workflow-logs/workflow.829441c6-06df-43a1-9657-e4f23fb941e9.log
2020-06-01 16:19:27,509 INFO  - MaterializeWorkflowDescriptorActor [UUID(829441c6)]: Parsing workflow as WDL draft-2
2020-06-01 16:19:27,515 INFO  - MaterializeWorkflowDescriptorActor [UUID(829441c6)]: Call-to-Backend assignments: test.gather_outputs_array -> TES, test.output_printer -> TES
2020-06-01 16:19:30,573 INFO  - WorkflowExecutionActor-829441c6-06df-43a1-9657-e4f23fb941e9 [UUID(829441c6)]: Starting test.output_printer
2020-06-01 16:19:33,324 INFO  - TesAsyncBackendJobExecutionActor [UUID(829441c6)test.output_printer:NA:1]: `for i in $(seq 1 10); do echo "$i" > "output_$i.txt"; done`
2020-06-01 16:19:33,326 INFO  - TesAsyncBackendJobExecutionActor [UUID(829441c6)test.output_printer:NA:1]: Calculated TES outputs (found 6): 
Output(Some(rc),Some(test.output_printer.rc),Some(/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/rc),/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/rc,Some(FILE))
Output(Some(stdout),Some(test.output_printer.stdout),Some(/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/stdout),/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/stdout,Some(FILE))
Output(Some(stderr),Some(test.output_printer.stderr),Some(/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/stderr),/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/stderr,Some(FILE))
Output(Some(commandScript),Some(test.output_printer.commandScript),Some(/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/script),/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/script,Some(FILE))
Output(Some(globDir.0),Some(test.output_printer.globDir.0),Some(/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161),/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161,Some(DIRECTORY))
Output(Some(globList.0),Some(test.output_printer.globList.0),Some(/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161.list),/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161.list,Some(FILE))

2020-06-01 16:19:33,326 INFO  - TesAsyncBackendJobExecutionActor [UUID(829441c6)test.output_printer:NA:1]: Calculated TES inputs (found 1): 
Input(Some(commandScript),Some(test.output_printer.commandScript),Some(/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/script),/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/script,Some(FILE),None)

2020-06-01 16:19:33,326 INFO  - TesAsyncBackendJobExecutionActor [UUID(829441c6)test.output_printer:NA:1]: `for i in $(seq 1 10); do echo "$i" > "output_$i.txt"; done`
2020-06-01 16:19:34,279 INFO  - TesAsyncBackendJobExecutionActor [UUID(829441c6)test.output_printer:NA:1]: job id: 829441c6_639c5c978fe34c999e5c3bcc09f1fd50
2020-06-01 16:19:34,285 INFO  - TesAsyncBackendJobExecutionActor [UUID(829441c6)test.output_printer:NA:1]: Status change from - to Running
2020-06-01 16:24:54,104 INFO  - TesAsyncBackendJobExecutionActor [UUID(829441c6)test.output_printer:NA:1]: Job 829441c6_639c5c978fe34c999e5c3bcc09f1fd50 is complete
2020-06-01 16:24:54,105 INFO  - TesAsyncBackendJobExecutionActor [UUID(829441c6)test.output_printer:NA:1]: Status change from Running to Complete

Cromwell Metadata:

root@___________:/# curl -X GET "http://cromwell:8000/api/workflows/v1/829441c6-06df-43a1-9657-e4f23fb941e9/metadata?expandSubWorkflows=true" -H "accept: application/json" | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
								 Dload  Upload   Total   Spent    Left  Speed
100  7699  100  7699    0     0   626k      0 --:--:-- --:--:-- --:--:--  626k
{
  "workflowName": "test",
  "workflowProcessingEvents": [
	{
	  "cromwellId": "cromid-a7f8cb4",
	  "description": "PickedUp",
	  "timestamp": "2020-06-01T16:19:27.347Z",
	  "cromwellVersion": "50-5df1e07"
	},
	{
	  "cromwellId": "cromid-a7f8cb4",
	  "description": "Finished",
	  "timestamp": "2020-06-01T16:24:55.557Z",
	  "cromwellVersion": "50-5df1e07"
	}
  ],
  "metadataSource": "Unarchived",
  "actualWorkflowLanguageVersion": "draft-2",
  "submittedFiles": {
	"workflow": "\r\n\r\nworkflow test {\r\n\r\ncall output_printer {\r\ninput:\r\nDOCKER=\"ubuntu:16.04\"\r\n\r\n}\r\n\r\ncall gather_outputs_array {\r\ninput:\r\nDOCKER=\"ubuntu:16.04\",\r\nOUTPUTS_ARRAY=output_printer.OUTPUTS_ARRAY\r\n}\r\n\r\n}\r\n\r\ntask output_printer {\r\n\r\nString DOCKER\r\n\r\ncommand {\r\nfor i in $(seq 1 10); do echo \"$i\" > \"output_$i.txt\"; done\r\n}\r\n\r\noutput {\r\nArray[File] OUTPUTS_ARRAY = glob(\"output_*.txt\")\r\n}\r\n\r\nruntime {\r\ndocker: \"${DOCKER}\"\r\n}\r\n}\r\n\r\ntask gather_outputs_array {\r\n\r\nArray[File] OUTPUTS_ARRAY\r\nString DOCKER\r\n\r\ncommand {\r\n\r\ncat ${sep = \" \" OUTPUTS_ARRAY} > output.txt\r\n\r\n}\r\n\r\noutput {\r\nFile OUTPUT_FILE = \"output.txt\"\r\n}\r\n\r\nruntime {\r\ndocker: \"${DOCKER}\"\r\n}\r\n\r\n\r\n}",
	"root": "",
	"options": "{\n\n}",
	"inputs": "{}",
	"workflowUrl": "",
	"labels": "{}"
  },
  "calls": {
	"test.output_printer": [
	  {
		"retryableFailure": false,
		"executionStatus": "Failed",
		"stdout": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/stdout",
		"backendStatus": "Complete",
		"compressedDockerSize": 44248681,
		"commandLine": "for i in $(seq 1 10); do echo \"$i\" > \"output_$i.txt\"; done",
		"shardIndex": -1,
		"runtimeAttributes": {
		  "preemptible": "true",
		  "failOnStderr": "false",
		  "continueOnReturnCode": "0",
		  "docker": "ubuntu:16.04",
		  "maxRetries": "0"
		},
		"callCaching": {
		  "allowResultReuse": false,
		  "effectiveCallCachingMode": "CallCachingOff"
		},
		"inputs": {
		  "DOCKER": "ubuntu:16.04"
		},
		"failures": [
		  {
			"causedBy": [
			  {
				"causedBy": [],
				"message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_1.txt"
			  },
			  {
				"causedBy": [],
				"message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_10.txt"
			  },
			  {
				"causedBy": [],
				"message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_2.txt"
			  },
			  {
				"causedBy": [],
				"message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_3.txt"
			  },
			  {
				"causedBy": [],
				"message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_4.txt"
			  },
			  {
				"causedBy": [],
				"message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_5.txt"
			  },
			  {
				"causedBy": [],
				"message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_6.txt"
			  },
			  {
				"causedBy": [],
				"message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_7.txt"
			  },
			  {
				"causedBy": [],
				"message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_8.txt"
			  },
			  {
				"causedBy": [],
				"message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_9.txt"
			  }
			],
			"message": ""
		  }
		],
		"jobId": "829441c6_639c5c978fe34c999e5c3bcc09f1fd50",
		"backend": "TES",
		"end": "2020-06-01T16:24:55.305Z",
		"stderr": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/stderr",
		"callRoot": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer",
		"attempt": 1,
		"executionEvents": [
		  {
			"startTime": "2020-06-01T16:19:31.444Z",
			"description": "PreparingJob",
			"endTime": "2020-06-01T16:19:31.771Z"
		  },
		  {
			"startTime": "2020-06-01T16:19:30.575Z",
			"description": "Pending",
			"endTime": "2020-06-01T16:19:30.575Z"
		  },
		  {
			"startTime": "2020-06-01T16:24:55.105Z",
			"description": "UpdatingJobStore",
			"endTime": "2020-06-01T16:24:55.305Z"
		  },
		  {
			"startTime": "2020-06-01T16:19:30.575Z",
			"description": "RequestingExecutionToken",
			"endTime": "2020-06-01T16:19:31.444Z"
		  },
		  {
			"startTime": "2020-06-01T16:19:31.444Z",
			"description": "WaitingForValueStore",
			"endTime": "2020-06-01T16:19:31.444Z"
		  },
		  {
			"startTime": "2020-06-01T16:19:31.771Z",
			"description": "RunningJob",
			"endTime": "2020-06-01T16:24:55.105Z"
		  }
		],
		"start": "2020-06-01T16:19:30.575Z"
	  }
	]
  },
  "outputs": {},
  "workflowRoot": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9",
  "actualWorkflowLanguage": "WDL",
  "id": "829441c6-06df-43a1-9657-e4f23fb941e9",
  "inputs": {},
  "labels": {
	"cromwell-workflow-id": "cromwell-829441c6-06df-43a1-9657-e4f23fb941e9"
  },
  "submission": "2020-06-01T16:19:17.719Z",
  "status": "Failed",
  "failures": [
	{
	  "causedBy": [
		{
		  "causedBy": [
			{
			  "causedBy": [],
			  "message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_1.txt"
			},
			{
			  "causedBy": [],
			  "message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_10.txt"
			},
			{
			  "causedBy": [],
			  "message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_2.txt"
			},
			{
			  "causedBy": [],
			  "message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_3.txt"
			},
			{
			  "causedBy": [],
			  "message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_4.txt"
			},
			{
			  "causedBy": [],
			  "message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_5.txt"
			},
			{
			  "causedBy": [],
			  "message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_6.txt"
			},
			{
			  "causedBy": [],
			  "message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_7.txt"
			},
			{
			  "causedBy": [],
			  "message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_8.txt"
			},
			{
			  "causedBy": [],
			  "message": "Could not process output, file not found: /cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161/output_9.txt"
			}
		  ],
		  "message": ""
		}
	  ],
	  "message": "Workflow failed"
	}
  ],
  "end": "2020-06-01T16:24:55.557Z",
  "start": "2020-06-01T16:19:27.347Z"
}

TES Job Logs (only one job found for this workflow):

{
  "id": "829441c6_639c5c978fe34c999e5c3bcc09f1fd50",
  "state": "COMPLETE",
  "name": "test.output_printer",
  "description": "829441c6-06df-43a1-9657-e4f23fb941e9:BackendJobDescriptorKey_CommandCallNode_test.output_printer:-1:1",
  "inputs": [
	{
	  "name": "commandScript",
	  "description": "test.output_printer.commandScript",
	  "url": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/script",
	  "path": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/script",
	  "type": "FILE",
	  "content": null
	}
  ],
  "outputs": [
	{
	  "name": "rc",
	  "description": "test.output_printer.rc",
	  "url": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/rc",
	  "path": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/rc",
	  "type": "FILE"
	},
	{
	  "name": "stdout",
	  "description": "test.output_printer.stdout",
	  "url": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/stdout",
	  "path": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/stdout",
	  "type": "FILE"
	},
	{
	  "name": "stderr",
	  "description": "test.output_printer.stderr",
	  "url": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/stderr",
	  "path": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/stderr",
	  "type": "FILE"
	},
	{
	  "name": "commandScript",
	  "description": "test.output_printer.commandScript",
	  "url": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/script",
	  "path": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/script",
	  "type": "FILE"
	},
	{
	  "name": "globDir.0",
	  "description": "test.output_printer.globDir.0",
	  "url": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161",
	  "path": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161",
	  "type": "DIRECTORY"
	},
	{
	  "name": "globList.0",
	  "description": "test.output_printer.globList.0",
	  "url": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161.list",
	  "path": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161.list",
	  "type": "FILE"
	}
  ],
  "resources": {
	"cpu_cores": null,
	"preemptible": true,
	"ram_gb": null,
	"disk_gb": null,
	"zones": null
  },
  "executors": [
	{
	  "image": "ubuntu@sha256:a4fc0c40360ff2224db3a483e5d80e9164fe3fdce2a8439d2686270643974632",
	  "command": [
		"/bin/bash",
		"/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/script"
	  ],
	  "workdir": null,
	  "stdin": null,
	  "stdout": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/stdout",
	  "stderr": "/cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/stderr",
	  "env": null
	}
  ],
  "volumes": null,
  "tags": null,
  "logs": [
	{
	  "logs": null,
	  "metadata": null,
	  "start_time": null,
	  "end_time": null,
	  "outputs": null,
	  "system_logs": [
		"{\"MoreThanOneActiveJobFound\":false,\"ActiveJobWithMissingAutoPool\":false,\"AttemptNumber\":1,\"NodeAllocationFailed\":false,\"NodeDiskFull\":false,\"JobState\":0,\"JobStartTime\":\"2020-06-01T16:19:37.1255378Z\",\"JobEndTime\":null,\"JobSchedulingError\":null,\"JobPreparationTaskState\":1,\"JobPreparationTaskExitCode\":0,\"JobPreparationTaskExecutionResult\":0,\"JobPreparationTaskStartTime\":\"2020-06-01T16:23:58.447322Z\",\"JobPreparationTaskEndTime\":\"2020-06-01T16:24:13.022002Z\",\"JobPreparationTaskFailureInformation\":null,\"JobPreparationTaskContainerState\":\"created\",\"JobPreparationTaskContainerError\":null,\"TaskState\":3,\"TaskExitCode\":0,\"TaskExecutionResult\":0,\"TaskStartTime\":\"2020-06-01T16:24:13.963847Z\",\"TaskEndTime\":\"2020-06-01T16:24:19.497711Z\",\"TaskFailureInformation\":null,\"TaskContainerState\":\"created\",\"TaskContainerError\":null}"
	  ]
	}
  ],
  "creation_time": "06/01/2020 16:19:33"
}

Cromwell Execution Folder

root@___________:/# ls -l cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/
ls: cannot access 'cromwell-executions/test/829441c6-06df-43a1-9657-e4f23fb941e9/call-output_printer/execution/glob-f085e32532fc20b7c6300ae574bdb161': No such file or directory
total 0
-rwxrwxrwx 1 root root  476 Jun  1 16:19 download_files_script
d????????? ? ?    ?       ?            ? glob-f085e32532fc20b7c6300ae574bdb161
-rwxrwxrwx 1 root root  131 Jun  1 16:24 glob-f085e32532fc20b7c6300ae574bdb161.list
-rwxrwxrwx 1 root root    2 Jun  1 16:24 rc
-rwxrwxrwx 1 root root 3567 Jun  1 16:24 script
-rwxrwxrwx 1 root root    0 Jun  1 16:24 stderr
-rwxrwxrwx 1 root root    0 Jun  1 16:24 stdout

Add Cromwell script's result code to TesTask object

Cromwell script's result code is captured in file named "rc" and uploaded to task's executions directory in the default storage account, assuming that the execution succeeded from the Batch perspective. The state of those tasks will be "COMPLETE". When there is a problem with customer's code, this file will contain non-zero value indicating to Cromwell engine that the task failed.
Copy this value to TES task database for easier debugging. This will make it easier to find tasks that failed without resorting to opening each "rc" file in storage.

CWL support

This is more a question than an issue: is there any plan to add CWL support to cromwell on azure?

CWL issues with Cromwell on Azure

Currently when a CWL workflow is run with Cromwell on Azure, Cromwell localizes the file paths in a different manner than WDLs, which results in TES unable to find the input files in some scenarios.

Insufficient access to deploy

Dear team,

I gave CromwellOnAzure a try, but both deployment attempts (see below) failed immediately with "Insufficient access to deploy".

  1. Deployment without RG
    deploy-cromwell-on-azure-linux --subscriptionid XXXXXX --regionname southeastasia

I'm account admin of the used subscription.

  1. Deployment with newly created RG

    deploy-cromwell-on-azure-linux --subscriptionid XXXXXX --regionname southeastasia --resourcegroup cromwelltest

In both cases I get

Copyright (c) Microsoft Corporation.
Licensed under the MIT License.
Privacy & Cookies: https://go.microsoft.com/fwlink/?LinkId=521839

Cromwell on Azure
https://github.com/microsoft/CromwellOnAzure

Running...
Insufficient access to deploy. You must be: 1) Owner of the subscription, or 2) Contributor and User Access Administrator of the subscription, or 3) Owner of the resource group

What's the best way to debug the problem?

Many thanks,
Andreas

  • COA version: 1.0.8
  • az CLI version: 2.0.81

Support private GitHub and Azure DevOps links in Trigger Service

To keep workflows under version control and avoid copying to storage account.

You can get a link to private GitHub with temporary token (click on raw view) https://raw.githubusercontent.com/olesya13/MyRepo/master/MyWorkflow.wdl?token=MyToken , but trigger service doesn't support it. Also token life expectancy is really short, you can’t get access to the file in the browser in ~10 min.

Couple discussions (might be outdated) about token creation and auth:
https://stackoverflow.com/questions/15408053/c-sharp-example-of-downloading-github-private-repo-programmatically
https://gist.github.com/Integralist/9482061

Installation hangs during Docker tasks

I'm getting hung on this step now: Waiting for docker containers to download and start...

Found 2 issues:

  • docker commands are being run as vmadmin instead of using sudo.
  • fstab commands exit the startup.sh process prematurely

Originally posted by @jacorbello in #87 (comment)

Running checklist for documentation tasks

  • Add instructions to use a specific Cromwell docker image
  • Add a note for users to check quotas for their batch account - number of accounts per subscription may be limited #62 (comment)
  • Add links to docs that are coming soon to the FAQs #60 #67
  • [ ] Add link to Storage Explorer to create SAS tokens/URLs and specify both "Read" and "Write" access for copying files from public storage container (specific directory) to user's account
  • Add instructions on how to configure CoA deployment e.g.: VmSize #101 (comment)
    #125 (comment)
  • Add compilation instructions for linux #91
  • Add Cromwell db lock known issue #57
  • Update instructions and all docs based on v2.0
  • Add links to bugs for known issues https://github.com/microsoft/CromwellOnAzure/blob/master/docs/troubleshooting-guide.md#known-issues-and-mitigation
  • Add links to docs that are coming soon to the FAQs #66
  • Add instructions on how to configure CoA deployment for development scenarios e.g.: using your local docker images
  • Add link to documentation about WDL file input localization to FAQs #208 (comment)
  • Add instructions to use womtool.jar to validate WDL before running workflows in FAQs #108 (comment)

Invalid Container config when using ACR

When a job is submitted to the Batch API, it appears to be passed invalid ContainerConfig by the tes backend when the image is an ACR repository. This causes the job to stall indefinately, and the node to enter an unusable state.

The actual issue seems to be the registryServer parameter is being set to the acr name and not the web accessible acr url. You can see where this is getting set here. The Batch api tries to contact mycontainerregistry instead of mycontainerregistry.azurecr.io resulting in an error

{
  "containerConfiguration": {
   "type": "dockerCompatible",
   "containerImageNames": [
      "mycontainerregistry.azurecr.io/tes:1.0.0"
    ],
    "containerRegistries": [
     {
       "registryServer": "mycontainerregistry",
       "username": "mycontainerregistry"
      }
    ]
  }
}
{
"errors": [
        {
            "code": "ContainerInvalidRegistry",
            "message": "One or more container registries specified are invalid",
            "errorDetails": [
                {
                    "name": "ContainerRegistry",
                    "value": "mycontainerregistry"
                },
                {
                    "name": "Message",
                    "value": "Get https://mycontainerregistry/v2/: dial tcp: lookup superclustercontainerregistry on 168.63.129.16:53: no such host"
                }
            ]
        }
    ]
}

This is a pretty serious error since all of our docker images are private and we need to store them in ACR

How to write a WDL from scratch

Will use this issue to track documentation changes for documenting how to write a WDL from scratch, including references for locations for Docker container repos, WDL syntax, and scatter/gather with Azure as the computational engine.

deploy-cromwell-on-azure-linux cannot get an access token from azure cli

I am trying to deploy cromwell on our azure environment.

The precompiled version fails with "No access token found. Please install the Azure CLI and login with 'az login'" while azure cli is in path and logged in.
Is it possible to add a token manually or get some more debugging information?

Large intermediate files will not transfer

Behavior: If you happen to run any analysis that requires the upload of a file back to blob storage that is larger than ~110Gb, the API for doing so in Azure batch throws a MiscFileUploarError exception, which is uncaught by CoA. Cromwell keeps attempting to complete task, but consistently fails until the retries condition is met.

Cause: We are using standard storage, which has limits on upload bandwidth and blocks. Azure Blob service does not stop the upload, but sends back the throttling message, but Azure Batch is not equipped to catch that message gracefully. Since the file does not upload- Cromwell fails and re-attempts the task, creating a loop until workflow failure. Technical support has advised that the code handling uploads should be handling adjusting 503 warning gracefully, but the current Batch API we are using does not.

Scope: This will affect anyone who is running any workflow that requires an intermediate file upload that is > ~110Gb in size, using any WDL. (An aligned BAM for 30X of human genome without compression is ~ 175Gb - 2.5B*3.2Mbases). Most notably, anyone running the GATK Germline best practices with > 35x coverage or so will feel this error.

Mitigation: Tony has been investigating trying to switch to BlobXfer for uploads instead of using the Azure Batch API. Results suggest that blobXfer can handle the transfer without issue.

Installation hangs - win-x64

Cromwell Version

1.3.0

Command run

.\deploy-cromwell-on-azure-win.exe --SubscriptionId xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx --RegionName eastus2 --MainIdentifierPrefix xxxxxx

Account permissions for user running command:

Subscription Owner

Number of attempts to deploy

8

Issue

The task Running installation script on the VM... hangs for hours.

Creating Resource Group: xxxxxx-03889652... Completed in 1s
Creating Application Insights: xxxxxx-67268108... Completed in 3s
Creating Batch Account: xxxxxx05762345... Completed in 17s
Creating Storage Account: xxxxxx36a8382430e85... Completed in 20s
Creating Cosmos DB: xxxxxx-a0f83393... Completed in 553s
Creating Linux VM: xxxxxx-62e87685dde53... Completed in 89s
Creating Network Security Group: xxxxxx8d269345... Completed in 5s
Associating NIC with Network Security Group: xxxxxx8d269345... Completed in 11s
Waiting for VM to accept SSH connections... Completed in 1s
Copying installation files to the VM... Completed in 16s
Running installation script on the VM...

The VM is successfully provisioned, files are copied over to the /cromwell directory, permissions are applied for the vmadmin user account to take ownership of the directory.

When logging into the VM, the cromwellazure service is successfully created, though the service is dead:

● cromwellazure.service - CromwellOnAzure - Start blobfuse and Docker Compose
   Loaded: loaded (/lib/systemd/system/cromwellazure.service; disabled; vendor preset: enabled)
   Active: inactive (dead)

Attempting to start the service manually:

vmadmin@vmxxxxxxxxxx:~$ systemctl start cromwellazure
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ===
Authentication is required to start 'cromwellazure.service'.
Authenticating as: Ubuntu (vmadmin)
Password:
==== AUTHENTICATION COMPLETE ===
vmadmin@vmxxxxxxxxxx:~$ systemctl status cromwellazure
● cromwellazure.service - CromwellOnAzure - Start blobfuse and Docker Compose
   Loaded: loaded (/lib/systemd/system/cromwellazure.service; disabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2020-05-27 20:53:09 UTC; 8s ago
  Process: 2322 ExecStart=/bin/bash /cromwellazure/startup.sh (code=exited, status=127)
 Main PID: 2322 (code=exited, status=127)

May 27 20:53:08 vmxxxxxxxxxx systemd[1]: Started CromwellOnAzure - Start blobfuse and Docker Compose.
May 27 20:53:09 vmxxxxxxxxxx sudo[2373]:     root : TTY=unknown ; PWD=/cromwellazure ; USER=root ; COMMAND=/bin/sed -i /mount.blobfuse/d /etc/fstab
May 27 20:53:09 vm0c00152552 sudo[2373]: pam_unix(sudo:session): session opened for user root by (uid=0)
May 27 20:53:09 vmxxxxxxxxxx sudo[2373]: pam_unix(sudo:session): session closed for user root
May 27 20:53:09 vmxxxxxxxxxx sudo[2379]:     root : TTY=unknown ; PWD=/cromwellazure ; USER=root ; COMMAND=/bin/mount -av -t fuse
May 27 20:53:09 vmxxxxxxxxxx sudo[2379]: pam_unix(sudo:session): session opened for user root by (uid=0)
May 27 20:53:09 vmxxxxxxxxxx sudo[2379]: pam_unix(sudo:session): session closed for user root
May 27 20:53:09 vmxxxxxxxxxx systemd[1]: cromwellazure.service: Main process exited, code=exited, status=127/n/a
May 27 20:53:09 vmxxxxxxxxxx systemd[1]: cromwellazure.service: Unit entered failed state.
May 27 20:53:09 vmxxxxxxxxxx systemd[1]: cromwellazure.service: Failed with result 'exit-code'.

Custom storage accounts break CromwellOnAzure

Well this was fun but I think I have the repro steps.

  1. Install cromwell, and use FastqToUbamSingleSample.chr21.json. Works.

  2. Reboot VM, nothing changed. Run FastqToUbamSingleSample, works.

  3. Edit containers-to-mount, and add a storage account that the Managed identity does not have access to yet. Reboot. Run FastqToUbamSingleSample, works.

  4. Grant Contributor access to storage account to identity. Wait five minutes. Reboot. FastqToUbamSingleSample no longer works - it just stays inside "new" folder and is not picked up.

I can see the folder is mounted.

2020-05-29T18:38:37+00:00 Successfully mounted /mnt/msgenpublicdata/inputs
2020-05-29T18:38:39+00:00 Mounting /mnt/kesselrunstorage/outputs using the account key
2020-05-29T18:38:39+00:00 Successfully mounted /mnt/kesselrunstorage/outputs
2020-05-29T18:38:39+00:00 Mounting /mnt/kesselrunstorage/inputs using the account key
2020-05-29T18:38:39+00:00 Successfully mounted /mnt/kesselrunstorage/inputs
2020-05-29T18:38:39+00:00 Mounting /mnt/kesselrunstorage/salmon using the account key
2020-05-29T18:38:40+00:00 Successfully mounted /mnt/kesselrunstorage/salmon

All that was added was a single line at the end of the file containers-to-mount

/kesselrunstorage/*

Next Page token should be URLSafe

At the moment the next_page_token cannot be copy pasted directly to the browser and a client needs to know to properly URL escape the String.

An example of a token looks like: +RID:~j389AJotrAYKAAAAAAAAAA==#RT:1#TRC:10#ISV:2#IEO:65551#QCF:1#FPC:AQoAAAAAAAAA6AEAAAAAAAA=

I would suggest first Url Safe Base64 encoding the token first before passing in back in the list view

Test workflow failed

Hi, I am using the deploy-cromwell-on-azure-linux version 1.3.0 executable and I tried to run the Quickstart guide, but it failed. My subscription id has the Owner, Contributor, and User Access Administrator roles.

I did log in with the az login command and I can see my subscription. I ran the next command but it failed:

deploy-cromwell-on-azure-linux --subscriptionId $my_subscriptionid --RegionName southcentralus --MainIdentifierPrefix coafelipe

the output was:

Creating Resource Group: coafelipe... Completed in 0s
Creating Batch Account: coafelipe... Completed in 17s
Creating Application Insights: coafelipe... Completed in 2s
Creating Cosmos DB: coafelipe... Completed in 219s
Creating Storage Account: coafelipe... Completed in 20s
Creating Linux VM: coafelipe... Completed in 87s
Creating Network Security Group: coafelipe... Completed in 5s
Associating NIC with Network Security Group: coafelipe... Completed in 12s
Waiting for VM to accept SSH connections... Completed in 2s
Copying installation files to the VM... Completed in 40s
Running installation script on the VM... Completed in 73s
Assigning Billing Reader role for VM to Subscription scope... Completed in 2s
Assigning Contributor role for VM to App Insights resource scope... Completed in 1s
Assigning Contributor role for VM to Cosmos DB resource scope... Completed in 1s
Assigning Contributor role for VM to Batch Account resource scope... Completed in 1s
Assigning Contributor role for VM to Storage Account resource scope... Completed in 1s
Assigning Storage Blob Data Reader role for VM to Storage Account resource scope... Completed in 1s
Waiting (5) minutes for Azure to fully propagate role assignments... Completed in 300s
Restarting VM... Completed in 32s
Waiting for VM to accept SSH connections... Completed in 3s
Waiting for docker containers to download and start... Completed in 94s
Waiting for Cromwell to perform one-time database preparation... Completed in 212s
Running a test workflow... Completed in 85s

Test workflow failed.

Please try deployment again, and create an issue if this continues to fail: https://github.com/microsoft/CromwellOnAzure/issues

Delete the resource group? Type 'yes' and press enter, or, press any key to exit: Completed in 16.4 minutes.

I deployed again but got the same issue, but this time I didn't delete the resource group, so I uploaded the JSON file to trigger the workflow but it failed:

2020-03-31 23:59:54,610 INFO - MaterializeWorkflowDescriptorActor [UUID(595c0c7a)]: Parsing workflow as WDL draft-2
2020-03-31 23:59:54,727 INFO - MaterializeWorkflowDescriptorActor [UUID(595c0c7a)]: Call-to-Backend assignments: FastqToUbamSingleSample.FastqToUbam -> TES
2020-04-01 00:00:00,567 INFO - WorkflowExecutionActor-595c0c7a-6650-4abc-ac62-4f06fbe2048f [UUID(595c0c7a)]: Starting FastqToUbamSingleSample.FastqToUbam
2020-04-01 00:00:03,790 INFO - TesAsyncBackendJobExecutionActor [UUID(595c0c7a)FastqToUbamSingleSample.FastqToUbam:NA:1]: java -jar /usr/gitc/picard.jar FastqToSam \ FASTQ=/cromwell-executions/FastqToUbamSingleSample/595c0c7a-6650-4abc-ac62-4f06fbe2048f/call-FastqToUbam/inputs/msgenpublicdata/inputs/chr21/chr21.read1.fq.gz \ FASTQ2=/cromwell-executions/FastqToUbamSingleSample/595c0c7a-6650-4abc-ac62-4f06fbe2048f/call-FastqToUbam/inputs/msgenpublicdata/inputs/chr21/chr21.read2.fq.gz \ OUTPUT=chr21.unmapped.bam \ READ_GROUP_NAME=GrA \ SAMPLE_NAME=chr21 \ LIBRARY_NAME=Pond001 \ PLATFORM_UNIT=GrA.chr21 \ PLATFORM=illumina
2020-04-01 00:00:03,806 INFO - TesAsyncBackendJobExecutionActor [UUID(595c0c7a)FastqToUbamSingleSample.FastqToUbam:NA:1]: Calculated TES outputs (found 5):
Output(Some(rc),Some(FastqToUbamSingleSample.FastqToUbam.rc),Some(/cromwell-executions/FastqToUbamSingleSample/595c0c7a-6650-4abc-ac62-4f06fbe2048f/call-FastqToUbam/execution/rc),/cromwell-executions/FastqToUbamSingleSample/595c0c7a-6650-4abc-ac62-4f06fbe2048f/call-FastqToUbam/execution/rc,Some(FILE))
Output(Some(stdout),Some(FastqToUbamSingleSample.FastqToUbam.stdout),Some(/cromwell-executions/FastqToUbamSingleSample/595c0c7a-6650-4abc-ac62-4f06fbe2048f/call-FastqToUbam/execution/stdout),/cromwell-executions/FastqToUbamSingleSample/595c0c7a-6650-4abc-ac62-4f06fbe2048f/call-FastqToUbam/execution/stdout,Some(FILE))
Output(Some(stderr),Some(FastqToUbamSingleSample.FastqToUbam.stderr),Some(/cromwell-executions/FastqToUbamSingleSample/595c0c7a-6650-4abc-ac62-4f06fbe2048f/call-FastqToUbam/execution/stderr),/cromwell-executions/FastqToUbamSingleSample/595c0c7a-6650-4abc-ac62-4f06fbe2048f/call-FastqToUbam/execution/stderr,Some(FILE))
Output(Some(commandScript),Some(FastqToUbamSingleSample.FastqToUbam.commandScript),Some(/cromwell-executions/FastqToUbamSingleSample/595c0c7a-6650-4abc-ac62-4f06fbe2048f/call-FastqToUbam/execution/script),/cromwell-executions/FastqToUbamSingleSample/595c0c7a-6650-4abc-ac62-4f06fbe2048f/call-FastqToUbam/execution/script,Some(FILE))
Output(Some(FastqToUbamSingleSample.FastqToUbam.output.0),Some(FastqToUbamSingleSample.FastqToUbam.output.0),Some(/cromwell-executions/FastqToUbamSingleSample/595c0c7a-6650-4abc-ac62-4f06fbe2048f/call-FastqToUbam/execution/chr21.unmapped.bam),/cromwell-executions/FastqToUbamSingleSample/595c0c7a-6650-4abc-ac62-4f06fbe2048f/call-FastqToUbam/execution/chr21.unmapped.bam,Some(FILE))

2020-04-01 00:00:03,807 INFO - TesAsyncBackendJobExecutionActor [UUID(595c0c7a)FastqToUbamSingleSample.FastqToUbam:NA:1]: Calculated TES inputs (found 3):
Input(Some(FastqToUbamSingleSample.FastqToUbam.fastqs.0),Some(FastqToUbamSingleSample.FastqToUbamSingleSample.FastqToUbam.fastqs.0),Some(/msgenpublicdata/inputs/chr21/chr21.read1.fq.gz),/cromwell-executions/FastqToUbamSingleSample/595c0c7a-6650-4abc-ac62-4f06fbe2048f/call-FastqToUbam/inputs/msgenpublicdata/inputs/chr21/chr21.read1.fq.gz,Some(FILE),None)
Input(Some(FastqToUbamSingleSample.FastqToUbam.fastqs.1),Some(FastqToUbamSingleSample.FastqToUbamSingleSample.FastqToUbam.fastqs.1),Some(/msgenpublicdata/inputs/chr21/chr21.read2.fq.gz),/cromwell-executions/FastqToUbamSingleSample/595c0c7a-6650-4abc-ac62-4f06fbe2048f/call-FastqToUbam/inputs/msgenpublicdata/inputs/chr21/chr21.read2.fq.gz,Some(FILE),None)
Input(Some(commandScript),Some(FastqToUbamSingleSample.FastqToUbam.commandScript),Some(/cromwell-executions/FastqToUbamSingleSample/595c0c7a-6650-4abc-ac62-4f06fbe2048f/call-FastqToUbam/execution/script),/cromwell-executions/FastqToUbamSingleSample/595c0c7a-6650-4abc-ac62-4f06fbe2048f/call-FastqToUbam/execution/script,Some(FILE),None)

2020-04-01 00:00:03,808 INFO - TesAsyncBackendJobExecutionActor [UUID(595c0c7a)FastqToUbamSingleSample.FastqToUbam:NA:1]: java -jar /usr/gitc/picard.jar FastqToSam \ FASTQ=/cromwell-executions/FastqToUbamSingleSample/595c0c7a-6650-4abc-ac62-4f06fbe2048f/call-FastqToUbam/inputs/msgenpublicdata/inputs/chr21/chr21.read1.fq.gz \ FASTQ2=/cromwell-executions/FastqToUbamSingleSample/595c0c7a-6650-4abc-ac62-4f06fbe2048f/call-FastqToUbam/inputs/msgenpublicdata/inputs/chr21/chr21.read2.fq.gz \ OUTPUT=chr21.unmapped.bam \ READ_GROUP_NAME=GrA \ SAMPLE_NAME=chr21 \ LIBRARY_NAME=Pond001 \ PLATFORM_UNIT=GrA.chr21 \ PLATFORM=illumina
2020-04-01 00:00:05,865 INFO - TesAsyncBackendJobExecutionActor [UUID(595c0c7a)FastqToUbamSingleSample.FastqToUbam:NA:1]: job id: 595c0c7a_bc23092482d04e04b65fbbe6b992694d
2020-04-01 00:00:05,878 INFO - TesAsyncBackendJobExecutionActor [UUID(595c0c7a)FastqToUbamSingleSample.FastqToUbam:NA:1]: TES reported an error for Job 595c0c7a_bc23092482d04e04b65fbbe6b992694d: 'SYSTEM_ERROR'
2020-04-01 00:00:05,878 INFO - TesAsyncBackendJobExecutionActor [UUID(595c0c7a)FastqToUbamSingleSample.FastqToUbam:NA:1]: Status change from - to FailedOrError

Deploy cromwell issue.

When I try to deploy cromwell by quickstart, I got same error in windows or linux and I can't found detail information of error . Would you please help me check the issue?

.\deploy-cromwell-on-azure-win.exe --subscriptionid 5526299f-5847-4171-85ce-9c32fd3608ca --regionname southeastasia
Copyright (c) Microsoft Corporation.
Licensed under the MIT License.
Privacy & Cookies: https://go.microsoft.com/fwlink/?LinkId=521839

Cromwell on Azure
https://github.com/microsoft/CromwellOnAzure

Running...
Unhandled exception. CromwellOnAzureDeployer.Deployer+ValidationException: Exception of type 'CromwellOnAzureDeployer.Deployer+ValidationException' was thrown.
at CromwellOnAzureDeployer.Deployer.ValidateSubscriptionAndResourceGroupAsync(String subscriptionId, String resourceGroupName)
at CromwellOnAzureDeployer.Deployer.DeployAsync()
at CromwellOnAzureDeployer.Program.InitializeAndDeployAsync(String[] args)
at CromwellOnAzureDeployer.Program.Main(String[] args)
at CromwellOnAzureDeployer.Program.

(String[] args)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.