azavea / raster-vision-aws Goto Github PK
View Code? Open in Web Editor NEWA CloudFormation template for deploying Raster Vision Batch jobs to AWS.
License: Other
A CloudFormation template for deploying Raster Vision Batch jobs to AWS.
License: Other
The cloudformation template validates the Prefix to have a maximum of 12 characters, but only describes the lowercase letters and numbers requirement.
We should see if it's possible to have the CloudFormation template disassociate the Batch job queues before destroying the compute environment.
Terraform has attributes that define resource lifecycle, like create_before_destroy
. Are there potential CloudFormation equivalents or prior art?
An error occurred (ClientException) when calling the DeleteComputeEnvironment operation: Cannot delete, found existing JobQueue relationship
Hi ,
After creating the AWS batch, I am not able to find the "default" file containing the configurations which needs to be updated with
[AWS_BATCH]
job_queue=RasterVisionGpuJobQueue
cpu_job_queue=RasterVisionCpuJobQueue
job_definition=RasterVisionHostedGpuJobDefinition
cpu_job_definition=RasterVisionHostedCpuJobDefinition
attempts=1
I don't find any default configuration file in ~/. rastervision
The current workflow is for each develop to run the template to create an entirely new stack. This seemed like a good idea except that there is a limit on number of queues and compute environments https://docs.aws.amazon.com/batch/latest/userguide/service_limits.html, and the Batch console loads very slowly once you have more than a few of these (it seems to make a fetch for each queue and state combination).
It seems like a better workflow is to create the full stack once per account and then each developer will just create their own job def which reference the global job queue and compute environments.
Hello,
I am trying to run rastervision with AWS batch after finishing setting up with raster-vision-aws. When I do
"cat .rastervision/default". I have the following configurations:
[AWS_BATCH]
job_queue=RasterVisionGpuJobQueue
cpu_job_queue=RasterVisionCpuJobQueue
job_definition=RasterVisionHostedGpuJobDefinition
cpu_job_definition=RasterVisionHostedCpuJobDefinition
attempts=1
After that, I ran "rastervision run aws_batch -p some_script.py" in a docker container and I received the following error:
An error occurred (ClientException) when calling the SubmitJob operation: JobDefinition arn:aws:batch:us-east-2:193628433786:job-definition/RasterVisionHostedGpuJobDefinition not found.
Do you know why it's missing the job definition here? FYI, I created CloudFormation Stack using your template.yaml.
Thank you for any help.
This issue was copied over from the old version of this repo which has been deleted.
Hector: Blocked by hashicorp/terraform-provider-aws#6454
Jean: Launch templates are supported in CloudFormation, so we could use them in https://github.com/azavea/raster-vision-cloudformation. Would this allow us to remove the AMI creation steps from the project setup?
Hector: I think we determined that the answer is no, because we still need to apply: https://github.com/azavea/raster-vision-aws/blob/master/packer/scripts/configure-gpu.sh
I think https://github.com/azavea/raster-vision-aws/blob/master/packer/template-gpu.json#L9 should inherit from the user profile
@jamesmcclain reported that:
"The packer build fails with the latest Deep Learning AMI (Version 21):
That seems to be a well-known issue (see here and here) in various contexts. "
I tried this and was able to run the job. So this makes me think it's an intermittent problem.
Around half the time, after a job has been running in one of the RV GPU queues for a while, the job fails with Status Reason: Host EC2 Terminated
. This happened to a job I ran in the lewfishRasterVisionGpuJobQueue
. After looking in the Spot History Request in the AWS console, it seems to happen because there is "no more unused capacity available in this pool." It then proceeds to try running the job in each of the 4 availability zones that the Compute Environment is configured to use. These 4 AZs/subnets were chosen because they are the ones available for the p3.2xlarge
instance type as seen in the Spot Price Graph. It's possible that there is no available capacity, but that seems strange because the max spot price we set was $1.83, which is above the spot price for all AZs for the past week. Is it possible to get booted off due to unavailable capacity without the spot price exceeding the max price we set? (Edit: it seems so. See below.) BTW, I think this problem started happening around the time that we switched over to the new CloudFormation setup, but I'm not sure.
In case it is useful, the entire spot request history was requested using
aws ec2 describe-spot-fleet-request-history --spot-fleet-request-id sfr-818dd790-ca2d-4f3f-a59d-b23073030499 --start-time 2019-02-27 > spot-request-history.json
and is in spot-request-history.json.txt
Currently in the R&D account, only admins can create new RV Batch stacks due to missing permissions in the developer
group. I tried to fix this before, but found it was non-trivial so had to pause as it wasn't high-priority.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.