Git Product home page Git Product logo

raster-vision-aws's People

Contributors

drewbo avatar hectcastro avatar jamesmcclain avatar jeancochrane avatar jisantuc avatar lewfish avatar lossyrob avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

raster-vision-aws's Issues

Disassociate job queues before destroying compute environment

We should see if it's possible to have the CloudFormation template disassociate the Batch job queues before destroying the compute environment.

Terraform has attributes that define resource lifecycle, like create_before_destroy. Are there potential CloudFormation equivalents or prior art?

An error occurred (ClientException) when calling the DeleteComputeEnvironment operation: Cannot delete, found existing JobQueue relationship

not able to create profile after creating AWS batch

Hi ,

After creating the AWS batch, I am not able to find the "default" file containing the configurations which needs to be updated with

[AWS_BATCH]
job_queue=RasterVisionGpuJobQueue
cpu_job_queue=RasterVisionCpuJobQueue
job_definition=RasterVisionHostedGpuJobDefinition
cpu_job_definition=RasterVisionHostedCpuJobDefinition
attempts=1

I don't find any default configuration file in ~/. rastervision

Add CloudFormation template for adding a job def

The current workflow is for each develop to run the template to create an entirely new stack. This seemed like a good idea except that there is a limit on number of queues and compute environments https://docs.aws.amazon.com/batch/latest/userguide/service_limits.html, and the Batch console loads very slowly once you have more than a few of these (it seems to make a fetch for each queue and state combination).

It seems like a better workflow is to create the full stack once per account and then each developer will just create their own job def which reference the global job queue and compute environments.

job-definition/RasterVisionHostedGpuJobDefinition not found

Hello,

I am trying to run rastervision with AWS batch after finishing setting up with raster-vision-aws. When I do
"cat .rastervision/default". I have the following configurations:
[AWS_BATCH]
job_queue=RasterVisionGpuJobQueue
cpu_job_queue=RasterVisionCpuJobQueue
job_definition=RasterVisionHostedGpuJobDefinition
cpu_job_definition=RasterVisionHostedCpuJobDefinition
attempts=1
After that, I ran "rastervision run aws_batch -p some_script.py" in a docker container and I received the following error:
An error occurred (ClientException) when calling the SubmitJob operation: JobDefinition arn:aws:batch:us-east-2:193628433786:job-definition/RasterVisionHostedGpuJobDefinition not found.

Do you know why it's missing the job definition here? FYI, I created CloudFormation Stack using your template.yaml.

Thank you for any help.

Add support for AWS Batch launch templates

This issue was copied over from the old version of this repo which has been deleted.

Hector: Blocked by hashicorp/terraform-provider-aws#6454

Jean: Launch templates are supported in CloudFormation, so we could use them in https://github.com/azavea/raster-vision-cloudformation. Would this allow us to remove the AMI creation steps from the project setup?

Hector: I think we determined that the answer is no, because we still need to apply: https://github.com/azavea/raster-vision-aws/blob/master/packer/scripts/configure-gpu.sh

Packer SSH handshake bug

@jamesmcclain reported that:

"The packer build fails with the latest Deep Learning AMI (Version 21):

screenshot_2019-01-31_19-16-39

That seems to be a well-known issue (see here and here) in various contexts. "

I tried this and was able to run the job. So this makes me think it's an intermittent problem.

Batch jobs mysteriously terminated

Around half the time, after a job has been running in one of the RV GPU queues for a while, the job fails with Status Reason: Host EC2 Terminated. This happened to a job I ran in the lewfishRasterVisionGpuJobQueue. After looking in the Spot History Request in the AWS console, it seems to happen because there is "no more unused capacity available in this pool." It then proceeds to try running the job in each of the 4 availability zones that the Compute Environment is configured to use. These 4 AZs/subnets were chosen because they are the ones available for the p3.2xlarge instance type as seen in the Spot Price Graph. It's possible that there is no available capacity, but that seems strange because the max spot price we set was $1.83, which is above the spot price for all AZs for the past week. Is it possible to get booted off due to unavailable capacity without the spot price exceeding the max price we set? (Edit: it seems so. See below.) BTW, I think this problem started happening around the time that we switched over to the new CloudFormation setup, but I'm not sure.

screen shot 2019-02-28 at 11 46 14 am

screen shot 2019-02-28 at 11 54 37 am

screen shot 2019-02-28 at 11 54 57 am

In case it is useful, the entire spot request history was requested using

aws ec2 describe-spot-fleet-request-history --spot-fleet-request-id sfr-818dd790-ca2d-4f3f-a59d-b23073030499 --start-time 2019-02-27 > spot-request-history.json

and is in spot-request-history.json.txt

Add permissions to developers to create stacks

Currently in the R&D account, only admins can create new RV Batch stacks due to missing permissions in the developer group. I tried to fix this before, but found it was non-trivial so had to pause as it wasn't high-priority.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.