Git Product home page Git Product logo

aws-parallelcluster's People

Contributors

awslitvit avatar bwbarrett avatar cfncluster-ami-bot avatar charlesg3 avatar chenwany avatar davprat avatar ddeidda avatar delongmeng avatar demartinofra avatar dougalb avatar dreambeyondorange avatar eantonin avatar eddymm avatar enrico-usai avatar ermanno avatar fnubalaj avatar francesco-giordano avatar gmarciani avatar hanwen-pcluste avatar hehe7318 avatar hgreebe avatar himani2411 avatar jdeamicis avatar lukeseawalker avatar mauri-melato avatar mohanasudhan avatar nssirena avatar rexcsn avatar sean-smith avatar yuleiwan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aws-parallelcluster's Issues

cfncluster create error message

Hi,
if I habe run cfncluster as root first, and then as a normal user, I get the following error message:
ec2-user@ip-172-31-32-159 ~]$ cfncluster create isc
Traceback (most recent call last):
File "/usr/bin/cfncluster", line 9, in
load_entry_point('cfncluster==0.0.12', 'console_scripts', 'cfncluster')()
File "/usr/lib/python2.6/site-packages/cfncluster/cli.py", line 50, in main
filemode='w')
File "/usr/lib64/python2.6/logging/init.py", line 1402, in basicConfig
hdlr = FileHandler(filename, mode)
File "/usr/lib64/python2.6/logging/init.py", line 827, in init
StreamHandler.init(self, self._open())
File "/usr/lib64/python2.6/logging/init.py", line 846, in _open
stream = open(self.baseFilename, self.mode)
IOError: [Errno 13] Permission denied: '/tmp/cfncluster-cli.log'

Apparantly the permissions are set on the first call.

br, Jan

cannot stop cluster

When I tried to run the command "cfncluster stop mycluster", I could not stop cluster and no response at all. Any thoughts?

Thanks,
Michael

cluster creation requires IAM permissions

This is specific to my AWS IAM privileges, but I get the following when I try to create a cluster:

Davids-MacBook-Pro:~ dkoppstein$ cfncluster status dkoppstein-cfncluster
Status: ROLLBACK_COMPLETE
2015-03-25 15:41:44.703000 CREATE_FAILED AWS::EC2::SecurityGroup MasterSecurityGroup Resource creation cancelled
2015-03-25 15:41:30.679000 CREATE_FAILED AWS::DynamoDB::Table DynamoDBTable Resource creation cancelled
2015-03-25 15:41:30.596000 CREATE_FAILED AWS::EC2::EIP MasterEIP Resource creation cancelled
2015-03-25 15:41:30.593000 CREATE_FAILED AWS::EC2::Volume SharedVolume Resource creation cancelled
2015-03-25 15:41:29.187000 CREATE_FAILED AWS::IAM::Role RootRole API: iam:CreateRole User: arn:aws:iam::643571691154:user/davidk is not authorized to perform: iam:CreateRole on resource: arn:aws:iam::643571691154:role/cfncluster-dkoppstein-cfncluster-RootRole-TTPCGJFH359D

Elasticluster and StarCluster both do not require CreateRole on a root resource; it'd be nice if cfncluster didn't require it, too.

Export DRMAA_LIBRARY_PATH

When using SGE, export DRMAA_LIBRARY_PATH=/opt/sge/lib/lx-amd64/libdrmaa.so in one of the profile files.

Configure failing in python 3.5.1

Hi,

Are there plans to make cfncluster python 3 compatible? I'm receiving the following error when I try to run cfncluster configure under 3.5.1.

AttributeError: module 'easyconfig' has no attribute 'configure'

Thanks!

Changing the size of the predefined AMI's EBS volume

Dear All,

I am trying to modify the size of the EBS volume of the prebuilt AMI (ami-f1dd7e9a).
By changing its size from 10GB to 30GB, cfncluster cannot create the master node and it's rolling back.
My aim is to set-up an AMI with all the source code and the necessary input files, based on the prebuild AMI (ami-f1dd7e9a).
Is there any other possible solution or workaround?

Thanks in advance,
Bill

build cluster from linux AMIs?

It'd be nice to be able to build cfncluster from standard AMIs. Currently, the StarCluster AMIs are very out of date (stuck on Ubuntu 12 series) which causes problems with software installation. An AWS customer representative suggested cfncluster as an alternative, but I'm worried that the cfncluster AMIs will go out of date too in the future.

SGE 'pending jobs' count - ignore held jobs

Hi,

Firstly, thanks Dougal for your efforts with cfn-cluster, it's great.
I think I've spotted a small issue with the pending jobs count for the sge scheduler. Currently this is calculated as follows:

pending=$(qstat -g d -s p -u '*' | tail -n+3 | awk '{total = total+ $8}END{print total}')

The problem is, this doesn't take into account whether jobs are held pending the completion of other job dependencies (their state is 'hqw' rather than 'qw'). Obviously held jobs don't want to be included in the pending count for scaling purposes since they won't run even if there are slots available.

The following seems to be a quick fix (caveat I know as much about awk as a fish knows about cycling).

pending=$(qstat -g d -s p -u '*' | tail -n+3 | awk '$5 == "qw" {total = total+ $8} END {print total}')

Hope that's helpful.

Cheers,

Mike

cfncluster create error

When I updated the new version and tried to create a new cluster, I found the template URL refers to the S3 is not valid. Any thoughts? Thanks!

no signal that MasterServer is provisioned for certain types of custom VPCs

During "cfncluster create examplename", when using a custom VPC setting defined in the AWS Console, the system gets stuck at "Status: MasterServer - CREATE_IN_PROGRESS", even though the MasterServer does get successfully created (upon manually inspecting it in EC2 Dashboard).

It's possible my inexperience with subnets caused some sort of communication feedback problem. I used a VPC CIDR with 10.0.0.0/16 (DNS resolution: Yes, DNS hostname: Yes) and set up a subnet 10.0.0.0/26 (Default Subnet: No, Auto-assign Public IP: No) with an attached Internet gateway and a default DHCP option (ec2.internal). The VPC was launched in us-east-1.

This led me to try digging around all of the underlying code... Finally I considered that maybe my VPC attempt was naive somehow, and so went with one of the Default VPCs. At first, still a delay in the response of MasterServer...

But, then, the magical words: cfncluster-examplename CREATE_COMPLETE ... with all the outputs that you want to see, including the elegant neuroscience-inspired name "Ganglia". :)

Anyways, I wanted to post this experience - a bug, or at least a failure of the system to help me address a potentially problematic VPC configuration - so that others might be able to tear this, and so that this case can be considered in developing this out.

Race condition when nodes terminate (SGE)

There is a race condition in the nodewatcher-sqswatcher system for scaling down. Normal termination flow is:

  1. Nodewatcher determines node is eligible for termination
  2. Nodewatcher sends termination request to Auto Scaling
  3. Auto Scaling terminates the node and sends a notification to SNS
  4. SNS puts a message on the SQS queue
  5. Sqswatcher on the master receives the message
  6. Sqswatcher removes the node from SGE queues and execute host list

The problem happens when a job is submitted and scheduled onto the node any time after step 1 and before step 6. Grid Engine will fail to remove the node from the queue if jobs are scheduled onto that node, even if the node is no longer responding to requests (and the jobs are still in 'transferring' state). If the node fails to remove, Grid Engine will continue to try to send jobs to it.

My suggestion to help resolve this is to, immediately after step 1 above, the node disables its own queue with the command qmod -d all.q@<node>. This shortens the race condition window by preventing jobs from being scheduled during the remaining steps. It does not eliminate the race condition entirely.

post_install using non-public-read script

Hello,

I'm trying to create a post_install script to handle all the fun environmental customizations. When I go by the docs and set the file to be 'public-read' everything works fine. However if I attempt use a script set to be accessible only by the IAM user creating the account and a config line such as

post_install = s3://mybucket/testme.sh

the cnfcluster script fails to create the cluster. I guess I had assumed that the post_install would have access to the aws_access_key_id and aws_secret_access_key environment variables. Is there any reasonable way to not have the install script world readable?

Thanks!

Cloudformation fails and cfncluster rollbacks during cluster creation

Dear all,

I had been unable to create a stable cluster with cfncluster. The cluster is created and then, immediately after all nodes come up, the cluster is destroyed. After looking the log file and wandering around in AWS, I found the following events in the event log of cloudformation:

14:26:13 UTC-0450 ROLLBACK_IN_PROGRESS AWS::CloudFormation::Stack cfncluster-slam The following resource(s) failed to create: [ComputeFleet]. . Rollback requested by user.

14:26:11 UTC-0450 CREATE_FAILED AWS::AutoScaling::AutoScalingGroup ComputeFleet Received FAILURE signal with UniqueId i-624e0bb6

14:18:47 UTC-0450 CREATE_IN_PROGRESS AWS::AutoScaling::Auto Resource creation Initiated

It seems that I am missing some kind of permission but I have no idea how to solve this problem. My config file (edited) is attached below.

Any help is appreciated!

Thanks

Blai

screen shot 2015-10-21 at 3 03 35 pm

Not able to create CFNcluster

ssh2

I can’t manage to install CFNcluster, the error I get is in an attachment.

My workflow is:
-Use AMI cfncluster - CentOS6.5 x86_64 HVM - 20140806-0 (ami-208a5b48) (Us-east)

  • Setup VPC and enable DNS resolution & hostnames
  • install CFNcluster
    -copy config file into new directory “/.cfncluster”
  • edit config file with openlava and AWS credentials & key location
  • run “cfncluster start mycluster”

any ideas? Thanks

Support Adding more Values to /opt/cfncluster/cfnconfig

It appears the only way to pass in variables right now is pre_install_args and post_install_args. I suggest allowing the user to add more variables in /opt/cfncluster/cfnconfig which would be more akin 12 factor's use of environmental variables.

In addition, consider sourcing /opt/cfncluster/cfnconfig in either /etc/bashrc, or /etc/profile.

Is it possible to customize AMI?

I have a prebuilt AMI and I want to launch a cluster with all worker nodes created from the customized AMI. I am not sure if it is possible to support this mode.
Besides, how can I use the public AMI? For example, the AMI name is amzn-ami-hvm-2014.09.1.x86_64-ebs from the source amazon/amzn-ami-hvm-2014.09.1.x86_64-ebs.

In the config file, I cannot access the template URL: https://s3.amazonaws.com/cfncluster-us-east-1/templates/cfncluster.cfn.json. So how can I find other template available?

Thanks!

SGE cluster crash issue

Hi Dougal,

Recently, I created an SGE cluster with 20 nodes (cc2.8xlarge) to process our jobs with 30+ dataset. For each dataset, I need to submit 5,000+ jobs to the SGE cluster. When SGE finishes one dataset, then I submit another dataset (5,000+ jobs) to the queue. However, the cluster was crashed when cluster processed every 5 dataset. The below is the error message,

Unable to run job: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).

When I log in the master node, I cannot get any response when I type the command "qhsot" and "qstat". And I can not also submit any jobs to the queue. Do you have any suggestions?

Thanks,
Michael

Setting AWS options for cfncluster

It would be nice if the following two changes could be made to cfncluster.

  1. Allow cfncluster to use the same sorts of environment variables as the AWS cli, such as AWS_DEFAULT_REGION.
  2. Follow the same convention for using these options (commandline, environment variables then config file)

TemplateURL must reference a valid S3 object to which you have access

Hello,

I seem to be getting the same error as #33 with version 0.0.22

(cfncluster)HLI-MACPRO48:envs jpiper$ cfncluster create mycluster
Starting: mycluster
TemplateURL must reference a valid S3 object to which you have access.
(cfncluster)HLI-MACPRO48:envs jpiper$ cfncluster version
0.0.22
(cfncluster)HLI-MACPRO48:envs jpiper$ pip freeze
boto==2.38.0
cfncluster==0.0.22
wheel==0.24.0

Permit multiple s3 resources

How would we specify an array value for s3_read_resource and s3_read_write_resource, if we needed to access to multiple ARNs?

Alternately, some mechanism to specify a custom role to attached to the EC2 instances can solve the problem too.

mounting preexisting EBS volume part II

I realize that you've recommended making a snapshot of the volume to mount (which works fine), but can you implement a way to mount directly from a saved volume?

The advantage for me at least is that any data generated and output to the persistent volume during program execution is saved without losing it when the cluster is terminated. I can then remount as needed to another instance of cfncluster after going through the generated dataset without the need to find a way to save it elsewhere.

Thoughts?

Scaling Seems to Double Apply

I think there is an issue where the nodes take longer to start up than the threshold time leading to the start trigger firing twice.

Right now I have:

Execute policy when:
cfncluster-mycluster7-AddCapacityAlarm-XXXXX    
breaches the alarm threshold: pending >= 4 for 2 consecutive periods of 60 seconds
for the metric dimensions
Stack = cfncluster-mycluster7
Take the action:
Add 2 instances 
And then wait: 120 seconds before allowing another scaling activity

but I am seeing 4 nodes started.

Can cfncluster mount the existing ebs volumes?

Can I automatically mount the existing ebs volumes to a cluster if I have already created some ebs volumes? For example, can I specify the volume IDs in the configuration file before create/start a cluster?

Thanks,
Michael

Provide IP output for failed cluster creation with --norollback specified

When building new clusters I often find it necessary to use the --norollback option so I can get into the Master instance and check logs. However, running cfncluster status on a failed cluster with running Master does not produce the instances public or private IP address as it does with a successful cluster launch.

Is there a design reason why the status command does not provide IP information on a failed cluster launch that has a master running (--norollback)? Running something like aws(cli) ec2 describe-instances --instance-ids i-a440567f | grep PublicIpAddress works but it's not as convenient.

John

Indefinite Hang on MasterServer - CREATE_IN_PROGRESS

I installed cfncluster on Ubuntu 14.04 with sudo pip install cfncluster. My config:

[cluster default]
vpc_settings = test
key_name = mykey
cluster_type = spot
spot_price = 0.030
use_public_ips = false
custom_ami = ami-######
compute_instance_type = c4.large
master_root_volume_size = 50

[vpc test]
master_subnet_id = subnet-######
vpc_id = vpc-######

[global]
update_check = true
sanity_check = true
cluster_template = default

When I cfncluster create default, I see it present the status of a few things being created, and then it hangs indefinitely on:

$ cfncluster create default
Starting: default
Status: MasterServer - CREATE_IN_PROGRESS

In the AWS online console I can see the node named "Master" has successfully started and status checks have passed. It has an elastic IP and I can SSH into it.

Setting Slots per Host

There does not appear to be any setting for controlling the number of slots available on a host (aka computer node).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.