aws / aws-parallelcluster Goto Github PK

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.

Home Page: https://github.com/aws/aws-parallelcluster

License: Apache License 2.0

Shell 2.54% Python 96.59% Dockerfile 0.05% C 0.05% Jinja 0.06% Smithy 0.72%

aws-parallelcluster's People

Contributors

Stargazers

Watchers

Forkers

irvined1982 sh4t 00gavin rartzi danlorts aculich kwangshi bart20073 cmarcond ank jnbala ospreyx cancan101 thomasleyer kdaily howardlinus anubhavsinha zhangbofrank hurtz hpcnow zenotech kgutwin markusjamoose davebiffuk f403 dreamingbinary ricardolui msto yunyanggit lchoy kislyuk clausman zeus911 mkosmo laxmanlax santhoshdaivajna rodolico khericlim whistful pansapiens hartzell rbramwell kleopatra999 akiyoshi83 tomsing1 kdyb iansealy adamchainz snorfalorpagus rwtaylor cheealtair btklein huangjincong dougalb bwbarrett shazindia roboticluddite cgorski seblat5ch ravirajadrangi etsangsplk mohanasudhan maneeshs jorgeboucas bioxops mkuchnik jocherry chambm karanindia armenr resurgo-genetics gopicares doyen2015 gridl elainhelen acrosby miker2746 cesargomez sub-salt nedrycontainmentsolutions brianjbeach enrico-usai lukeseawalker ahmed38essa mauri-melato nzioka victorsundaram fnubalaj ugiwgh etiennebourgeois ken-f-k elveskevtar henriquefreitas mdheller jamesswiggins sowmian demartinofra hsoans rexcsn unquietcode

aws-parallelcluster's Issues

cfncluster create error message

Hi,
if I habe run cfncluster as root first, and then as a normal user, I get the following error message:
ec2-user@ip-172-31-32-159 ~]$ cfncluster create isc
Traceback (most recent call last):
File "/usr/bin/cfncluster", line 9, in
load_entry_point('cfncluster==0.0.12', 'console_scripts', 'cfncluster')()
File "/usr/lib/python2.6/site-packages/cfncluster/cli.py", line 50, in main
filemode='w')
File "/usr/lib64/python2.6/logging/init.py", line 1402, in basicConfig
hdlr = FileHandler(filename, mode)
File "/usr/lib64/python2.6/logging/init.py", line 827, in init
StreamHandler.init(self, self._open())
File "/usr/lib64/python2.6/logging/init.py", line 846, in _open
stream = open(self.baseFilename, self.mode)
IOError: [Errno 13] Permission denied: '/tmp/cfncluster-cli.log'

Apparantly the permissions are set on the first call.

br, Jan

Allow additional_sg to be a list rather than just one item

It looks like additional_sg only allows one additional sg. I suggest allowing this to be a comma separated list.

cannot stop cluster

When I tried to run the command "cfncluster stop mycluster", I could not stop cluster and no response at all. Any thoughts?

Thanks,
Michael

Add GPU as a (Complex) Resource in SGE

SGE has support for defining custom resources. GPUs are a natural missing resource. There is source code already for such a resource: https://arc.liv.ac.uk/repos/darcs/sge/source/dist/util/resources/loadsensors/gpu-loadsensor.c
and varios SO discussions: http://serverfault.com/questions/322073/howto-set-up-sge-for-cuda-devices

Ganglia might have some support too: http://gridengine.org/pipermail/users/2012-May/003567.html

Mount /opt/sge with sync flag

Consider mounting the /opt/sge directory with the sync flag so that output will stream back as it is created. For more discussion see: https://arc.liv.ac.uk/trac/SGE/ticket/1545 and the referenced SO tickets.

cluster creation requires IAM permissions

This is specific to my AWS IAM privileges, but I get the following when I try to create a cluster:

Davids-MacBook-Pro:~ dkoppstein$ cfncluster status dkoppstein-cfncluster
Status: ROLLBACK_COMPLETE
2015-03-25 15:41:44.703000 CREATE_FAILED AWS::EC2::SecurityGroup MasterSecurityGroup Resource creation cancelled
2015-03-25 15:41:30.679000 CREATE_FAILED AWS::DynamoDB::Table DynamoDBTable Resource creation cancelled
2015-03-25 15:41:30.596000 CREATE_FAILED AWS::EC2::EIP MasterEIP Resource creation cancelled
2015-03-25 15:41:30.593000 CREATE_FAILED AWS::EC2::Volume SharedVolume Resource creation cancelled
2015-03-25 15:41:29.187000 CREATE_FAILED AWS::IAM::Role RootRole API: iam:CreateRole User: arn:aws:iam::643571691154:user/davidk is not authorized to perform: iam:CreateRole on resource: arn:aws:iam::643571691154:role/cfncluster-dkoppstein-cfncluster-RootRole-TTPCGJFH359D

Elasticluster and StarCluster both do not require CreateRole on a root resource; it'd be nice if cfncluster didn't require it, too.

Export DRMAA_LIBRARY_PATH

When using SGE, export DRMAA_LIBRARY_PATH=/opt/sge/lib/lx-amd64/libdrmaa.so in one of the profile files.

Upgrade AMI to Centos 6.7

http://wiki.centos.org/Manuals/ReleaseNotes/CentOS6.7

TemplateURL must reference a valid S3 object to which you have access.

I get this error when trying to start a cluster:

$ cfncluster create mycluster
Starting: mycluster
TemplateURL must reference a valid S3 object to which you have access.

$ cfncluster version
0.0.21

Configure failing in python 3.5.1

Hi,

Are there plans to make cfncluster python 3 compatible? I'm receiving the following error when I try to run cfncluster configure under 3.5.1.

AttributeError: module 'easyconfig' has no attribute 'configure'

Thanks!

Changing the size of the predefined AMI's EBS volume

Dear All,

I am trying to modify the size of the EBS volume of the prebuilt AMI (ami-f1dd7e9a).
By changing its size from 10GB to 30GB, cfncluster cannot create the master node and it's rolling back.
My aim is to set-up an AMI with all the source code and the necessary input files, based on the prebuild AMI (ami-f1dd7e9a).
Is there any other possible solution or workaround?

Thanks in advance,
Bill

build cluster from linux AMIs?

It'd be nice to be able to build cfncluster from standard AMIs. Currently, the StarCluster AMIs are very out of date (stuck on Ubuntu 12 series) which causes problems with software installation. An AWS customer representative suggested cfncluster as an alternative, but I'm worried that the cfncluster AMIs will go out of date too in the future.

SGE 'pending jobs' count - ignore held jobs

Hi,

Firstly, thanks Dougal for your efforts with cfn-cluster, it's great.
I think I've spotted a small issue with the pending jobs count for the sge scheduler. Currently this is calculated as follows:

pending=$(qstat -g d -s p -u '*' | tail -n+3 | awk '{total = total+ $8}END{print total}')

The problem is, this doesn't take into account whether jobs are held pending the completion of other job dependencies (their state is 'hqw' rather than 'qw'). Obviously held jobs don't want to be included in the pending count for scaling purposes since they won't run even if there are slots available.

The following seems to be a quick fix (caveat I know as much about awk as a fish knows about cycling).

pending=$(qstat -g d -s p -u '*' | tail -n+3 | awk '$5 == "qw" {total = total+ $8} END {print total}')

Hope that's helpful.

Cheers,

Mike

cfncluster create error

When I updated the new version and tried to create a new cluster, I found the template URL refers to the S3 is not valid. Any thoughts? Thanks!

Add Docs about what IAM Permissions are Needed

Add Docs about what IAM Permissions are Needed in order to start the cluster. For example when permissions should I grant the user that I specify in the config?

Determine if master node in pre_install / post_install scripts

Set some environmental variable indicating if running master or non master node.

Best option now is to see if $cfn_master is set

Option to Auto-Create the Placement Group

Right now that does not seem to be the case: http://cfncluster.readthedocs.org/en/latest/configuration.html#placement-group where as for StarCluster, the placement group is created automatically for the cluster.

no signal that MasterServer is provisioned for certain types of custom VPCs

During "cfncluster create examplename", when using a custom VPC setting defined in the AWS Console, the system gets stuck at "Status: MasterServer - CREATE_IN_PROGRESS", even though the MasterServer does get successfully created (upon manually inspecting it in EC2 Dashboard).

It's possible my inexperience with subnets caused some sort of communication feedback problem. I used a VPC CIDR with 10.0.0.0/16 (DNS resolution: Yes, DNS hostname: Yes) and set up a subnet 10.0.0.0/26 (Default Subnet: No, Auto-assign Public IP: No) with an attached Internet gateway and a default DHCP option (ec2.internal). The VPC was launched in us-east-1.

This led me to try digging around all of the underlying code... Finally I considered that maybe my VPC attempt was naive somehow, and so went with one of the Default VPCs. At first, still a delay in the response of MasterServer...

But, then, the magical words: cfncluster-examplename CREATE_COMPLETE ... with all the outputs that you want to see, including the elegant neuroscience-inspired name "Ganglia". :)

Anyways, I wanted to post this experience - a bug, or at least a failure of the system to help me address a potentially problematic VPC configuration - so that others might be able to tear this, and so that this case can be considered in developing this out.

Add Docker to Image

For Centos: https://docs.docker.com/installation/centos/
Including:

sudo usermod -aG docker ec2-user

openlava-web

Web interface for OpenLava:
https://www.clusterfsck.io/static/openlava-web/index.html

Race condition when nodes terminate (SGE)

There is a race condition in the nodewatcher-sqswatcher system for scaling down. Normal termination flow is:

Nodewatcher determines node is eligible for termination
Nodewatcher sends termination request to Auto Scaling
Auto Scaling terminates the node and sends a notification to SNS
SNS puts a message on the SQS queue
Sqswatcher on the master receives the message
Sqswatcher removes the node from SGE queues and execute host list

The problem happens when a job is submitted and scheduled onto the node any time after step 1 and before step 6. Grid Engine will fail to remove the node from the queue if jobs are scheduled onto that node, even if the node is no longer responding to requests (and the jobs are still in 'transferring' state). If the node fails to remove, Grid Engine will continue to try to send jobs to it.

My suggestion to help resolve this is to, immediately after step 1 above, the node disables its own queue with the command qmod -d all.q@<node>. This shortens the race condition window by preventing jobs from being scheduled during the remaining steps. It does not eliminate the race condition entirely.

Cloud Watch Metrics not Cleaned up When Stack Deleted

Accounting and Reporting Console (ARCo) on SGE

http://arc.liv.ac.uk/repos/darcs/arco/www/index.html or at the very least the dbwriter.

Upgrade Python to 2.7

Currently default AMI has 2.6.6

post_install using non-public-read script

Hello,

I'm trying to create a post_install script to handle all the fun environmental customizations. When I go by the docs and set the file to be 'public-read' everything works fine. However if I attempt use a script set to be accessible only by the IAM user creating the account and a config line such as

post_install = s3://mybucket/testme.sh

the cnfcluster script fails to create the cluster. I guess I had assumed that the post_install would have access to the aws_access_key_id and aws_secret_access_key environment variables. Is there any reasonable way to not have the install script world readable?

Thanks!

Upgrade SGE to 8.1.1

http://arc.liv.ac.uk/repos/darcs/sge/NEWS

Cloudformation fails and cfncluster rollbacks during cluster creation

Dear all,

I had been unable to create a stable cluster with cfncluster. The cluster is created and then, immediately after all nodes come up, the cluster is destroyed. After looking the log file and wandering around in AWS, I found the following events in the event log of cloudformation:

14:26:13 UTC-0450 ROLLBACK_IN_PROGRESS AWS::CloudFormation::Stack cfncluster-slam The following resource(s) failed to create: [ComputeFleet]. . Rollback requested by user.

14:26:11 UTC-0450 CREATE_FAILED AWS::AutoScaling::AutoScalingGroup ComputeFleet Received FAILURE signal with UniqueId i-624e0bb6

14:18:47 UTC-0450 CREATE_IN_PROGRESS AWS::AutoScaling::Auto Resource creation Initiated

It seems that I am missing some kind of permission but I have no idea how to solve this problem. My config file (edited) is attached below.

Any help is appreciated!

Thanks

Blai

Not able to create CFNcluster

I can’t manage to install CFNcluster, the error I get is in an attachment.

My workflow is:
-Use AMI cfncluster - CentOS6.5 x86_64 HVM - 20140806-0 (ami-208a5b48) (Us-east)

Setup VPC and enable DNS resolution & hostnames
install CFNcluster
-copy config file into new directory “/.cfncluster”
edit config file with openlava and AWS credentials & key location
run “cfncluster start mycluster”

any ideas? Thanks

Support Adding more Values to /opt/cfncluster/cfnconfig

It appears the only way to pass in variables right now is pre_install_args and post_install_args. I suggest allowing the user to add more variables in /opt/cfncluster/cfnconfig which would be more akin 12 factor's use of environmental variables.

In addition, consider sourcing /opt/cfncluster/cfnconfig in either /etc/bashrc, or /etc/profile.

Is it possible to customize AMI?

I have a prebuilt AMI and I want to launch a cluster with all worker nodes created from the customized AMI. I am not sure if it is possible to support this mode.
Besides, how can I use the public AMI? For example, the AMI name is amzn-ami-hvm-2014.09.1.x86_64-ebs from the source amazon/amzn-ami-hvm-2014.09.1.x86_64-ebs.

In the config file, I cannot access the template URL: https://s3.amazonaws.com/cfncluster-us-east-1/templates/cfncluster.cfn.json. So how can I find other template available?

Thanks!

Document why enableDnsHostnames is required

StarCluster does not require this settings.

What are the issue with turning this on?

SGE cluster crash issue

Hi Dougal,

Recently, I created an SGE cluster with 20 nodes (cc2.8xlarge) to process our jobs with 30+ dataset. For each dataset, I need to submit 5,000+ jobs to the SGE cluster. When SGE finishes one dataset, then I submit another dataset (5,000+ jobs) to the queue. However, the cluster was crashed when cluster processed every 5 dataset. The below is the error message,

Unable to run job: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).

When I log in the master node, I cannot get any response when I type the command "qhsot" and "qstat". And I can not also submit any jobs to the queue. Do you have any suggestions?

Thanks,
Michael

Setting AWS options for cfncluster

It would be nice if the following two changes could be made to cfncluster.

Allow cfncluster to use the same sorts of environment variables as the AWS cli, such as AWS_DEFAULT_REGION.
Follow the same convention for using these options (commandline, environment variables then config file)

Killed Instances not Cleaned up

$ qstat -f
[email protected] BIP   0/1/1          -NA-     lx-amd64      auo
    103 0.55500 bip_runner ec2-user     dt    07/22/2015 00:56:13     1        
---------------------------------------------------------------------------------
[email protected] BIP   0/1/1          -NA-     lx-amd64      auo
    105 0.55500 bip_runner ec2-user     dt    07/22/2015 00:58:28     1

TemplateURL must reference a valid S3 object to which you have access

Hello,

I seem to be getting the same error as #33 with version 0.0.22

(cfncluster)HLI-MACPRO48:envs jpiper$ cfncluster create mycluster
Starting: mycluster
TemplateURL must reference a valid S3 object to which you have access.

(cfncluster)HLI-MACPRO48:envs jpiper$ cfncluster version
0.0.22

(cfncluster)HLI-MACPRO48:envs jpiper$ pip freeze
boto==2.38.0
cfncluster==0.0.22
wheel==0.24.0

master_subnet_id shows up twice in the docs

Permit multiple s3 resources

How would we specify an array value for s3_read_resource and s3_read_write_resource, if we needed to access to multiple ARNs?

Alternately, some mechanism to specify a custom role to attached to the EC2 instances can solve the problem too.

Source / Destination IP Checks Disabled

Why is the "Source / Destination IP Checks" disabled on the master node?

mounting preexisting EBS volume part II

I realize that you've recommended making a snapshot of the volume to mount (which works fine), but can you implement a way to mount directly from a saved volume?

The advantage for me at least is that any data generated and output to the persistent volume during program execution is saved without losing it when the cluster is terminated. I can then remount as needed to another instance of cfncluster after going through the generated dataset without the need to find a way to save it elsewhere.

Thoughts?

Install SGE GUI

http://arc.liv.ac.uk/downloads/SGE/releases/8.1.8/gridengine-guiinst-8.1.8-1.el6.noarch.rpm

Scaling Seems to Double Apply

I think there is an issue where the nodes take longer to start up than the threshold time leading to the start trigger firing twice.

Right now I have:

Execute policy when:
cfncluster-mycluster7-AddCapacityAlarm-XXXXX    
breaches the alarm threshold: pending >= 4 for 2 consecutive periods of 60 seconds
for the metric dimensions
Stack = cfncluster-mycluster7
Take the action:
Add 2 instances 
And then wait: 120 seconds before allowing another scaling activity

but I am seeing 4 nodes started.

Can cfncluster mount the existing ebs volumes?

Can I automatically mount the existing ebs volumes to a cluster if I have already created some ebs volumes? For example, can I specify the volume IDs in the configuration file before create/start a cluster?

Thanks,
Michael

Provide IP output for failed cluster creation with --norollback specified

When building new clusters I often find it necessary to use the --norollback option so I can get into the Master instance and check logs. However, running cfncluster status on a failed cluster with running Master does not produce the instances public or private IP address as it does with a successful cluster launch.

Is there a design reason why the status command does not provide IP information on a failed cluster launch that has a master running (--norollback)? Running something like aws(cli) ec2 describe-instances --instance-ids i-a440567f | grep PublicIpAddress works but it's not as convenient.

John

Rq not seen as queue to increase capacity

Hello. Thank you very much for a great tool!

rescheduled jobs in Rq state don't seem to trigger load alarms. 'Rq' should probably be treated just like 'qw'.

Indefinite Hang on MasterServer - CREATE_IN_PROGRESS

I installed cfncluster on Ubuntu 14.04 with sudo pip install cfncluster. My config:

[cluster default]
vpc_settings = test
key_name = mykey
cluster_type = spot
spot_price = 0.030
use_public_ips = false
custom_ami = ami-######
compute_instance_type = c4.large
master_root_volume_size = 50

[vpc test]
master_subnet_id = subnet-######
vpc_id = vpc-######

[global]
update_check = true
sanity_check = true
cluster_template = default

When I cfncluster create default, I see it present the status of a few things being created, and then it hangs indefinitely on:

$ cfncluster create default
Starting: default
Status: MasterServer - CREATE_IN_PROGRESS

In the AWS online console I can see the node named "Master" has successfully started and status checks have passed. It has an elastic IP and I can SSH into it.

aws / aws-parallelcluster Goto Github PK

aws-parallelcluster's People

Contributors

Stargazers

Watchers

Forkers

aws-parallelcluster's Issues

Recommend Projects

Recommend Topics

Recommend Org