Git Product home page Git Product logo

starcluster's Introduction

StarCluster v0.95.6

StarCluster:Cluster Computing Toolkit for the Cloud
Version: 0.95.6
Author: Justin Riley <[email protected]>
Team:Software Tools for Academics and Researchers (http://star.mit.edu)
Homepage:http://star.mit.edu/cluster
License:LGPL
https://secure.travis-ci.org/jtriley/StarCluster.png?branch=develop https://pypip.in/d/StarCluster/badge.png

Description:

StarCluster is a utility for creating and managing computing clusters hosted on Amazon's Elastic Compute Cloud (EC2). StarCluster utilizes Amazon's EC2 web service to create and destroy clusters of Linux virtual machines on demand.

All that's needed to create your own cluster(s) on Amazon EC2 is an AWS account and StarCluster. StarCluster features:

  • Simple configuration - with examples ready to go out-of-the-box
  • Create/Manage Clusters - simple start command to automatically launch and configure one or more clusters on EC2
  • Automated Cluster Setup - includes NFS-sharing, Open Grid Scheduler queuing system, Condor, password-less ssh between machines, and more
  • Scientific Computing AMI - comes with Ubuntu 11.10-based EBS-backed AMI that contains Hadoop, OpenMPI, ATLAS, LAPACK, NumPy, SciPy, IPython, and other useful libraries
  • EBS Volume Sharing - easily NFS-share Amazon Elastic Block Storage (EBS) volumes across a cluster for persistent storage
  • EBS-Backed Clusters - start and stop EBS-backed clusters on EC2
  • Cluster Compute Instances - support for "cluster compute" instance types
  • Expand/Shrink Clusters - scale a cluster by adding or removing nodes
  • Elastic Load Balancing - automatically shrink or expand a cluster based on Open Grid Scheduler queue statistics
  • Plugin Support - allows users to run additional setup routines on the cluster after StarCluster's defaults. Comes with plugins for IPython parallel+notebook, Condor, Hadoop, MPICH2, MySQL cluster, installing Ubuntu packages, and more.

Interested? See the getting started section for more details.

Getting Started:

Install StarCluster using easy_install:

$ easy_install StarCluster

or using pip:

$ pip install StarCluster

or manually:

$ (Download StarCluster from http://star.mit.edu/cluster/downloads.html)
$ tar xvzf starcluster-X.X.X.tar.gz  (where x.x.x is a version number)
$ cd starcluster-X.X.X
$ sudo python setup.py install

After the software has been installed, the next step is to setup the configuration file:

$ starcluster help
StarCluster - (http://star.mit.edu/cluster)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to [email protected]

!!! ERROR - config file /home/user/.starcluster/config does not exist

Options:
--------
[1] Show the StarCluster config template
[2] Write config template to /home/user/.starcluster/config
[q] Quit

Please enter your selection:

Select the second option by typing 2 and pressing enter. This will give you a template to use to create a configuration file containing your AWS credentials, cluster settings, etc. The next step is to customize this file using your favorite text-editor:

$ vi ~/.starcluster/config

This file is commented with example "cluster templates". A cluster template defines a set of configuration settings used to start a new cluster. The example config provides a smallcluster template that is ready to go out-of-the-box. However, first, you must fill in your AWS credentials and keypair info:

[aws info]
aws_access_key_id = #your aws access key id here
aws_secret_access_key = #your secret aws access key here
aws_user_id = #your 12-digit aws user id here

The next step is to fill in your keypair information. If you don't already have a keypair you can create one from StarCluster using:

$ starcluster createkey mykey -o ~/.ssh/mykey.rsa

This will create a keypair called mykey on Amazon EC2 and save the private key to ~/.ssh/mykey.rsa. Once you have a key the next step is to fill-in your keypair info in the StarCluster config file:

[key key-name-here]
key_location = /path/to/your/keypair.rsa

For example, the section for the keypair created above using the createkey command would look like:

[key mykey]
key_location = ~/.ssh/mykey.rsa

After defining your keypair in the config, the next step is to update the default cluster template smallcluster with the name of your keypair on EC2:

[cluster smallcluster]
keyname = key-name-here

For example, the smallcluster template would be updated to look like:

[cluster smallcluster]
keyname = mykey

Now that the config file has been set up we're ready to start using StarCluster. Next we start a cluster named "mycluster" using the default cluster template smallcluster in the example config:

$ starcluster start mycluster

The default_template setting in the [global] section of the config specifies the default cluster template and is automatically set to smallcluster in the example config.

After the start command completes you should now have a working cluster. You can login to the master node as root by running:

$ starcluster sshmaster mycluster

You can also copy files to/from the cluster using the put and get commands. To copy a file or entire directory from your local computer to the cluster:

$ starcluster put mycluster /path/to/local/file/or/dir /remote/path/

To copy a file or an entire directory from the cluster to your local computer:

$ starcluster get mycluster /path/to/remote/file/or/dir /local/path/

Once you've finished using the cluster and wish to stop paying for it:

$ starcluster terminate mycluster

Have a look at the rest of StarCluster's available commands:

$ starcluster --help

Dependencies:

  • Amazon AWS Account
  • Python 2.6+
  • Boto 2.23.0+
  • Paramiko 1.12.1+
  • WorkerPool 0.9.2
  • Jinja2 2.7
  • decorator 3.4.0+
  • iptools 0.6.1+
  • optcomplete 1.2-devel+
  • PyCrypto 2.5+
  • scp 0.7.1+
  • iso8601 0.1.8+

Learn more...

Watch an ~8 minute screencast @ http://star.mit.edu/cluster

To learn more have a look at the documentation: http://star.mit.edu/cluster/docs/latest

Community

StarCluster has a mailing list for users and developers:

http://star.mit.edu/cluster/mailinglist.html

Join our IRC channel #starcluster on freenode. If you do not have an IRC client you can join the #starcluster channel using your web browser:

http://webchat.freenode.net/?channels=starcluster

Licensing

StarCluster is licensed under the LGPLv3 See COPYING.LESSER (LGPL) and COPYING (GPL) for LICENSE details

starcluster's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

starcluster's Issues

add --ssh-status option to listclusters command

From bmabey in another bug:

"I would like to be able to list the clusters and know which nodes are running and which ones are not so I can kill/restart the running nodes that are not responding to SSH, and/or resize my cluster. Right now my only way of finding the strangler node is to manually check each instance by trying to ssh into them."

Idea is to add --ssh-status option that would look something like:

$ starcluster listclusters --ssh-status

.... master i-asdf0232 ec2-asdf-asf.aws.com (SSH: Up)
node001 i-basd222 ec2-fdsa-asf.aws.com (SSH: Down)
....

add new 'config' command for quickly viewing cfg settings

Here's the proposed interface:

List all config sections of a given type:

$ starcluster config list cluster
smallcluster
mediumcluster
largecluster
$ starcluster config list plugin
ipcluster
tmux

Show a given section or multiple sections:

$ starcluster config show cluster smallcluster mediumcluster largecluster
[cluster smallcluster]
cluster_size = 3
node_instance_type = m1.small
node_image_id = ami-small
plugins = ipcluster, tmux

[cluster mediumcluster]
extends = smallcluster
cluster_size = 5
node_instance_type = m1.large
node_image_id = ami-medium

[cluster largecluster]
extends = smallcluster
cluster_size = 10
node_instance_type = cg1.4xlarge
node_image_id = ami-large
$ starcluster config show plugin ipcluster
[plugin ipcluster]
setup_class = starcluster.plugins.ipcluster.IPCluster

optional kwargs in a plugin's __init__ not working

Reported by @minrk (thanks, moving this to a new issue so it's easier to follow)

Can plugins have optional init arguments? One of my plugins build and installs pyzmq, but I would like users to be able to specify a url for an egg if they have one. The logical model would be:

class IPythonDev(ClusterSetup):

    def __init__(self, egg_url=None):
        self.egg_url = egg_url

But that doesn't work. Either it has to be a mandatory option in the config file (not a kwarg), or specifying it in the config file is completely ignored (kwarg).

I do have an egg for pyzmq that works, but supporting an egg on Linux is not something I want to be in the business of doing.

cluster_user's home

Feature Request: configure the cluster_user's home on the ephemeral storage drive or an EBS volume

Update orte parallel environment on addnode command

When starting a cluster, a parallel environment called orte is setup with the number of slots equal to the total number of cpu cores. However, if a node is added, sge does not update the orte slots to reflect the added cores. Is this a bug?

amazon attaches ebs vols to alternate device on HVM instances

starcluster attaches EBS volumes to /dev/sd* and expects the device to mountable via the /dev/sd* address. With HVM instances the EBS volume devices show up as /dev/xvd* instead. Need to also check for /dev/xvd* whenever /dev/sd* doesn't exist.

Failed easy_install on Mac OS X Snow Leopard

I am trying to install StarCluster on Mac OS X Snow Leopard but I get this failure:

$ sudo easy_install StarCluster
Password:
Searching for StarCluster
Best match: StarCluster 0.92rc2
Processing StarCluster-0.92rc2-py2.6.egg
StarCluster 0.92rc2 is already the active version in easy-install.pth
Installing starcluster script to /usr/local/bin

Using /Library/Python/2.6/site-packages/StarCluster-0.92rc2-py2.6.egg
Processing dependencies for StarCluster
Searching for pycrypto>=2.1
Reading http://pypi.python.org/simple/pycrypto/
Reading http://www.pycrypto.org/
Reading http://pycrypto.sourceforge.net
Reading http://www.amk.ca/python/code/crypto
Best match: pycrypto 2.3
Downloading http://ftp.dlitz.net/pub/dlitz/crypto/pycrypto/pycrypto-2.3.tar.gz
Processing pycrypto-2.3.tar.gz
Running pycrypto-2.3/setup.py -q bdist_egg --dist-dir /tmp/easy_install-hon51f/pycrypto-2.3/egg-dist-tmp-E0S9Tl
warning: GMP library not found; Not building Crypto.PublicKey._fastmath.
/usr/libexec/gcc/powerpc-apple-darwin10/4.2.1/as: assembler (/usr/bin/../libexec/gcc/darwin/ppc/as or /usr/bin/../local/libexec/gcc/darwin/ppc/as) for architecture ppc not installed
Installed assemblers are:
/usr/bin/../libexec/gcc/darwin/x86_64/as for architecture x86_64
/usr/bin/../libexec/gcc/darwin/i386/as for architecture i386
src/MD2.c:134: fatal error: error writing to -: Broken pipe
compilation terminated.
lipo: can't open input file: /var/tmp//ccIfnB2z.out (No such file or directory)
error: Setup script exited with error: command 'gcc-4.2' failed with exit status 1

AttributeError when running createvolume

When running starcluster createvolume, this traceback appears at the end, after the volume appears to have been created successfully:

*** WARNING - There are still volume hosts running: i-0de6f06e
*** WARNING - Run 'starcluster terminate volumecreator' to terminate all volume host instances once they're no longer needed
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/StarCluster-0.92.1-py2.7.egg/starcluster/cli.py", line 252, in main
sc.execute(args)
File "/Library/Python/2.7/site-packages/StarCluster-0.92.1-py2.7.egg/starcluster/commands/createvolume.py", line 135, in execute
vc.create(size, zone, name=self.opts.name, tags=self.opts.tags)
File "", line 2, in create
File "/Library/Python/2.7/site-packages/StarCluster-0.92.1-py2.7.egg/starcluster/utils.py", line 86, in wrap_f
res = func(_arg, *_kargs)
File "/Library/Python/2.7/site-packages/StarCluster-0.92.1-py2.7.egg/starcluster/volume.py", line 261, in create
self.log.error("failed to create new volume")
AttributeError: 'VolumeCreator' object has no attribute 'log'

do not launch the master instance using spot instances by default

When using -b option to launch a cluster with size >1 do not launch the master using a spot instance request; use a flat-rate request instead. If cluster size == 1 then this is OK given that it's only a single machine and no other machines have NFS shares mounted, etc. Probably need to also add an option to force master to be spot... (--spot-master?)

ebsimage: unmount every non root volume before creating new AMI ?

When using starcluster ebsimage on an EBS AMI (an Ubuntu 11.04 HVM AMI) with a mounted EBS backed volume, the mounted volume is also snapshot during the process. Furthermore when using the newly created AMI the EBS backed volume can not be mounted because the mount point is already in use.
(sorry for the lack of details, I am writing from memory)

workaround: temporarily remove the line beginning with 'VOLUMES= ' in the cluster definition in the ~/.starcluster/config file before launching an instance to be used to create a new AMI.

suggestion: ebsimage should unmount every mounted non root volume before creating the new AMI image.

Progress Bar divide by zero error

Hi,
I've been using the latest version of StarCluster from github (for the add/remove node commands). I've ran into the following error several times when starting a cluster. Based on the stacktrace the error appears to be in the new progress bar code and it not handling the case where the denominator may be 0 initially.

PID: 39151 config.py:490 - DEBUG - Loading config
PID: 39151 config.py:107 - DEBUG - Loading file: /Users/bmabey/.starcluster/config
PID: 39151 config.py:490 - DEBUG - Loading config
PID: 39151 config.py:107 - DEBUG - Loading file: /Users/bmabey/.starcluster/config
PID: 39151 awsutils.py:54 - DEBUG - creating self._conn w/ connection_authenticator kwargs = {'path': '/', 'region': None, 'port': None, 'is_secure': True}
PID: 39151 start.py:167 - INFO - Using default cluster template: testcluster
PID: 39151 cluster.py:420 - DEBUG - plugin args = ['self', 'pkg_to_install']
PID: 39151 cluster.py:421 - DEBUG - plugin varargs = None
PID: 39151 cluster.py:422 - DEBUG - plugin keywords = None
PID: 39151 cluster.py:423 - DEBUG - plugin defaults = None
PID: 39151 cluster.py:440 - DEBUG - config_args = ['htop']
PID: 39151 cluster.py:441 - DEBUG - config_kwargs = {}
PID: 39151 ubuntu.py:7 - DEBUG - pkg_to_install = htop
PID: 39151 cluster.py:420 - DEBUG - plugin args = ['self']
PID: 39151 cluster.py:421 - DEBUG - plugin varargs = None
PID: 39151 cluster.py:422 - DEBUG - plugin keywords = None
PID: 39151 cluster.py:423 - DEBUG - plugin defaults = None
PID: 39151 cluster.py:440 - DEBUG - config_args = []
PID: 39151 cluster.py:441 - DEBUG - config_kwargs = {}
PID: 39151 cluster.py:1302 - INFO - Validating cluster template settings...
PID: 39151 cluster.py:1324 - INFO - Cluster template settings are valid
PID: 39151 cluster.py:1200 - INFO - Starting cluster...
PID: 39151 cluster.py:949 - INFO - Launching a 1-node cluster...
PID: 39151 cluster.py:892 - INFO - Launching master (ami: ami-a5c42dcc, type: m1.large)
PID: 39151 awsutils.py:175 - INFO - Creating security group @sc-testcluster...
PID: 39151 cluster.py:1023 - INFO - Waiting for cluster to come up... (updating every 30s)
PID: 39151 cluster.py:632 - DEBUG - existing nodes: {}
PID: 39151 cluster.py:649 - DEBUG - returning self._nodes = []
PID: 39151 cluster.py:1039 - INFO - Waiting for all nodes to be in a 'running' state...
PID: 39151 cli.py:171 - DEBUG - Traceback (most recent call last):
  File "build/bdist.macosx-10.6-universal/egg/starcluster/cli.py", line 152, in main
    sc.execute(args)
  File "build/bdist.macosx-10.6-universal/egg/starcluster/commands/start.py", line 195, in execute
    scluster.start(create=create, create_only=create_only, validate=False)
  File "build/bdist.macosx-10.6-universal/egg/starcluster/cluster.py", line 1191, in start
    return self._start(create, create_only)
  File "build/bdist.macosx-10.6-universal/egg/starcluster/utils.py", line 69, in wrap_f
    res = func(*arg, **kargs)
  File "build/bdist.macosx-10.6-universal/egg/starcluster/cluster.py", line 1209, in _start
    self._setup_cluster()
  File "build/bdist.macosx-10.6-universal/egg/starcluster/cluster.py", line 1223, in _setup_cluster
    self.wait_for_cluster()
  File "build/bdist.macosx-10.6-universal/egg/starcluster/cluster.py", line 1041, in wait_for_cluster
    pbar.update(0)
  File "build/bdist.macosx-10.6-universal/egg/starcluster/progressbar.py", line 312, in update
    self.prev_percentage = self.percentage()
  File "build/bdist.macosx-10.6-universal/egg/starcluster/progressbar.py", line 261, in percentage
    return self.currval * 100.0 / self.maxval
ZeroDivisionError: float division

PID: 39151 cli.py:173 - ERROR - Oops! Looks like you've found a bug in StarCluster
PID: 39151 cli.py:174 - ERROR - Debug file written to: /tmp/starcluster-debug-bmabey.log
PID: 39151 cli.py:175 - ERROR - Look for lines starting with PID: 39151
PID: 39151 cli.py:177 - ERROR - Please submit this file, minus any private information,
PID: 39151 cli.py:178 - ERROR - to [email protected]

Mounting two partitions from the same volume

Hi jtriley,

I have an EBS volume with two partitions (both formatted). If I am mounting either alone, starcluster starts normally. However, when I try to mount both partitions together, I get the following error:

!!! ERROR - Multiple configurations for volume vol-####### specified.
!!! ERROR - Please choose one

In the starcluster config file the entries under EBS volumes for these two partitions are completely different, except for the volume ID. For example:

[volume partition-1]
VOLUME_ID = vol-abc123
MOUNT_PATH = /opt/part1/
PARTITION = 1

[volume partition-1]
VOLUME_ID = vol-abc123
MOUNT_PATH = /opt/part2/
PARTITION = 2

Is there a flag I'm missing or some error in the way I'm defining these volumes? Thanks a lot for the help!

-brainfood

couchdb

Would it be possible to get couchdb-0.10.X installed on the server node and a ssh tunnel on each node (ssh -f -N -L 5984:localhost:5984 master) ?

Thanks in advance.

Nicolas

Verify existence of keyfiles prior to AMI launch

Apologies if this is fixed in git, but the paths to the keyfiles should be tested prior to the launch of an AMI. I don't know if its tested prior to launch of a cluster, but I know that with createvolume on 0.91 it does not.

Austin

bug with listvolumes

Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.9999-py2.6.egg/starcluster/cli.py", line 160, in main
sc.execute(args)
File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.9999-py2.6.egg/starcluster/commands/listvolumes.py", line 41, in execute
self.ec2.list_volumes(**self.options_dict)
File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.9999-py2.6.egg/starcluster/awsutils.py", line 1012, in list_volumes
vols.sort(key=lambda x: x.create_time)
AttributeError: 'NoneType' object has no attribute 'sort'

Bug: qlogin acts weird on master node

executing 'qlogin' on the master node makes it "flicker" on my machine with the following error message:
"id: cannot find name for group ID 20000"

bug in 0.92rc1 setup.py breaks easy_install

From Adam Marsh on the mailing list:

When I run the install routine for the 0.92rc1 version of SC, it aborts with the following error message:
ImportError: No module named starcluster

Full output from the command "sudo easy_install StarCluster":

$ sudo easy_install StarCluster
Searching for StarCluster
Reading http://pypi.python.org/simple/StarCluster/
Reading http://web.mit.edu/starcluster
Best match: StarCluster 0.92rc1
Downloading http://pypi.python.org/packages/source/S/StarCluster/StarCluster-0.92rc1.tar.gz#md5=0e2fadfa011de41bf6a6868bc37c61ed
Processing StarCluster-0.92rc1.tar.gz
Running StarCluster-0.92rc1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-_L48LL/StarCluster-0.92rc1/egg-dist-tmp-ZZxqNG
Traceback (most recent call last):
File "/usr/bin/easy_install", line 9, in
load_entry_point('distribute==0.6.10', 'console_scripts', 'easy_install')()
File "/usr/lib/python2.6/dist-packages/setuptools/command/easy_install.py", line 1760, in main
with_ei_usage(lambda:
File "/usr/lib/python2.6/dist-packages/setuptools/command/easy_install.py", line 1741, in with_ei_usage
return f()
File "/usr/lib/python2.6/dist-packages/setuptools/command/easy_install.py", line 1764, in
distclass=DistributionWithoutHelpCommands, **kw
File "/usr/lib/python2.6/distutils/core.py", line 152, in setup
dist.run_commands()
File "/usr/lib/python2.6/distutils/dist.py", line 975, in run_commands
self.run_command(cmd)
File "/usr/lib/python2.6/distutils/dist.py", line 995, in run_command
cmd_obj.run()
File "/usr/lib/python2.6/dist-packages/setuptools/command/easy_install.py", line 254, in run
self.easy_install(spec, not self.no_deps)
File "/usr/lib/python2.6/dist-packages/setuptools/command/easy_install.py", line 489, in easy_install
return self.install_item(spec, dist.location, tmpdir, deps)
File "/usr/lib/python2.6/dist-packages/setuptools/command/easy_install.py", line 519, in install_item
dists = self.install_eggs(spec, download, tmpdir)
File "/usr/lib/python2.6/dist-packages/setuptools/command/easy_install.py", line 698, in install_eggs
return self.build_and_install(setup_script, setup_base)
File "/usr/lib/python2.6/dist-packages/setuptools/command/easy_install.py", line 975, in build_and_install
self.run_setup(setup_script, setup_base, args)
File "/usr/lib/python2.6/dist-packages/setuptools/command/easy_install.py", line 964, in run_setup
run_setup(setup_script, args)
File "/usr/lib/python2.6/dist-packages/setuptools/sandbox.py", line 29, in run_setup
lambda: execfile(
File "/usr/lib/python2.6/dist-packages/setuptools/sandbox.py", line 70, in run
return func()
File "/usr/lib/python2.6/dist-packages/setuptools/sandbox.py", line 31, in
{'file':setup_script, 'name':'main'}
File "setup.py", line 5, in
ImportError: No module named starcluster

use pygooglechart in place of matplotlib

matplotlib is a fairly heavy dependency to have for producing simple line graphs. since you must be online anyway when using spothistory/loadbalance -p commands it seems logical to use google charts instead which only requires the std python library and an internet connection.

an interesting idea for the spothistory/loadbalance -p commands is to launch a simple local static web server that serves dashboard-like pages containing the plots. Could also use the webbrowser module to open the user's web browser to view the dashboard pages served on localhost...

clarify EBS spot instances cannot be stopped

EBS spot instances can only be terminated, not stopped. But the message when using the stop command is confusing for a newbie like me.
Could that be changed with a WARNING that the cluster will be terminated( when using spot instances)?
It would also be nice to add something about it in the documentation.

release 0.92 version!

Need to release 0.92 version ASAP. Stop adding features, update docs, implement error handling for clusters started from 0.91.2 and release!

Specified configuration files are not used for: starcluster start

Hello,

I've spotted a problem when starting a new cluster from the command line while specifying a configuration file. Starcluster doesn't use the specified configuration file. For example:

$ starcluster -c /user/other-location/config start clustername

or

$ starcluster --config=/user/other-location/config start clustername

Results in:

starcluster.exception.ConfigNotFound: config file /user/.starcluster/config does not exist

The same approach works with other commands, such as listclusters.

I spotted in the source that the start command adds another -c mapping to the action options (-c or --cluster-template) for specifying a cluster template. Could this be overwriting the global options somehow?

Many thanks!

~ Matt @ AWS

try/catch until successful when fetching user_data

From the mailing list:

There are still a couple of issues that I'd like your thoughts on. First is
that we are still seeing occasional failures due to timing / eventual
consistency
of adding a node. Here are the relevant lines from the log file:

PID: 7860 cluster.py:678 - DEBUG - adding node i-eb030185 to self._nodes
list
PID: 7860 cli.py:157 - ERROR - InvalidInstanceID.NotFound: The instance
ID 'i-eb030185' does not exist

Does StarCluster return an error code when this happens? I have looked at
the code, but not studied it enough to know for sure. When we see
starcluster
return a non zero, we terminate and then restart the cluster. Is this
what you
would recommend?

This is a tricky bug that's hard to debug and fix. The root cause is that StarCluster creates the instances and then immediately attempts to perform an operation that retrieves the instance's user_data (ec2-get-instance-attribute API call). In some cases EC2's infrastructure isn't quite 'ready' yet most for the call given that it's performed almost instantaneously by StarCluster in order fetch metadata assigned to the instance at launch time (such as the node's alias). This metadata is then used to tag the instances using the new tags & filters API[1] which allows faster and more advanced querying of the nodes. The majority of the time the call works without issue, especially on smaller clusters, however, every now and again, as you've seen, the call is made too soon and this error is produced.

A simple fix for this would be to simply try/catch the call, sleep if unsuccessful, and keep trying until it's successful. That should reliably fix the issue once and for all (fingers crossed).

add ability to 'include' files in config

Need the ability in the config to perform includes, e.g.:

~/.starcluster/config:

#include credentials.txt

[cluster smallcluster]
cluster_size = 3
....

~/.starcluster/credentials.txt:

[aws info]
aws_access_key_id = #access key id
aws_secret_access_key = #access key
aws_user_id = #user id

[key mykey]
key_location=~/.ssh/ec2.rsa

This would allow users to share common portions of the config and modularize credentials, templates, etc into separate files as necessary. Includes should also support including files from a web address.

s3image hangs at "Bundling image file..."

Started up a cluster with command:
$ starcluster start -o -s 1 -i m1.small -n ami-8cf913e5 imagehost

Logged in with command:
$ starcluster sshmaster imagehost

Generate an AMI with s3image with command:
$ starcluster -d s3image i-XXXXXXX my-new-image

stdout will get to "Bundling image file..." and then hang. When inspecting the master node, the ec2-bundle-vol process has ended and all the my-new-image.part files are created. The next command is not executed. Any help would be appreciated.

notes:

  • the ssh client has NOT timed out as ssh process is still visible on the master.
  • as far as I can tell, the next commands should be ec2-upload-bundle and ec2-register. these commands are not executed.
  • running 0.92rc2
  • creating an ebsimage works without a problem.

listclusters command could show configured clusters

The listclusters command currently only shows running clusters, it could show both running and configured clusters so that there is an easy way to remember which clusternames you have defined when you go to launch a cluster.

In fact, the start command printout could show the user their list of configured clusters when explaining that a cluster tag is a required argument for starting a cluster.

This is very minor and a feature request rather than an issue.

ensure ephemeral storage is available by overriding block device mapping when running instances

From the StarCluster mailing list:

I would like to use the instance storage (2x 840G disks) that comes
with the cluster compute instance type. This is again for an Ubuntu
11.04 hvm AMI.

The scratch space is not mounted on /mnt:
root@master:/mnt/sgeadmin# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 7.9G 4.0G 3.5G 54% /
none 12G 144K 12G 1% /dev
none 12G 0 12G 0% /dev/shm
none 12G 60K 12G 1% /var/run
none 12G 0 12G 0% /var/lock
/dev/xvdz1 20G 618M 19G 4% /home

The block devices are indeed not available on my AMI:

ec2-describe-images ami-ab5197c2 -K ~/.starcluster/pk-....pem -C
~/.starcluster/cert-....pem
IMAGE ami-ab5197c2 11-04-hvm available private x86_64 machine ebs hvm
BLOCKDEVICEMAPPING /dev/sda1 snap-8f055aee 8

Nor are they on Alestic's public one:

ec2-describe-images ami-1cad5275 -K ~/.starcluster/pk-....pem -C
~/.starcluster/cert-....pem
IMAGE ami-1cad5275 099720109477/hvm/ubuntu-images/ubuntu-natty-11.04-amd64-server-20110426 099720109477 available public x86_64 machine
ebs hvm
BLOCKDEVICEMAPPING /dev/sda1 snap-b1ad2dde 8

Would it be possible to add the local instance storage during the AMI
creation with ebsimage ?
http://docs.amazonwebservices.com/AWSEC2/2011-01-01/UserGuide/index.html?Using_AddingDefaultLocalInstanceStorageToAMI.html

fix InvalidGroup.InUse error when deleting a security group on terminate

% starcluster terminate flatmastertest2
StarCluster - (http://web.mit.edu/starcluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to [email protected]

Terminate cluster flatmastertest2 (y/n)? y

Running plugin ipcluster
Terminating node: master (i-9110aef0)
Removing @sc-flatmastertest2 security group
!!! ERROR - InvalidGroup.InUse: There are active instances using security group '@sc-flatmastertest2'

Hadoop

Would it be possible to have Hadoop installed on a StarCluster ? Something like what Cloudera has would be wonderful ;-)

Thanks in advance.

Nicolas

listcluster command blows up when a cluster is being created

When trying to bring up a large cluster I almost always have certain nodes that never come up or they do come up but never respond to SSH. What than means is that Starcluster will wait indefinelty for the stragler while the other nodes sit idle:

>>> Waiting for SSH to come up on all nodes...
39/40 |\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\   |  97%   

I know you are aware of this, and so I'm not reporting this. I'm reporting the fact that when I try to list my clusters when in this state it blows up. This is the error I get:

$ starcluster listclusters
StarCluster - (http://web.mit.edu/starcluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to [email protected]

cli.py:156 - ERROR - failed to load cluster receipt: Incorrect padding

I would like to be able to list the clusters and know which nodes are running and which ones are not so I can kill/restart the running nodes that are not responding to SSH, and/or resize my cluster. Right now my only way of finding the strangler node is to manually check each instance by trying to ssh into them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.