In order to properly set up the Tensorflow framework for the given network configuration (cluster of 4 machines) do what follows:
-
Clone this repository on your local machine
-
Set up locally the following alias for the cluster machines in the ~/.ssh/config file:
Host nodei Hostname <nodei_IP>
Where
nodei
is something likenode0
,node1
... -
Run init.sh username . Where username is your name on CloudLab. The script will update the system and install the required packages.
In our case the ~/.ssh/config file will be something like:
Host node0
HostName node0_id_code.cloudlab.us
Host node1
HostName node1_id_code.cloudlab.us
Host node2
HostName node2_id_code.wisc.cloudlab.us
Host node3
HostName node3_id_code.wisc.cloudlab.us
OBS: Do not use the Utah cluster, it seems to have problems with tensorflow and python3
To run a given experment some useful scripts are given.
- To run the logistic regression model in asynchronous mode do the following:
run-scripts/run-task1-cluster.sh username
- To run the logistic regression model in synchronous mode do the following:
run-scripts/run-task2.sh username
- To run AlexNet in distribute mode do the following:
cd alexnet/alexnet && ./startservers.sh username mode
(Where mode is either single, cluster or cluster2)
The output of the run will be logged locally. If you want to profile the experiments using dstat just append to each of the above mentioned commands the '-profile' flag. The output of the proifling will be stored in an appropriate directory.