Git Product home page Git Product logo

mpich2-yarn's Introduction

mpich-yarn

#Introduction

MPICH-yarn is an application running on Hadoop YARN that enables MPI programs running on Hadoop YARN clusters.

##Prerequisite

As a prerequisite, you need to

  1. The cluster has been deployed Hadoop YARN and HDFS.
  2. Each node in the cluster has installed mpich-3.1.2 and its ./bin folder has been added to PATH.

This version of mpich-yarn uses MPICH-3.1.2 as implementation of MPI and uses ssh as communication daemon.

##Recommended Configuation

  1. Ubuntu 12.04 LTS
  2. hadoop 2.4.1
  3. gcc 4.6.3
  4. jdk 1.7.0_25
  5. Apache Maven 3.2.3

#Compile

To compile MPICH-yarn, first you need to have maven installed. Then type command at source folder:

mvn clean package -Dmaven.test.skip=true

You need to ensure Internet connected as maven needs to download plugins on the maven repository, this may take minutes.

After this command, you will get mpich2-yarn-1.0-SNAPSHOT.jar at ./target folder. This is the application running at YARN to execute MPI programs.

#Configuation

There are many tutorials on the Internet about configuring Hadoop. However, there are many troubles in configuring YARN to make it work well with mpich2- yarn. To save your time, here is a sample configuration that has successfully run in our cluster for your reference.

yarn-site.xml

<configuration>
<property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>${YOUR_HOST_IP_OR_NAME}:8031</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>${YOUR_HOST_IP_OR_NAME}:8032</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>${YOUR_HOST_IP_OR_NAME}</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>${YOUR_HOST_IP_OR_NAME}:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address</name>
    <value>${YOUR_HOST_IP_OR_NAME}:8033</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>${YOUR_HOST_IP_OR_NAME}:8088</value>
  </property>
<property>
	<name>yarn.nodemanager.aux-services</name>
	<value>mapreduce_shuffle</value>
</property>
<property>
	<name>yarn.nodemanager.resource.cpu-vcore</name>
	<value>16</value>
</property>
<property>
	<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
	<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
	<name>yarn.application.classpath</name>
	<value>
		/home/hadoop/hadoop-2.4.1/etc/hadoop,
		/home/hadoop/hadoop-2.4.1/share/hadoop/common/*,
		/home/hadoop/hadoop-2.4.1/share/hadoop/hdfs/*,
		/home/hadoop/hadoop-2.4.1/share/hadoop/yarn/*,
		/home/hadoop/hadoop-2.4.1/share/hadoop/common/lib/*,
		/home/hadoop/hadoop-2.4.1/share/hadoop/hdfs/lib/*,
		/home/hadoop/hadoop-2.4.1/share/hadoop/yarn/lib/*
	</value>
</property>
</configuration>

mpi-site.conf

<configuration>
	<property>
		<name>yarn.mpi.scratch.dir</name>
		<value></value>
		<description>
			The HDFS address that stores temporary file like:
			hdfs://sandking04:9000/home/hadoop/mpi-tmp
		</description>
	</property>
	<property>
		<name>yarn.mpi.ssh.authorizedkeys.path</name>
		<value>/home/hadoop/.ssh/authorized_keys</value>
		<description>
			MPICH-YARN will create a temporary RSA key pair for
			password-less login and automatically configure it. 
			All of your hosts should enable public_key login.
		</description>
	</property>
</configuration> 

#Submit Jobs

CPI

On the client nodes:

mpicc -o cpi cpi.c
hadoop jar mpich2-yarn-1.0-SNAPSHOT.jar -a cpi -M 1024 -m 1024 -n 2

Hello world

hadoop jar mpich2-yarn-1.0-SNAPSHOT.jar -a hellow -M 1024 -m 1024 -n 2

PLDA

svn checkout http://plda.googlecode.com/svn/trunk/ plda  # Prepare source code
cd plda
make  # call mpicc to compile
cd ..

Put the input data to the hdfs (P.S. there is a testdata in the PLDA source code dir):

hadoop fs -mkdir /group/dc/zhuoluo.yzl/plda\_input
hadoop fs -put plda/testdata/test\_data.txt /group/dc/zhuoluo.yzl/plda\_input/
hadoop jar mpich2-yarn-1.0-SNAPSHOT.jar -a plda/mpi\_lda -M 1024 -m 1024 -n 2\
 -o "--num_topics 2 --alpha 0.1 --beta 0.01 --training_data_file MPIFILE1 --model_file MPIOUTFILE1 --total_iterations 150"\
 -DMPIFILE1=/group/dc/zhuoluo.yzl/plda_input -SMPIFILE1=true -OMPIOUTFILE1=/group/dc/zhuoluo.yzl/lda_model_output.txt -ppc 2

mpich2-yarn's People

Contributors

fredfsh avatar liangry avatar stevenybw avatar zhuoluoy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mpich2-yarn's Issues

how to pass environment variable

#include <mpi.h>
#include <stdio.h>

#include <cstdlib>

int main(int argc, char** argv) {
    MPI_Init(NULL, NULL);
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);

    // get env
    const char* DONGZE = std::getenv("DONGZE");
    if (DONGZE != nullptr) {  printf("get env DONGZE = %s\n", DONGZE); } else { printf("env not found\n"); }

    printf("Hello world from processor %s, rank %d out of %d processors\n",
           processor_name, world_rank, world_size);
    MPI_Finalize();
}

here is my command:
hadoop jar ./mpich2-yarn/target/mpich2-yarn-1.0-SNAPSHOT.jar -a mpi -M 2048 -m 1024 -n 5

but the variable DONGZE is null.

Application can not run on Hadoop 0.23.3

  1. Yarn Api Container class does not support equals method, should not be used as a HashMap key, use String instead;
  2. We client specified no arguments, print usage instead of running directly.
  3. Applicaton name should contain file name only instead of the full path.
  4. mpiexec should specify phrase

Add a configuration file

Like hadoop and hive...
We can add a configuration file like "mpich2-default.xml" and "mpich2-site.xml" to make it configurable.
Configuration file will eliminate all the hard coding.

修改pom.xml中的yarn的版本为2.3.0,编译出错

[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /search/mpich2-yarn/src/main/java/org/apache/hadoop/yarn/mpi/util/Utilities.java:[513,32] 找不到符号
符号: 变量 RM_HOSTNAME
位置: 类 org.apache.hadoop.yarn.mpi.MPIConfiguration
[ERROR] /search/mpich2-yarn/src/main/java/org/apache/hadoop/yarn/mpi/util/Utilities.java:[513,74] 找不到符号
符号: 变量 RM_HOSTNAME
位置: 类 org.apache.hadoop.yarn.mpi.MPIConfiguration
[INFO] 2 errors
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 3.575 s
[INFO] Finished at: 2014-10-11T11:30:37+08:00
[INFO] Final Memory: 19M/301M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project mpich2-yarn: Compilation failure: Compilation failure:
[ERROR] /search/mpich2-yarn/src/main/java/org/apache/hadoop/yarn/mpi/util/Utilities.java:[513,32] 找不到符号
[ERROR] 符号: 变量 RM_HOSTNAME
[ERROR] 位置: 类 org.apache.hadoop.yarn.mpi.MPIConfiguration
[ERROR] /search/mpich2-yarn/src/main/java/org/apache/hadoop/yarn/mpi/util/Utilities.java:[513,74] 找不到符号
[ERROR] 符号: 变量 RM_HOSTNAME
[ERROR] 位置: 类 org.apache.hadoop.yarn.mpi.MPIConfiguration
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

testMessageEfficiency(org.apache.hadoop.yarn.mpi.util.TestStringEfficiency) error

Anyone can help me on the following error..

[hadoop@linux01 mpich2-yarn-master]$ mvn clean package
[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for org.apache.hadoop.yarn.mpi:mpich2-yarn:jar:1.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-jar-plugin is missing. @ line 54, column 12
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building mpich2-yarn 1.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ mpich2-yarn ---
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ mpich2-yarn ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 1 resource
[INFO]
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ mpich2-yarn ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 36 source files to /ifs/home/hadoop/tmp/mpich2-yarn-master/target/classes
[WARNING] /ifs/home/hadoop/tmp/mpich2-yarn-master/src/main/java/org/apache/hadoop/yarn/mpi/server/handler/MPINMAsyncHandler.java:[17,12] unmappable character for encoding UTF-8
[WARNING] /ifs/home/hadoop/tmp/mpich2-yarn-master/src/main/java/org/apache/hadoop/yarn/mpi/server/handler/MPINMAsyncHandler.java:[17,14] unmappable character for encoding UTF-8
[WARNING] /ifs/home/hadoop/tmp/mpich2-yarn-master/src/main/java/org/apache/hadoop/yarn/mpi/server/handler/MPINMAsyncHandler.java:[17,15] unmappable character for encoding UTF-8
[WARNING] /ifs/home/hadoop/tmp/mpich2-yarn-master/src/main/java/org/apache/hadoop/yarn/mpi/server/handler/MPIAMRMAsyncHandler.java:[20,12] unmappable character for encoding UTF-8
[WARNING] /ifs/home/hadoop/tmp/mpich2-yarn-master/src/main/java/org/apache/hadoop/yarn/mpi/server/handler/MPIAMRMAsyncHandler.java:[20,14] unmappable character for encoding UTF-8
[WARNING] /ifs/home/hadoop/tmp/mpich2-yarn-master/src/main/java/org/apache/hadoop/yarn/mpi/server/handler/MPIAMRMAsyncHandler.java:[20,15] unmappable character for encoding UTF-8
[INFO]
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ mpich2-yarn ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /ifs/home/hadoop/tmp/mpich2-yarn-master/src/test/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ mpich2-yarn ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 5 source files to /ifs/home/hadoop/tmp/mpich2-yarn-master/target/test-classes
[INFO]
[INFO] --- maven-surefire-plugin:2.12.4:test (default-test) @ mpich2-yarn ---
[INFO] Surefire report directory: /ifs/home/hadoop/tmp/mpich2-yarn-master/target/surefire-reports


T E S T S

Running org.apache.hadoop.yarn.mpi.util.TestUtilities
log4j:WARN No appenders could be found for logger (org.apache.hadoop.yarn.mpi.util.TestUtilities).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Number0 Phrase is: XwemlFQjv1iIAjaC
Number5 Phrase is: dLR6Ym7usaKoNNDD
Number25 Phrase is: QDereGDnl3YgNOaL
Number125 Phrase is: 7I6XvRhsJR0XGKrs
Number625 Phrase is: k3W6Jzg0irarfwPm
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.473 sec
Running org.apache.hadoop.yarn.mpi.util.TestClient
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.401 sec
Running org.apache.hadoop.yarn.mpi.util.TestStringEfficiency
This test case will prove String.format() is memory friendly.
String.format() time: 1307, memory:4386400
Operator '+' time: 114, memory:269696
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.514 sec <<< FAILURE!
testMessageEfficiency(org.apache.hadoop.yarn.mpi.util.TestStringEfficiency) Time elapsed: 1.513 sec <<< FAILURE!
java.lang.AssertionError:
at org.junit.Assert.fail(Assert.java:91)
at org.junit.Assert.assertTrue(Assert.java:43)
at org.junit.Assert.assertTrue(Assert.java:54)
at org.apache.hadoop.yarn.mpi.util.TestStringEfficiency.testMessageEfficiency(TestStringEfficiency.java:47)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)

Running org.apache.hadoop.yarn.mpi.server.TestMPDListener
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.185 sec
Running org.apache.hadoop.yarn.mpi.server.TestMPIClientService
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.006 sec

Results :

Failed tests: testMessageEfficiency(org.apache.hadoop.yarn.mpi.util.TestStringEfficiency)

Tests run: 9, Failures: 1, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 22.096 s
[INFO] Finished at: 2014-11-07T07:43:39+08:00
[INFO] Final Memory: 18M/51M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project mpich2-yarn: There are test failures.
[ERROR]
[ERROR] Please refer to /ifs/home/hadoop/tmp/mpich2-yarn-master/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[hadoop@linux01 mpich2-yarn-master]$

Too many logs

such log occurrence many times, we should reduce the log.

12/10/10 09:00:34 INFO server.Utilities: Sending request to RM for containers, requestedSet=0, releasedSet=0, progress=0.0
12/10/10 09:00:34 INFO server.ApplicationMaster: Current available resources in the cluster memory: 0
12/10/10 09:00:34 INFO server.ApplicationMaster: Got response from RM for container ask, completedCnt=0
12/10/10 09:00:34 INFO server.ApplicationMaster: Current application state: loop=0, appDone=false, total=5, completed=0, failed=0
12/10/10 09:00:35 INFO server.ApplicationMaster: Sending empty request to RM, to let RM know we are alive
12/10/10 09:00:35 INFO server.Utilities: Sending request to RM for containers, requestedSet=0, releasedSet=0, progress=0.0
12/10/10 09:00:35 INFO server.ApplicationMaster: Current available resources in the cluster memory: 0
12/10/10 09:00:35 INFO server.ApplicationMaster: Got response from RM for container ask, completedCnt=0
12/10/10 09:00:35 INFO server.ApplicationMaster: Current application state: loop=1, appDone=false, total=5, completed=0, failed=0
12/10/10 09:00:36 INFO server.ApplicationMaster: Sending empty request to RM, to let RM know we are alive
12/10/10 09:00:36 INFO server.Utilities: Sending request to RM for containers, requestedSet=0, releasedSet=0, progress=0.0
12/10/10 09:00:36 INFO server.ApplicationMaster: Current available resources in the cluster memory: 0
12/10/10 09:00:36 INFO server.ApplicationMaster: Got response from RM for container ask, completedCnt=0
12/10/10 09:00:36 INFO server.ApplicationMaster: Current application state: loop=2, appDone=false, total=5, completed=0, failed=0
12/10/10 09:00:37 INFO server.ApplicationMaster: Sending empty request to RM, to let RM know we are alive
12/10/10 09:00:37 INFO server.Utilities: Sending request to RM for containers, requestedSet=0, releasedSet=0, progress=0.0
12/10/10 09:00:37 INFO server.ApplicationMaster: Current available resources in the cluster memory: 0
12/10/10 09:00:37 INFO server.ApplicationMaster: Got response from RM for container ask, completedCnt=0
12/10/10 09:00:37 INFO server.ApplicationMaster: Current application state: loop=3, appDone=false, total=5, completed=0, failed=0
12/10/10 09:00:38 INFO server.ApplicationMaster: Sending empty request to RM, to let RM know we are alive
12/10/10 09:00:38 INFO server.Utilities: Sending request to RM for containers, requestedSet=0, releasedSet=0, progress=0.0
12/10/10 09:00:38 INFO server.ApplicationMaster: Current available resources in the cluster memory: 0
12/10/10 09:00:38 INFO server.ApplicationMaster: Got response from RM for container ask, completedCnt=0
12/10/10 09:00:38 INFO server.ApplicationMaster: Current application state: loop=4, appDone=false, total=5, completed=0, failed=0

License request

@fredfsh @stevenybw @liangry
Please put a license on this project. There are about 35 project forks until now. An explicit license for this project will let users know whether or not use the code from this project.

Fix hard codings

  1. Create a configuration file...
  2. Fix hard codings and make them configurable

Host key verification failed

Hi,
I currently try to get mpich2-yarn running on AWS EMR. I installed mpich-3.1.2 on all machines and compiled on master. Executing this
hadoop jar mpich2-yarn/target/mpich2-yarn-1.0-SNAPSHOT.jar -a cpi -M 1024 -m 1024 -n 2
I get the following error:
14/09/18 16:20:40 INFO util.Utilities: Connecting to ApplicationMaster at ip-172-31-43-192.ec2.internal/172.31.43.192:54288
14/09/18 16:20:40 INFO client.Client: Initializing ApplicationMaster
14/09/18 16:20:48 INFO client.Client: all containers are launched successfully, executing mpiexec...
14/09/18 16:20:48 INFO client.Client: [stderr] Host key verification failed.
Any idea to fix that? Do I have to enable ssh key less login to all nodes? I thought this will be dropped via YARN.

Thanks a lot Markus

InterruptedException happened even when succeeded

The following exception is observed with Cgroups enabled even when it finished successfully.

java.lang.InterruptedException
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:287)

RM is not recognized

14/10/12 20:31:57 INFO util.Utilities: Checking some environment variable is properly set.
14/10/12 20:31:57 INFO util.Utilities: HADOOP_CONF_DIR=/etc/hadoop/conf
14/10/12 20:31:57 INFO util.Utilities: YARN_CONF_DIR=/etc/hadoop/conf/
14/10/12 20:31:57 INFO util.Utilities: PATH=/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/opt/dell/srvadmin/bin:/opt/dell/srvadmin/sbin:/root/bin:/search/apps/mpich2/bin:/search/apps/mpich2/bin
14/10/12 20:31:57 INFO util.Utilities: Checking conf is correct
14/10/12 20:31:57 INFO util.Utilities: yarn.resourcemanager.address=0.0.0.0:8032
14/10/12 20:31:57 INFO util.Utilities: yarn.resourcemanager.scheduler.address=0.0.0.0:8030
14/10/12 20:31:57 INFO util.Utilities: 0.0.0.0:8032=null
14/10/12 20:31:57 INFO util.Utilities: 0.0.0.0:8030=null
14/10/12 20:31:57 INFO util.Utilities: yarn.mpi.container.allocator=null
14/10/12 20:31:57 INFO util.Utilities: *********************************************************

my env is cloudera cdh5 2.3.0

  1. RM HA
  2. viewFS
  3. Namenode HA

Resource restriction of MPI process

I have read most of the codes in this project. First, this is a great implementation to enable MPI programs running in YARN clusters, but I still got concerns about how does AppMaster monitor the resource usage of the containers since the MPI processes have no connections with container processes.

To be specific, MPI processes are launched in AppMaster using Java Process while the containers only do some dependencies syncing work instead of the real users' logic.

Could you please give me some explanations on this or do you have any plans to improve this project?

error in AppMaster.stderr

Hi,

now I have time again to test mpich2-yarn on Amazon EMR. I have the following problem

AppMaster.stderr : Total file length is 11170 bytes.

...
14/10/16 11:43:55 INFO server.ApplicationMaster: Setting up container command
14/10/16 11:43:55 INFO server.ApplicationMaster: Executing command: [${JAVA_HOME}/bin/java -Xmx1024m org.apache.hadoop.yarn.mpi.server.Container 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr ]
14/10/16 11:43:55 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-172-31-42-163.ec2.internal:9103
14/10/16 11:43:55 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1413458972800_0006_01_000003
14/10/16 11:43:55 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-172-31-42-165.ec2.internal:9103
14/10/16 11:43:55 INFO handler.MPINMAsyncHandler: onContainerStarted invoked.
14/10/16 11:43:55 INFO handler.MPINMAsyncHandler: onContainerStarted invoked.
14/10/16 11:43:57 INFO server.MPDListenerImpl: Try to report status.
14/10/16 11:43:57 INFO server.MPDListenerImpl: container_1413458972800_0006_01_000003 report status INITIALIZED
14/10/16 11:43:57 INFO server.MPDListenerImpl: Try to report status.
14/10/16 11:43:57 INFO server.MPDListenerImpl: container_1413458972800_0006_01_000002 report status INITIALIZED
14/10/16 11:43:57 INFO server.MPDListenerImpl: Try to report status.
14/10/16 11:43:57 INFO server.MPDListenerImpl: container_1413458972800_0006_01_000003 report status ERROR_FINISHED
14/10/16 11:43:58 ERROR server.ApplicationMaster: error occurs while starting MPD
org.apache.hadoop.yarn.mpi.util.MPDException: Container container_1413458972800_0006_01_000003 error
at org.apache.hadoop.yarn.mpi.server.MPDListenerImpl.isAllMPDStarted(MPDListenerImpl.java:121)
at org.apache.hadoop.yarn.mpi.server.ApplicationMaster.run(ApplicationMaster.java:733)
at org.apache.hadoop.yarn.mpi.server.ApplicationMaster.main(ApplicationMaster.java:170)
org.apache.hadoop.yarn.mpi.util.MPDException: Container container_1413458972800_0006_01_000003 error
at org.apache.hadoop.yarn.mpi.server.MPDListenerImpl.isAllMPDStarted(MPDListenerImpl.java:121)
at org.apache.hadoop.yarn.mpi.server.ApplicationMaster.run(ApplicationMaster.java:733)
at org.apache.hadoop.yarn.mpi.server.ApplicationMaster.main(ApplicationMaster.java:170)
14/10/16 11:43:58 INFO server.ApplicationMaster: Application completed. Stopping running containers
14/10/16 11:43:58 INFO impl.ContainerManagementProtocolProxy: Closing proxy : ip-172-31-42-163.ec2.internal:9103
14/10/16 11:43:58 INFO impl.ContainerManagementProtocolProxy: Closing proxy : ip-172-31-42-165.ec2.internal:9103
14/10/16 11:43:58 INFO server.ApplicationMaster: Application completed. Signalling finish to RM
14/10/16 11:43:58 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
14/10/16 11:43:58 INFO server.ApplicationMaster: AMRM, NM two services stopped
14/10/16 11:43:58 INFO impl.AMRMClientAsyncImpl: Interrupted while waiting for queue
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:275)
14/10/16 11:43:58 INFO server.ApplicationMaster: Finalizing.
14/10/16 11:43:58 INFO server.ApplicationMaster: Application Master failed. exiting
...

Any idea how to fix that? Key-less ssh to nodes now works.

Thanks
Markus

All attempts fail with ClassNotFoundException

I install mpich2-yarn on a small hadoop cluster with one master and two worker nodes.
All mpi tasks fail with following exception (log from attempts):


Error: A JNI error has occured, please check your installation and try again.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/yarn/client/api/async/AMRMClientAsync$CallbackHandler
.
.
.
caused by: java.lang.ClassNotFoundException: org.apache.hadoop.yarn.client.api.async.AMRMClientAsync$CallbackHandler
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
.
.

.

I have set path of all necessary files in share directory of Hadoop to CLASSPATH environment variable of all nodes. Also passphease-less ssh is set between nodes.

Can help me to resolve this problem?
thanks.

Container should be assigned in distinct nodes

For each SMPD daemon in a container, it will listen on a port . If multiple containers listen on the same port, the port will conflict and the container will fail. Currently, the mpich2-yarn will reduce the conflict containers, but it will reduce the processor number of the mpi program. So it should be considered as a bug.
We will fix this bug by sending a list of request which contains certain host names to the resource manager. The host names can be fetched by scheduler or the statistics.

Allocate containers of different memory size

Currently, Yarn support memory based resource allocation only, we have launched a daemon for each container. As daemon listens on a port, each daemon should be separated on different machines. We need to make daemon fork multiple mpi processes to leverage the computing resources of each node.
We need to allocate containers of different memory size, for the mpi process numbers are different on each node. If the mpi processes are multiple, the container of these processes should occupy multiple times of memory.

Test cpi/hellow error

你好
我在执行hadoop jar mpich2-yarn-1.0-SNAPSHOT.jar -a hellow -M 1024 -m 1024 -n 2命令时出现java.lang.IllegalArgumentException:Wrong FS: hdfs://hdpnn/group/dc/mpi-tmp/MPICH2-hellow/5/AppMaster.jar, expected: hdfs://...异常。
查看项目代码,发现在run()方法的471行会从本地拷贝一个 appMasterJar 到hdfs [。请问这个文件和mpi可执行程序是什么关系呢
Screenshot from 2021-01-19 10-27-06

tips for debugging?

Hi,

all code stops at "INFO client.Client: Initializing ApplicationMaster"
(cpi works with mpiexec on master)

[hadoop@ip-172-31-36-126 ~]$ hadoop jar mpich2-yarn/target/mpich2-yarn-1.0-SNAPSHOT.jar -a ./cpi -M 1024 -m 1024 -n 5
14/09/24 18:54:39 INFO client.Client: Initializing Client
14/09/24 18:54:39 INFO client.Client: Container number is 5
14/09/24 18:54:39 INFO client.Client: Application Master's jar is /home/hadoop/mpich2-yarn/target/mpich2-yarn-1.0-SNAPSHOT.jar
14/09/24 18:54:39 INFO client.Client: Starting Client
14/09/24 18:54:39 INFO util.Utilities: BELOW IS CONFIGUATIONS FROM Client ***
key=TERM; value=xterm-256color
key=HADOOP_PREFIX; value=/home/hadoop
key=PIG_CONF_DIR; value=/home/hadoop/pig/conf
key=JAVA_HOME; value=/usr/java/latest
key=HBASE_HOME; value=/home/hadoop/hbase
key=HIVE_HOME; value=/home/hadoop/hive
key=HADOOP_YARN_HOME; value=/home/hadoop
key=HADOOP_DATANODE_HEAPSIZE; value=384
key=SSH_CLIENT; value=54.240.217.9 19295 22
key=HADOOP_NAMENODE_HEAPSIZE; value=768
key=YARN_HOME; value=/home/hadoop
key=MAIL; value=/var/spool/mail/hadoop
key=HOSTNAME; value=ip-172-31-36-126.ec2.internal
key=PWD; value=/home/hadoop
key=IMPALA_CONF_DIR; value=/home/hadoop/impala/conf
key=LESS_TERMCAP_mb; value=
key=LESS_TERMCAP_me; value=
key=LESS_TERMCAP_md; value=
key=NLSPATH; value=/usr/dt/lib/nls/msg/%L/%N.cat
key=AWS_AUTO_SCALING_HOME; value=/opt/aws/apitools/as
key=HISTSIZE; value=1000
key=HADOOP_COMMON_HOME; value=/home/hadoop
key=PATH; value=/home/hadoop/pig/bin:/usr/local/cuda/bin:/usr/java/latest/bin:/home/hadoop/bin:/home/hadoop/mahout/bin:/home/hadoop/hive/bin:/home/hadoop/hbase/bin:/home/hadoop/impala/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/aws/bin:/home/hadoop/cascading/tools/multitool-20140224/bin:/home/hadoop/cascading/tools/load-20140223/bin:/home/hadoop/cascading/tools/lingual-client/bin:/home/hadoop/cascading/driven/bin
key=HIVE_CONF_DIR; value=/home/hadoop/hive/conf
key=HADOOP_CLASSPATH; value=:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/lib/
key=HADOOP_CONF_DIR; value=/home/hadoop/conf
key=IMPALA_HOME; value=/home/hadoop/impala
key=AWS_IAM_HOME; value=/opt/aws/apitools/iam
key=SHLVL; value=1
key=XFILESEARCHPATH; value=/usr/dt/app-defaults/%L/Dt
key=AWS_CLOUDWATCH_HOME; value=/opt/aws/apitools/mon
key=EC2_AMITOOL_HOME; value=/opt/aws/amitools/ec2
key=HADOOP_HOME_WARN_SUPPRESS; value=true
key=PIG_CLASSPATH; value=/home/hadoop/pig/lib
key=AWS_RDS_HOME; value=/opt/aws/apitools/rds
key=LESS_TERMCAP_se; value=
key=SSH_TTY; value=/dev/pts/0
key=MAHOUT_CONF_DIR; value=/home/hadoop/mahout/conf
key=HBASE_CONF_DIR; value=/home/hadoop/hbase/conf
key=LOGNAME; value=hadoop
key=YARN_CONF_DIR; value=/home/hadoop/conf
key=AWS_PATH; value=/opt/aws
key=HADOOP_HOME; value=/home/hadoop
key=LD_LIBRARY_PATH; value=/home/hadoop/lib/native:/usr/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib::/home/hadoop/lib/native
key=MALLOC_ARENA_MAX; value=4
key=SSH_CONNECTION; value=54.240.217.9 19295 172.31.36.126 22
key=HADOOP_OPTS; value= -server -Dhadoop.log.dir=/home/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/home/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -XX:MaxPermSize=128m -Dhadoop.security.logger=INFO,NullAppender -Dsun.net.inetaddr.ttl=30
key=MAHOUT_LOG_DIR; value=/mnt/var/log/apps
key=SHELL; value=/bin/bash
key=LC_CTYPE; value=UTF-8
key=CLASSPATH; value=/home/hadoop/conf:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/:/home/hadoop/share/hadoop/hdfs:/home/hadoop/share/hadoop/hdfs/lib/:/home/hadoop/share/hadoop/hdfs/:/home/hadoop/share/hadoop/yarn/lib/:/home/hadoop/share/hadoop/yarn/:/home/hadoop/share/hadoop/mapreduce/lib/:/home/hadoop/share/hadoop/mapreduce/::/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/lib/
key=PIG_HOME; value=/home/hadoop/pig
key=EC2_HOME; value=/opt/aws/apitools/ec2
key=LESS_TERMCAP_ue; value=
key=LC_ALL; value=en_US.UTF-8
key=AWS_ELB_HOME; value=/opt/aws/apitools/elb
key=USER; value=hadoop
key=HADOOP_HDFS_HOME; value=/home/hadoop
key=HADOOP_CLIENT_OPTS; value= -XX:MaxPermSize=128m
key=RUBYOPT; value=rubygems
key=HISTCONTROL; value=ignoredups
key=HOME; value=/home/hadoop
key=MAHOUT_HOME; value=/home/hadoop/mahout
key=LESSOPEN; value=|/usr/bin/lesspipe.sh %s
key=LS_COLORS; value=rs=0:di=38;5;27:ln=38;5;51:mh=44;38;5;15:pi=40;38;5;11:so=38;5;13:do=38;5;5:bd=48;5;232;38;5;11:cd=48;5;232;38;5;3:or=48;5;232;38;5;9:mi=05;48;5;232;38;5;15:su=48;5;196;38;5;15:sg=48;5;11;38;5;16:ca=48;5;196;38;5;226:tw=48;5;10;38;5;16:ow=48;5;10;38;5;21:st=48;5;21;38;5;15:ex=38;5;34:.tar=38;5;9:.tgz=38;5;9:.arc=38;5;9:.arj=38;5;9:.taz=38;5;9:.lha=38;5;9:.lzh=38;5;9:.lzma=38;5;9:.tlz=38;5;9:.txz=38;5;9:.tzo=38;5;9:.t7z=38;5;9:.zip=38;5;9:.z=38;5;9:.Z=38;5;9:.dz=38;5;9:.gz=38;5;9:.lrz=38;5;9:.lz=38;5;9:.lzo=38;5;9:.xz=38;5;9:.bz2=38;5;9:.bz=38;5;9:.tbz=38;5;9:.tbz2=38;5;9:.tz=38;5;9:.deb=38;5;9:.rpm=38;5;9:.jar=38;5;9:.war=38;5;9:.ear=38;5;9:.sar=38;5;9:.rar=38;5;9:.alz=38;5;9:.ace=38;5;9:.zoo=38;5;9:.cpio=38;5;9:.7z=38;5;9:.rz=38;5;9:.cab=38;5;9:.jpg=38;5;13:.jpeg=38;5;13:.gif=38;5;13:.bmp=38;5;13:.pbm=38;5;13:.pgm=38;5;13:.ppm=38;5;13:.tga=38;5;13:.xbm=38;5;13:.xpm=38;5;13:.tif=38;5;13:.tiff=38;5;13:.png=38;5;13:.svg=38;5;13:.svgz=38;5;13:.mng=38;5;13:.pcx=38;5;13:.mov=38;5;13:.mpg=38;5;13:.mpeg=38;5;13:.m2v=38;5;13:.mkv=38;5;13:.ogm=38;5;13:.mp4=38;5;13:.m4v=38;5;13:.mp4v=38;5;13:.vob=38;5;13:.qt=38;5;13:.nuv=38;5;13:.wmv=38;5;13:.asf=38;5;13:.rm=38;5;13:.rmvb=38;5;13:.flc=38;5;13:.avi=38;5;13:.fli=38;5;13:.flv=38;5;13:.gl=38;5;13:.dl=38;5;13:.xcf=38;5;13:.xwd=38;5;13:.yuv=38;5;13:.cgm=38;5;13:.emf=38;5;13:.axv=38;5;13:.anx=38;5;13:.ogv=38;5;13:.ogx=38;5;13:.aac=38;5;45:.au=38;5;45:.flac=38;5;45:.mid=38;5;45:.midi=38;5;45:.mka=38;5;45:.mp3=38;5;45:.mpc=38;5;45:.ogg=38;5;45:.ra=38;5;45:.wav=38;5;45:.axa=38;5;45:.oga=38;5;45:.spx=38;5;45:.xspf=38;5;45:
key=LESS_TERMCAP_us; value=
key=LANG; value=en_US.UTF-8
key=HADOOP_MAPRED_HOME; value=/home/hadoop
14/09/24 18:54:39 INFO util.Utilities: Checking some environment variable is properly set.
14/09/24 18:54:39 INFO util.Utilities: HADOOP_CONF_DIR=/home/hadoop/conf
14/09/24 18:54:39 INFO util.Utilities: YARN_CONF_DIR=/home/hadoop/conf
14/09/24 18:54:39 INFO util.Utilities: PATH=/home/hadoop/pig/bin:/usr/local/cuda/bin:/usr/java/latest/bin:/home/hadoop/bin:/home/hadoop/mahout/bin:/home/hadoop/hive/bin:/home/hadoop/hbase/bin:/home/hadoop/impala/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/aws/bin:/home/hadoop/cascading/tools/multitool-20140224/bin:/home/hadoop/cascading/tools/load-20140223/bin:/home/hadoop/cascading/tools/lingual-client/bin:/home/hadoop/cascading/driven/bin
14/09/24 18:54:39 INFO util.Utilities: Checking conf is correct
14/09/24 18:54:39 INFO util.Utilities: yarn.resourcemanager.hostname=0.0.0.0
14/09/24 18:54:39 INFO util.Utilities: yarn.resourcemanager.address=172.31.36.126:9022
14/09/24 18:54:39 INFO util.Utilities: yarn.resourcemanager.scheduler.address=172.31.36.126:9024
14/09/24 18:54:39 INFO util.Utilities: 0.0.0.0:8032=null
14/09/24 18:54:39 INFO util.Utilities: 0.0.0.0:8030=null
14/09/24 18:54:39 INFO util.Utilities: yarn.mpi.container.allocator=null
14/09/24 18:54:39 INFO util.Utilities: *
****************************************************
14/09/24 18:54:39 INFO util.Utilities: Connecting to ResourceManager at /172.31.36.126:9022
14/09/24 18:54:39 INFO client.Client: Got new application id=application_1411583461927_0008
14/09/24 18:54:39 INFO client.Client: Got Applicatioin: application_1411583461927_0008
14/09/24 18:54:39 INFO client.Client: Max mem capabililty of resources in this cluster 3072
14/09/24 18:54:39 INFO client.Client: Setting up application submission context for ASM
14/09/24 18:54:39 INFO client.Client: Set Application Id: application_1411583461927_0008
14/09/24 18:54:39 INFO client.Client: Set Application Name: MPICH2-cpi
14/09/24 18:54:39 INFO client.Client: Copy App Master jar from local filesystem and add to local environment
14/09/24 18:54:39 INFO client.Client: Source path: /home/hadoop/mpich2-yarn/target/mpich2-yarn-1.0-SNAPSHOT.jar
14/09/24 18:54:39 INFO client.Client: Destination path: hdfs://172.31.36.126:9000/tmp/MPICH2-cpi/8/AppMaster.jar
14/09/24 18:54:40 INFO client.Client: Copy MPI application from local filesystem to remote.
14/09/24 18:54:40 INFO client.Client: Source path: cpi
14/09/24 18:54:40 INFO client.Client: Destination path: hdfs://172.31.36.126:9000/tmp/MPICH2-cpi/8/MPIExec
14/09/24 18:54:40 INFO client.Client: Set the environment for the application master and mpi application
14/09/24 18:54:40 INFO client.Client: Trying to generate classpath for app master from current thread's classpath
14/09/24 18:54:40 INFO client.Client: Could not classpath resource from class loader
14/09/24 18:54:40 INFO client.Client: Setting up app master command
14/09/24 18:54:40 INFO client.Client: Completed setting up app master command ${JAVA_HOME}/bin/java -Xmx1024m org.apache.hadoop.yarn.mpi.server.ApplicationMaster --container_memory 1024 --num_containers 5 --priority 0 1><LOG_DIR>/AppMaster.stdout 2><LOG_DIR>/AppMaster.stderr
14/09/24 18:54:40 INFO client.Client: Submitting application to ASM
14/09/24 18:54:40 INFO client.Client: Submisstion result: true
14/09/24 18:54:40 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411584880804, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop
14/09/24 18:54:41 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411584880804, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop
14/09/24 18:54:42 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411584880804, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop
14/09/24 18:54:43 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411584880804, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop
14/09/24 18:54:44 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411584880804, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop
14/09/24 18:54:45 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411584880804, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop
14/09/24 18:54:46 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411584880804, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop
14/09/24 18:54:47 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411584880804, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop
14/09/24 18:54:48 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411584880804, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop
14/09/24 18:54:49 INFO client.Client: Got application report from ASM for, appId=8, clientToken=null, appDiagnostics=, appMasterHost=ip-172-31-38-17.ec2.internal, rpcPort:42455, appQueue=default, appMasterRpcPort=42455, appStartTime=1411584880804, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0008/, appUser=hadoop
14/09/24 18:54:49 INFO util.Utilities: Connecting to ApplicationMaster at ip-172-31-38-17.ec2.internal/172.31.38.17:42455
14/09/24 18:54:49 INFO client.Client: Initializing ApplicationMaster
^C14/09/24 18:56:50 INFO util.Utilities: Killing appliation with id: application_1411583461927_0008
[hadoop@ip-172-31-36-126 ~]$ hadoop jar mpich2-yarn/target/mpich2-yarn-1.0-SNAPSHOT.jar -a cpi -M 1024 -m 1024 -n 5
14/09/24 18:56:57 INFO client.Client: Initializing Client
14/09/24 18:56:57 INFO client.Client: Container number is 5
14/09/24 18:56:57 INFO client.Client: Application Master's jar is /home/hadoop/mpich2-yarn/target/mpich2-yarn-1.0-SNAPSHOT.jar
14/09/24 18:56:57 INFO client.Client: Starting Client
14/09/24 18:56:57 INFO util.Utilities: BELOW IS CONFIGUATIONS FROM Client ***
key=TERM; value=xterm-256color
key=HADOOP_PREFIX; value=/home/hadoop
key=PIG_CONF_DIR; value=/home/hadoop/pig/conf
key=JAVA_HOME; value=/usr/java/latest
key=HBASE_HOME; value=/home/hadoop/hbase
key=HIVE_HOME; value=/home/hadoop/hive
key=HADOOP_YARN_HOME; value=/home/hadoop
key=HADOOP_DATANODE_HEAPSIZE; value=384
key=SSH_CLIENT; value=54.240.217.9 19295 22
key=HADOOP_NAMENODE_HEAPSIZE; value=768
key=YARN_HOME; value=/home/hadoop
key=MAIL; value=/var/spool/mail/hadoop
key=HOSTNAME; value=ip-172-31-36-126.ec2.internal
key=PWD; value=/home/hadoop
key=IMPALA_CONF_DIR; value=/home/hadoop/impala/conf
key=LESS_TERMCAP_mb; value=
key=LESS_TERMCAP_me; value=
key=LESS_TERMCAP_md; value=
key=NLSPATH; value=/usr/dt/lib/nls/msg/%L/%N.cat
key=AWS_AUTO_SCALING_HOME; value=/opt/aws/apitools/as
key=HISTSIZE; value=1000
key=HADOOP_COMMON_HOME; value=/home/hadoop
key=PATH; value=/home/hadoop/pig/bin:/usr/local/cuda/bin:/usr/java/latest/bin:/home/hadoop/bin:/home/hadoop/mahout/bin:/home/hadoop/hive/bin:/home/hadoop/hbase/bin:/home/hadoop/impala/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/aws/bin:/home/hadoop/cascading/tools/multitool-20140224/bin:/home/hadoop/cascading/tools/load-20140223/bin:/home/hadoop/cascading/tools/lingual-client/bin:/home/hadoop/cascading/driven/bin
key=HIVE_CONF_DIR; value=/home/hadoop/hive/conf
key=HADOOP_CLASSPATH; value=:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/lib/
key=HADOOP_CONF_DIR; value=/home/hadoop/conf
key=IMPALA_HOME; value=/home/hadoop/impala
key=AWS_IAM_HOME; value=/opt/aws/apitools/iam
key=SHLVL; value=1
key=XFILESEARCHPATH; value=/usr/dt/app-defaults/%L/Dt
key=AWS_CLOUDWATCH_HOME; value=/opt/aws/apitools/mon
key=EC2_AMITOOL_HOME; value=/opt/aws/amitools/ec2
key=HADOOP_HOME_WARN_SUPPRESS; value=true
key=PIG_CLASSPATH; value=/home/hadoop/pig/lib
key=AWS_RDS_HOME; value=/opt/aws/apitools/rds
key=LESS_TERMCAP_se; value=
key=SSH_TTY; value=/dev/pts/0
key=MAHOUT_CONF_DIR; value=/home/hadoop/mahout/conf
key=HBASE_CONF_DIR; value=/home/hadoop/hbase/conf
key=LOGNAME; value=hadoop
key=YARN_CONF_DIR; value=/home/hadoop/conf
key=AWS_PATH; value=/opt/aws
key=HADOOP_HOME; value=/home/hadoop
key=LD_LIBRARY_PATH; value=/home/hadoop/lib/native:/usr/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib::/home/hadoop/lib/native
key=MALLOC_ARENA_MAX; value=4
key=SSH_CONNECTION; value=54.240.217.9 19295 172.31.36.126 22
key=HADOOP_OPTS; value= -server -Dhadoop.log.dir=/home/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/home/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -XX:MaxPermSize=128m -Dhadoop.security.logger=INFO,NullAppender -Dsun.net.inetaddr.ttl=30
key=MAHOUT_LOG_DIR; value=/mnt/var/log/apps
key=SHELL; value=/bin/bash
key=LC_CTYPE; value=UTF-8
key=CLASSPATH; value=/home/hadoop/conf:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/:/home/hadoop/share/hadoop/hdfs:/home/hadoop/share/hadoop/hdfs/lib/:/home/hadoop/share/hadoop/hdfs/:/home/hadoop/share/hadoop/yarn/lib/:/home/hadoop/share/hadoop/yarn/:/home/hadoop/share/hadoop/mapreduce/lib/:/home/hadoop/share/hadoop/mapreduce/::/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/lib/
key=PIG_HOME; value=/home/hadoop/pig
key=EC2_HOME; value=/opt/aws/apitools/ec2
key=LESS_TERMCAP_ue; value=
key=LC_ALL; value=en_US.UTF-8
key=AWS_ELB_HOME; value=/opt/aws/apitools/elb
key=USER; value=hadoop
key=HADOOP_HDFS_HOME; value=/home/hadoop
key=HADOOP_CLIENT_OPTS; value= -XX:MaxPermSize=128m
key=RUBYOPT; value=rubygems
key=HISTCONTROL; value=ignoredups
key=HOME; value=/home/hadoop
key=MAHOUT_HOME; value=/home/hadoop/mahout
key=LESSOPEN; value=|/usr/bin/lesspipe.sh %s
key=LS_COLORS; value=rs=0:di=38;5;27:ln=38;5;51:mh=44;38;5;15:pi=40;38;5;11:so=38;5;13:do=38;5;5:bd=48;5;232;38;5;11:cd=48;5;232;38;5;3:or=48;5;232;38;5;9:mi=05;48;5;232;38;5;15:su=48;5;196;38;5;15:sg=48;5;11;38;5;16:ca=48;5;196;38;5;226:tw=48;5;10;38;5;16:ow=48;5;10;38;5;21:st=48;5;21;38;5;15:ex=38;5;34:.tar=38;5;9:.tgz=38;5;9:.arc=38;5;9:.arj=38;5;9:.taz=38;5;9:.lha=38;5;9:.lzh=38;5;9:.lzma=38;5;9:.tlz=38;5;9:.txz=38;5;9:.tzo=38;5;9:.t7z=38;5;9:.zip=38;5;9:.z=38;5;9:.Z=38;5;9:.dz=38;5;9:.gz=38;5;9:.lrz=38;5;9:.lz=38;5;9:.lzo=38;5;9:.xz=38;5;9:.bz2=38;5;9:.bz=38;5;9:.tbz=38;5;9:.tbz2=38;5;9:.tz=38;5;9:.deb=38;5;9:.rpm=38;5;9:.jar=38;5;9:.war=38;5;9:.ear=38;5;9:.sar=38;5;9:.rar=38;5;9:.alz=38;5;9:.ace=38;5;9:.zoo=38;5;9:.cpio=38;5;9:.7z=38;5;9:.rz=38;5;9:.cab=38;5;9:.jpg=38;5;13:.jpeg=38;5;13:.gif=38;5;13:.bmp=38;5;13:.pbm=38;5;13:.pgm=38;5;13:.ppm=38;5;13:.tga=38;5;13:.xbm=38;5;13:.xpm=38;5;13:.tif=38;5;13:.tiff=38;5;13:.png=38;5;13:.svg=38;5;13:.svgz=38;5;13:.mng=38;5;13:.pcx=38;5;13:.mov=38;5;13:.mpg=38;5;13:.mpeg=38;5;13:.m2v=38;5;13:.mkv=38;5;13:.ogm=38;5;13:.mp4=38;5;13:.m4v=38;5;13:.mp4v=38;5;13:.vob=38;5;13:.qt=38;5;13:.nuv=38;5;13:.wmv=38;5;13:.asf=38;5;13:.rm=38;5;13:.rmvb=38;5;13:.flc=38;5;13:.avi=38;5;13:.fli=38;5;13:.flv=38;5;13:.gl=38;5;13:.dl=38;5;13:.xcf=38;5;13:.xwd=38;5;13:.yuv=38;5;13:.cgm=38;5;13:.emf=38;5;13:.axv=38;5;13:.anx=38;5;13:.ogv=38;5;13:.ogx=38;5;13:.aac=38;5;45:.au=38;5;45:.flac=38;5;45:.mid=38;5;45:.midi=38;5;45:.mka=38;5;45:.mp3=38;5;45:.mpc=38;5;45:.ogg=38;5;45:.ra=38;5;45:.wav=38;5;45:.axa=38;5;45:.oga=38;5;45:.spx=38;5;45:.xspf=38;5;45:
key=LESS_TERMCAP_us; value=
key=LANG; value=en_US.UTF-8
key=HADOOP_MAPRED_HOME; value=/home/hadoop
14/09/24 18:56:58 INFO util.Utilities: Checking some environment variable is properly set.
14/09/24 18:56:58 INFO util.Utilities: HADOOP_CONF_DIR=/home/hadoop/conf
14/09/24 18:56:58 INFO util.Utilities: YARN_CONF_DIR=/home/hadoop/conf
14/09/24 18:56:58 INFO util.Utilities: PATH=/home/hadoop/pig/bin:/usr/local/cuda/bin:/usr/java/latest/bin:/home/hadoop/bin:/home/hadoop/mahout/bin:/home/hadoop/hive/bin:/home/hadoop/hbase/bin:/home/hadoop/impala/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/aws/bin:/home/hadoop/cascading/tools/multitool-20140224/bin:/home/hadoop/cascading/tools/load-20140223/bin:/home/hadoop/cascading/tools/lingual-client/bin:/home/hadoop/cascading/driven/bin
14/09/24 18:56:58 INFO util.Utilities: Checking conf is correct
14/09/24 18:56:58 INFO util.Utilities: yarn.resourcemanager.hostname=0.0.0.0
14/09/24 18:56:58 INFO util.Utilities: yarn.resourcemanager.address=172.31.36.126:9022
14/09/24 18:56:58 INFO util.Utilities: yarn.resourcemanager.scheduler.address=172.31.36.126:9024
14/09/24 18:56:58 INFO util.Utilities: 0.0.0.0:8032=null
14/09/24 18:56:58 INFO util.Utilities: 0.0.0.0:8030=null
14/09/24 18:56:58 INFO util.Utilities: yarn.mpi.container.allocator=null
14/09/24 18:56:58 INFO util.Utilities: *
****************************************************
14/09/24 18:56:58 INFO util.Utilities: Connecting to ResourceManager at /172.31.36.126:9022
14/09/24 18:56:58 INFO client.Client: Got new application id=application_1411583461927_0009
14/09/24 18:56:58 INFO client.Client: Got Applicatioin: application_1411583461927_0009
14/09/24 18:56:58 INFO client.Client: Max mem capabililty of resources in this cluster 3072
14/09/24 18:56:58 INFO client.Client: Setting up application submission context for ASM
14/09/24 18:56:58 INFO client.Client: Set Application Id: application_1411583461927_0009
14/09/24 18:56:58 INFO client.Client: Set Application Name: MPICH2-cpi
14/09/24 18:56:58 INFO client.Client: Copy App Master jar from local filesystem and add to local environment
14/09/24 18:56:58 INFO client.Client: Source path: /home/hadoop/mpich2-yarn/target/mpich2-yarn-1.0-SNAPSHOT.jar
14/09/24 18:56:58 INFO client.Client: Destination path: hdfs://172.31.36.126:9000/tmp/MPICH2-cpi/9/AppMaster.jar
14/09/24 18:56:58 INFO client.Client: Copy MPI application from local filesystem to remote.
14/09/24 18:56:58 INFO client.Client: Source path: cpi
14/09/24 18:56:58 INFO client.Client: Destination path: hdfs://172.31.36.126:9000/tmp/MPICH2-cpi/9/MPIExec
14/09/24 18:56:58 INFO client.Client: Set the environment for the application master and mpi application
14/09/24 18:56:58 INFO client.Client: Trying to generate classpath for app master from current thread's classpath
14/09/24 18:56:58 INFO client.Client: Could not classpath resource from class loader
14/09/24 18:56:58 INFO client.Client: Setting up app master command
14/09/24 18:56:58 INFO client.Client: Completed setting up app master command ${JAVA_HOME}/bin/java -Xmx1024m org.apache.hadoop.yarn.mpi.server.ApplicationMaster --container_memory 1024 --num_containers 5 --priority 0 1><LOG_DIR>/AppMaster.stdout 2><LOG_DIR>/AppMaster.stderr
14/09/24 18:56:58 INFO client.Client: Submitting application to ASM
14/09/24 18:56:58 INFO client.Client: Submisstion result: true
14/09/24 18:56:58 INFO client.Client: Got application report from ASM for, appId=9, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411585018942, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0009/, appUser=hadoop
14/09/24 18:56:59 INFO client.Client: Got application report from ASM for, appId=9, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411585018942, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0009/, appUser=hadoop
14/09/24 18:57:00 INFO client.Client: Got application report from ASM for, appId=9, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411585018942, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0009/, appUser=hadoop
14/09/24 18:57:01 INFO client.Client: Got application report from ASM for, appId=9, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411585018942, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0009/, appUser=hadoop
14/09/24 18:57:02 INFO client.Client: Got application report from ASM for, appId=9, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411585018942, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0009/, appUser=hadoop
14/09/24 18:57:03 INFO client.Client: Got application report from ASM for, appId=9, clientToken=null, appDiagnostics=, appMasterHost=N/A, rpcPort:-1, appQueue=default, appMasterRpcPort=-1, appStartTime=1411585018942, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0009/, appUser=hadoop
14/09/24 18:57:04 INFO client.Client: Got application report from ASM for, appId=9, clientToken=null, appDiagnostics=, appMasterHost=ip-172-31-38-17.ec2.internal, rpcPort:56056, appQueue=default, appMasterRpcPort=56056, appStartTime=1411585018942, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://172.31.36.126:9046/proxy/application_1411583461927_0009/, appUser=hadoop
14/09/24 18:57:05 INFO util.Utilities: Connecting to ApplicationMaster at ip-172-31-38-17.ec2.internal/172.31.38.17:56056
14/09/24 18:57:05 INFO client.Client: Initializing ApplicationMaster

Data and apps seem to be available at hdfs:

[hadoop@ip-172-31-36-126 ~]$ hadoop fs -ls /
Found 1 items
drwxrwx--- - hadoop supergroup 0 2014-09-24 18:50 /tmp
[hadoop@ip-172-31-36-126 ~]$ hadoop fs -ls /tmp
Found 2 items
drwxrwx--- - hadoop supergroup 0 2014-09-24 18:56 /tmp/MPICH2-cpi
drwxrwx--- - hadoop supergroup 0 2014-09-24 18:31 /tmp/hadoop-yarn
[hadoop@ip-172-31-36-126 ~]$ hadoop fs -ls /tmp/hadoop-yarn
Found 1 items
drwxrwx--- - hadoop supergroup 0 2014-09-24 18:31 /tmp/hadoop-yarn/staging
[hadoop@ip-172-31-36-126 ~]$ hadoop fs -ls /tmp/hadoop-yarn/staging
Found 1 items
drwxrwx--- - hadoop supergroup 0 2014-09-24 18:31 /tmp/hadoop-yarn/staging/history
[hadoop@ip-172-31-36-126 ~]$ hadoop fs -ls /tmp/hadoop-yarn/staging/history
Found 2 items
drwxrwx--- - hadoop supergroup 0 2014-09-24 18:31 /tmp/hadoop-yarn/staging/history/done
drwxrwxrwt - hadoop supergroup 0 2014-09-24 18:31 /tmp/hadoop-yarn/staging/history/done_intermediate
[hadoop@ip-172-31-36-126 ~]$ hadoop fs -ls /tmp/hadoop-yarn/staging/history/done
[hadoop@ip-172-31-36-126 ~]$ hadoop fs -ls /tmp/MPICH2-cpi
Found 3 items
drwxrwx--- - hadoop supergroup 0 2014-09-24 18:50 /tmp/MPICH2-cpi/7
drwxrwx--- - hadoop supergroup 0 2014-09-24 18:54 /tmp/MPICH2-cpi/8
drwxrwx--- - hadoop supergroup 0 2014-09-24 18:56 /tmp/MPICH2-cpi/9
[hadoop@ip-172-31-36-126 ~]$ hadoop fs -ls /tmp/MPICH2-cpi/9
Found 2 items
-rw-r--r-- 2 hadoop supergroup 96333 2014-09-24 18:56 /tmp/MPICH2-cpi/9/AppMaster.jar
-rw-r--r-- 2 hadoop supergroup 9598 2014-09-24 18:56 /tmp/MPICH2-cpi/9/MPIExec
[hadoop@ip-172-31-36-126 ~]$

Any idea how to debug this?

Thanks a lot
Markus

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.