In cluster mode (using EMR at least), for any query that uses with-job-conf
to set the mapred.child.java.opts property, the job never actually starts. It tries to start a number of times, but ultimately fails. Changes to the setting directly in mapred-site.xml don't get picked up for some reason, so this setting has to be modified in hadoop-site.xml:
mapred.child.java.opts-Djava.library.path=/home/hadoop/native -Xms1024m -Xmx1025m
Sample query to reproduce error
This only appears to happen in cluster mode, but it works even on a single-instance EMR cluster. After uberjaring, from the repl, (use 'cascalog.api)
then run this:
(with-job-conf {"mapred.child.java.opts" "-Xmx512"}
(let [src [[1 2]]
out-loc (hfs-seqfile "s3n://formaexperiments/test-with-job-conf" :sinkmode :replace)]
(?<- out-loc [?a]
(src ?a ?b))))
Things I've tried
- The query I really want to run (forma/beta-gen) starts if you don't use with-job-conf, but eventually fails b/c the reducers run out of memory for big ecoregions in Brazil and Indonesia. For a smaller country like Malaysia, we don't need to modify the memory configuration, but we must be able to control the memory configuration in order to calculate the beta vectors.
- The simple sample query above works without using with-job-conf
- It works with
(with-job-conf {"mapred.map.tasks" 10} ...
- It fails using Cascalog 1.9 AND Cascalog 1.9-wip with
(with-job-conf {"mapped.child.java.opts" "-Xmx512"} ...
and for several other memory configurations
- It works if conf/hadoop-site.xml is modified, in this case so that the max child process memory allocation is 1025m:
<property><name>mapred.child.java.opts</name><value>-Djava.library.path=/home/hadoop/native -Xms1024m -Xmx1025m</value></property>
As far as workarounds go it's not too bad, but it's definitely a pain.
Sample error messages from the logs
(JOB_SETUP) 'attempt_201207101735_0006_m_000013_8' to tip task_201207101735_0006_m_000013, for tracker 'tracker_10.96.174.59:localhost/127.0.0.1:39641' 2012-07-10 18:04:44,941 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 23 on 9001): Removing task 'attempt_201207101735_0006_m_000013_7' 2012-07-10 18:04:47,946 INFO org.apache.hadoop.mapred.TaskInProgress (IPC Server handler 43 on 9001): Error from attempt_201207101735_0006_m_000013_8: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
(JOB_CLEANUP) 'attempt_201207101735_0003_m_000010_17' to tip task_201207101735_0003_m_000010, for tracker 'tracker_10.96.174.59:localhost/127.0.0.1:39641' 2012-07-10 17:53:00,970 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 61 on 9001): Removing task 'attempt_201207101735_0003_m_000010_16' 2012-07-10 17:53:03,973 INFO org.apache.hadoop.mapred.TaskInProgress (IPC Server handler 42 on 9001): Error from attempt_201207101735_0003_m_000010_17: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) 2012-07-10 17:53:06,977 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 2 on 9001): Adding task