Comments (64)
I have this problem trying to follow tutorial with spark 2.2, how can I fix it?
from oryx.
@stiv-yakovenko are you able to test a build from master? I have committed a potential fix there, but have lacked an easy way to test it.
Batch:
https://drive.google.com/open?id=1q-i5RbnevXrtLwSvhLgMH5bDN2GaAvOA
Serving:
https://drive.google.com/open?id=1GgOeelB-NCDZiIIk7vzDicoIXZRlYXbY
Speed:
https://drive.google.com/open?id=1L3JXMcMADoB-1W8AC98DqEDZDFIeRcY2
from oryx.
Thank you so much dear @srowen. This one error is gone, but I haven't finished entire process yet, I'll let you know if something else is broken.
from oryx.
This version crashes with java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka010/LocationStrategies
from oryx.
Hm, that class has been in Spark since Spark 2.1, and is still there. Can you double check how you're running Spark, what version? are there other earlier errors?
from oryx.
I use spark-2.2.0-bin-hadoop2.7, these are my environment settings:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export KAFKA_HOME=/home/osboxes/kafka_2.12-2.0.0
export HADOOP_HOME=/home/osboxes/hadoop-3.1.0
export HADOOP_CONF_DIR=/home/osboxes/hadoop-3.1.0/etc/hadoop
export PATH=$KAFKA_HOME/bin:$HADOOP_HOME/bin:/home/osboxes/spark-2.2.0-bin-hadoop2.7/bin:$PATH
these are my starting commands:
cd ./kafka_2.12-2.0.0
./bin/zookeeper-server-start.sh config/zookeeper.properties
./bin/kafka-server-start.sh config/server.properties
cd ./hadoop-3.1.0
./sbin/start-dfs.sh
./sbin/start-yarn.sh
cd ~/oryx
./oryx-run.sh batch
from oryx.
I've packed all these folders into single archive. So you may check any of my configs quickly if you wish:
https://drive.google.com/file/d/1R1vYcoUAfN_MbB1BGtiFfo7a2tou10Az/view?usp=sharing
(i've removed all jars, otherwise it would be super huge)
from oryx.
Hm, I suspect Kafka 2 is the difference. I am not sure Spark even supports that (ironically I'm working on that change a bit right now in Spark). I think the important thing are the logs. Is there an earlier error about a class not being found from Kafka?
Also this build is for Spark 2.3 rather than 2.2
It's all a nightmare because each of these versions of both change in slightly subtle ways.
This may still work or be possible to patch over but I think there's a missing piece of info ... something else is at work.
Can you try changing the Kafka version to 2.0.0 in the project and rebuilding it? Or I could post those artifacts to try.
from oryx.
Well, I already have kafka 2.0.0: kafka_2.12-2.0.0. Or do you mean changing spark version?
from oryx.
Maybe the most important thing is getting a bit more log detail. Do you see earlier errors before NoClassDefFoundError? Updating to Spark 2.3 may also help, but doesn't seem directly related to the error.
I think I intend for this build to work with Spark 2.3 + Kafka 0.11. I really hope 0.11 isn't so different from 2.0.0 but we'll see if it can be patched over or whether needs yet another release branch.
from oryx.
Well, after upgrading to sparc 2.3 i have this new crash, no class exceptions any more:
2018-08-02 15:52:08 INFO Client:54 -
client token: N/A
diagnostics: Application application_1533224154849_0007 failed 2 times due to AM Container for ap pattempt_1533224154849_0007_000002 exited with exitCode: -103
Failing this attempt.Diagnostics: [2018-08-02 15:51:45.461]Container [pid=18592,containerID=container_1533 224154849_0007_02_000001] is running 208308736B beyond the 'VIRTUAL' memory limit. Current usage: 151.2 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1533224154849_0007_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RS SMEM_USAGE(PAGES) FULL_CMD_LINE
|- 18592 18591 18592 18592 (bash) 0 0 118063104 372 /bin/bash -c /usr/lib/jvm/java-1.8.0-openjdk/b in/java -server -Xmx512m -Djava.io.tmpdir=/tmp/hadoop-osboxes/nm-local-dir/usercache/osboxes/appcache/appl ication_1533224154849_0007/container_1533224154849_0007_02_000001/tmp -Dspark.yarn.app.container.log.dir=/ home/osboxes/hadoop-3.1.0/logs/userlogs/application_1533224154849_0007/container_1533224154849_0007_02_000 001 org.apache.spark.deploy.yarn.ExecutorLauncher --arg 'master:37555' --properties-file /tmp/hadoop-osbox es/nm-local-dir/usercache/osboxes/appcache/application_1533224154849_0007/container_1533224154849_0007_02_ 000001/__spark_conf__/__spark_conf__.properties 1> /home/osboxes/hadoop-3.1.0/logs/userlogs/application_15 33224154849_0007/container_1533224154849_0007_02_000001/stdout 2> /home/osboxes/hadoop-3.1.0/logs/userlogs /application_1533224154849_0007/container_1533224154849_0007_02_000001/stderr
|- 18603 18592 18592 18592 (java) 457 19 2345103360 38336 /usr/lib/jvm/java-1.8.0-openjdk/bin/java -server -Xmx512m -Djava.io.tmpdir=/tmp/hadoop-osboxes/nm-local-dir/usercache/osboxes/appcache/application _1533224154849_0007/container_1533224154849_0007_02_000001/tmp -Dspark.yarn.app.container.log.dir=/home/os boxes/hadoop-3.1.0/logs/userlogs/application_1533224154849_0007/container_1533224154849_0007_02_000001 org .apache.spark.deploy.yarn.ExecutorLauncher --arg master:37555 --properties-file /tmp/hadoop-osboxes/nm-loc al-dir/usercache/osboxes/appcache/application_1533224154849_0007/container_1533224154849_0007_02_000001/__ spark_conf__/__spark_conf__.properties
Container [pid=18592,containerID=container_1533 224154849_0007_02_000001] is running 208308736B beyond the 'VIRTUAL' memory limit. Current usage: 151.2 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used. Killing container.
which jvm params should I tune? How much memory do I need?
from oryx.
Tried setting 6Gm mem to my VM, doesn't help.
from oryx.
Good, that's progress. That means it basically works. Can you post your config file here? it'll be easier to see the settings and say what might need increasing. What layer are you running?
from oryx.
I think the latest commit resolved the problem this ticket was about, but we can continue discussion. Resolving ...
from oryx.
Some testing on this end also suggests Spark 2.3 works, and I made one more change that should make it work with Kafka 2. It's enough that I think a 2.7.0 release makes sense: https://github.com/OryxProject/oryx/releases/tag/oryx-2.7.0
from oryx.
I'll test today in few hours.
from oryx.
HI! With oryx 2.7 i still get this error after finetuning jvm memory settings:
2018-08-02 18:16:06 INFO SparkContext:54 - Successfully stopped SparkContext
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka010/LocationStrategies
at com.cloudera.oryx.lambda.AbstractSparkLayer.buildInputDStream(AbstractSparkLayer.java:195)
at com.cloudera.oryx.lambda.batch.BatchLayer.start(BatchLayer.java:105)
at com.cloudera.oryx.batch.Main.main(Main.java:33)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka010.LocationStrategies
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 13 more
PATH indicates components which I have:
/home/osboxes/kafka_2.12-2.0.0/bin:/home/osboxes/hadoop-3.1.0/bin:/home/osboxes/spark-2.3.0-bin-hadoop2.7/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/osboxes/.local/bin:/home/osboxes/bin
from oryx.
That's really weird. That class has been in Spark itself since 2.1. How are you running this?
from oryx.
Well, the same way:
cd ~/kafka_2.12-2.0.0
./bin/zookeeper-server-start.sh config/zookeeper.properties&
./bin/kafka-server-start.sh config/server.properties&
cd ~/hadoop-3.1.0
./sbin/start-dfs.sh
./sbin/start-yarn.sh
cd ~/oryx
./oryx-run.sh batch
env settings:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export KAFKA_HOME=/home/osboxes/kafka_2.12-2.0.0
export HADOOP_HOME=/home/osboxes/hadoop-3.1.0
export HADOOP_CONF_DIR=/home/osboxes/hadoop-3.1.0/etc/hadoop
export PATH=$KAFKA_HOME/bin:$HADOOP_HOME/bin:/home/osboxes/spark-2.3.0-bin-hadoop2.7/bin:$PATH
may be this is from older version of app being deployed to spark?
how else can I diagnose the problem?
from oryx.
OK, and what's your Spark cluster like? what does spark-submit
connect you to? I'm trying to figure out how an old version of Spark might be in play here -- is that possible?
Is it possible there's any earlier error?
from oryx.
I see a lot of this right now after removing old version of spark and old applications, uploaded there:
2018-08-04 03:33:42 INFO Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-08-04 03:33:48 INFO Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-08-04 03:33:54 INFO Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-08-04 03:34:00 INFO Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-08-04 03:34:06 INFO Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-08-04 03:34:12 INFO Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-08-04 03:34:18 INFO Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
what is meant to be on 8032?
from oryx.
That's going to be the YARN resource manager. Is that running?
from oryx.
Yes, it had incorrect port, fixed it, default port from distribution is different and it doesnt work. Also hadoop claster seems to have some inner conflicts, I'm fixing it.
from oryx.
After fixing this YARN error, reformatting and restarting hadoop I still get this ClassNotFound exception:
2018-08-04 05:58:46 INFO ZooKeeper:684 - Session: 0x1000005b82f0009 closed
2018-08-04 05:58:46 INFO BatchLayer:166 - Shutting down Spark Streaming; this may take some time
2018-08-04 05:58:46 WARN StreamingContext:66 - StreamingContext has not been started yet
2018-08-04 05:58:46 INFO ClientCnxn:512 - EventThread shut down
2018-08-04 05:58:46 INFO AbstractConnector:318 - Stopped Spark@1e6308a9{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-08-04 05:58:46 INFO SparkUI:54 - Stopped Spark web UI at http://master:4040
2018-08-04 05:58:46 ERROR Client:91 - Failed to contact YARN for application application_1533376242674_0002.
java.io.InterruptedIOException: Call interrupted
at org.apache.hadoop.ipc.Client.call(Client.java:1469)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy15.getApplicationReport(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:191)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy16.getApplicationReport(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:430)
at org.apache.spark.deploy.yarn.Client.getApplicationReport(Client.scala:300)
at org.apache.spark.deploy.yarn.Client.monitorApplication(Client.scala:1053)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:109)
2018-08-04 05:58:46 ERROR YarnClientSchedulerBackend:70 - Yarn application has already exited with state FAILED!
2018-08-04 05:58:46 INFO SparkContext:54 - SparkContext already stopped.
2018-08-04 05:58:46 ERROR TransportClient:233 - Failed to send RPC 7863554702237191545 to /192.168.1.106:51244: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2018-08-04 05:58:46 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint:91 - Sending RequestExecutors(0,0,Map(),Set()) to AM was unsuccessful
java.io.IOException: Failed to send RPC 7863554702237191545 to /192.168.1.106:51244: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.access$000(DefaultPromise.java:34)
at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:431)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2018-08-04 05:58:46 INFO SchedulerExtensionServices:54 - Stopping SchedulerExtensionServices
(serviceOption=None,
services=List(),
started=false)
2018-08-04 05:58:46 ERROR Utils:91 - Uncaught exception in thread main
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:566)
at org.apache.spark.scheduler.cluster.YarnSchedulerBackend.stop(YarnSchedulerBackend.scala:95)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:155)
at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:508)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1752)
at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1924)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1923)
at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:707)
at org.apache.spark.streaming.api.java.JavaStreamingContext.stop(JavaStreamingContext.scala:601)
at com.cloudera.oryx.lambda.batch.BatchLayer.close(BatchLayer.java:167)
at com.cloudera.oryx.batch.Main.main(Main.java:35)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: Failed to send RPC 7863554702237191545 to /192.168.1.106:51244: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.access$000(DefaultPromise.java:34)
at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:431)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2018-08-04 05:58:46 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-08-04 05:58:46 INFO MemoryStore:54 - MemoryStore cleared
2018-08-04 05:58:46 INFO BlockManager:54 - BlockManager stopped
2018-08-04 05:58:46 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2018-08-04 05:58:46 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-08-04 05:58:46 INFO SparkContext:54 - Successfully stopped SparkContext
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka010/LocationStrategies
at com.cloudera.oryx.lambda.AbstractSparkLayer.buildInputDStream(AbstractSparkLayer.java:195)
at com.cloudera.oryx.lambda.batch.BatchLayer.start(BatchLayer.java:105)
at com.cloudera.oryx.batch.Main.main(Main.java:33)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka010.LocationStrategies
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 13 more
2018-08-04 05:58:46 INFO ShutdownHookManager:54 - Shutdown hook called
2018-08-04 05:58:46 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-275a777e-8864-448c-8543-09109606876d
2018-08-04 05:58:46 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-e5ffdd9a-e36d-4525-b105-e61e3def4de0
from oryx.
What version of Spark does spark-submit run? Are you sure it is running the version you think?
What is your cluster like - a distro or run from vanilla binary releases?
from oryx.
Well, ps fax indicates that the version (/home/osboxes/spark-2.3.0-bin-hadoop2.7) is what expected, because I've deleted older version of spark:
6348 pts/0 S 0:00 | \_ bash ./oryx-run.sh batch
6403 pts/0 Sl 0:00 | | \_ /usr/lib/jvm/java-1.8.0-openjdk/bin/java -cp /home/osboxes/spark-2.3.0-bin-hadoop2.7/conf/:/home/osboxes/spark-2.3.0-bin-hadoop2.7/jars/*:/home/osboxes/hadoop-3.1.0/etc/hadoop/ -Xmx1g -Dconfig.file=oryx.conf org.apache.spark.deploy.SparkSubmit --master yarn --deploy-mode client --conf spark.logConf=true --conf spark.driver.memory=1g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.ui.port=4040 --conf spark.io.compression.codec=lzf --conf spark.executor.extraJavaOptions=-Dconfig.file=oryx.conf --conf spark.speculation=true --conf spark.ui.showConsoleProgress=false --conf spark.dynamicAllocation.enabled=false --conf spark.driver.extraJavaOptions=-Dconfig.file=oryx.conf --class com.cloudera.oryx.batch.Main --name OryxBatchLayer-ALSExample --files oryx.conf --executor-memory 4g --executor-cores 8 --num-executors 4 oryx-batch-2.7.0.jar
I have virtualbox with Centos 7 from this image: https://www.osboxes.org/centos/
I've used this rpm to install scala: https://downloads.lightbend.com/scala/2.11.2/scala-2.11.2.rpm
I have downloaded there hadoop-3.1.0, kafka_2.12-2.0.0, spark-2.3.0-bin-hadoop2.7 from their apache websites and followed these instructions to install them:
https://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-common/SingleCluster.html
https://kafka.apache.org/quickstart
https://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm
I'm using oryx 2.7 from release section of this github repo.
from oryx.
Hm, I'm confused now. I downloaded the Spark distro and I don't see the Kafka integration classes in the jars/ dir. There should be an assembly for it. I'm looking into that on the Spark side; I hope I'm just missing something obvious. But it would explain why the class isn't found!
from oryx.
Tracking the possible Spark distro problem at https://issues.apache.org/jira/browse/SPARK-25026 . In the meantime, I manually built the spark-streaming-kafka JAR for 2.3.0 and posted it here: https://drive.google.com/open?id=1WJhdkjrwdGF02eE1ePUhafoAnc-VFkk2 You could try copying that into your jars/ directory and running again. If it works, it kind of confirms the problem is what I think in Spark.
from oryx.
Update @stiv-yakovenko -- I think the answer is, for better or worse it hasn't shipped with Spark and won't. The Oryx app needs to include this code. I think this was never obvious before because other JARs on the classpath pulled in the Kafka integration, and/or simply this was almost never tested in standalone mode, instead tested on a cluster from a distro.
Anyway, I have new binaries you can try (ignore the 2.8.0 version)
Batch
https://drive.google.com/open?id=1JmuERkZULvO6c7iC0zuMRzNOUfZA7sX9
Speed
https://drive.google.com/open?id=1usSYp8wvgPZeJKycad_hWZiZhyUoTBph
Serving
https://drive.google.com/open?id=1ioDzKs5eMmlx-5iRfs8pti02oTtZH-U-
from oryx.
will this be in 2.9.0?
from oryx.
If this is the problem I will make a 2.7.1
from oryx.
Well, this classnotfound exception has gone. But I still get another exception, which happens after application has be uploaded:
5115777883 to /192.168.1.106:60192: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknow n Source)
2018-08-04 18:51:08 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint:91 - Se nding RequestExecutors(0,0,Map(),Set()) to AM was unsuccessful
java.io.IOException: Failed to send RPC 5869987425115777883 to /192.168.1.10 6:60192: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2( TransportClient.java:237)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPr omise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(Defaul tPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPr omise.java:420)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise .java:122)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(Ab stractChannel.java:987)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractCha nnel.java:869)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(Default ChannelPipeline.java:1316)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(Abstr actChannelHandlerContext.java:738)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(Abstra ctChannelHandlerContext.java:730)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(Abstra ctChannelHandlerContext.java:38)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask. write(AbstractChannelHandlerContext.java:1081)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask. write(AbstractChannelHandlerContext.java:1128)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask. run(AbstractChannelHandlerContext.java:1070)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(Abstra ctEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(Si ngleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleTh readEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDeco rator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknow n Source)
2018-08-04 18:51:08 INFO SchedulerExtensionServices:54 - Stopping Scheduler ExtensionServices
(serviceOption=None,
services=List(),
started=false)
2018-08-04 18:51:08 ERROR Utils:91 - Uncaught exception in thread Yarn appli cation state monitor
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala: 205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend. requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:566)
at org.apache.spark.scheduler.cluster.YarnSchedulerBackend.stop(Yarn SchedulerBackend.scala:95)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.sto p(YarnClientSchedulerBackend.scala:155)
at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerIm pl.scala:508)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1 752)
at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkC ontext.scala:1924)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357 )
at org.apache.spark.SparkContext.stop(SparkContext.scala:1923)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$Mon itorThread.run(YarnClientSchedulerBackend.scala:112)
Caused by: java.io.IOException: Failed to send RPC 5869987425115777883 to /1 92.168.1.106:60192: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2( TransportClient.java:237)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPr omise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(Defaul tPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPr omise.java:420)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise .java:122)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(Ab stractChannel.java:987)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractCha nnel.java:869)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(Default ChannelPipeline.java:1316)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(Abstr actChannelHandlerContext.java:738)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(Abstra ctChannelHandlerContext.java:730)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(Abstra ctChannelHandlerContext.java:38)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask. write(AbstractChannelHandlerContext.java:1081)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask. write(AbstractChannelHandlerContext.java:1128)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask. run(AbstractChannelHandlerContext.java:1070)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(Abstra ctEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(Si ngleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleTh readEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDeco rator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknow n Source)
2018-08-04 18:51:08 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrack erMasterEndpoint stopped!
2018-08-04 18:51:08 INFO MemoryStore:54 - MemoryStore cleared
2018-08-04 18:51:08 INFO BlockManager:54 - BlockManager stopped
2018-08-04 18:51:08 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2018-08-04 18:51:08 INFO OutputCommitCoordinator$OutputCommitCoordinatorEnd point:54 - OutputCommitCoordinator stopped!
2018-08-04 18:51:08 INFO SparkContext:54 - Successfully stopped SparkContex t
[2018-08-04 18:51:23,534] INFO [GroupMetadataManager brokerId=0] Removed 0 e xpired offsets in 8 milliseconds. (kafka.coordinator.group.GroupMetadataMana ger)
from oryx.
Hm, that's good, but this error is hard to figure out. Sounds like Spark can't talk to workers run by YARN. Is YARN healthy and running? I wonder because I know you're running on a small VM and not sure how effectively all these services fit into those resources.
from oryx.
HI!
I've added these options to yarn config:
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
now this first crash is gone, but I have this error though:
2018-08-05 06:03:20 ERROR Utils:91 - Uncaught exception in thread Yarn application state monitor
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:566)
at org.apache.spark.scheduler.cluster.YarnSchedulerBackend.stop(YarnSchedulerBackend.scala:95)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:155)
at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:508)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1752)
at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1924)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1923)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:112)
Caused by: java.io.IOException: Failed to send RPC 5781682026554577140 to /192.168.1.106:60798: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
I have searched through all configs on my home folder, but none of them contained port 60798. Probably this port is hardcoded somewhere, I wish I knew what it is. Noone is listening on this port.
from oryx.
Yeah I see why you do that, to make YARN less likely to kill things for running out of memory. It doesn't really fix the fact that you're running many big services in a small VM. I think the errors here still indicate YARN daemons aren't running. What you'd really want to tune is to turn down the memory that the services use, and turn down the memory that the oryx apps use. That would help it fit.
Really this is meant for at least a small cluster and is indeed hard to set up a whole small cluster just to run it. Usually the deployment scenario is an existing cluster from a distro or something.
from oryx.
Memory is not an issue at this step, I have enough memory in my vm. It was before, but now its this last error. Oryx wants to access port 60798. I have no idea what service should listen there. I haven't found it in any config file, looks like its hardcoded somewhere in the source code. How can u fix this error?
from oryx.
That's almost certainly the app's application master run by YARN -- random port chosen by YARN. I think the next question would be why it didn't run or failed to run. That would be more in the YARN logs, which can be tricky to chase down, but should be available from the resource manager UI in YARN.
from oryx.
I've managed to start batch and speed layers without exceptions finally.
But serving layer crashes with exception:
Aug 05, 2018 12:46:17 PM org.apache.catalina.core.StandardContext listenerStart
SEVERE: Exception sending context initialized event to listener instance of class [com.cloudera.oryx.lambda.serving.ModelManagerListener]
java.lang.NoClassDefFoundError: org/apache/zookeeper/Watcher
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
from oryx.
Can you please give a hint, has speed layer started?
I also have this in my logs:
2018-08-05 12:45:22 INFO Client:54 -
client token: N/A
diagnostics: [Sun Aug 05 12:45:21 -0400 2018] Application is added to the scheduler and is not yet activated. Queue's AM resource limit exceeded. Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:1024, vCores:1>; Queue Resource Limit for AM = <memory:1024, vCores:1>; User AM Resource Limit of the queue = <memory:1024, vCores:1>; Queue AM Resource Usage = <memory:1024, vCores:1>;
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1533487521782
final status: UNDEFINED
tracking URL: http://master:8088/proxy/application_1533487265246_0002/
user: osboxes
Application is added to the scheduler and is not yet activated. Queue's AM resource limit exceeded. Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:1024, vCores:1>; Queue Resource Limit for AM = <memory:1024, vCores:1>; User AM Resource Limit of the queue = <memory:1024, vCores:1>; Queue AM Resource Usage = <memory:1024, vCores:1>
How can I fix this?
from oryx.
That means it hasn't started. There isn't enough memory available in YARN to meet the request. I think you'll want to turn down the amount of memory the config file requests. Looks like you are asking for 6GB of memory?
I'll look at the other error; looks like it thinks it needs ZK classes now too. I didn't see that when I started it but didn't test it too much.
from oryx.
Actually, regarding the ClassNotFoundError -- I correct myself. Zookeeper and Hadoop dependencies are still meant to be provided by the runtime environment (cluster), not packaged. This makes it easier to avoid conflicts with what's on the cluster.
The compute-classpath.sh
script is set up to find the JAR files in the usual locations in CDH. However that won't match your setup, which is just unpacked binary releases. You will just need to modify that script to search the jars directories of your Hadoop install, and Kafka install. Have a look, it's pretty straightforward; you're just going to list different path glob expressions than the ones you see there.
from oryx.
Regarding memory: I've updated my virtual box to provide 12G of memory. I see in yarn the following:
So I don't understand why it cant allocate more memory. I'll try to pass missing jar file to the path.
from oryx.
What's your config? it will dump the whole effective oryx config when it starts up, or should. That will show how much memory it's requesting. It's asking for a driver and executors; may be hard to fit into 12GB with everything else. I would have though you needed a VM of 32GB RAM or so to make it pretty easy to run all these services.
from oryx.
This is what oryx-run outputs me when it starts:
oryx {
als {
decay {
factor=1
zero-threshold=0
}
hyperparams {
alpha=1
epsilon=1.0E-5
features=10
lambda=0.001
}
implicit=true
iterations=10
logStrength=false
no-known-items=false
rescorer-provider-class=null
sample-rate=1
}
batch {
storage {
data-dir="hdfs:///user/example/Oryx/data/"
key-writable-class="org.apache.hadoop.io.Text"
max-age-data-hours=-1
max-age-model-hours=-1
message-writable-class="org.apache.hadoop.io.Text"
model-dir="hdfs:///user/example/Oryx/model/"
}
streaming {
config {
spark {
io {
compression {
codec=lzf
}
}
logConf=true
serializer="org.apache.spark.serializer.KryoSerializer"
speculation=true
ui {
showConsoleProgress=false
}
}
}
deploy-mode=client
driver-memory="1g"
dynamic-allocation=false
executor-cores=1
executor-memory="1g"
generation-interval-sec=300
master=yarn
num-executors=1
}
ui {
port=4040
}
update-class="com.cloudera.oryx.app.batch.mllib.als.ALSUpdate"
}
default-streaming-config {
spark {
io {
compression {
codec=lzf
}
}
logConf=true
serializer="org.apache.spark.serializer.KryoSerializer"
speculation=true
ui {
showConsoleProgress=false
}
}
}
id=ALSExample
input-schema {
categorical-features=null
feature-names=[]
id-features=[]
ignored-features=[]
num-features=0
numeric-features=null
target-feature=null
}
input-topic {
broker="127.0.0.1:9092"
lock {
master="127.0.0.1:2181"
}
message {
key-class="java.lang.String"
key-decoder-class="org.apache.kafka.common.serialization.StringDeserializer"
message-class="java.lang.String"
message-decoder-class="org.apache.kafka.common.serialization.StringDeserializer"
topic=OryxInput
}
}
kmeans {
evaluation-strategy=SILHOUETTE
hyperparams {
k=10
}
initialization-strategy="k-means||"
iterations=30
runs=3
}
ml {
eval {
candidates=1
hyperparam-search=random
parallelism=1
test-fraction=0.1
threshold=null
}
}
rdf {
hyperparams {
impurity=entropy
max-depth=8
max-split-candidates=100
min-info-gain-nats=0.001
min-node-size=16
}
num-trees=20
}
serving {
api {
context-path="/"
key-alias=null
keystore-file=null
keystore-password=*****
password=*****
port=8080
read-only=false
secure-port=443
user-name=null
}
application-resources="com.cloudera.oryx.app.serving,com.cloudera.oryx.app.serving.als"
memory="4000m"
min-model-load-fraction=0.8
model-manager-class="com.cloudera.oryx.app.serving.als.model.ALSServingModelManager"
no-init-topics=false
yarn {
cores="4"
instances=1
}
}
speed {
min-model-load-fraction=0.8
model-manager-class="com.cloudera.oryx.app.speed.als.ALSSpeedModelManager"
streaming {
config {
spark {
io {
compression {
codec=lzf
}
}
logConf=true
serializer="org.apache.spark.serializer.KryoSerializer"
speculation=true
ui {
showConsoleProgress=false
}
}
}
deploy-mode=client
driver-memory="512m"
dynamic-allocation=false
executor-cores=4
executor-memory="1g"
generation-interval-sec=10
master=yarn
num-executors=2
}
ui {
port=4041
}
}
update-topic {
broker="127.0.0.1:9092"
lock {
master="127.0.0.1:2181"
}
message {
decoder-class="org.apache.kafka.common.serialization.StringDeserializer"
encoder-class="org.apache.kafka.common.serialization.StringDeserializer"
max-size=16777216
topic=OryxUpdate
}
}
}
this is my configuration file:
kafka-brokers = "127.0.0.1:9092"
zk-servers = "127.0.0.1:2181"
hdfs-base = "hdfs:///user/example/Oryx"
oryx {
id = "ALSExample"
als {
rescorer-provider-class = null
}
input-topic {
broker = ${kafka-brokers}
lock = {
master = ${zk-servers}
}
}
update-topic {
broker = ${kafka-brokers}
lock = {
master = ${zk-servers}
}
}
batch {
streaming {
generation-interval-sec = 300
num-executors = 1
executor-cores = 1
executor-memory = "1g"
}
update-class = "com.cloudera.oryx.app.batch.mllib.als.ALSUpdate"
storage {
data-dir = ${hdfs-base}"/data/"
model-dir = ${hdfs-base}"/model/"
}
ui {
port = 4040
}
}
speed {
model-manager-class = "com.cloudera.oryx.app.speed.als.ALSSpeedModelManager"
ui {
port = 4041
}
}
serving {
model-manager-class = "com.cloudera.oryx.app.serving.als.model.ALSServingModelManager"
application-resources = "com.cloudera.oryx.app.serving,com.cloudera.oryx.app.serving.als"
api {
port = 8080
}
}
}
This is yarn config, just in case:
<configuration>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>0.0.0.0:8032</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>130</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>1</value>
<description>Ratio between virtual memory to physical memory when
setting memory limits for containers</description>
</property>
</configuration>
I just need one executor for each service, so 3g would be enough. I start batch layer first. But serving layer, which is second, doesnt start and doesn't allocate any resources, even though 5g is still left.
from oryx.
The actual config logged by the app says: Batch starts a 1GB driver, an one 1GB executor. It will request more than 1GB in each case to add some overhead for off heap memory. But your config shows asking for 4 x 4GB executors.
I also see something is running in both cases; there's a batch app with several containers that says it is "RUNNING". The second one isn't allocating.
I wonder if some wires are crossed with the config?
I think the issue here is getting YARN set up. There are a number of configs to look at. One thing you might look at is its minimum container memory allocation size, the maximum, and the increment. It may be forcing YARN to allocate more memory.
from oryx.
How can reduce this number:
where does it come from?
from oryx.
You can just set it in your config file in the same location. This won't matter unless you are trying to run the serving layer via YARN. You don't have to; you can just run the binary.
from oryx.
I tried to reduce number of executors to 1 to reduce memory consumption, but I still see 2 containers being allocated:
This is what oryx-run outputs:
serving {
api {
context-path="/"
key-alias=null
keystore-file=null
keystore-password=*****
password=*****
port=8080
read-only=false
secure-port=443
user-name=null
}
application-resources="com.cloudera.oryx.app.serving,com.cloudera.oryx.a
memory="4000m"
min-model-load-fraction=0.8
model-manager-class="com.cloudera.oryx.app.serving.als.model.ALSServingM
no-init-topics=false
streaming {
executor-cores=1
executor-memory="300m"
generation-interval-sec=300
num-executors=1
}
yarn {
cores="1"
instances=1
}
}
./oryx-run.sh speed
spark-submit --master yarn --deploy-mode client --name OryxSpeedLayer-ALSExample --class com.cloudera.oryx.speed.Main --files oryx.conf --driver-memory 512m --driver-java-options "-Dconfig.file=oryx.conf" --executor-memory 500m --executor-cores 1 --conf spark.executor.extraJavaOptions="-Dconfig.file=oryx.conf" --conf spark.ui.port=4041 --conf spark.io.compression.codec=lzf --conf spark.logConf=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.speculation=true --conf spark.ui.showConsoleProgress=false --conf spark.dynamicAllocation.enabled=false --num-executors=1 oryx-speed-2.8.0-SNAPSHOT.jar
and after that there is not enough resource for batch layer! Even with 6g memory and 6 vcores free. Why is that? Why app is deployed on 2 containers?
from oryx.
There will always be at least 2 containers in a Spark app -- driver, and an executor. One job is running fine here, is it the speed layer? I'm still not clear if your batch config is correct because previously it showed requesting 4 x 4GB executors.
from oryx.
Yes, it was speed layer. So now I start batch layer like this:
[osboxes@master oryx2.7]$ ./oryx-run.sh batch
spark-submit --master yarn --deploy-mode client --name OryxBatchLayer-ALSExample --class com.cloudera.oryx.batch.Main --files oryx.conf --driver-memory 1g --driver-java-options "-Dconfig.file=oryx.conf" --executor-memory 500m --executor-cores 1 --conf spark.executor.extraJavaOptions="-Dconfig.file=oryx.conf" --conf spark.ui.port=4040 --conf spark.io.compression.codec=lzf --conf spark.logConf=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.speculation=true --conf spark.ui.showConsoleProgress=false --conf spark.dynamicAllocation.enabled=false --num-executors=1 oryx-batch-2.8.0-SNAPSHOT.jar
this is relevant part of config:
batch {
storage {
data-dir="hdfs:///user/example/Oryx/data/"
key-writable-class="org.apache.hadoop.io.Text"
max-age-data-hours=-1
max-age-model-hours=-1
message-writable-class="org.apache.hadoop.io.Text"
model-dir="hdfs:///user/example/Oryx/model/"
}
streaming {
config {
spark {
io {
compression {
codec=lzf
}
}
logConf=true
serializer="org.apache.spark.serializer.KryoSerializer"
speculation=true
ui {
showConsoleProgress=false
}
}
}
deploy-mode=client
driver-memory="1g"
dynamic-allocation=false
executor-cores=1
executor-memory="500m"
generation-interval-sec=300
master=yarn
num-executors=1
}
ui {
port=4040
}
update-class="com.cloudera.oryx.app.batch.mllib.als.ALSUpdate"
yarn {
cores="1"
instances=1
}
}
So if I start speed layer first, and batch layer second, then I see this message during batch layer start:
client token: N/A
diagnostics: [Sun Aug 05 17:06:06 -0400 2018] Application is added to the scheduler and is not yet activated. Queue's AM resource limit exceeded. Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:1024, vCores:1>; Queue Resource Limit for AM = <memory:1024, vCores:1>; User AM Resource Limit of the queue = <memory:1024, vCores:1>; Queue AM Resource Usage = <memory:1024, vCores:1>;
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1533503166625
final status: UNDEFINED
tracking URL: http://master:8088/proxy/application_1533502694403_0002/
user: osboxes
from oryx.
This confuses me, it turns out like i have free 6g, i need only few of it and I can't use it:
from oryx.
It could be that the request is actually for more than 6GB. Or, that you're hitting a resource limit for your queue, not the whole YARN cluster. In fact that seems to be the issue in the error. Above, is "Maximum Allocation" the limit for your job submission queue? it says 2GB.
from oryx.
I tried to set this limit to 4GB, but still have the problem:
from oryx.
Hm, here we're really into YARN configuration and troubleshooting. I'm not a YARN expert; might be able to spot something if I were looking over your shoulder. I think the last error message is the best pointer; something to do with limits on how much resource you're allowed to use? the output here says it thinks it's waiting on more resources for some reason, for sure. It's a question of why.
from oryx.
I've managed to launch it finally, by changing this option:
The fix was to edit
~/hadoop-3.1.0/etc/hadoop/capacity-scheduler.xml
and update .1 to 1:
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>1</value>
<description>
Maximum percent of resources in the cluster which can be used to run
application masters i.e. controls number of concurrent running
applications.
</description>
</property>
Also I had to patch ednd of compute-classpath.sh, otherwise it doesn't work:
echo "/home/osboxes/kafka_2.12-2.0.0/libs/zookeeper-3.4.13.jar"
echo "/home/osboxes/hadoop-3.1.0/share/hadoop/common/hadoop-common-3.1.0.jar"
echo "/home/osboxes/hadoop-3.1.0/share/hadoop/common/lib/woodstox-core-5.0.3.jar"
echo "/home/osboxes/hadoop-3.1.0/share/hadoop/common/lib/stax2-api-3.1.4.jar"
echo "/home/osboxes/hadoop-3.1.0/share/hadoop/common/lib/commons-collections-3.2.2.jar"
Now I have a new problem. Even thought I observed service working once, I get error 503 or HTTP code 0 now and I don't know how to diagnose it, where is the log file for web layer?
from oryx.
OK, I figured it was some YARN config. I am not so familiar with YARN 3.x settings myself.
Good to know about the change to compute-classpath.sh; this is only for the serving layer, thankfully, but is unfortunately highly distribution-specific.
503 just means there's no model yet. Once you feed it data and see the batch layer publish a model, it should load and serve results. The logs just go to stdout.
Example docs here, which should help get some data fed in for testing: http://oryx.io/docs/endusers.html
from oryx.
HI! Even though I was able to feed data.csv into the the service, I can't add extra data to the existing model.
I've created CSV file of two lines:
19611,24211,3
18611,30211,3
these userIds and itemIds are new, they never appeared in data.csv. I use same wget command to send this new data to /ingest, but I can't see these new ids here:
Searched with CTRL-F in browser.
Should IDS be long numbers or they can be any string uids?
Does /because/uId/itId update recommendation database or just recommends?
from oryx.
They can be any ID. Are you running at least the speed layer? one key point is that the serving layer doesn't update anything itself, but waits for the speed layer (or batch) to send it updates to the model. Until that info makes it to the speed layer it wouldn't reflect any updates.
Also, I'd have to check the code on this, but are both the user and item unknown in each of the new lines you're sending? that may cause them to do nothing. There's no update it can fold in to the model if a new user likes a new item. (Eventually of course the batch layer would be able to do something with it.) . You might try a new user with a known item if you're just observing short-term updates via the speed layer.
from oryx.
I have all 3 layers running. Are you saying that I need to wait 5 min to get /ingest to update the model? Or there is some other way how I should supply new users and items?
from oryx.
You'd need to wait for updates to be computed and reach the serving layer. Yeah if your speed layer batch interval is about 5 minutes, then about 5 minutes. You can reduce that interval down to 10 seconds or so without any issue. Only gets tough below a couple seconds.
You might watch the traffic on the two kafka queues (see the 'tail' option in the oryx-run.sh script for help) and just make sure the data is going back and forth as expected too.
from oryx.
What if new userId accesses itemId? Does this mean we can't recommend him something immediately?
from oryx.
Finally new ids have arrived to the model. Thank you so much for your support.
from oryx.
One more question, @srowen. I've seen each jar has it entry java main function. Can they work without spark just connecting to kafka and hdfs?
from oryx.
Batch and speed require Spark. These apps are submitted with spark-submit. main() is the entry point that Spark uses.
from oryx.
Related Issues (20)
- Kafka 0.11.x support HOT 27
- Customer ID update HOT 2
- Multiple jobs on oryx HOT 3
- Update to Tomcat 9 / Jersey 2.26 HOT 1
- Error while runnig Oryx batch HOT 1
- java.lang.NoClassDefFoundError: scala/collection/Seq HOT 3
- 2.7.0 - Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka010/LocationStrategies HOT 1
- mvn install error HOT 1
- SpeedLayer read kafka offset HOT 1
- KryoException: Buffer overflow. Available: 1, required: 4 HOT 19
- Trouble handling huge data HOT 3
- Reseting model HOT 3
- java.lang.ClassCastException: java.lang.Object cannot be cast to java.lang.String HOT 22
- /knownItems/uid doesn't return all items HOT 50
- Can't build oryx 8.0 HOT 8
- Recommend products FROM <IDs> HOT 1
- Oryx tutorial HOT 2
- Application is added to the scheduler and is not yet activated. HOT 2
- question about "fold-in" implementation HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from oryx.