It looks like running Oryx 2.6 against Spark 2.3 yields errors like "java.lang.NoClass

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thank you so much dear <a class="user-mention notranslate" data-hovercard-type="user"

Spark 2.3 and java.lang.NoClassDefFoundError: kafka/admin/RackAwareMode,about oryxproject/oryx

Comments (64)

stiv-yakovenko commented on July 16, 2024

I have this problem trying to follow tutorial with spark 2.2, how can I fix it?

from oryx.

srowen commented on July 16, 2024

@stiv-yakovenko are you able to test a build from master? I have committed a potential fix there, but have lacked an easy way to test it.

Batch:
https://drive.google.com/open?id=1q-i5RbnevXrtLwSvhLgMH5bDN2GaAvOA
Serving:
https://drive.google.com/open?id=1GgOeelB-NCDZiIIk7vzDicoIXZRlYXbY
Speed:
https://drive.google.com/open?id=1L3JXMcMADoB-1W8AC98DqEDZDFIeRcY2

from oryx.

stiv-yakovenko commented on July 16, 2024

Thank you so much dear @srowen. This one error is gone, but I haven't finished entire process yet, I'll let you know if something else is broken.

from oryx.

stiv-yakovenko commented on July 16, 2024

This version crashes with java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka010/LocationStrategies

from oryx.

srowen commented on July 16, 2024

Hm, that class has been in Spark since Spark 2.1, and is still there. Can you double check how you're running Spark, what version? are there other earlier errors?

from oryx.

stiv-yakovenko commented on July 16, 2024

I use spark-2.2.0-bin-hadoop2.7, these are my environment settings:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export KAFKA_HOME=/home/osboxes/kafka_2.12-2.0.0
export HADOOP_HOME=/home/osboxes/hadoop-3.1.0
export HADOOP_CONF_DIR=/home/osboxes/hadoop-3.1.0/etc/hadoop
export PATH=$KAFKA_HOME/bin:$HADOOP_HOME/bin:/home/osboxes/spark-2.2.0-bin-hadoop2.7/bin:$PATH

these are my starting commands:

cd ./kafka_2.12-2.0.0
./bin/zookeeper-server-start.sh config/zookeeper.properties
./bin/kafka-server-start.sh config/server.properties
cd ./hadoop-3.1.0
./sbin/start-dfs.sh
./sbin/start-yarn.sh
cd ~/oryx
./oryx-run.sh batch

from oryx.

stiv-yakovenko commented on July 16, 2024

I've packed all these folders into single archive. So you may check any of my configs quickly if you wish:
https://drive.google.com/file/d/1R1vYcoUAfN_MbB1BGtiFfo7a2tou10Az/view?usp=sharing
(i've removed all jars, otherwise it would be super huge)

from oryx.

srowen commented on July 16, 2024

Hm, I suspect Kafka 2 is the difference. I am not sure Spark even supports that (ironically I'm working on that change a bit right now in Spark). I think the important thing are the logs. Is there an earlier error about a class not being found from Kafka?

Also this build is for Spark 2.3 rather than 2.2

It's all a nightmare because each of these versions of both change in slightly subtle ways.

This may still work or be possible to patch over but I think there's a missing piece of info ... something else is at work.

Can you try changing the Kafka version to 2.0.0 in the project and rebuilding it? Or I could post those artifacts to try.

from oryx.

stiv-yakovenko commented on July 16, 2024

Well, I already have kafka 2.0.0: kafka_2.12-2.0.0. Or do you mean changing spark version?

from oryx.

srowen commented on July 16, 2024

Maybe the most important thing is getting a bit more log detail. Do you see earlier errors before NoClassDefFoundError? Updating to Spark 2.3 may also help, but doesn't seem directly related to the error.

I think I intend for this build to work with Spark 2.3 + Kafka 0.11. I really hope 0.11 isn't so different from 2.0.0 but we'll see if it can be patched over or whether needs yet another release branch.

from oryx.

stiv-yakovenko commented on July 16, 2024

Well, after upgrading to sparc 2.3 i have this new crash, no class exceptions any more:

2018-08-02 15:52:08 INFO  Client:54 -
         client token: N/A
         diagnostics: Application application_1533224154849_0007 failed 2 times due to AM Container for ap                         pattempt_1533224154849_0007_000002 exited with  exitCode: -103
Failing this attempt.Diagnostics: [2018-08-02 15:51:45.461]Container [pid=18592,containerID=container_1533                         224154849_0007_02_000001] is running 208308736B beyond the 'VIRTUAL' memory limit. Current usage: 151.2 MB                          of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1533224154849_0007_02_000001 :
        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RS                         SMEM_USAGE(PAGES) FULL_CMD_LINE
        |- 18592 18591 18592 18592 (bash) 0 0 118063104 372 /bin/bash -c /usr/lib/jvm/java-1.8.0-openjdk/b                         in/java -server -Xmx512m -Djava.io.tmpdir=/tmp/hadoop-osboxes/nm-local-dir/usercache/osboxes/appcache/appl                         ication_1533224154849_0007/container_1533224154849_0007_02_000001/tmp -Dspark.yarn.app.container.log.dir=/                         home/osboxes/hadoop-3.1.0/logs/userlogs/application_1533224154849_0007/container_1533224154849_0007_02_000                         001 org.apache.spark.deploy.yarn.ExecutorLauncher --arg 'master:37555' --properties-file /tmp/hadoop-osbox                         es/nm-local-dir/usercache/osboxes/appcache/application_1533224154849_0007/container_1533224154849_0007_02_                         000001/__spark_conf__/__spark_conf__.properties 1> /home/osboxes/hadoop-3.1.0/logs/userlogs/application_15                         33224154849_0007/container_1533224154849_0007_02_000001/stdout 2> /home/osboxes/hadoop-3.1.0/logs/userlogs                         /application_1533224154849_0007/container_1533224154849_0007_02_000001/stderr
        |- 18603 18592 18592 18592 (java) 457 19 2345103360 38336 /usr/lib/jvm/java-1.8.0-openjdk/bin/java                          -server -Xmx512m -Djava.io.tmpdir=/tmp/hadoop-osboxes/nm-local-dir/usercache/osboxes/appcache/application                         _1533224154849_0007/container_1533224154849_0007_02_000001/tmp -Dspark.yarn.app.container.log.dir=/home/os                         boxes/hadoop-3.1.0/logs/userlogs/application_1533224154849_0007/container_1533224154849_0007_02_000001 org                         .apache.spark.deploy.yarn.ExecutorLauncher --arg master:37555 --properties-file /tmp/hadoop-osboxes/nm-loc                         al-dir/usercache/osboxes/appcache/application_1533224154849_0007/container_1533224154849_0007_02_000001/__                         spark_conf__/__spark_conf__.properties

Container [pid=18592,containerID=container_1533 224154849_0007_02_000001] is running 208308736B beyond the 'VIRTUAL' memory limit. Current usage: 151.2 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used. Killing container.

which jvm params should I tune? How much memory do I need?

from oryx.

stiv-yakovenko commented on July 16, 2024

Tried setting 6Gm mem to my VM, doesn't help.

from oryx.

srowen commented on July 16, 2024

Good, that's progress. That means it basically works. Can you post your config file here? it'll be easier to see the settings and say what might need increasing. What layer are you running?

from oryx.

srowen commented on July 16, 2024

I think the latest commit resolved the problem this ticket was about, but we can continue discussion. Resolving ...

from oryx.

srowen commented on July 16, 2024

Some testing on this end also suggests Spark 2.3 works, and I made one more change that should make it work with Kafka 2. It's enough that I think a 2.7.0 release makes sense: https://github.com/OryxProject/oryx/releases/tag/oryx-2.7.0

from oryx.

stiv-yakovenko commented on July 16, 2024

I'll test today in few hours.

from oryx.

stiv-yakovenko commented on July 16, 2024

HI! With oryx 2.7 i still get this error after finetuning jvm memory settings:

2018-08-02 18:16:06 INFO SparkContext:54 - Successfully stopped SparkContext

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka010/LocationStrategies
        at com.cloudera.oryx.lambda.AbstractSparkLayer.buildInputDStream(AbstractSparkLayer.java:195)
        at com.cloudera.oryx.lambda.batch.BatchLayer.start(BatchLayer.java:105)
        at com.cloudera.oryx.batch.Main.main(Main.java:33)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka010.LocationStrategies
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 13 more

PATH indicates components which I have:

/home/osboxes/kafka_2.12-2.0.0/bin:/home/osboxes/hadoop-3.1.0/bin:/home/osboxes/spark-2.3.0-bin-hadoop2.7/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/osboxes/.local/bin:/home/osboxes/bin

from oryx.

srowen commented on July 16, 2024

That's really weird. That class has been in Spark itself since 2.1. How are you running this?

from oryx.

stiv-yakovenko commented on July 16, 2024

Well, the same way:

cd ~/kafka_2.12-2.0.0
./bin/zookeeper-server-start.sh config/zookeeper.properties&
./bin/kafka-server-start.sh config/server.properties&
cd ~/hadoop-3.1.0
./sbin/start-dfs.sh
./sbin/start-yarn.sh
cd ~/oryx
./oryx-run.sh batch

env settings:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export KAFKA_HOME=/home/osboxes/kafka_2.12-2.0.0
export HADOOP_HOME=/home/osboxes/hadoop-3.1.0
export HADOOP_CONF_DIR=/home/osboxes/hadoop-3.1.0/etc/hadoop
export PATH=$KAFKA_HOME/bin:$HADOOP_HOME/bin:/home/osboxes/spark-2.3.0-bin-hadoop2.7/bin:$PATH

may be this is from older version of app being deployed to spark?
how else can I diagnose the problem?

from oryx.

srowen commented on July 16, 2024

OK, and what's your Spark cluster like? what does spark-submit connect you to? I'm trying to figure out how an old version of Spark might be in play here -- is that possible?

Is it possible there's any earlier error?

from oryx.

stiv-yakovenko commented on July 16, 2024

I see a lot of this right now after removing old version of spark and old applications, uploaded there:

2018-08-04 03:33:42 INFO  Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-08-04 03:33:48 INFO  Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-08-04 03:33:54 INFO  Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-08-04 03:34:00 INFO  Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-08-04 03:34:06 INFO  Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-08-04 03:34:12 INFO  Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-08-04 03:34:18 INFO  Client:871 - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

what is meant to be on 8032?

from oryx.

srowen commented on July 16, 2024

That's going to be the YARN resource manager. Is that running?

from oryx.

stiv-yakovenko commented on July 16, 2024

Yes, it had incorrect port, fixed it, default port from distribution is different and it doesnt work. Also hadoop claster seems to have some inner conflicts, I'm fixing it.

from oryx.

stiv-yakovenko commented on July 16, 2024

After fixing this YARN error, reformatting and restarting hadoop I still get this ClassNotFound exception:

2018-08-04 05:58:46 INFO  ZooKeeper:684 - Session: 0x1000005b82f0009 closed
2018-08-04 05:58:46 INFO  BatchLayer:166 - Shutting down Spark Streaming; this may take some time
2018-08-04 05:58:46 WARN  StreamingContext:66 - StreamingContext has not been started yet
2018-08-04 05:58:46 INFO  ClientCnxn:512 - EventThread shut down
2018-08-04 05:58:46 INFO  AbstractConnector:318 - Stopped Spark@1e6308a9{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-08-04 05:58:46 INFO  SparkUI:54 - Stopped Spark web UI at http://master:4040
2018-08-04 05:58:46 ERROR Client:91 - Failed to contact YARN for application application_1533376242674_0002.
java.io.InterruptedIOException: Call interrupted
        at org.apache.hadoop.ipc.Client.call(Client.java:1469)
        at org.apache.hadoop.ipc.Client.call(Client.java:1412)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy15.getApplicationReport(Unknown Source)
        at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:191)
        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy16.getApplicationReport(Unknown Source)
        at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:430)
        at org.apache.spark.deploy.yarn.Client.getApplicationReport(Client.scala:300)
        at org.apache.spark.deploy.yarn.Client.monitorApplication(Client.scala:1053)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:109)
2018-08-04 05:58:46 ERROR YarnClientSchedulerBackend:70 - Yarn application has already exited with state FAILED!
2018-08-04 05:58:46 INFO  SparkContext:54 - SparkContext already stopped.
2018-08-04 05:58:46 ERROR TransportClient:233 - Failed to send RPC 7863554702237191545 to /192.168.1.106:51244: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2018-08-04 05:58:46 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint:91 - Sending RequestExecutors(0,0,Map(),Set()) to AM was unsuccessful
java.io.IOException: Failed to send RPC 7863554702237191545 to /192.168.1.106:51244: java.nio.channels.ClosedChannelException
        at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
        at io.netty.util.concurrent.DefaultPromise.access$000(DefaultPromise.java:34)
        at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:431)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2018-08-04 05:58:46 INFO  SchedulerExtensionServices:54 - Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
2018-08-04 05:58:46 ERROR Utils:91 - Uncaught exception in thread main
org.apache.spark.SparkException: Exception thrown in awaitResult:
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:566)
        at org.apache.spark.scheduler.cluster.YarnSchedulerBackend.stop(YarnSchedulerBackend.scala:95)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:155)
        at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:508)
        at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1752)
        at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1924)
        at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357)
        at org.apache.spark.SparkContext.stop(SparkContext.scala:1923)
        at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:707)
        at org.apache.spark.streaming.api.java.JavaStreamingContext.stop(JavaStreamingContext.scala:601)
        at com.cloudera.oryx.lambda.batch.BatchLayer.close(BatchLayer.java:167)
        at com.cloudera.oryx.batch.Main.main(Main.java:35)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: Failed to send RPC 7863554702237191545 to /192.168.1.106:51244: java.nio.channels.ClosedChannelException
        at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
        at io.netty.util.concurrent.DefaultPromise.access$000(DefaultPromise.java:34)
        at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:431)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2018-08-04 05:58:46 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-08-04 05:58:46 INFO  MemoryStore:54 - MemoryStore cleared
2018-08-04 05:58:46 INFO  BlockManager:54 - BlockManager stopped
2018-08-04 05:58:46 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2018-08-04 05:58:46 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-08-04 05:58:46 INFO  SparkContext:54 - Successfully stopped SparkContext
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka010/LocationStrategies
        at com.cloudera.oryx.lambda.AbstractSparkLayer.buildInputDStream(AbstractSparkLayer.java:195)
        at com.cloudera.oryx.lambda.batch.BatchLayer.start(BatchLayer.java:105)
        at com.cloudera.oryx.batch.Main.main(Main.java:33)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka010.LocationStrategies
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 13 more
2018-08-04 05:58:46 INFO  ShutdownHookManager:54 - Shutdown hook called
2018-08-04 05:58:46 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-275a777e-8864-448c-8543-09109606876d
2018-08-04 05:58:46 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-e5ffdd9a-e36d-4525-b105-e61e3def4de0

from oryx.

srowen commented on July 16, 2024

What version of Spark does spark-submit run? Are you sure it is running the version you think?

What is your cluster like - a distro or run from vanilla binary releases?

from oryx.

stiv-yakovenko commented on July 16, 2024

Well, ps fax indicates that the version (/home/osboxes/spark-2.3.0-bin-hadoop2.7) is what expected, because I've deleted older version of spark:

 6348 pts/0    S      0:00  |           \_ bash ./oryx-run.sh batch
 6403 pts/0    Sl     0:00  |           |   \_ /usr/lib/jvm/java-1.8.0-openjdk/bin/java -cp /home/osboxes/spark-2.3.0-bin-hadoop2.7/conf/:/home/osboxes/spark-2.3.0-bin-hadoop2.7/jars/*:/home/osboxes/hadoop-3.1.0/etc/hadoop/ -Xmx1g -Dconfig.file=oryx.conf org.apache.spark.deploy.SparkSubmit --master yarn --deploy-mode client --conf spark.logConf=true --conf spark.driver.memory=1g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.ui.port=4040 --conf spark.io.compression.codec=lzf --conf spark.executor.extraJavaOptions=-Dconfig.file=oryx.conf --conf spark.speculation=true --conf spark.ui.showConsoleProgress=false --conf spark.dynamicAllocation.enabled=false --conf spark.driver.extraJavaOptions=-Dconfig.file=oryx.conf --class com.cloudera.oryx.batch.Main --name OryxBatchLayer-ALSExample --files oryx.conf --executor-memory 4g --executor-cores 8 --num-executors 4 oryx-batch-2.7.0.jar

I have virtualbox with Centos 7 from this image: https://www.osboxes.org/centos/
I've used this rpm to install scala: https://downloads.lightbend.com/scala/2.11.2/scala-2.11.2.rpm
I have downloaded there hadoop-3.1.0, kafka_2.12-2.0.0, spark-2.3.0-bin-hadoop2.7 from their apache websites and followed these instructions to install them:

https://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-common/SingleCluster.html
https://kafka.apache.org/quickstart
https://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm
I'm using oryx 2.7 from release section of this github repo.

from oryx.

srowen commented on July 16, 2024

Hm, I'm confused now. I downloaded the Spark distro and I don't see the Kafka integration classes in the jars/ dir. There should be an assembly for it. I'm looking into that on the Spark side; I hope I'm just missing something obvious. But it would explain why the class isn't found!

from oryx.

srowen commented on July 16, 2024

Tracking the possible Spark distro problem at https://issues.apache.org/jira/browse/SPARK-25026 . In the meantime, I manually built the spark-streaming-kafka JAR for 2.3.0 and posted it here: https://drive.google.com/open?id=1WJhdkjrwdGF02eE1ePUhafoAnc-VFkk2 You could try copying that into your jars/ directory and running again. If it works, it kind of confirms the problem is what I think in Spark.

from oryx.

srowen commented on July 16, 2024

Update @stiv-yakovenko -- I think the answer is, for better or worse it hasn't shipped with Spark and won't. The Oryx app needs to include this code. I think this was never obvious before because other JARs on the classpath pulled in the Kafka integration, and/or simply this was almost never tested in standalone mode, instead tested on a cluster from a distro.

Anyway, I have new binaries you can try (ignore the 2.8.0 version)

Batch
https://drive.google.com/open?id=1JmuERkZULvO6c7iC0zuMRzNOUfZA7sX9
Speed
https://drive.google.com/open?id=1usSYp8wvgPZeJKycad_hWZiZhyUoTBph
Serving
https://drive.google.com/open?id=1ioDzKs5eMmlx-5iRfs8pti02oTtZH-U-

from oryx.

stiv-yakovenko commented on July 16, 2024

will this be in 2.9.0?

from oryx.

srowen commented on July 16, 2024

If this is the problem I will make a 2.7.1

from oryx.

stiv-yakovenko commented on July 16, 2024

Well, this classnotfound exception has gone. But I still get another exception, which happens after application has be uploaded:

5115777883 to /192.168.1.106:60192: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknow                                                       n Source)
2018-08-04 18:51:08 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint:91 - Se                                                       nding RequestExecutors(0,0,Map(),Set()) to AM was unsuccessful
java.io.IOException: Failed to send RPC 5869987425115777883 to /192.168.1.10                                                       6:60192: java.nio.channels.ClosedChannelException
        at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(                                                       TransportClient.java:237)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPr                                                       omise.java:507)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(Defaul                                                       tPromise.java:481)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPr                                                       omise.java:420)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise                                                       .java:122)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(Ab                                                       stractChannel.java:987)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractCha                                                       nnel.java:869)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.write(Default                                                       ChannelPipeline.java:1316)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(Abstr                                                       actChannelHandlerContext.java:738)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(Abstra                                                       ctChannelHandlerContext.java:730)
        at io.netty.channel.AbstractChannelHandlerContext.access$1900(Abstra                                                       ctChannelHandlerContext.java:38)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.                                                       write(AbstractChannelHandlerContext.java:1081)
        at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.                                                       write(AbstractChannelHandlerContext.java:1128)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.                                                       run(AbstractChannelHandlerContext.java:1070)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(Abstra                                                       ctEventExecutor.java:163)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(Si                                                       ngleThreadEventExecutor.java:403)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleTh                                                       readEventExecutor.java:858)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDeco                                                       rator.run(DefaultThreadFactory.java:138)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknow                                                       n Source)
2018-08-04 18:51:08 INFO  SchedulerExtensionServices:54 - Stopping Scheduler                                                       ExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
2018-08-04 18:51:08 ERROR Utils:91 - Uncaught exception in thread Yarn appli                                                       cation state monitor
org.apache.spark.SparkException: Exception thrown in awaitResult:
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:                                                       205)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.                                                       requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:566)
        at org.apache.spark.scheduler.cluster.YarnSchedulerBackend.stop(Yarn                                                       SchedulerBackend.scala:95)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.sto                                                       p(YarnClientSchedulerBackend.scala:155)
        at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerIm                                                       pl.scala:508)
        at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1                                                       752)
        at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkC                                                       ontext.scala:1924)
        at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357                                                       )
        at org.apache.spark.SparkContext.stop(SparkContext.scala:1923)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$Mon                                                       itorThread.run(YarnClientSchedulerBackend.scala:112)
Caused by: java.io.IOException: Failed to send RPC 5869987425115777883 to /1                                                       92.168.1.106:60192: java.nio.channels.ClosedChannelException
        at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(                                                       TransportClient.java:237)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPr                                                       omise.java:507)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(Defaul                                                       tPromise.java:481)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPr                                                       omise.java:420)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise                                                       .java:122)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(Ab                                                       stractChannel.java:987)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractCha                                                       nnel.java:869)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.write(Default                                                       ChannelPipeline.java:1316)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(Abstr                                                       actChannelHandlerContext.java:738)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(Abstra                                                       ctChannelHandlerContext.java:730)
        at io.netty.channel.AbstractChannelHandlerContext.access$1900(Abstra                                                       ctChannelHandlerContext.java:38)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.                                                       write(AbstractChannelHandlerContext.java:1081)
        at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.                                                       write(AbstractChannelHandlerContext.java:1128)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.                                                       run(AbstractChannelHandlerContext.java:1070)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(Abstra                                                       ctEventExecutor.java:163)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(Si                                                       ngleThreadEventExecutor.java:403)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleTh                                                       readEventExecutor.java:858)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDeco                                                       rator.run(DefaultThreadFactory.java:138)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknow                                                       n Source)
2018-08-04 18:51:08 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrack                                                       erMasterEndpoint stopped!
2018-08-04 18:51:08 INFO  MemoryStore:54 - MemoryStore cleared
2018-08-04 18:51:08 INFO  BlockManager:54 - BlockManager stopped
2018-08-04 18:51:08 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2018-08-04 18:51:08 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEnd                                                       point:54 - OutputCommitCoordinator stopped!
2018-08-04 18:51:08 INFO  SparkContext:54 - Successfully stopped SparkContex                                                       t
[2018-08-04 18:51:23,534] INFO [GroupMetadataManager brokerId=0] Removed 0 e                                                       xpired offsets in 8 milliseconds. (kafka.coordinator.group.GroupMetadataMana                                                       ger)

from oryx.

srowen commented on July 16, 2024

Hm, that's good, but this error is hard to figure out. Sounds like Spark can't talk to workers run by YARN. Is YARN healthy and running? I wonder because I know you're running on a small VM and not sure how effectively all these services fit into those resources.

from oryx.

stiv-yakovenko commented on July 16, 2024

HI!

I've added these options to yarn config:

<property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
</property>
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>

now this first crash is gone, but I have this error though:

2018-08-05 06:03:20 ERROR Utils:91 - Uncaught exception in thread Yarn application state monitor
org.apache.spark.SparkException: Exception thrown in awaitResult:
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:566)
        at org.apache.spark.scheduler.cluster.YarnSchedulerBackend.stop(YarnSchedulerBackend.scala:95)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:155)
        at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:508)
        at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1752)
        at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1924)
        at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357)
        at org.apache.spark.SparkContext.stop(SparkContext.scala:1923)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:112)
Caused by: java.io.IOException: Failed to send RPC 5781682026554577140 to /192.168.1.106:60798: java.nio.channels.ClosedChannelException
        at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)

I have searched through all configs on my home folder, but none of them contained port 60798. Probably this port is hardcoded somewhere, I wish I knew what it is. Noone is listening on this port.

from oryx.

srowen commented on July 16, 2024

Yeah I see why you do that, to make YARN less likely to kill things for running out of memory. It doesn't really fix the fact that you're running many big services in a small VM. I think the errors here still indicate YARN daemons aren't running. What you'd really want to tune is to turn down the memory that the services use, and turn down the memory that the oryx apps use. That would help it fit.

Really this is meant for at least a small cluster and is indeed hard to set up a whole small cluster just to run it. Usually the deployment scenario is an existing cluster from a distro or something.

from oryx.

stiv-yakovenko commented on July 16, 2024

Memory is not an issue at this step, I have enough memory in my vm. It was before, but now its this last error. Oryx wants to access port 60798. I have no idea what service should listen there. I haven't found it in any config file, looks like its hardcoded somewhere in the source code. How can u fix this error?

from oryx.

srowen commented on July 16, 2024

That's almost certainly the app's application master run by YARN -- random port chosen by YARN. I think the next question would be why it didn't run or failed to run. That would be more in the YARN logs, which can be tricky to chase down, but should be available from the resource manager UI in YARN.

from oryx.

stiv-yakovenko commented on July 16, 2024

I've managed to start batch and speed layers without exceptions finally.
But serving layer crashes with exception:

Aug 05, 2018 12:46:17 PM org.apache.catalina.core.StandardContext listenerStart
SEVERE: Exception sending context initialized event to listener instance of class [com.cloudera.oryx.lambda.serving.ModelManagerListener]
java.lang.NoClassDefFoundError: org/apache/zookeeper/Watcher
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
        at java.security.AccessController.doPrivileged(Native Method)

from oryx.

stiv-yakovenko commented on July 16, 2024

Can you please give a hint, has speed layer started?

I also have this in my logs:

2018-08-05 12:45:22 INFO  Client:54 -
         client token: N/A
         diagnostics: [Sun Aug 05 12:45:21 -0400 2018] Application is added to the scheduler and is not yet activated. Queue's AM resource limit exceeded.  Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:1024, vCores:1>; Queue Resource Limit for AM = <memory:1024, vCores:1>; User AM Resource Limit of the queue = <memory:1024, vCores:1>; Queue AM Resource Usage = <memory:1024, vCores:1>;
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1533487521782
         final status: UNDEFINED
         tracking URL: http://master:8088/proxy/application_1533487265246_0002/
         user: osboxes

Application is added to the scheduler and is not yet activated. Queue's AM resource limit exceeded. Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:1024, vCores:1>; Queue Resource Limit for AM = <memory:1024, vCores:1>; User AM Resource Limit of the queue = <memory:1024, vCores:1>; Queue AM Resource Usage = <memory:1024, vCores:1>

How can I fix this?

from oryx.

srowen commented on July 16, 2024

That means it hasn't started. There isn't enough memory available in YARN to meet the request. I think you'll want to turn down the amount of memory the config file requests. Looks like you are asking for 6GB of memory?

I'll look at the other error; looks like it thinks it needs ZK classes now too. I didn't see that when I started it but didn't test it too much.

from oryx.

srowen commented on July 16, 2024

Actually, regarding the ClassNotFoundError -- I correct myself. Zookeeper and Hadoop dependencies are still meant to be provided by the runtime environment (cluster), not packaged. This makes it easier to avoid conflicts with what's on the cluster.

The compute-classpath.sh script is set up to find the JAR files in the usual locations in CDH. However that won't match your setup, which is just unpacked binary releases. You will just need to modify that script to search the jars directories of your Hadoop install, and Kafka install. Have a look, it's pretty straightforward; you're just going to list different path glob expressions than the ones you see there.

from oryx.

stiv-yakovenko commented on July 16, 2024

Regarding memory: I've updated my virtual box to provide 12G of memory. I see in yarn the following:

So I don't understand why it cant allocate more memory. I'll try to pass missing jar file to the path.

from oryx.

srowen commented on July 16, 2024

What's your config? it will dump the whole effective oryx config when it starts up, or should. That will show how much memory it's requesting. It's asking for a driver and executors; may be hard to fit into 12GB with everything else. I would have though you needed a VM of 32GB RAM or so to make it pretty easy to run all these services.

from oryx.

stiv-yakovenko commented on July 16, 2024

This is what oryx-run outputs me when it starts:

oryx {
    als {
        decay {
            factor=1
            zero-threshold=0
        }
        hyperparams {
            alpha=1
            epsilon=1.0E-5
            features=10
            lambda=0.001
        }
        implicit=true
        iterations=10
        logStrength=false
        no-known-items=false
        rescorer-provider-class=null
        sample-rate=1
    }
    batch {
        storage {
            data-dir="hdfs:///user/example/Oryx/data/"
            key-writable-class="org.apache.hadoop.io.Text"
            max-age-data-hours=-1
            max-age-model-hours=-1
            message-writable-class="org.apache.hadoop.io.Text"
            model-dir="hdfs:///user/example/Oryx/model/"
        }
        streaming {
            config {
                spark {
                    io {
                        compression {
                            codec=lzf
                        }
                    }
                    logConf=true
                    serializer="org.apache.spark.serializer.KryoSerializer"
                    speculation=true
                    ui {
                        showConsoleProgress=false
                    }
                }
            }
            deploy-mode=client
            driver-memory="1g"
            dynamic-allocation=false
            executor-cores=1
            executor-memory="1g"
            generation-interval-sec=300
            master=yarn
            num-executors=1
        }
        ui {
            port=4040
        }
        update-class="com.cloudera.oryx.app.batch.mllib.als.ALSUpdate"
    }
    default-streaming-config {
        spark {
            io {
                compression {
                    codec=lzf
                }
            }
            logConf=true
            serializer="org.apache.spark.serializer.KryoSerializer"
            speculation=true
            ui {
                showConsoleProgress=false
            }
        }
    }
    id=ALSExample
    input-schema {
        categorical-features=null
        feature-names=[]
        id-features=[]
        ignored-features=[]
        num-features=0
        numeric-features=null
        target-feature=null
    }
    input-topic {
        broker="127.0.0.1:9092"
        lock {
            master="127.0.0.1:2181"
        }
        message {
            key-class="java.lang.String"
            key-decoder-class="org.apache.kafka.common.serialization.StringDeserializer"
            message-class="java.lang.String"
            message-decoder-class="org.apache.kafka.common.serialization.StringDeserializer"
            topic=OryxInput
        }
    }
    kmeans {
        evaluation-strategy=SILHOUETTE
        hyperparams {
            k=10
        }
        initialization-strategy="k-means||"
        iterations=30
        runs=3
    }
    ml {
        eval {
            candidates=1
            hyperparam-search=random
            parallelism=1
            test-fraction=0.1
            threshold=null
        }
    }
    rdf {
        hyperparams {
            impurity=entropy
            max-depth=8
            max-split-candidates=100
            min-info-gain-nats=0.001
            min-node-size=16
        }
        num-trees=20
    }
    serving {
        api {
            context-path="/"
            key-alias=null
            keystore-file=null
            keystore-password=*****
            password=*****
            port=8080
            read-only=false
            secure-port=443
            user-name=null
        }
        application-resources="com.cloudera.oryx.app.serving,com.cloudera.oryx.app.serving.als"
        memory="4000m"
        min-model-load-fraction=0.8
        model-manager-class="com.cloudera.oryx.app.serving.als.model.ALSServingModelManager"
        no-init-topics=false
        yarn {
            cores="4"
            instances=1
        }
    }
    speed {
        min-model-load-fraction=0.8
        model-manager-class="com.cloudera.oryx.app.speed.als.ALSSpeedModelManager"
        streaming {
            config {
                spark {
                    io {
                        compression {
                            codec=lzf
                        }
                    }
                    logConf=true
                    serializer="org.apache.spark.serializer.KryoSerializer"
                    speculation=true
                    ui {
                        showConsoleProgress=false
                    }
                }
            }
            deploy-mode=client
            driver-memory="512m"
            dynamic-allocation=false
            executor-cores=4
            executor-memory="1g"
            generation-interval-sec=10
            master=yarn
            num-executors=2
        }
        ui {
            port=4041
        }
    }
    update-topic {
        broker="127.0.0.1:9092"
        lock {
            master="127.0.0.1:2181"
        }
        message {
            decoder-class="org.apache.kafka.common.serialization.StringDeserializer"
            encoder-class="org.apache.kafka.common.serialization.StringDeserializer"
            max-size=16777216
            topic=OryxUpdate
        }
    }
}

this is my configuration file:

kafka-brokers = "127.0.0.1:9092"
zk-servers = "127.0.0.1:2181"
hdfs-base = "hdfs:///user/example/Oryx"

oryx {
  id = "ALSExample"
  als {
    rescorer-provider-class = null
  }
  input-topic {
    broker = ${kafka-brokers}
    lock = {
      master = ${zk-servers}
    }
  }
  update-topic {
    broker = ${kafka-brokers}
    lock = {
      master = ${zk-servers}
    }
  }
  batch {
    streaming {
      generation-interval-sec = 300
      num-executors = 1
      executor-cores = 1
      executor-memory = "1g"
    }
    update-class = "com.cloudera.oryx.app.batch.mllib.als.ALSUpdate"
    storage {
      data-dir =  ${hdfs-base}"/data/"
      model-dir = ${hdfs-base}"/model/"
    }
    ui {
      port = 4040
    }
  }
  speed {
    model-manager-class = "com.cloudera.oryx.app.speed.als.ALSSpeedModelManager"
    ui {
      port = 4041
    }
  }
  serving {
    model-manager-class = "com.cloudera.oryx.app.serving.als.model.ALSServingModelManager"
    application-resources = "com.cloudera.oryx.app.serving,com.cloudera.oryx.app.serving.als"
    api {
      port = 8080
    }
  }
}

This is yarn config, just in case:

<configuration>
<property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
</property>
<property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>
</property>
<property>
   <name>yarn.resourcemanager.address</name>
   <value>0.0.0.0:8032</value>
</property>
<property>
  <name>yarn.scheduler.maximum-allocation-vcores</name>
  <value>130</value>
</property>
<property>
   <name>yarn.nodemanager.vmem-check-enabled</name>
   <value>false</value>
   <description>Whether virtual memory limits will be enforced for containers</description>
</property>
<property>
   <name>yarn.scheduler.maximum-allocation-mb</name>
   <value>2048</value>
</property>
<property>
   <name>yarn.nodemanager.vmem-pmem-ratio</name>
   <value>1</value>
   <description>Ratio between virtual memory to physical memory when
setting memory limits for containers</description>
</property>
</configuration>

I just need one executor for each service, so 3g would be enough. I start batch layer first. But serving layer, which is second, doesnt start and doesn't allocate any resources, even though 5g is still left.

from oryx.

srowen commented on July 16, 2024

The actual config logged by the app says: Batch starts a 1GB driver, an one 1GB executor. It will request more than 1GB in each case to add some overhead for off heap memory. But your config shows asking for 4 x 4GB executors.

I also see something is running in both cases; there's a batch app with several containers that says it is "RUNNING". The second one isn't allocating.

I wonder if some wires are crossed with the config?

I think the issue here is getting YARN set up. There are a number of configs to look at. One thing you might look at is its minimum container memory allocation size, the maximum, and the increment. It may be forcing YARN to allocate more memory.

from oryx.

stiv-yakovenko commented on July 16, 2024

How can reduce this number:

where does it come from?

from oryx.

srowen commented on July 16, 2024

You can just set it in your config file in the same location. This won't matter unless you are trying to run the serving layer via YARN. You don't have to; you can just run the binary.

from oryx.

stiv-yakovenko commented on July 16, 2024

I tried to reduce number of executors to 1 to reduce memory consumption, but I still see 2 containers being allocated:

This is what oryx-run outputs:

 serving {
        api {
            context-path="/"
            key-alias=null
            keystore-file=null
            keystore-password=*****
            password=*****
            port=8080
            read-only=false
            secure-port=443
            user-name=null
        }
        application-resources="com.cloudera.oryx.app.serving,com.cloudera.oryx.a
        memory="4000m"
        min-model-load-fraction=0.8
        model-manager-class="com.cloudera.oryx.app.serving.als.model.ALSServingM
        no-init-topics=false
        streaming {
            executor-cores=1
            executor-memory="300m"
            generation-interval-sec=300
            num-executors=1
        }
        yarn {
            cores="1"
            instances=1
        }
    }

./oryx-run.sh speed
spark-submit --master yarn --deploy-mode client --name OryxSpeedLayer-ALSExample --class com.cloudera.oryx.speed.Main --files oryx.conf --driver-memory 512m --driver-java-options "-Dconfig.file=oryx.conf" --executor-memory 500m --executor-cores 1 --conf spark.executor.extraJavaOptions="-Dconfig.file=oryx.conf" --conf spark.ui.port=4041 --conf spark.io.compression.codec=lzf --conf spark.logConf=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.speculation=true --conf spark.ui.showConsoleProgress=false --conf spark.dynamicAllocation.enabled=false --num-executors=1 oryx-speed-2.8.0-SNAPSHOT.jar

and after that there is not enough resource for batch layer! Even with 6g memory and 6 vcores free. Why is that? Why app is deployed on 2 containers?

from oryx.

srowen commented on July 16, 2024

There will always be at least 2 containers in a Spark app -- driver, and an executor. One job is running fine here, is it the speed layer? I'm still not clear if your batch config is correct because previously it showed requesting 4 x 4GB executors.

from oryx.

stiv-yakovenko commented on July 16, 2024

Yes, it was speed layer. So now I start batch layer like this:

[osboxes@master oryx2.7]$ ./oryx-run.sh batch
spark-submit --master yarn --deploy-mode client      --name OryxBatchLayer-ALSExample --class com.cloudera.oryx.batch.Main  --files oryx.conf      --driver-memory 1g --driver-java-options "-Dconfig.file=oryx.conf"      --executor-memory 500m --executor-cores 1      --conf spark.executor.extraJavaOptions="-Dconfig.file=oryx.conf" --conf spark.ui.port=4040 --conf spark.io.compression.codec=lzf --conf spark.logConf=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.speculation=true --conf spark.ui.showConsoleProgress=false --conf spark.dynamicAllocation.enabled=false        --num-executors=1 oryx-batch-2.8.0-SNAPSHOT.jar

this is relevant part of config:

 batch {
        storage {
            data-dir="hdfs:///user/example/Oryx/data/"
            key-writable-class="org.apache.hadoop.io.Text"
            max-age-data-hours=-1
            max-age-model-hours=-1
            message-writable-class="org.apache.hadoop.io.Text"
            model-dir="hdfs:///user/example/Oryx/model/"
        }
        streaming {
            config {
                spark {
                    io {
                        compression {
                            codec=lzf
                        }
                    }
                    logConf=true
                    serializer="org.apache.spark.serializer.KryoSerializer"
                    speculation=true
                    ui {
                        showConsoleProgress=false
                    }
                }
            }
            deploy-mode=client
            driver-memory="1g"
            dynamic-allocation=false
            executor-cores=1
            executor-memory="500m"
            generation-interval-sec=300
            master=yarn
            num-executors=1
        }
        ui {
            port=4040
        }
        update-class="com.cloudera.oryx.app.batch.mllib.als.ALSUpdate"
        yarn {
            cores="1"
            instances=1
        }
    }

So if I start speed layer first, and batch layer second, then I see this message during batch layer start:

         client token: N/A
         diagnostics: [Sun Aug 05 17:06:06 -0400 2018] Application is added to the scheduler and is not yet activated. Queue's AM resource limit exceeded.  Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:1024, vCores:1>; Queue Resource Limit for AM = <memory:1024, vCores:1>; User AM Resource Limit of the queue = <memory:1024, vCores:1>; Queue AM Resource Usage = <memory:1024, vCores:1>;
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1533503166625
         final status: UNDEFINED
         tracking URL: http://master:8088/proxy/application_1533502694403_0002/
         user: osboxes

from oryx.

stiv-yakovenko commented on July 16, 2024

This confuses me, it turns out like i have free 6g, i need only few of it and I can't use it:

from oryx.

srowen commented on July 16, 2024

It could be that the request is actually for more than 6GB. Or, that you're hitting a resource limit for your queue, not the whole YARN cluster. In fact that seems to be the issue in the error. Above, is "Maximum Allocation" the limit for your job submission queue? it says 2GB.

from oryx.

stiv-yakovenko commented on July 16, 2024

I tried to set this limit to 4GB, but still have the problem:

from oryx.

srowen commented on July 16, 2024

Hm, here we're really into YARN configuration and troubleshooting. I'm not a YARN expert; might be able to spot something if I were looking over your shoulder. I think the last error message is the best pointer; something to do with limits on how much resource you're allowed to use? the output here says it thinks it's waiting on more resources for some reason, for sure. It's a question of why.

from oryx.

stiv-yakovenko commented on July 16, 2024

I've managed to launch it finally, by changing this option:

The fix was to edit

~/hadoop-3.1.0/etc/hadoop/capacity-scheduler.xml

and update .1 to 1:

<property>
    <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
    <value>1</value>
    <description>
      Maximum percent of resources in the cluster which can be used to run
      application masters i.e. controls number of concurrent running
      applications.
    </description>
  </property>

Also I had to patch ednd of compute-classpath.sh, otherwise it doesn't work:

echo "/home/osboxes/kafka_2.12-2.0.0/libs/zookeeper-3.4.13.jar"
echo "/home/osboxes/hadoop-3.1.0/share/hadoop/common/hadoop-common-3.1.0.jar"
echo "/home/osboxes/hadoop-3.1.0/share/hadoop/common/lib/woodstox-core-5.0.3.jar"
echo "/home/osboxes/hadoop-3.1.0/share/hadoop/common/lib/stax2-api-3.1.4.jar"
echo "/home/osboxes/hadoop-3.1.0/share/hadoop/common/lib/commons-collections-3.2.2.jar"

Now I have a new problem. Even thought I observed service working once, I get error 503 or HTTP code 0 now and I don't know how to diagnose it, where is the log file for web layer?

from oryx.

srowen commented on July 16, 2024

OK, I figured it was some YARN config. I am not so familiar with YARN 3.x settings myself.

Good to know about the change to compute-classpath.sh; this is only for the serving layer, thankfully, but is unfortunately highly distribution-specific.

503 just means there's no model yet. Once you feed it data and see the batch layer publish a model, it should load and serve results. The logs just go to stdout.

Example docs here, which should help get some data fed in for testing: http://oryx.io/docs/endusers.html

from oryx.

stiv-yakovenko commented on July 16, 2024

HI! Even though I was able to feed data.csv into the the service, I can't add extra data to the existing model.
I've created CSV file of two lines:

19611,24211,3
18611,30211,3

these userIds and itemIds are new, they never appeared in data.csv. I use same wget command to send this new data to /ingest, but I can't see these new ids here:

Searched with CTRL-F in browser.
Should IDS be long numbers or they can be any string uids?
Does /because/uId/itId update recommendation database or just recommends?

from oryx.

srowen commented on July 16, 2024

They can be any ID. Are you running at least the speed layer? one key point is that the serving layer doesn't update anything itself, but waits for the speed layer (or batch) to send it updates to the model. Until that info makes it to the speed layer it wouldn't reflect any updates.

Also, I'd have to check the code on this, but are both the user and item unknown in each of the new lines you're sending? that may cause them to do nothing. There's no update it can fold in to the model if a new user likes a new item. (Eventually of course the batch layer would be able to do something with it.) . You might try a new user with a known item if you're just observing short-term updates via the speed layer.

from oryx.

stiv-yakovenko commented on July 16, 2024

I have all 3 layers running. Are you saying that I need to wait 5 min to get /ingest to update the model? Or there is some other way how I should supply new users and items?

from oryx.

srowen commented on July 16, 2024

You'd need to wait for updates to be computed and reach the serving layer. Yeah if your speed layer batch interval is about 5 minutes, then about 5 minutes. You can reduce that interval down to 10 seconds or so without any issue. Only gets tough below a couple seconds.

You might watch the traffic on the two kafka queues (see the 'tail' option in the oryx-run.sh script for help) and just make sure the data is going back and forth as expected too.

from oryx.

stiv-yakovenko commented on July 16, 2024

What if new userId accesses itemId? Does this mean we can't recommend him something immediately?

from oryx.

stiv-yakovenko commented on July 16, 2024

Finally new ids have arrived to the model. Thank you so much for your support.

from oryx.

stiv-yakovenko commented on July 16, 2024

One more question, @srowen. I've seen each jar has it entry java main function. Can they work without spark just connecting to kafka and hdfs?

from oryx.

srowen commented on July 16, 2024

Batch and speed require Spark. These apps are submitted with spark-submit. main() is the entry point that Spark uses.

from oryx.

Spark 2.3 and java.lang.NoClassDefFoundError: kafka/admin/RackAwareMode about oryx HOT 64 CLOSED

Comments (64)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent