Git Product home page Git Product logo

angel's Introduction

license Release Version PRs Welcome Download Code

(ZH-CN Version)

Angel is a high-performance distributed machine learning and graph computing platform based on the philosophy of Parameter Server. It is tuned for performance with big data from Tencent and has a wide range of applicability and stability, demonstrating increasing advantage in handling higher dimension model. Angel is jointly developed by Tencent and Peking University, taking account of both high availability in industry and innovation in academia.

With model-centered core design concept, Angel partitions parameters of complex models into multiple parameter-server nodes, and implements a variety of machine learning algorithms and graph algorithms using efficient model-updating interfaces and functions, as well as flexible consistency model for synchronization.

Angel is developed with Java and Scala. It supports running on Yarn. With PS Service abstraction, it supports Spark on Angel. Graph computing and deep learning frameworks support is under development and will be released in the future.

We welcome everyone interested in machine learning or graph computing to contribute code, create issues or pull requests. Please refer to Angel Contribution Guide for more detail.

Introduction to Angel

Design

Quick Start

Deployment

Programming Guide

Algorithm

Community

FAQ

Papers

  1. PaSca: A Graph Neural Architecture Search System under the Scalable Paradigm. WWW, 2022
  2. Graph Attention Multi-Layer Perceptron. KDD, 2022
  3. Node Dependent Local Smoothing for Scalable Graph Learning. NeurlPS, 2021
  4. PSGraph: How Tencent trains extremely large-scale graphs with Spark?.ICDE, 2020.
  5. DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions. SIGMOD, 2018.
  6. LDA*: A Robust and Large-scale Topic Modeling System. VLDB, 2017
  7. Heterogeneity-aware Distributed Parameter Servers. SIGMOD, 2017
  8. Angel: a new large-scale machine learning system. National Science Review (NSR), 2017
  9. TencentBoost: A Gradient Boosting Tree System with Parameter Server. ICDE, 2017

angel's People

Contributors

andyyehoo avatar biaoma-ty avatar bluesjjw avatar dependabot[bot] avatar duoyu119 avatar ericzhang-cn avatar facaiy avatar flyingqq avatar hbghhy avatar howiehywang avatar jyswpp avatar leleyu avatar lengfeng343 avatar liuzhanhao avatar ljch2018 avatar luxin0123 avatar lxs121 avatar ouyangwen-it avatar paynie avatar rachelsunrh avatar ryantaocer avatar sagewe avatar shunanzhang avatar taaan avatar wangcaihua avatar xs-li avatar xuehuanran avatar yundai424 avatar zunwenyou avatar zwt233 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

angel's Issues

运行错误的问题,看不懂哪错了

Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
at com.tencent.angel.client.yarn.AngelYarnClient.updateMaster(AngelYarnClient.java:493)
at com.tencent.angel.client.yarn.AngelYarnClient.startPSServer(AngelYarnClient.java:165)
at com.tencent.angel.ml.MLRunner$class.train(MLRunner.scala:46)
at com.tencent.angel.ml.classification.lr.LRRunner.train(LRRunner.scala:28)
at com.tencent.angel.ml.classification.lr.LRRunner.train(LRRunner.scala:40)
at com.tencent.angel.ml.MLRunner$class.submit(MLRunner.scala:90)
at com.tencent.angel.ml.classification.lr.LRRunner.submit(LRRunner.scala:28)
at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:124)
at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:110)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at com.tencent.angel.utils.AngelRunJar.main(AngelRunJar.java:110)
com.tencent.angel.exception.AngelException: java.io.IOException: Failed to run job : Application application_1496819712324_39810 failed 2 times due to AM Container for appattempt_1496819712324_39810_000002 exited with exitCode: 1
For more detailed output, check application tracking page:http://103.26.158.31:54315/proxy/application_1496819712324_39810/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1496819712324_39810_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
at com.tencent.angel.client.yarn.AngelYarnClient.startPSServer(AngelYarnClient.java:171)
at com.tencent.angel.ml.MLRunner$class.train(MLRunner.scala:46)
at com.tencent.angel.ml.classification.lr.LRRunner.train(LRRunner.scala:28)
at com.tencent.angel.ml.classification.lr.LRRunner.train(LRRunner.scala:40)
at com.tencent.angel.ml.MLRunner$class.submit(MLRunner.scala:90)
at com.tencent.angel.ml.classification.lr.LRRunner.submit(LRRunner.scala:28)
at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:124)
at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:110)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at com.tencent.angel.utils.AngelRunJar.main(AngelRunJar.java:110)

BUILD FAILURE

[ERROR] Failed to execute goal org.scalatest:scalatest-maven-plugin:1.0:test (test) on project spark-on-angel: There are test failures -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal
org.scalatest:scalatest-maven-plugin:1.0:test (test) on project spark-on-angel: There are test failures

    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor

.java:212)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor
.java:153)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor
.java:145)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProje
ct(LifecycleModuleBuilder.java:116)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProje
ct(LifecycleModuleBuilder.java:80)
at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThre
adedBuilder.build(SingleThreadedBuilder.java:51)
at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(Lifecycl
eStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Laun
cher.java:289)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.jav
a:229)
at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(La
uncher.java:415)
at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:
356)
Caused by: org.apache.maven.plugin.MojoFailureException: There are test failures

    at org.scalatest.tools.maven.TestMojo.execute(TestMojo.java:107)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(Default

BuildPluginManager.java:134)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor
.java:207)
... 20 more
[ERROR]
[ERROR]
[ERROR] For more information about the errors and possible solutions, please rea
d the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureExc
eption

Some connect refused exception happened in local mode

ERROR transport.MatrixTransportClient : get channel failed
java.net.ConnectException: Connection refused: no further information: /10.13.64.176:26782
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:338)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:504)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:418)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:390)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:742)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:145)
at java.lang.Thread.run(Thread.java:748)

Does angel support the outputPath without hdfs prefix?

When path is specified without hdfs prefix, like

./bin/angel-submit \
    --action.type train \
    --angel.app.submit.class com.tencent.angel.ml.classification.lr.LRRunner  \
    --angel.train.data.path /usr/foo/data \
    --angel.save.model.path /usr/foo/model \

Angel always failed with the exception, which is really ambiguous:

Exit code: 1
Stack trace: ExitCodeException exitCode=1:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
        at org.apache.hadoop.util.Shell.run(Shell.java:456)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

So I checked all logs on yarn, and noticed that the names of output directories start with null.

17/07/03 15:02:02 INFO utils.HdfsUtil: tmp output dir is null:///tmp/application_1498535347111_0403_6bb79e70-16df-40a5-8207-1181b7bea2e9
17/07/03 15:02:02 INFO utils.HdfsUtil: tmp output dir is null:///tmp/application_1498535347111_0403_f2b40825-7a27-42bd-b983-164ff3342026
17/07/03 15:02:02 INFO client.AngelClient: angel.tmp.output.path=null:/tmp/f/application_1498535347111_0403_6bb79e70-16df-40a5-8207-1181b7bea2e9
17/07/03 15:02:02 INFO client.AngelClient: internal state file is null:/tmp/application_1498535347111_0403_f2b40825-7a27-42bd-b983-164ff3342026/state

Can not build from source code.

I try build the source code from idea through clone maven project , then i get some error issues when run example code: com.tencent.angel.example.GBDTLocalExample.

F:\github_projects\angel\angel-ps\core\src\main\java\com\tencent\angel\ml\model\PSModel.scala
Error:(14, 35) object generated is not a member of package com.tencent.angel.protobuf
import com.tencent.angel.protobuf.generated.MLProtos
Error:(327, 27) not found: value MLProtos
  def setRowType(rowType: MLProtos.RowType) {
F:\github_projects\angel\angel-ps\core\src\main\java\com\tencent\angel\ml\matrix\MatrixContext.java
Error:(22, 8) object generated is not a member of package com.tencent.angel.protobuf
import com.tencent.angel.protobuf.generated.MLProtos;
Error:(230, 26) not found: value MLProtos
  public void setRowType(MLProtos.RowType rowType) {

but it's really not exist the package com.tencent.angel.protobuf.generated, please tell me how can i fixxed it?

Worker try connecting to standby RM (resource manager) and hanging

We run LRRunner demo with a9a.train data. Bash script is flowing.

#!/bin/bash
./angel-submit
--cluster c3prc-hadoop
--action.type train
--angel.app.submit.class "com.tencent.angel.ml.classification.lr.LRRunner"
--angel.train.data.path "hdfs://c3prc-hadoop/user/h_miui_ad/develop/mengqingdi/dmp_example/sim_train_data/a9a.train"
--angel.log.path "hdfs://c3prc-hadoop/user/h_miui_ad/develop/mengqingdi/dmp_example/sim/log"
--angel.save.model.path "hdfs://c3prc-hadoop/user/h_miui_ad/develop/mengqingdi/dmp_example/sim/model"
--queue root.production.miui_group.miui_ad.queue_algo
--ml.epoch.num 10
--ml.batch.num 10
--ml.feature.num 1024
--ml.validate.ratio 0.1
--ml.data.type libsvm
--ml.learn.rate 1
--ml.learn.decay 0.1
--ml.reg.l2 0
--angel.workergroup.number 50
--angel.worker.memory.gb 10
--angel.worker.task.number 4
--angel.ps.number 20
--angel.ps.memory.gb 5
--angel.job.name LR_SAMPLE

At beginning, iteration goes well. But after a few iterations (more or less) , the work hanging, keep trying connecting to standby RM, the whole computing stuck.

Head log of workerthreadstack is like this,

threadid: 121196 threadname: IPC Client (868889379) connection to c3-hadoop-prc-ct05.bj/10.108.84.32:11200 from work threadstate: TIMED_WAITING
java.lang.Object.wait(Native Method)
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:928)
org.apache.hadoop.ipc.Client$Connection.run(Client.java:973)

PS: our yarn RM configure is

yarn.resourcemanager.ha.enabled true yarn.resourcemanager.ha.rm-ids rm0,rm1 yarn.resourcemanager.resource-tracker.address c3-hadoop-prc-ct04.bj:22303 yarn.resourcemanager.resource-tracker.address.rm0 c3-hadoop-prc-ct04.bj:22303 yarn.resourcemanager.resource-tracker.address.rm1 c3-hadoop-prc-ct05.bj:22303 yarn.resourcemanager.resource-tracker.client.thread-count 50 yarn.resourcemanager.scheduler.address c3-hadoop-prc-ct04.bj:22302 yarn.resourcemanager.scheduler.address.rm0 c3-hadoop-prc-ct04.bj:22302 yarn.resourcemanager.scheduler.address.rm1 c3-hadoop-prc-ct05.bj:22302 yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler yarn.resourcemanager.scheduler.client.thread-count 50

[question]stop method of MatricesCache

the stop method of MatricesCache interrupt syncer thread if exists, however the loop catch InterruptedException which will clear the interrupt flag of syncer thread, so check Thread.interrupted() is alway true. (hower the stop flag is set and the syncer thread will stop).
is the thread interrupt status check for syncPolcy.sync which can be interruptted?

     while (stopped.get() && !Thread.interrupted()) {
        syncPolicy.sync(PSAgentContext.get().getMatricesCache());
        try {
          Thread.sleep(syncTimeIntervalMS);
        } catch (InterruptedException e) {

        }
      }

Does logistic regression of angel use sparsity in data?

When I read the code of logistic regression, I found that angel seems to fetch total weight vector in worker, and also create one dense gradient vector with the same length.

LRLearner.scala: 72L

    val batchGD = GradientDescent.miniBatchGD(trainData, lrModel.weight, lr, logLoss,
      batchSize, batchNum)

GradientDescent.scala: 35L

    //Pull model from PS Server
    val w = model.getRow(0)
    ............
    for (batch: Int <- 1 to batchNum) {
      ......
      val grad = new DenseDoubleVector(w.getDimension)

DenseDoubleVector.java: 68L

  public DenseDoubleVector(int dim) {
    ....
    this.values = new double[dim];
    .....
  }

As shown above, logistic regression of angel doesn't use sparsity to fetch only non-zero bits, if I understand correctly.

So I have two questions:

  1. It seems that lr in angel now cannot support super-large dimension data. And it's optimizer is gradient descent. As far as I known, logistic regreesion in spark-ml uses LBFGS/WQLQN, which performs better in theory. Am I wrong?

  2. There is another package in angel: sparselr. Does sparselr is designed to solve super-large dimension data? I haven't checked the code there.

Thanks.

BUG: connect to ResourceManager, retry failed.

Hi,
I compiled a package from master branch:
commit dcaa496bef3b8355258a19ac3ec09010299efd40

The program cannot work, and its log contains a warn "Connecting to ResourceManager at /0.0.0.0:8032"

logs:

17/07/21 12:42:50 INFO utils.AngelRunJar: angelHomePath conf path=/data0/user/yipeng5/target_new/bin/..//conf/angel-site.xml
17/07/21 12:42:50 INFO utils.AngelRunJar: load system config file success
17/07/21 12:42:51 INFO utils.UGITools: UGI_PROPERTY_NAME is null
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data0/user/yipeng5/target_new/lib/slf4j-log4j12-1.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
17/07/21 12:42:51 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir;  Ignoring.
17/07/21 12:42:51 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir;  Ignoring.
17/07/21 12:42:51 WARN conf.Configuration: /usr/local/hadoop/etc/hadoop/yarn-site.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs;  Ignoring.
17/07/21 12:42:51 INFO model.PSModel: After training matrix lr_weight will be saved to hdfs://emr-cluster/user/feed_weibo/yipeng5/model/model_train_data_learn_rate=4.5_reg_l2=1.0_new
17/07/21 12:42:51 INFO utils.UGITools: UGI_PROPERTY_NAME is null
17/07/21 12:42:51 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir;  Ignoring.
17/07/21 12:42:51 WARN conf.Configuration: /usr/local/hadoop/etc/hadoop/yarn-site.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs;  Ignoring.
17/07/21 12:42:51 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/07/21 12:42:52 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir;  Ignoring.
17/07/21 12:42:53 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
17/07/21 12:42:54 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
17/07/21 12:42:55 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

By the way,
old package did work (compiled from commit 3ff4329825fd355118384d5be438dd132f1015c3)

Hence, I guess that the bug should be bought in the code modified between the two commits.

Protobuf 版本问题

我安装官方文档要求安装的Protobuf版本为3.3>2.5.0,但是在编译的时候出现错误,提示如下:
Failed to execute goal com.github.igor-petruk.protobuf:protobuf-maven-plugin:0.6.3:run (default) on project angel-ps-core: Protobuf installation version does not match Protobuf library version

这该怎么办呢?

Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project spark-on-angel-core: wrap: scala.reflect.internal.MissingRequirementError: object java.lang.Object in compiler mirror not found. -> [Help 1] [ERROR]

编译出现了这个问题 什么回事 ,
[INFO] angel .............................................. SUCCESS [ 0.117 s]
[INFO] angel-ps ........................................... SUCCESS [ 1.814 s]
[INFO] angel-ps-core ...................................... SUCCESS [ 20.579 s]
[INFO] angel-ps-psf ....................................... SUCCESS [ 1.961 s]
[INFO] angel-ps-mllib ..................................... SUCCESS [ 10.837 s]
[INFO] angel-ps-examples .................................. SUCCESS [ 4.214 s]
[INFO] spark-on-angel ..................................... SUCCESS [ 2.845 s]
[INFO] spark-on-angel-core ................................ FAILURE [ 0.752 s]
[INFO] spark-on-angel-mllib ............................... SKIPPED
[INFO] spark-on-angel-examples ............................ SKIPPED
[INFO] angel-dist ......................................... SKIPPED

Some UT run failed as output directory exist

Running com.tencent.angel.ml.LDA.LDATest
17/07/07 09:58:58 INFO LDA.LDATest : E:\git\github-branch-1.0.0\angel\angel-ps\mllib
17/07/07 09:58:58 INFO utils.UGITools : UGI_PROPERTY_NAME is null
17/07/07 09:58:58 INFO client.AngelClient : running mode = ANGEL_PS_WORKER
17/07/07 09:58:58 ERROR local.AngelLocalClient : start application failed.
java.io.IOException: output path file:/C:/Users/PAYNIE~1/AppData/Local/Temp/out already exist, please check
at com.tencent.angel.client.AngelClient.setOutputDirectory(AngelClient.java:558)
at com.tencent.angel.client.local.AngelLocalClient.startPSServer(AngelLocalClient.java:126)
at com.tencent.angel.ml.LDA.LDATest.run(LDATest.scala:83)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.internal.runners.TestMethod.invoke(TestMethod.java:59)
at org.junit.internal.runners.MethodRoadie.runTestMethod(MethodRoadie.java:98)
at org.junit.internal.runners.MethodRoadie$2.run(MethodRoadie.java:79)
at org.junit.internal.runners.MethodRoadie.runBeforesThenTestThenAfters(MethodRoadie.java:87)
at org.junit.internal.runners.MethodRoadie.runTest(MethodRoadie.java:77)
at org.junit.internal.runners.MethodRoadie.run(MethodRoadie.java:42)
at org.junit.internal.runners.JUnit4ClassRunner.invokeTestMethod(JUnit4ClassRunner.java:88)
at org.junit.internal.runners.JUnit4ClassRunner.runMethods(JUnit4ClassRunner.java:51)
at org.junit.internal.runners.JUnit4ClassRunner$1.run(JUnit4ClassRunner.java:44)
at org.junit.internal.runners.ClassRoadie.runUnprotected(ClassRoadie.java:27)
at org.junit.internal.runners.ClassRoadie.runProtected(ClassRoadie.java:37)
at org.junit.internal.runners.JUnit4ClassRunner.run(JUnit4ClassRunner.java:42)
at org.mockito.internal.runners.JUnit44RunnerImpl.run(JUnit44RunnerImpl.java:37)
at org.mockito.runners.MockitoJUnitRunner.run(MockitoJUnitRunner.java:62)
at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:62)
at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:138)
at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:163)
at org.apache.maven.surefire.Surefire.run(Surefire.java:84)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:244)
at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:814)

workerservice report port is not available when run local example

17/06/26 20:17:58 INFO master.AngelApplicationMaster : build WorkerManager success
17/06/26 20:17:58 INFO event.AsyncDispatcher : Registering class com.tencent.angel.master.app.AppEventType for class com.tencent.angel.master.app.App
17/06/26 20:17:58 INFO event.AsyncDispatcher : Registering class com.tencent.angel.master.app.AppFinishEventType for class com.tencent.angel.master.AngelApplicationMaster$AppFinishEventHandler
17/06/26 20:17:58 ERROR utils.NetUtils : workerservice:port 25063 is not available, try agine
17/06/26 20:17:58 ERROR utils.NetUtils : workerservice:port 29086 is not available, try agine
17/06/26 20:17:58 ERROR utils.NetUtils : workerservice:port 24028 is not available, try agine
17/06/26 20:17:58 ERROR utils.NetUtils : workerservice:port 29215 is not available, try agine
17/06/26 20:17:58 ERROR utils.NetUtils : workerservice:port 24873 is not available, try agine
17/06/26 20:17:58 ERROR utils.NetUtils : workerservice:port 26633 is not available, try agine
17/06/26 20:17:58 ERROR utils.NetUtils : workerservice:port 21016 is not available, try agine
17/06/26 20:17:58 ERROR utils.NetUtils : workerservice:port 20667 is not available, try agine
17/06/26 20:17:58 ERROR utils.NetUtils : workerservice:port 20189 is not available, try agine
17/06/26 20:17:58 ERROR utils.NetUtils : workerservice:port 26848 is not available, try agine
17/06/26 20:17:58 INFO service.AbstractService : Service com.tencent.angel.master.MasterService failed in state INITED; cause: java.io.IOException: can not find a avaliable port for workerservice
java.io.IOException: can not find a avaliable port for workerservice
at com.tencent.angel.utils.NetUtils.chooseAListenPort(NetUtils.java:151)
at com.tencent.angel.master.MasterService.serviceInit(MasterService.java:438)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at com.tencent.angel.master.AngelApplicationMaster.initAndStart(AngelApplicationMaster.java:720)
at com.tencent.angel.localcluster.LocalMaster.run(LocalMaster.java:64)
17/06/26 20:17:58 INFO master.MasterService : WorkerPSService is stoped!
17/06/26 20:17:58 ERROR master.AngelApplicationMaster : Failed startup APPMaster.
org.apache.hadoop.service.ServiceStateException: java.io.IOException: can not find a avaliable port for workerservice
at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at com.tencent.angel.master.AngelApplicationMaster.initAndStart(AngelApplicationMaster.java:720)
at com.tencent.angel.localcluster.LocalMaster.run(LocalMaster.java:64)
Caused by: java.io.IOException: can not find a avaliable port for workerservice
at com.tencent.angel.utils.NetUtils.chooseAListenPort(NetUtils.java:151)
at com.tencent.angel.master.MasterService.serviceInit(MasterService.java:438)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 2 more

PSAttempt_0_0 failed due to: heartbeat timeout

  • ENV: hadoop2.7.2 jdk1.8
  • Error: PSAttempt_0_0 failed due to: heartbeat timeout
  • command:
./angel-submit \
--angel.app.submit.class "com.tencent.angel.ml.classification.lr.LRRunner"\
 	--angel.train.data.path "hdfs://hd-23:6000/angel/data" \
 	--angel.log.path "hdfs://hd-23:6000/angel/log1" \
 	--angel.save.model.path "hdfs://hd-23:6000/angel/model1" \
 	--action.type train \
 	--ml.data.type libsvm \
 	--ml.feature.num 1024 \
	--angel.am.env JAVA_HOME="/usr/local/lib/jdk1.8.0_77",PATH="/usr/local/lib/jdk1.8.0_77/bin:$PATH" \
	--angel.worker.env JAVA_HOME="/usr/local/lib/jdk1.8.0_77",PATH="/usr/local/lib/jdk1.8.0_77/bin:$PATH" \
	--angel.ps.env JAVA_HOME="/usr/local/lib/jdk1.8.0_77",PATH="/usr/local/lib/jdk1.8.0_77/bin:$PATH" \
	--angel.am.heartbeat.interval.ms 10000 \
	--angel.worker.heartbeat.interval.ms 10000 \
	--angel.ps.heartbeat.interval.ms 10000 \
	--angel.worker.max-attempts 2 \
	--angel.ps.max-attempts 2 \
 	--angel.job.name test 

Can you tell me how to solve it, please?

Global Metrics Log

Collect and summarize all the Task's algorithm indicators, then write these indicators into the log file. We also need real-time display these indicators in console.

编译安装后,运行出现问题

我照着官网的步骤编译安装angel,但是测试运行出现了一些问题:
1、spark-on-angel-env.sh里ANGEL_VERSION指定的是1.1.8,但是编译后的jar包版本都是1.0.0的;

2、LOCAL运行./angel-example com.tencent.angel.example.SgdLRLocalExample时,
(1)选择1时报错如下:
Exception in thread "main" java.lang.NoSuchMethodError: com.google.protobuf.LazyStringList.getUnmodifiableView()Lcom/google/protobuf/LazyStringList;
at com.tencent.angel.protobuf.generated.ClientMasterServiceProtos$CheckMatricesCreatedRequest$Builder.buildPartial(ClientMasterServiceProtos.java:9294)
at com.tencent.angel.protobuf.generated.ClientMasterServiceProtos$CheckMatricesCreatedRequest$Builder.build(ClientMasterServiceProtos.java:9283)
at com.tencent.angel.client.AngelClient.waitForMatricesCreated(AngelClient.java:502)
at com.tencent.angel.client.AngelClient.createMatrices(AngelClient.java:493)
at com.tencent.angel.client.AngelClient.loadModel(AngelClient.java:164)
at com.tencent.angel.ml.MLRunner$class.train(MLRunner.scala:47)
at com.tencent.angel.ml.classification.lr.LRRunner.train(LRRunner.scala:28)
at com.tencent.angel.ml.classification.lr.LRRunner.train(LRRunner.scala:40)
at com.tencent.angel.example.SgdLRLocalExample.trainOnLocalCluster(SgdLRLocalExample.java:103)
at com.tencent.angel.example.SgdLRLocalExample.main(SgdLRLocalExample.java:168)
我发现在angel-ps的core下,com.tencent.angel.protobuf目录下确实没有generated这个目录,但是在angel-1.0.0-jar-with-dependencies.jar里面又有generated,我不清楚是什么情况。
(2)当选择2时,bug如下:
Exception in thread "main" com.tencent.angel.exception.AngelException: java.io.IOException: matrix path file:/tmpmodel/lr_weight does not exist
中间缺少一个"/",因而找不到路径。

3、执行./SONA-example-local后,出现如下问题:
17/06/19 15:59:00 INFO AngelYarnClient: default FileSystem: hdfs://mycluster
17/06/19 15:59:00 INFO AngelYarnClient: libjarsDir=/tmp/hadoop-yarn/hadp/.staging/application_1497853469786_0002/libjars
17/06/19 15:59:00 INFO AngelYarnClient: libjars=
17/06/19 15:59:00 ERROR AngelYarnClient: submit application to yarn failed.
java.lang.IllegalArgumentException: Can not create a Path from an empty string
以上是我遇到的一些问题,希望能够得到帮助以解决问题

在“Angel快速入门指南”中有个拼写错误

在 docs/tutorials/angel_ps_quick_start.md (即“Angel快速入门指南”)中,第7行有一个拼写错误,
原文:

  • 大多数的机器学习算法都可以抽象成向量(Vector)、矩阵(Martix),张量(Tensor)间的运算,用向量、矩阵、张量来表示学习数据和算法模型。

矩阵应该是 Matrix 吧。。。

submit application to yarn failed

我按照配置要求都编译好了,但是提交之后发生了这样的错误。

17/06/20 07:25:11 INFO utils.AngelRunJar: angelHomePath conf path=/host/root/angel/dist/target/angel-1.0.0-bin/bin/..//conf/angel-site.xml
17/06/20 07:25:11 INFO utils.AngelRunJar: load system config file success
17/06/20 07:25:11 INFO utils.UGITools: UGI_PROPERTY_NAME is null 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/host/root/angel/dist/target/angel-1.0.0-bin/lib/slf4j-log4j12-1.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/06/20 07:25:12 INFO model.PSModel: After training matrix lr_weight will be saved to hdfs://model-hdp00:8020/user/wuhaonan/test/model
17/06/20 07:25:12 INFO utils.UGITools: UGI_PROPERTY_NAME is null 
17/06/20 07:25:12 INFO impl.TimelineClientImpl: Timeline service address: http://model-hdp00:8188/ws/v1/timeline/
17/06/20 07:25:13 INFO client.RMProxy: Connecting to ResourceManager at model-hdp00/172.27.2.60:8032
17/06/20 07:25:13 INFO client.AngelClient: running mode = ANGEL_PS_WORKER
17/06/20 07:25:13 INFO utils.HdfsUtil: tmp output dir is hdfs://model-hdp00:8020/tmp/root/application_1497010209719_0829_4960f24c-61b4-4cfb-8c8b-c70bfc143dc1
17/06/20 07:25:13 INFO utils.HdfsUtil: tmp output dir is hdfs://model-hdp00:8020/tmp/root/application_1497010209719_0829_e519601b-e9fb-4db4-bdf9-336d6f42feea
17/06/20 07:25:13 INFO client.AngelClient: angel.tmp.output.path=hdfs://model-hdp00:8020/tmp/root/application_1497010209719_0829_4960f24c-61b4-4cfb-8c8b-c70bfc143dc1
17/06/20 07:25:13 INFO client.AngelClient: internal state file is hdfs://model-hdp00:8020/tmp/root/application_1497010209719_0829_e519601b-e9fb-4db4-bdf9-336d6f42feea/state
17/06/20 07:25:13 INFO yarn.AngelYarnClient: default FileSystem: hdfs://model-hdp00:8020
17/06/20 07:25:13 INFO yarn.AngelYarnClient: libjarsDir=/tmp/hadoop-yarn/root/.staging/application_1497010209719_0829/libjars
17/06/20 07:25:13 INFO yarn.AngelYarnClient: libjars=file:///host/root/angel/dist/target/angel-1.0.0-bin/bin/..//lib/memory-0.8.1.jar,file:///host/root/angel/dist/target/angel-1.0.0-bin/bin/..//lib/sketches-core-0.8.1.jar,file:///host/root/angel/dist/target/angel-1.0.0-bin/bin/..//lib/commons-pool-1.6.jar,file:///host/root/angel/dist/target/angel-1.0.0-bin/bin/..//lib/kryo-shaded-3.0.3.jar,file:///host/root/angel/dist/target/angel-1.0.0-bin/bin/..//lib/scala-library-2.11.8.jar,file:///host/root/angel/dist/target/angel-1.0.0-bin/bin/..//lib/angel-ps-core-1.0.0.jar,file:///host/root/angel/dist/target/angel-1.0.0-bin/bin/..//lib/angel-ps-mllib-1.0.0.jar,file:///host/root/angel/dist/target/angel-1.0.0-bin/bin/..//lib/angel-ps-examples-1.0.0.jar,file:///host/root/angel/dist/target/angel-1.0.0-bin/bin/..//lib/angel-ps-psf-1.0.0.jar,file:///host/root/angel/dist/target/angel-1.0.0-bin/bin/..//lib/fastutil-7.1.0.jar,file:///host/root/angel/dist/target/angel-1.0.0-bin/bin/..//lib/sizeof-0.3.0.jar,file:///host/root/angel/dist/target/angel-1.0.0-bin/bin/..//lib/kryo-shaded-3.0.3.jar,file:///host/root/angel/dist/target/angel-1.0.0-bin/bin/..//lib/minlog-1.3.0.jar,file:///host/root/angel/dist/target/angel-1.0.0-bin/bin/..//lib/breeze_2.11-0.12.jar
AppMaster capability = <memory:2048, vCores:1>
17/06/20 07:25:15 INFO yarn.AngelYarnClient: Command to launch container for ApplicationMaster is : $JAVA_HOME/bin/java -Dlog4j.configuration=log/angel.properties -Dlog4j.logger.com.tencent.ml=DEBUG -Dyarn.app.container.log.dir=<LOG_DIR> -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Xmx1024m com.tencent.angel.master.AngelApplicationMaster 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr 
17/06/20 07:25:15 INFO yarn.AngelYarnClient: ApplicationSubmissionContext Queuename :  default
17/06/20 07:25:15 INFO impl.YarnClientImpl: Submitted application application_1497010209719_0829
17/06/20 07:25:26 INFO yarn.AngelYarnClient: appMaster getTrackingUrl = http://model-hdp00:8088/cluster/app/application_1497010209719_0829/
17/06/20 07:25:26 INFO yarn.AngelYarnClient: master host=172.27.2.62, port=21030
17/06/20 07:25:26 INFO yarn.AngelYarnClient: start to create rpc client to am
17/06/20 07:35:37 ERROR yarn.AngelYarnClient: submit application to yarn failed.
com.google.protobuf.ServiceException: java.util.concurrent.ExecutionException: java.io.IOException: Error connecting to /172.27.2.62:21030
	at com.tencent.angel.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:312)
	at com.sun.proxy.$Proxy24.getAllPSLocation(Unknown Source)
	at com.tencent.angel.client.AngelClient.waitForAllPS(AngelClient.java:710)
	at com.tencent.angel.client.yarn.AngelYarnClient.startPSServer(AngelYarnClient.java:167)
	at com.tencent.angel.ml.MLRunner$class.train(MLRunner.scala:46)
	at com.tencent.angel.ml.classification.lr.LRRunner.train(LRRunner.scala:28)
	at com.tencent.angel.ml.classification.lr.LRRunner.train(LRRunner.scala:40)
	at com.tencent.angel.ml.MLRunner$class.submit(MLRunner.scala:90)
	at com.tencent.angel.ml.classification.lr.LRRunner.submit(LRRunner.scala:28)
	at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:124)
	at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:110)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at com.tencent.angel.utils.AngelRunJar.main(AngelRunJar.java:110)
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Error connecting to /172.27.2.62:21030
	at com.tencent.angel.ipc.CallFuture.get(CallFuture.java:126)
	at com.tencent.angel.ipc.NettyTransceiver.call(NettyTransceiver.java:462)
	at com.tencent.angel.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:290)
	... 14 more
Caused by: java.io.IOException: Error connecting to /172.27.2.62:21030
	at com.tencent.angel.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:283)
	at com.tencent.angel.ipc.NettyTransceiver.writeDataPack(NettyTransceiver.java:535)
	at com.tencent.angel.ipc.NettyTransceiver.transceive(NettyTransceiver.java:505)
	at com.tencent.angel.ipc.NettyTransceiver.call(NettyTransceiver.java:458)
	... 15 more
Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:148)
	at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:104)
	at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:78)
	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
	at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:41)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
com.tencent.angel.exception.AngelException: com.google.protobuf.ServiceException: java.util.concurrent.ExecutionException: java.io.IOException: Error connecting to /172.27.2.62:21030
	at com.tencent.angel.client.yarn.AngelYarnClient.startPSServer(AngelYarnClient.java:171)
	at com.tencent.angel.ml.MLRunner$class.train(MLRunner.scala:46)
	at com.tencent.angel.ml.classification.lr.LRRunner.train(LRRunner.scala:28)
	at com.tencent.angel.ml.classification.lr.LRRunner.train(LRRunner.scala:40)
	at com.tencent.angel.ml.MLRunner$class.submit(MLRunner.scala:90)
	at com.tencent.angel.ml.classification.lr.LRRunner.submit(LRRunner.scala:28)
	at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:124)
	at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:110)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at com.tencent.angel.utils.AngelRunJar.main(AngelRunJar.java:110)
Caused by: com.google.protobuf.ServiceException: java.util.concurrent.ExecutionException: java.io.IOException: Error connecting to /172.27.2.62:21030
	at com.tencent.angel.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:312)
	at com.sun.proxy.$Proxy24.getAllPSLocation(Unknown Source)
	at com.tencent.angel.client.AngelClient.waitForAllPS(AngelClient.java:710)
	at com.tencent.angel.client.yarn.AngelYarnClient.startPSServer(AngelYarnClient.java:167)
	... 11 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Error connecting to /172.27.2.62:21030
	at com.tencent.angel.ipc.CallFuture.get(CallFuture.java:126)
	at com.tencent.angel.ipc.NettyTransceiver.call(NettyTransceiver.java:462)
	at com.tencent.angel.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:290)
	... 14 more
Caused by: java.io.IOException: Error connecting to /172.27.2.62:21030
	at com.tencent.angel.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:283)
	at com.tencent.angel.ipc.NettyTransceiver.writeDataPack(NettyTransceiver.java:535)
	at com.tencent.angel.ipc.NettyTransceiver.transceive(NettyTransceiver.java:505)
	at com.tencent.angel.ipc.NettyTransceiver.call(NettyTransceiver.java:458)
	... 15 more
Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:148)
	at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:104)
	at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:78)
	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
	at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:41)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

感觉是我连接不上172.27.2.62,但是我ping都可以ping得通。
同时,我在集群的管理页面上面也可以看得到提交的任务信息,也可以看到部分log

2017-06-20 15:26:40,812 INFO [1379994152@qtp-422522663-15] org.apache.hadoop.yarn.webapp.View: before compute worker state items
2017-06-20 15:26:41,774 INFO [1865938769@qtp-422522663-8] org.apache.hadoop.yarn.webapp.View: before compute worker state items
2017-06-20 15:26:42,289 INFO [1690374204@qtp-422522663-7] org.apache.hadoop.yarn.webapp.View: before compute worker state items
2017-06-20 15:26:42,812 INFO [628989099@qtp-422522663-14] org.apache.hadoop.yarn.webapp.View: before compute worker state items
2017-06-20 15:26:43,251 INFO [628989099@qtp-422522663-14] org.apache.hadoop.yarn.webapp.View: before compute worker state items
2017-06-20 15:26:43,611 INFO [1883304309@qtp-422522663-11] org.apache.hadoop.yarn.webapp.View: before compute worker state items
2017-06-20 15:30:09,534 INFO [707941836@qtp-422522663-4] org.apache.hadoop.yarn.webapp.View: before compute worker state items
2017-06-20 15:30:18,249 INFO [1082928458@qtp-422522663-5] org.apache.hadoop.yarn.webapp.View: before compute worker state items
2017-06-20 15:34:52,296 INFO [846664587@qtp-422522663-9] org.apache.hadoop.yarn.webapp.View: before compute worker state items
2017-06-20 15:35:00,537 INFO [707941836@qtp-422522663-4] org.apache.hadoop.yarn.webapp.View: before compute worker state items
2017-06-20 15:35:02,072 INFO [2093319848@qtp-422522663-0] org.apache.hadoop.yarn.webapp.View: before compute worker state items
2017-06-20 15:35:03,185 INFO [971746334@qtp-422522663-17] org.apache.hadoop.yarn.webapp.View: before compute worker state items
2017-06-20 15:35:03,912 INFO [1865938769@qtp-422522663-8] org.apache.hadoop.yarn.webapp.View: before compute worker state items
2017-06-20 15:35:05,704 INFO [1082928458@qtp-422522663-5] org.apache.hadoop.yarn.webapp.View: before compute worker state items
2017-06-20 15:35:07,095 INFO [1082928458@qtp-422522663-5] org.apache.hadoop.yarn.webapp.View: before compute worker state items
2017-06-20 15:35:08,273 INFO [1865938769@qtp-422522663-8] org.apache.hadoop.yarn.webapp.View: before compute worker state items
2017-06-20 15:35:10,411 INFO [1865938769@qtp-422522663-8] org.apache.hadoop.yarn.webapp.View: before compute worker state items
2017-06-20 15:35:10,463 FATAL [app-state-monitor] com.tencent.angel.master.app.InternalErrorEvent: app in state NEW over 600000 milliseconds!
2017-06-20 15:35:10,463 FATAL [app-state-monitor] com.tencent.angel.master.app.InternalErrorEvent: app in state NEW over 600000 milliseconds!
2017-06-20 15:35:10,463 FATAL [app-state-monitor] com.tencent.angel.master.app.InternalErrorEvent: app in state NEW over 600000 milliseconds!
2017-06-20 15:35:10,463 FATAL [app-state-monitor] com.tencent.angel.master.app.InternalErrorEvent: app in state NEW over 600000 milliseconds!
2017-06-20 15:35:10,463 FATAL [app-state-monitor] com.tencent.angel.master.app.InternalErrorEvent: app in state NEW over 600000 milliseconds!
2017-06-20 15:35:10,463 FATAL [app-state-monitor] com.tencent.angel.master.app.InternalErrorEvent: app in state NEW over 600000 milliseconds!
2017-06-20 15:35:10,464 INFO [AsyncDispatcher event handler] com.tencent.angel.master.app.App: some error happened, InternalErrorEvent [errorMsg=app in state NEW over 600000 milliseconds!, getType()=INTERNAL_ERROR]
2017-06-20 15:35:10,464 INFO [AsyncDispatcher event handler] com.tencent.angel.master.app.App: application_1497010209719_0829Job Transitioned from NEW to FAILED
2017-06-20 15:35:10,465 INFO [Thread-281] com.tencent.angel.master.AngelApplicationMaster: Calling stop for all the services
2017-06-20 15:35:10,485 INFO [Thread-281] com.tencent.angel.master.deploy.ContainerAllocator: to unregister from Yarn RM
2017-06-20 15:35:10,486 INFO [Thread-281] com.tencent.angel.master.deploy.ContainerAllocator: Setting job diagnostics to app in state NEW over 600000 milliseconds!

2017-06-20 15:35:10,494 INFO [Thread-281] com.tencent.angel.master.deploy.ContainerAllocator: Waiting for application to be successfully unregistered.
2017-06-20 15:35:15,496 INFO [Thread-281] com.tencent.angel.master.deploy.ContainerAllocator: ContainerAllocator service stop!
2017-06-20 15:35:15,497 INFO [Thread-281] com.tencent.angel.master.oplog.AppStateStorage: app-state-writter service stop!
2017-06-20 15:35:15,497 INFO [Thread-281] com.tencent.angel.master.AngelApplicationMaster: start to write app state to file and clear tmp directory
2017-06-20 15:35:15,497 INFO [Thread-281] com.tencent.angel.master.AngelApplicationMaster: start to write app state to file hdfs://model-hdp00:8020/tmp/root/application_1497010209719_0829_e519601b-e9fb-4db4-bdf9-336d6f42feea/state
2017-06-20 15:35:15,680 INFO [Thread-281] com.tencent.angel.master.app.App: write app report to file successfully jobReport {
  jobState: J_FAILED
  curIteration: 0
  totalIteration: 100
  diagnostics: "app in state NEW over 600000 milliseconds!"
}

销毁PSModelPoool 方法的疑问?

在spark_on_angel_programing_guide.md中提到的

销毁PSModelPoool 使用:context.destroyModelPool(pool)

但实际测试中没有该方法,PSContext中只有destroyVectorPool()

image

LRRunner train hung at model.clock.get in GradientDescent

Runing LRRunner in yarn with config: --angel.worker.task.number 4 and the train hung at model.clock.get in GradientDescent.

See jstack

"pool-8-thread-4" #579 prio=5 os_prio=0 tid=0x00007f28f1a88000 nid=0x28c31 waiting on condition [0x00007f28b06f2000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000054dac7cc8> (a java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
        at com.tencent.angel.psagent.matrix.transport.FutureResult.get(FutureResult.java:54)
        at com.tencent.angel.ml.optimizer.sgd.GradientDescent$.miniBatchGD(GradientDescent.scala:80)
        at com.tencent.angel.ml.classification.lr.LRLearner.trainOneEpoch(LRLearner.scala:72)
        at com.tencent.angel.ml.classification.lr.LRLearner.train(LRLearner.scala:103)
        at com.tencent.angel.ml.classification.lr.LRTrainTask.train(LRTrainTask.scala:55)
        at com.tencent.angel.worker.task.TrainTask.run(TrainTask.scala:28)
        at com.tencent.angel.worker.task.Task.runUser(Task.java:95)
        at com.tencent.angel.worker.task.Task.run(Task.java:70)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

"pool-8-thread-3" #577 prio=5 os_prio=0 tid=0x00007f28f1a86000 nid=0x28c30 waiting on condition [0x00007f28b07f3000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000054dade4c8> (a java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
        at com.tencent.angel.psagent.matrix.transport.FutureResult.get(FutureResult.java:54)
        at com.tencent.angel.ml.optimizer.sgd.GradientDescent$.miniBatchGD(GradientDescent.scala:80)
        at com.tencent.angel.ml.classification.lr.LRLearner.trainOneEpoch(LRLearner.scala:72)
        at com.tencent.angel.ml.classification.lr.LRLearner.train(LRLearner.scala:103)
        at com.tencent.angel.ml.classification.lr.LRTrainTask.train(LRTrainTask.scala:55)
        at com.tencent.angel.worker.task.TrainTask.run(TrainTask.scala:28)
        at com.tencent.angel.worker.task.Task.runUser(Task.java:95)
        at com.tencent.angel.worker.task.Task.run(Task.java:70)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
"pool-8-thread-2" #575 prio=5 os_prio=0 tid=0x00007f28f1a84800 nid=0x28c2f waiting on condition [0x00007f28b08f4000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at com.tencent.angel.psagent.matrix.transport.adapter.MatrixClientAdapter.waitForClock(MatrixClientAdapter.java:406)
        at com.tencent.angel.psagent.matrix.transport.adapter.MatrixClientAdapter.getRow(MatrixClientAdapter.java:154)
        at com.tencent.angel.psagent.consistency.ConsistencyController.getRow(ConsistencyController.java:100)
        at com.tencent.angel.psagent.matrix.MatrixClientImpl.getRow(MatrixClientImpl.java:53)
        at com.tencent.angel.ml.model.PSModel.getRow(PSModel.scala:203)
        at com.tencent.angel.ml.optimizer.sgd.GradientDescent$.miniBatchGD(GradientDescent.scala:36)
        at com.tencent.angel.ml.classification.lr.LRLearner.trainOneEpoch(LRLearner.scala:72)
        at com.tencent.angel.ml.classification.lr.LRLearner.train(LRLearner.scala:103)
        at com.tencent.angel.ml.classification.lr.LRTrainTask.train(LRTrainTask.scala:55)
        at com.tencent.angel.worker.task.TrainTask.run(TrainTask.scala:28)
        at com.tencent.angel.worker.task.Task.runUser(Task.java:95)
        at com.tencent.angel.worker.task.Task.run(Task.java:70)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

bug in SgdLRLocalExample.java

--- a/angel-ps/examples/src/main/java/com/tencent/angel/example/SgdLRLocalExample.java
+++ b/angel-ps/examples/src/main/java/com/tencent/angel/example/SgdLRLocalExample.java
@@ -110,7 +110,7 @@ public class SgdLRLocalExample {
     String inputPath = "../data/exampledata/LRLocalExampleData/a9a.train";
     String LOCAL_FS = LocalFileSystem.DEFAULT_FS;
     String TMP_PATH = System.getProperty("java.io.tmpdir", "/tmp");
-    String loadPath = LOCAL_FS + TMP_PATH + "model";
+    String loadPath = LOCAL_FS + TMP_PATH + "/model";
     String savePath = LOCAL_FS + TMP_PATH + "/newmodel";
     String logPath = LOCAL_FS + TMP_PATH + "/log";
 
@@ -135,7 +135,7 @@ public class SgdLRLocalExample {
     String inputPath = "../data/exampledata/LRLocalExampleData/a9a.test";
     String LOCAL_FS = LocalFileSystem.DEFAULT_FS;
     String TMP_PATH = System.getProperty("java.io.tmpdir", "/tmp");
-    String loadPath = LOCAL_FS + TMP_PATH + "model";
+    String loadPath = LOCAL_FS + TMP_PATH + "/model";
     String savePath = LOCAL_FS + TMP_PATH + "/model";
     String logPath = LOCAL_FS + TMP_PATH + "/log";
     String predictPath = LOCAL_FS + TMP_PATH + "/predict";

mvn clean package编译错误 /angel/angel-ps/core/target/classes/com/tencent/angel/ml is not a directory

Jdk = 1.8
Maven = 3.0.5
Protobuf = 2.5.0

[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ angel-ps-core ---
[INFO] Using zinc server for incremental compilation
[warn] Default cache file location not accessible. Using /styx/home/hzfangzh/.zinc/0.3.5.3/analysis-cache/6cb44251464fcd23600d66276b8af136e8d816df
[info] Compiling 6 Scala sources and 485 Java sources to /home/recsys/chenmingming/opensource/angel/angel-ps/core/target/classes...
[error] /home/recsys/chenmingming/opensource/angel/angel-ps/core/src/main/java/com/tencent/angel/ml/matrix/psf/get/multi/GetRowsFunc.scala:20: error writing object GetRowsFunc: /home/recsys/chenmingming/opensource/angel/angel-ps/core/target/classes/com/tencent/angel/ml/matrix/psf/get/multi/GetRowsFunc$.class: /home/recsys/chenmingming/opensource/angel/angel-ps/core/target/classes/com/tencent/angel/ml is not a directory
[error] object GetRowsFunc {
[error] ^
[error] /home/recsys/chenmingming/opensource/angel/angel-ps/core/src/main/java/com/tencent/angel/ml/matrix/psf/get/multi/GetRowsFunc.scala:24: error writing class GetRowsFunc: /home/recsys/chenmingming/opensource/angel/angel-ps/core/target/classes/com/tencent/angel/ml/matrix/psf/get/multi/GetRowsFunc.class: /home/recsys/chenmingming/opensource/angel/angel-ps/core/target/classes/com/tencent/angel/ml is not a directory
[error] class GetRowsFunc(param: GetParam) extends GetFunc(param) {
[error] ^
[error] /home/recsys/chenmingming/opensource/angel/angel-ps/core/src/main/java/com/tencent/angel/ml/matrix/psf/get/multi/GetRowsFunc.scala:31: error writing <$anon: Function1>: /home/recsys/chenmingming/opensource/angel/angel-ps/core/target/classes/com/tencent/angel/ml/matrix/psf/get/multi/GetRowsFunc$$anonfun$1.class: /home/recsys/chenmingming/opensource/angel/angel-ps/core/target/classes/com/tencent/angel/ml is not a directory

English translation of documents

This is an interesting project. Some friends from other country also want to use it. An english version of document will be nice for them. I will start translating one by one. Hope more people can join me.

The following order will be expected:

  1. Readme.md
  2. QuickStart
  3. Design
  4. Interface
  5. Algo

Spark On Angel ut run failed

  • mergeMinAndFlush
    *** RUN ABORTED ***
    java.lang.NoSuchMethodError: io.netty.bootstrap.ServerBootstrap.config()Lio/netty/bootstrap/ServerBootstrapConfig;
    at com.tencent.angel.ipc.NettyServer.stop(NettyServer.java:123)
    at com.tencent.angel.ps.impl.ParameterServerService.stop(ParameterServerService.java:84)
    at com.tencent.angel.ps.impl.ParameterServer.stop(ParameterServer.java:153)
    at com.tencent.angel.localcluster.LocalPS.exit(LocalPS.java:70)
    at com.tencent.angel.localcluster.LocalCluster.stop(LocalCluster.java:80)
    at com.tencent.angel.client.local.AngelLocalClient.stop(AngelLocalClient.java:102)
    at com.tencent.angel.client.AngelClient.stop(AngelClient.java:238)
    at com.tencent.angel.client.AngelPSClient.stopPS(AngelPSClient.java:126)
    at com.tencent.angel.spark.context.AngelPSContext$.com$tencent$angel$spark$context$AngelPSContext$$doStop(AngelPSContext.scala:271)
    at com.tencent.angel.spark.context.AngelPSContext$.stop(AngelPSContext.scala:259)

what's ANGEL_HDFS_HOME?

Hey guys, thanks a lot for sharing this great work!
Could I ask a simple question: what's ANGEL_HDFS_HOME environmental variable on this page, and where should I point it to?

在Intellij中运行GBDTLocalExample,出现错误: FATAL localcluster.LocalCluster : worker WorkerAttempt_0_0_0 start failed

17/07/24 07:52:06 INFO worker.Worker : init and start worker
17/07/24 07:52:06 INFO worker.Worker : after init psagent
17/07/24 07:52:06 INFO worker.Worker : after init datablockmanager
Exception in thread "Thread-18" 17/07/24 07:52:07 FATAL localcluster.LocalCluster : worker WorkerAttempt_0_0_0 start failed.
java.lang.IllegalArgumentException
at java.util.concurrent.ThreadPoolExecutor.(ThreadPoolExecutor.java:1307)
at java.util.concurrent.ThreadPoolExecutor.(ThreadPoolExecutor.java:1195)
at java.util.concurrent.Executors.newFixedThreadPool(Executors.java:89)
at com.tencent.angel.psagent.matrix.transport.MatrixTransportClient.(MatrixTransportClient.java:140)
at com.tencent.angel.psagent.PSAgent.initAndStart(PSAgent.java:307)
at com.tencent.angel.worker.Worker.initAndStart(Worker.java:268)
at com.tencent.angel.localcluster.LocalWorker.run(LocalWorker.java:53)
java.lang.NullPointerException
at com.tencent.angel.worker.Worker.workerError(Worker.java:431)
at com.tencent.angel.worker.Worker.error(Worker.java:616)
at com.tencent.angel.localcluster.LocalWorker.run(LocalWorker.java:56)
17/07/24 07:53:07 ERROR master.MasterService : WorkerAttempt_0_0_0 heartbeat timeout!!!
17/07/24 07:53:07 INFO attempt.WorkerAttempt : Diagnostics report from WorkerAttempt_0_0_0: heartbeat timeout
17/07/24 07:53:07 INFO master.MasterService : WorkerAttempt_0_0_0 is unregistered in monitor!
17/07/24 07:53:07 INFO attempt.WorkerAttempt : WorkerAttempt_0_0_0 psserver Transitioned from LAUNCHED to FAILED
17/07/24 07:53:07 INFO worker.Worker : stop workerService
17/07/24 07:53:07 INFO worker.Worker : stop psagent
17/07/24 07:53:07 INFO psagent.PSAgent : stop heartbeat thread!
17/07/24 07:53:07 INFO psagent.PSAgent : stop op log merger
17/07/24 07:53:07 INFO psagent.PSAgent : stop clock cache
17/07/24 07:53:07 INFO psagent.PSAgent : stop matrix cache
17/07/24 07:53:07 INFO psagent.PSAgent : stop matrix client adapater
17/07/24 07:53:07 INFO psagent.PSAgent : stop rpc dispacher
17/07/24 07:53:07 INFO worker.Worker : stop heartbeat thread
17/07/24 07:53:07 INFO worker.Worker : stop taskmanager
17/07/24 07:53:07 INFO worker.Worker : stop data block manager
17/07/24 07:53:07 INFO worker.AMWorker : scheduling WorkerAttempt_0_0_1
17/07/24 07:53:07 INFO attempt.WorkerAttempt : allocate worker attempt resource, worker attempt id = WorkerAttempt_0_0_1, resource = <memory:5120, vCores:1>, priority = 20, hosts = localhost
17/07/24 07:53:07 INFO attempt.WorkerAttempt : WorkerAttempt_0_0_1 psserver Transitioned from NEW to SCHEDULED
17/07/24 07:53:07 INFO local.LocalContainerAllocator : local allocator event=ContainerAllocatorEvent [taskId=WorkerAttempt_0_0_1]
17/07/24 07:53:07 INFO attempt.WorkerAttempt : WorkerAttempt_0_0_1 psserver Transitioned from SCHEDULED to ASSIGNED
17/07/24 07:53:07 INFO attempt.WorkerAttempt : add WorkerAttempt_0_0_1 to monitor!
17/07/24 07:53:07 INFO master.MasterService : WorkerAttempt_0_0_1 is registered in monitor!
17/07/24 07:53:07 INFO attempt.WorkerAttempt : WorkerAttempt_0_0_1 psserver Transitioned from ASSIGNED to LAUNCHED
17/07/24 07:53:07 INFO worker.Worker : init and start worker
17/07/24 07:53:07 INFO worker.Worker : after init psagent
17/07/24 07:53:07 INFO worker.Worker : after init datablockmanager
Exception in thread "Thread-31" java.lang.NullPointerException
at com.tencent.angel.worker.Worker.workerError(Worker.java:431)
at com.tencent.angel.worker.Worker.error(Worker.java:616)
at com.tencent.angel.localcluster.LocalWorker.run(LocalWorker.java:56)
17/07/24 07:53:07 FATAL localcluster.LocalCluster : worker WorkerAttempt_0_0_1 start failed.
java.lang.IllegalArgumentException
at java.util.concurrent.ThreadPoolExecutor.(ThreadPoolExecutor.java:1307)
at java.util.concurrent.ThreadPoolExecutor.(ThreadPoolExecutor.java:1195)
at java.util.concurrent.Executors.newFixedThreadPool(Executors.java:89)
at com.tencent.angel.psagent.matrix.transport.MatrixTransportClient.(MatrixTransportClient.java:140)
at com.tencent.angel.psagent.PSAgent.initAndStart(PSAgent.java:307)
at com.tencent.angel.worker.Worker.initAndStart(Worker.java:268)
at com.tencent.angel.localcluster.LocalWorker.run(LocalWorker.java:53)

Error connecting

17/06/28 16:00:39 INFO yarn.AngelYarnClient: Command to launch container for ApplicationMaster is : $JAVA_HOME/bin/java -Dlog4j.configuration=log/angel.properties -Dlog4j.logger.com.tencent.ml=DEBUG -Dyarn.app.container.log.dir=<LOG_DIR> -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Xmx1024m com.tencent.angel.master.AngelApplicationMaster 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr
17/06/28 16:00:39 INFO yarn.AngelYarnClient: ApplicationSubmissionContext Queuename : default
17/06/28 16:00:39 INFO impl.YarnClientImpl: Submitted application application_1498124481939_0016
17/06/28 16:00:45 INFO yarn.AngelYarnClient: appMaster getTrackingUrl = http://stghadoop1:8088/cluster/app/application_1498124481939_0016/
17/06/28 16:00:45 INFO yarn.AngelYarnClient: master host=10.12.101.245, port=23469
17/06/28 16:00:45 INFO yarn.AngelYarnClient: start to create rpc client to am
17/06/28 16:02:00 ERROR yarn.AngelYarnClient: submit application to yarn failed.
com.google.protobuf.ServiceException: java.util.concurrent.ExecutionException: java.io.IOException: Error connecting to /10.12.101.245:23469
最基本的example没跑通,这个要怎么解决,另10.12.101.245并不是yarn集群的master

Spark-on-angel implement LogisticRegression use user data throw exception

We change this spark-on-angel example,
https://github.com/Tencent/angel/blob/master/spark-on-angel/examples/src/main/scala/com/tencent/angel/spark/examples/ml/LogisticRegression.scala
Then we change train data set input as a parquet file, which has schema like (label : Double, features : VectorUDT) and change code like

import org.apache.spark.ml.classification.ps.{LogisticRegression => PSLR}

val dateSet = spark.read.parquet(/path/to/mysamples.snappy.parquet)
val lr = new PSLR()
  .setLabelCol("label")
  .setFeaturesCol("features")
  .setRegParam(0.01)
  .setElasticNetParam(0.0)
  .setMaxIter(10)
  .setTol(1e-6)

We submit to yarn-cluster, and set num-texecuters to 10,
We got this Exception in LogisticRegression of spark-on-angel implementation.

User class threw exception: java.lang.IllegalArgumentException: requirement failed: Nothing has been added to this summarizer. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.mllib.stat.PSMultivariateOnlineSummarizer.variance(PSMultivariateOnlineSummarizer.scala:190) at org.apache.spark.mllib.stat.PSMultivariateOnlineSummarizer.std(PSMultivariateOnlineSummarizer.scala:207) at org.apache.spark.ml.classification.ps.LogisticRegression.train(LogisticRegression.scala:366) at org.apache.spark.ml.classification.ps.LogisticRegression.train(LogisticRegression.scala:332) at org.apache.spark.ml.classification.ps.LogisticRegression.train(LogisticRegression.scala:203) at org.apache.spark.ml.Predictor.fit(Predictor.scala:96) at com.xiaomi.angel.spark.examples.ml.LogisticRegressionSparkOnAngelExample$.run(LogisticRegressionSparkOnAngelExample.scala:50) at com.xiaomi.angel.spark.examples.ml.LogisticRegressionSparkOnAngelExample$$anonfun$main$1.apply(LogisticRegressionSparkOnAngelExample.scala:19) at com.xiaomi.angel.spark.examples.ml.LogisticRegressionSparkOnAngelExample$$anonfun$main$1.apply(LogisticRegressionSparkOnAngelExample.scala:17) at com.tencent.angel.spark.examples.util.PSExamples$.runWithSparkContext(PSExamples.scala:50) at com.xiaomi.angel.spark.examples.ml.LogisticRegressionSparkOnAngelExample$.main(LogisticRegressionSparkOnAngelExample.scala:17) at com.xiaomi.angel.spark.examples.ml.LogisticRegressionSparkOnAngelExample.main(LogisticRegressionSparkOnAngelExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:648)

I've just run angle LR on yarn model but failed,(centos 6.5 hadoop2.2 protobuf2.5 java1.8 hdfs and yarn all can work well)

17/06/25 00:23:54 INFO Configuration.deprecation: mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes
AppMaster capability = <memory:2048, vCores:1>
17/06/25 00:23:54 INFO yarn.AngelYarnClient: Command to launch container for ApplicationMaster is : $JAVA_HOME/bin/java -Dlog4j.configuration=log/angel.properties -Dlog4j.logger.com.tencent.ml=DEBUG -Dyarn.app.container.log.dir=<LOG_DIR> -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Xmx1024m com.tencent.angel.master.AngelApplicationMaster 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr
17/06/25 00:23:54 INFO yarn.AngelYarnClient: ApplicationSubmissionContext Queuename : default
17/06/25 00:23:54 INFO impl.YarnClientImpl: Submitted application application_1498320278135_0001 to ResourceManager at master/10.211.55.19:8032
17/06/25 00:23:59 INFO yarn.AngelYarnClient: appMaster getTrackingUrl = master:8088/cluster/app/application_1498320278135_0001/
17/06/25 00:23:59 INFO yarn.AngelYarnClient: master host=10.211.55.22, port=25695
17/06/25 00:23:59 INFO yarn.AngelYarnClient: start to create rpc client to am
17/06/25 00:28:55 ERROR yarn.AngelYarnClient: submit application to yarn failed.
com.google.protobuf.ServiceException: java.util.concurrent.ExecutionException: java.io.IOException: Error connecting to /10.211.55.22:25695
at com.tencent.angel.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:312)
at com.sun.proxy.$Proxy10.getAllPSLocation(Unknown Source)
at com.tencent.angel.client.AngelClient.waitForAllPS(AngelClient.java:710)
at com.tencent.angel.client.yarn.AngelYarnClient.startPSServer(AngelYarnClient.java:167)
at com.tencent.angel.ml.MLRunner$class.train(MLRunner.scala:46)
at com.tencent.angel.ml.classification.lr.LRRunner.train(LRRunner.scala:28)
at com.tencent.angel.ml.classification.lr.LRRunner.train(LRRunner.scala:40)
at com.tencent.angel.ml.MLRunner$class.submit(MLRunner.scala:90)
at com.tencent.angel.ml.classification.lr.LRRunner.submit(LRRunner.scala:28)
at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:124)
at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:110)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at com.tencent.angel.utils.AngelRunJar.main(AngelRunJar.java:110)
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Error connecting to /10.211.55.22:25695
at com.tencent.angel.ipc.CallFuture.get(CallFuture.java:126)
at com.tencent.angel.ipc.NettyTransceiver.call(NettyTransceiver.java:462)
at com.tencent.angel.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:290)
... 14 more
Caused by: java.io.IOException: Error connecting to /10.211.55.22:25695
at com.tencent.angel.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:283)
at com.tencent.angel.ipc.NettyTransceiver.writeDataPack(NettyTransceiver.java:535)
at com.tencent.angel.ipc.NettyTransceiver.transceive(NettyTransceiver.java:505)
at com.tencent.angel.ipc.NettyTransceiver.call(NettyTransceiver.java:458)
... 15 more

[GBDT] The purposes of using parameter server in GBDT

既然文档都是中文,那就直接用中文吧。

GBDT的文档提到使用PS的原因:

超大模型: GBDT用到的梯度直方图的大小与特征数量成正比,对于高维大数据集,梯度直方图会非常大,Angel将梯度直方图切分到到多个PS节点上存储,可以解决了高维度模型在汇总参数时的单点瓶颈问题。

通讯开销::合理地存储和切分全局梯度直方图,每个Worker不需要获取完整的全局梯度直方图,使用这种方法,大大减小了通信开销,加速了汇总局部梯度直方图的速度

两阶段树分裂算法: 在寻找最佳分裂点时,可以在多个PS节点上并行进行,只需要将局部最优分裂点返回给Worker,通信开销几乎可以忽略不计。

对于“超大模型”这一点,并不是很赞同。因为每个worker上构造的“部分直方图"和”全局直方图“的大小是一样的(每个worker各有一部分水平切分的数据,所以构造的直方图还是full feature的)。使用PS并不能减少worker的内存占用,没有解决高维数据的问题。

对于“通讯开销” 和 “两阶段树分裂算法”这两点,有没有PS都可以做到,比如LightGBM直接用“Reduce Scatter” 实现了不同worker合并不同的直方图,然后同步最优分割点,效果是等价的。参阅:https://github.com/Microsoft/LightGBM/wiki/Features#data-parallel-in-lightgbm

另外,第二点中,应该是"每个PS不需要获取完整的全局梯度直方图"?

如有理解错误,欢迎指正。

Some problem of the example of spark on angel

When I run the SONA-example, when I use local mode or SONA-example-local , the job seems work fine. And it will end with:

17/07/30 03:13:10 INFO ui.SparkUI: Stopped Spark web UI at http://127.0.0.1:4040
17/07/30 03:13:10 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/07/30 03:13:10 INFO memory.MemoryStore: MemoryStore cleared
17/07/30 03:13:10 INFO storage.BlockManager: BlockManager stopped
17/07/30 03:13:10 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
17/07/30 03:13:10 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/07/30 03:13:10 INFO spark.SparkContext: Successfully stopped SparkContext

But the process seems not end .... I have to input ctrl+C to end the script. Is it normal?

And when I run SONA-example, it will throw:

17/07/30 03:26:26 INFO yarn.AngelYarnClient: ApplicationSubmissionContext Queuename :  root.default
17/07/30 03:26:26 ERROR yarn.AngelYarnClient: submit application to yarn failed.
org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1501399441343_0032 to YARN : Application application_1501399441343_0032 submitted by user h to unknown queue: root.default
	at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:271)
	at com.tencent.angel.client.yarn.AngelYarnClient.startPSServer(AngelYarnClient.java:164)
	at com.tencent.angel.client.AngelPSClient.startPS(AngelPSClient.java:99)
	at com.tencent.angel.spark.context.AngelPSContext$.submit(AngelPSContext.scala:248)
	at com.tencent.angel.spark.context.AngelPSContext$.apply(AngelPSContext.scala:151)
	at com.tencent.angel.spark.PSContext$.liftedTree1$1(PSContext.scala:124)
	at com.tencent.angel.spark.PSContext$.instance(PSContext.scala:122)
	at com.tencent.angel.spark.PSContext$.getOrCreate(PSContext.scala:91)
	at com.tencent.angel.spark.PSContext$.getOrCreate(PSContext.scala:79)
	at com.tencent.angel.spark.examples.ml.BreezeSGD$$anonfun$main$1.apply(BreezeSGD.scala:42)
	at com.tencent.angel.spark.examples.ml.BreezeSGD$$anonfun$main$1.apply(BreezeSGD.scala:41)
	at com.tencent.angel.spark.examples.util.PSExamples$.runWithSparkContext(PSExamples.scala:50)
	at com.tencent.angel.spark.examples.ml.BreezeSGD$.main(BreezeSGD.scala:41)
	at com.tencent.angel.spark.examples.ml.BreezeSGD.main(BreezeSGD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Exception in thread "main" org.apache.spark.SparkException: PSContext init failed!
	at com.tencent.angel.spark.PSContext$.getOrCreate(PSContext.scala:92)
	at com.tencent.angel.spark.PSContext$.getOrCreate(PSContext.scala:79)
	at com.tencent.angel.spark.examples.ml.BreezeSGD$$anonfun$main$1.apply(BreezeSGD.scala:42)
	at com.tencent.angel.spark.examples.ml.BreezeSGD$$anonfun$main$1.apply(BreezeSGD.scala:41)
	at com.tencent.angel.spark.examples.util.PSExamples$.runWithSparkContext(PSExamples.scala:50)
	at com.tencent.angel.spark.examples.ml.BreezeSGD$.main(BreezeSGD.scala:41)
	at com.tencent.angel.spark.examples.ml.BreezeSGD.main(BreezeSGD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Do i miss sth?

Here is my SONA-example:

#!/bin/bash

source ./spark-on-angel-env.sh

$SPARK_HOME/bin/spark-submit \
    --master local[2] \
    --conf spark.ps.jars=$SONA_ANGEL_JARS \
    --conf spark.ps.instances=1 \
    --conf spark.ps.cores=1 \
    --conf spark.ps.memory=512m \
    --jars $SONA_SPARK_JARS \
    --name "BreezeSGD-spark-on-angel" \
    --driver-memory 512m \
    --num-executors 1 \
    --executor-cores 1 \
    --executor-memory 512m \
    --class com.tencent.angel.spark.examples.ml.BreezeSGD \
    ./../lib/spark-on-angel-examples-${ANGEL_VERSION}.jar

Here is my SONA-example-local:

source ./spark-on-angel-env.sh

$SPARK_HOME/bin/spark-submit \
    --jars $SONA_SPARK_JARS \
    --name "BreezeSGD-spark-on-angel" \
    --class com.tencent.angel.spark.examples.ml.BreezeSGD \
    ./../lib/spark-on-angel-examples-${ANGEL_VERSION}.jar

2.When i want to run the test part in source code in package org.apache.spark.ml.classificatiion.ps.LogisticRegressionSuite with IDEA.
I delete the ignore tag . And I run follow part:


  test("logistic regression doesn't fit intercept when fitIntercept is off") {
    val lr = new LogisticRegression().setFamily("binomial")
    lr.setFitIntercept(false)
    val model = lr.fit(smallBinaryDataset)
    assert(model.intercept === 0.0)

    val mlr = new LogisticRegression().setFamily("multinomial")
    mlr.setFitIntercept(false)
    val mlrModel = mlr.fit(smallMultinomialDataset)
    assert(mlrModel.interceptVector === Vectors.sparse(3, Seq()))
  }

It will throw:

17/07/30 03:04:57 INFO AngelYarnClient: libjars=
17/07/30 03:04:57 ERROR AngelYarnClient: submit application to yarn failed.
java.lang.IllegalArgumentException: Can not create a Path from an empty string
	at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
	at org.apache.hadoop.fs.Path.<init>(Path.java:135)
	at com.tencent.angel.client.yarn.AngelYarnClient.copyAndConfigureFiles(AngelYarnClient.java:255)
	at com.tencent.angel.client.yarn.AngelYarnClient.startPSServer(AngelYarnClient.java:150)
	at com.tencent.angel.client.AngelPSClient.startPS(AngelPSClient.java:99)
	at com.tencent.angel.spark.context.AngelPSContext$.submit(AngelPSContext.scala:248)
	at com.tencent.angel.spark.context.AngelPSContext$.apply(AngelPSContext.scala:151)
	at com.tencent.angel.spark.PSContext$.liftedTree1$1(PSContext.scala:124)
	at com.tencent.angel.spark.PSContext$.instance(PSContext.scala:122)
	at com.tencent.angel.spark.PSContext$.getOrCreate(PSContext.scala:91)
	at org.apache.spark.ml.classification.ps.LogisticRegression.train(LogisticRegression.scala:350)
	at org.apache.spark.ml.classification.ps.LogisticRegression.train(LogisticRegression.scala:332)
	at org.apache.spark.ml.classification.ps.LogisticRegression.train(LogisticRegression.scala:203)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)
	at org.apache.spark.ml.classificatiion.ps.LogisticRegressionSuite$$anonfun$7.apply$mcV$sp(LogisticRegressionSuite.scala:268)
	at org.apache.spark.ml.classificatiion.ps.LogisticRegressionSuite$$anonfun$7.apply(LogisticRegressionSuite.scala:265)
	at org.apache.spark.ml.classificatiion.ps.LogisticRegressionSuite$$anonfun$7.apply(LogisticRegressionSuite.scala:265)
	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	at org.scalatest.Transformer.apply(Transformer.scala:22)
	at org.scalatest.Transformer.apply(Transformer.scala:20)
	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
	at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
	at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
	at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
	at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
	at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
	at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
	at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
	at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
	at org.scalatest.Suite$class.run(Suite.scala:1424)
	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
	at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
	at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
	at org.apache.spark.ml.classificatiion.ps.LogisticRegressionSuite.org$scalatest$BeforeAndAfterAll$$super$run(LogisticRegressionSuite.scala:45)
	at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
	at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
	at org.apache.spark.ml.classificatiion.ps.LogisticRegressionSuite.run(LogisticRegressionSuite.scala:45)
	at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)
	at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)
	at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557)
	at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044)
	at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1043)
	at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:2722)
	at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1043)
	at org.scalatest.tools.Runner$.run(Runner.scala:883)
	at org.scalatest.tools.Runner.run(Runner.scala)
	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:138)
	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28)

It seems i need to config some libpath in the ide? Can you tell what i need to do or give me some tutorial about it.
THX

关于运行时jar文件

你好,
angel提供两种方式运行本地和yarn模式,我查看了bin下的angel-submit和angel-example两个启动脚本,实际上这里就是环境变量的设置添加ANGEL_HOME和classpath,但是我运行不起来,因为jar包没有加入CLASSPATH路径,看脚本jar存在的路径为"$bin"/../lib 但是我没有找到lib文件夹,所以无法将jar配置在CLASSPTH路径中。我查看项目也没有找到相应的jar文件。
请问下jar在哪个位置?需要我自己生成jar文件,然后运行吗?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.