ondra-m / ruby-spark Goto Github PK

View Code? Open in Web Editor NEW

226.0 16.0 29.0 650 KB

Ruby wrapper for Apache Spark

License: MIT License

Ruby 82.29% C 1.07% Java 1.20% Scala 13.06% Shell 1.19% R 0.37% Python 0.83%

ruby spark ruby-spark distributed rdd

ruby-spark's People

Contributors

Stargazers

Watchers

ruby-spark's Issues

Passing function

I'm trying to understand what is described here in the wiki
https://github.com/ondra-m/ruby-spark/wiki/Passing-function

This seems to described an issue I'm seeing where I have a method I want to use but it isn't recognized when I call it within .map. I can't figure out a possible solution from the documentation.

Something like this doesn't work

def manipulate_data
line + " more stuff"
end

api_keys = requests_with_key.map(lambda {|line| [manipulate_data(line), 1]})

But I am unsure how I would make it work. Is it as simple as adding
requests_with_key.bind(method(:manipulate_data))

Encoding::UndefinedConversionError: "\x8B" from ASCII-8BIT to UTF-8

2.1.3 :036 > rdd2 = sc.parallelize(["jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj"])
Encoding::UndefinedConversionError: "\x8B" from ASCII-8BIT to UTF-8
    from bundler/gems/ruby-spark-2287d5a71670/lib/spark/ext/io.rb:37:in `write'
    from bundler/gems/ruby-spark-2287d5a71670/lib/spark/ext/io.rb:37:in `write_int'
    from bundler/gems/ruby-spark-2287d5a71670/lib/spark/ext/io.rb:48:in `write_string'
    from bundler/gems/ruby-spark-2287d5a71670/lib/spark/serializer/batched.rb:53:in `block in dump_to_io'
    from bundler/gems/ruby-spark-2287d5a71670/lib/spark/serializer/batched.rb:51:in `each_slice'
    from bundler/gems/ruby-spark-2287d5a71670/lib/spark/serializer/batched.rb:51:in `each'
    from bundler/gems/ruby-spark-2287d5a71670/lib/spark/serializer/batched.rb:51:in `dump_to_io'
    from bundler/gems/ruby-spark-2287d5a71670/lib/spark/context.rb:215:in `parallelize'
    from (irb):36
    from bundler/gems/railties-3.2.13/lib/rails/commands/console.rb:47:in `start'
    from bundler/gems/railties-3.2.13/lib/rails/commands/console.rb:8:in `start'
    from  bundler/gems/railties-3.2.13/lib/rails/commands.rb:41:in `<top (required)>'
    from script/rails:6:in `require'
    from script/rails:6:in `<main>'

I Also tried Oj as a serializer, and get the same error. It seems to be coming from IO or StringIO

toy input works but fail on bigger inputs

Hi Ondra:

Really appreciate of writing such a great gem and I'm using it to suggest "friends you may know" function like facebook. Ruby-spark works for toy input but failed on the bigger input. Also it report error if I use RDD.take(2) even with toy input. Do you have a guess where should I look? Thanks!

Logic: each line represents a userID and the user's friends IDs. For each line, make all these friends share the leading user as the common friend. Remove those pair who's already friends. Aggregate and suggest users with most common friends.

Input: UserID FriendID, FriendID, FriendID....
Output: UserID, [[suggestedID, commonFriendCount], [suggestedID, commonFriendCount], [suggestedID, commonFriendCount] ..... ]

toy input: https://github.com/xjlin0/cs246/blob/master/w2015/hw1/q1testdata.txt
bigger input: http://snap.stanford.edu/class/cs246-data/hw1q1.zip

Working PySpark code, (works in PySpark shell, by typing execfile('q1_people_you_may_know_spark.py') ):
https://github.com/xjlin0/cs246/blob/master/w2015/hw1/q1_people_you_may_know_spark.py

My Ruby-Spark code, run in the ruby-spark shell (by typing load('q1_people_you_may_know_spark.rb') ) , works for toy input but failed for bigger input.
https://github.com/xjlin0/cs246/blob/master/w2015/hw1/q1_people_you_may_know_spark.rb

Thanks!

Running the tests

Hi, I cloned this project, ran bundle install using Ruby 2.1.3 on OS X and then tried running the tests via rspec spec. But I got the following error. Am I missing an installation step?

/Users/gnilrets/git/ruby-spark/lib/spark.rb:177:in `require': cannot load such file -- ruby_spark_ext (LoadError)
        from /Users/gnilrets/git/ruby-spark/lib/spark.rb:177:in `<top (required)>'
        from /Users/gnilrets/git/ruby-spark/lib/ruby-spark.rb:1:in `require_relative'
        from /Users/gnilrets/git/ruby-spark/lib/ruby-spark.rb:1:in `<top (required)>'
        from /Users/gnilrets/git/ruby-spark/spec/spec_helper.rb:5:in `require'
        from /Users/gnilrets/git/ruby-spark/spec/spec_helper.rb:5:in `<top (required)>'
        from /Users/gnilrets/git/ruby-spark/spec/lib/collect_spec.rb:1:in `require'
        from /Users/gnilrets/git/ruby-spark/spec/lib/collect_spec.rb:1:in `<top (required)>'
        from /Users/gnilrets/git/ruby-spark/vendor/gems/rspec-core-3.2.3/lib/rspec/core/configuration.rb:1226:in `load'
        from /Users/gnilrets/git/ruby-spark/vendor/gems/rspec-core-3.2.3/lib/rspec/core/configuration.rb:1226:in `block in load_spec_files'
        from /Users/gnilrets/git/ruby-spark/vendor/gems/rspec-core-3.2.3/lib/rspec/core/configuration.rb:1224:in `each'
        from /Users/gnilrets/git/ruby-spark/vendor/gems/rspec-core-3.2.3/lib/rspec/core/configuration.rb:1224:in `load_spec_files'
        from /Users/gnilrets/git/ruby-spark/vendor/gems/rspec-core-3.2.3/lib/rspec/core/runner.rb:97:in `setup'
        from /Users/gnilrets/git/ruby-spark/vendor/gems/rspec-core-3.2.3/lib/rspec/core/runner.rb:85:in `run'
        from /Users/gnilrets/git/ruby-spark/vendor/gems/rspec-core-3.2.3/lib/rspec/core/runner.rb:70:in `run'
        from /Users/gnilrets/git/ruby-spark/vendor/gems/rspec-core-3.2.3/lib/rspec/core/runner.rb:38:in `invoke'
        from /Users/gnilrets/git/ruby-spark/vendor/gems/rspec-core-3.2.3/exe/rspec:4:in `<top (required)>'
        from /Users/gnilrets/git/ruby-spark/vendor/bin/rspec:23:in `load'
        from /Users/gnilrets/git/ruby-spark/vendor/bin/rspec:23:in `<main>'

Converting a json file to parquet

Hi,
Can I use this tool to convert json objects to parquet files ?

Speed considerations

Hi @ondra-m ,

thank for your work on this great bindings library! Just added it to our awesome list: https://github.com/arbox/nlp-with-ruby#pipeline-generation

Concerning your speed considerations, Python vs. Ruby is not the choice in terms of speed. Ruby is not slower than Python. Please keep supporting this project, it's very useful!

What's the future of this project?

Just wondering if this project is still being maintained, and if things like Spark 2 are on the roadmap.

Issue with Master::Base::NIO NameError

Really cool project. I've almost got things working but running into this error.

master.rb:39:in run': uninitialized constant Master::Base::NIO (NameError) from master.rb:144:in

'
15/05/15 22:49:23 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
at java.net.ServerSocket.implAccept(ServerSocket.java:530)
at java.net.ServerSocket.accept(ServerSocket.java:498)
at org.apache.spark.api.ruby.RubyWorker$.createWorker(RubyWorker.scala:70)
at org.apache.spark.api.ruby.RubyWorker$.create(RubyWorker.scala:51)
at org.apache.spark.api.ruby.RubyRDD.compute(RubyRDD.scala:50)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/05/15 22:49:23 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
at java.net.ServerSocket.implAccept(ServerSocket.java:530)
at java.net.ServerSocket.accept(ServerSocket.java:498)
at org.apache.spark.api.ruby.RubyWorker$.createWorker(RubyWorker.scala:70)
at org.apache.spark.api.ruby.RubyWorker$.create(RubyWorker.scala:51)
at org.apache.spark.api.ruby.RubyRDD.compute(RubyRDD.scala:50)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

What's the easiest task to start contributing ?

I am very new to spark and ruby and really liked this project. Is there any list of intro dev task that one newbie can tackle ?

Java SE 6 required on OS X

I ran ruby-spark build successfully, but then when I tried ruby-spark shell I get a popup saying I need to install Java SE 6 runtime.

I installed Java 6, and now it's working, but I'm very curious why Java 6 is needed?

Spark Streaming support - feature request

I see that in TODO.md, any estimated timeframe?

Ruby-spark shell: No FileSystem for scheme: file

I am trying Ruby-spark and I have sucessfully installed ruby-spark on a linux mint os
(yet there is an issue with the sbt/sbt script that faild to detect when the download fail => I ended up with a jar corresponding to the page html 404)

I have successfully run the PI test on the local cluster using ruby-spark shell,
I cannot use local file as input for RDD like this:
expo = Spark.sc.text_file("/home/yves/test.csv")
expo.collect

I keep having:
Spark::RDDError: java.io.IOException: No FileSystem for scheme: file
from /home/yves/.rvm/gems/ruby-2.2.1/gems/ruby-spark-1.2.0/lib/spark/rdd.rb:207:in `rescue in collect'

I don't know what to fix in the sbt build script. But it is really linked to what is being put into
ruby-spark-deps.jar, Because replacing that file with a already build spark.jar make it works.

From what I have read it seems to be a packaging/build issue with hadoop
(see http://stackoverflow.com/questions/17265002/hadoop-no-filesystem-for-scheme-file)

Updating that issue: I understand that it certainly more a spark issue than a Ruby-spark issue,
but as ruby-spark self deliver spark, it would be nice to have a fully working spark version available,
or have pointer to fix it.

Install issues

I'm having issues with getting ruby spark installed on my Mac with Yosemite. When I run the command
ruby-spark build

I get an error about
Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.7.jar

I installed sbt with homebrew and it seems to be working but I always get this error.

Any suggestions on how to fix this and get ruby-spark properly running?

Not making use of multiple cores

I've got a file with 8M records and I'm trying to split it up into words and do a word count. Here's my code. When I run it, I see 4 new Ruby processes start up on my machine but only one of them shoots to 100%. The others just sit there idle. I don't think it's parallelizing properly. Am I missing a configuration setting somewhere?

require 'ruby-spark'
Spark.config do
  set_app_name 'RubySpark'
  set_master 'local[*]'
  set 'spark.ruby.serializer', 'oj'
  set 'spark.ruby.serializer.batch_size', 2048
end
Spark.start
sc = Spark.sc

tfile = sc.text_file('work/Contact.csv')
words = tfile.flat_map('lambda { |x| x.downcase.gsub(/[^a-z]/, " ").split(" ")}')
words.count

Maybe a better implementation of ruby binding for Apache Spark

Hi,

I have written a new prototype for ruby spark binding

https://github.com/chyh1990/jruby-spark

Although this implementation only works on JRuby, I think this approach is more promising:

REAL closure/lambda serialization, with elegant syntax

https://github.com/chyh1990/jruby-spark/blob/master/examples/pagerank.rb

use JVM infrastructure, run on YARN with the standard job submission workflow
reuse Java/Scala API, we can get Streaming/SQL/GraphX support nearly for free

https://github.com/chyh1990/jruby-spark/blob/master/examples/sqltest.rb

Easier to maintain even without merging into mainline spark

The prototype is preliminary, but the concept is proved. I think ruby would be a
more elegant binding language for spark than python. I'm looking forward for more
participants!

Support for dataframes

I'm really interested in using spark and would love to be able to interact with it using Ruby. This gem looks like a great option. It doesn't look like it would natively support spark dataframes, right? Would there be any way to interact with dataframes using this gem? If not, what kind of effort would you expect would be required to build it in?

Yarn and Ruby Spark

Is it possible to configure Ruby Spark to run against a cluster that uses yarn not meso? If so what is the syntax to configure Ruby Spark to do this?

SparkException: There was a problem with creating a server

I'm getting org.apache.spark.SparkException: There was a problem with creating a server when attempting to call the rdd.take(..)

ruby 2.2.4p230
Windows 7
Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_25
Spark version 2.0.2
Scala version 2.11.8

Any pointers are appreciated

C:\path\to\test>ruby spark_test.rb
16/11/22 12:29:22 INFO spark.SparkContext: Running Spark version 1.5.0
16/11/22 12:29:22 INFO spark.SecurityManager: Changing view acls to: username
16/11/22 12:29:22 INFO spark.SecurityManager: Changing modify acls to: username
16/11/22 12:29:22 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(username); users with modify permissions: Set(username)
16/11/22 12:29:23 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/11/22 12:29:23 INFO Remoting: Starting remoting
16/11/22 12:29:23 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:57673]
16/11/22 12:29:23 INFO util.Utils: Successfully started service 'sparkDriver' on port 57673.
16/11/22 12:29:23 INFO spark.SparkEnv: Registering MapOutputTracker
16/11/22 12:29:23 INFO spark.SparkEnv: Registering BlockManagerMaster
16/11/22 12:29:23 INFO storage.DiskBlockManager: Created local directory at C:\path\to\username\AppData\Local\Temp\blockmgr-20f72b69-042d-4bde-8c0a-be0b5e544925
16/11/22 12:29:23 INFO storage.MemoryStore: MemoryStore started with capacity 1955.5 MB
16/11/22 12:29:23 INFO spark.HttpFileServer: HTTP File server directory is C:\path\to\username\AppData\Local\Temp\spark-860dc58a-60af-4551-9c21-2857427e9259\httpd-24c4f7a0-7cca-44e2-a5db-f3c779f0fa73
16/11/22 12:29:23 INFO spark.HttpServer: Starting HTTP Server
16/11/22 12:29:23 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/11/22 12:29:23 INFO server.AbstractConnector: Started [email protected]:57674
16/11/22 12:29:23 INFO util.Utils: Successfully started service 'HTTP file server' on port 57674.
16/11/22 12:29:23 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/11/22 12:29:23 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/11/22 12:29:23 INFO server.AbstractConnector: Started [email protected]:4040
16/11/22 12:29:23 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
16/11/22 12:29:23 INFO ui.SparkUI: Started SparkUI at http://10.73.14.60:4040
16/11/22 12:29:23 WARN metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/11/22 12:29:23 INFO executor.Executor: Starting executor ID driver on host localhost
16/11/22 12:29:23 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57693.
16/11/22 12:29:23 INFO netty.NettyBlockTransferService: Server created on 57693
16/11/22 12:29:23 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/11/22 12:29:23 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:57693 with 1955.5 MB RAM, BlockManagerId(driver, localhost, 57693)
16/11/22 12:29:23 INFO storage.BlockManagerMaster: Registered BlockManager
16/11/22 12:29:23 INFO spark.SparkContext: Added JAR /.ruby-spark.7c68fd15-69f6-4d3d-83b7-f34a4cbd7a06/ruby-spark.jar at http://10.73.14.60:57674/jars/ruby-spark.jar with timestamp 1479842963886
16/11/22 12:29:23 INFO Ruby: Ruby accumulator server is running on port 57694
16/11/22 12:29:24 WARN spark.SparkContext: sc.runJob with allowLocal=true is deprecated in Spark 1.5.0+
16/11/22 12:29:24 INFO spark.SparkContext: Starting job: Ruby
16/11/22 12:29:24 INFO scheduler.DAGScheduler: Got job 0 (Ruby) with 1 output partitions
16/11/22 12:29:24 INFO scheduler.DAGScheduler: Final stage: ResultStage 0(Ruby)
16/11/22 12:29:24 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/11/22 12:29:24 INFO scheduler.DAGScheduler: Missing parents: List()
16/11/22 12:29:24 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (RubyRDD[1] at Ruby), which has no missing parents
16/11/22 12:29:24 INFO storage.MemoryStore: ensureFreeSpace(2632) called with curMem=0, maxMem=2050534932
16/11/22 12:29:24 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.6 KB, free 1955.5 MB)
16/11/22 12:29:24 INFO storage.MemoryStore: ensureFreeSpace(1664) called with curMem=2632, maxMem=2050534932
16/11/22 12:29:24 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1664.0 B, free 1955.5 MB)
16/11/22 12:29:24 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:57693 (size: 1664.0 B, free: 1955.5 MB)
16/11/22 12:29:24 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:861
16/11/22 12:29:24 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (RubyRDD[1] at Ruby)
16/11/22 12:29:24 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/11/22 12:29:24 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 2125 bytes)
16/11/22 12:29:24 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
16/11/22 12:29:24 INFO executor.Executor: Fetching http://10.73.14.60:57674/jars/ruby-spark.jar with timestamp 1479842963886
16/11/22 12:29:24 INFO util.Utils: Fetching http://10.73.14.60:57674/jars/ruby-spark.jar to C:\path\to\username\AppData\Local\Temp\spark-860dc58a-60af-4551-9c21-2857427e9259\userFiles-1f4e6fa6-ebfa-41fd-9ebd-a7873e970559\fetchFileTemp3122839241910942098.tmp
16/11/22 12:29:24 INFO executor.Executor: Adding file:/C:/path/to/username/AppData/Local/Temp/spark-860dc58a-60af-4551-9c21-2857427e9259/userFiles-1f4e6fa6-ebfa-41fd-9ebd-a7873e970559/ruby-spark.jar to class loader
16/11/22 12:29:24 INFO ruby.FileCommand: New FileCommand at C:\path\to\username\AppData\Local\Temp\spark-860dc58a-60af-4551-9c21-2857427e9259\userFiles-1f4e6fa6-ebfa-41fd-9ebd-a7873e970559\command847668330604496435.cmd
Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\path\to\test>16/11/22 12:29:34 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: There was a problem with creating a server
        at org.apache.spark.api.ruby.RubyWorker$.createServer(RubyWorker.scala:100)
        at org.apache.spark.api.ruby.RubyWorker$.create(RubyWorker.scala:47)
        at org.apache.spark.api.ruby.RubyRDD.compute(RubyRDD.scala:50)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.spark.SparkException: Ruby master did not connect back in time
        at org.apache.spark.api.ruby.RubyWorker$.createMaster(RubyWorker.scala:161)
        at org.apache.spark.api.ruby.RubyWorker$.createServer(RubyWorker.scala:97)
        ... 10 more
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
        at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:135)
        at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:198)
        at java.net.ServerSocket.implAccept(ServerSocket.java:530)
        at java.net.ServerSocket.accept(ServerSocket.java:498)
        at org.apache.spark.api.ruby.RubyWorker$.createMaster(RubyWorker.scala:154)
        ... 11 more
16/11/22 12:29:34 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.SparkException: There was a problem with creating a server
        at org.apache.spark.api.ruby.RubyWorker$.createServer(RubyWorker.scala:100)
        at org.apache.spark.api.ruby.RubyWorker$.create(RubyWorker.scala:47)
        at org.apache.spark.api.ruby.RubyRDD.compute(RubyRDD.scala:50)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.spark.SparkException: Ruby master did not connect back in time
        at org.apache.spark.api.ruby.RubyWorker$.createMaster(RubyWorker.scala:161)
        at org.apache.spark.api.ruby.RubyWorker$.createServer(RubyWorker.scala:97)
        ... 10 more
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
        at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:135)
        at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:198)
        at java.net.ServerSocket.implAccept(ServerSocket.java:530)
        at java.net.ServerSocket.accept(ServerSocket.java:498)
        at org.apache.spark.api.ruby.RubyWorker$.createMaster(RubyWorker.scala:154)
        ... 11 more

16/11/22 12:29:34 ERROR scheduler.TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
16/11/22 12:29:34 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/11/22 12:29:34 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
16/11/22 12:29:34 INFO scheduler.DAGScheduler: ResultStage 0 (Ruby) failed in 10.583 s
16/11/22 12:29:34 INFO scheduler.DAGScheduler: Job 0 failed: Ruby, took 10.757870 s
16/11/22 12:29:34 INFO Ruby: Ruby accumulator server was stopped
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/11/22 12:29:34 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/11/22 12:29:35 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/11/22 12:29:35 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/11/22 12:29:35 INFO ui.SparkUI: Stopped Spark web UI at http://10.73.14.60:4040
16/11/22 12:29:35 INFO scheduler.DAGScheduler: Stopping DAGScheduler
16/11/22 12:29:35 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/11/22 12:29:35 INFO storage.MemoryStore: MemoryStore cleared
16/11/22 12:29:35 INFO storage.BlockManager: BlockManager stopped
16/11/22 12:29:35 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
16/11/22 12:29:35 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/11/22 12:29:35 INFO spark.SparkContext: Successfully stopped SparkContext
16/11/22 12:29:35 INFO Ruby: Workers were stopped
16/11/22 12:29:35 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/11/22 12:29:35 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
C:/Rubies/Ruby224/lib/ruby/gems/2.2.0/gems/ruby-spark-1.2.1/lib/spark/context.rb:306:in `method_missing': Job aborted due to stage failure: Task 0 in stage 0.0 fa16/11/22 12:29:35 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
 (SparkException)t recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.SparkException: There was a problem with creating a server
        at org.apache.spark.api.ruby.RubyWorker$.createServer(RubyWorker.scala:100)
        at org.apache.spark.api.ruby.RubyWorker$.create(RubyWorker.scala:47)
        at org.apache.spark.api.ruby.RubyRDD.compute(RubyRDD.scala:50)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.spark.SparkException: Ruby master did not connect back in time
        at org.apache.spark.api.ruby.RubyWorker$.createMaster(RubyWorker.scala:161)
        at org.apache.spark.api.ruby.RubyWorker$.createServer(RubyWorker.scala:97)
        ... 10 more
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
        at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:135)
        at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:198)
        at java.net.ServerSocket.implAccept(ServerSocket.java:530)
        at java.net.ServerSocket.accept(ServerSocket.java:498)
        at org.apache.spark.api.ruby.RubyWorker$.createMaster(RubyWorker.scala:154)
        ... 11 more

Driver stacktrace:
        from C:/Rubies/Ruby224/lib/ruby/gems/2.2.0/gems/ruby-spark-1.2.1/lib/spark/context.rb:306:in `run_job_with_command'
        from C:/Rubies/Ruby224/lib/ruby/gems/2.2.0/gems/ruby-spark-1.2.1/lib/spark/rdd.rb:256:in `take'
        from spark_test.rb:27:in `<main>'

Add jars to shell

It would be really wonderful if there was a feature to add jars like spark-shell and pyspark have. For instance, I would want to use https://github.com/datastax/spark-cassandra-connector. Is it possible if i use jruby?

[RDD.lookup] how to get the value by a given key?

First, this work is wonderful, your work is absolutely amazing!

If I want to lookup a certain pair by given key, what method could I use? If there's not, in which file I can write one? Thanks!

https://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=sort#pyspark.RDD.lookup

JRuby Gemfile: C Extension

Great project!

I am having trouble installing ruby-spark via bundle and JRuby. This is what happens:

=> bundle install
Fetching gem metadata from https://rubygems.org/........
Fetching version metadata from https://rubygems.org/..
Resolving dependencies......
Using coderay 1.1.0
Using highline 1.7.2
Using commander 4.3.4
Using diff-lcs 1.2.5
Using distribution 0.7.3
Using ffi 1.9.10
Using tins 0.13.2
Using file-tail 1.0.12
Using method_source 0.8.2
Using nio4r 1.1.0
Using slop 3.6.0
Using spoon 0.0.4
Using pry 0.10.1
Installing rjb 1.5.3 with native extensions

Gem::Ext::BuildError: ERROR: Failed to build gem native extension.

    /Users/emanuel/.rvm/rubies/jruby-1.7.20.1/bin/jruby -r ./siteconf20150702-1012-k7ach.rb extconf.rb
NotImplementedError: C extension support is not enabled. Pass -Xcext.enabled=true to JRuby or set JRUBY_OPTS.

   (root) at /Users/emanuel/.rvm/rubies/jruby-1.7.20.1/lib/ruby/shared/mkmf.rb:8
  require at org/jruby/RubyKernel.java:1072
   (root) at /Users/emanuel/.rvm/rubies/jruby-1.7.20.1/lib/ruby/shared/rubygems/core_ext/kernel_require.rb:1
   (root) at extconf.rb:6

extconf failed, uncaught signal 1

Gem files will remain installed in /Users/emanuel/.rvm/gems/jruby-1.7.20.1@testing/gems/rjb-1.5.3 for inspection.
Results logged to /Users/emanuel/.rvm/gems/jruby-1.7.20.1@testing/extensions/universal-java-1.8/1.9/rjb-1.5.3/gem_make.out
An error occurred while installing rjb (1.5.3), and Bundler cannot continue.
Make sure that `gem install rjb -v '1.5.3'` succeeds before bundling.

=> ruby -v
jruby 1.7.20.1 (1.9.3p551) 2015-06-10 d7c8c27 on Java HotSpot(TM) 64-Bit Server VM 1.8.0_11-b12 +jit [darwin-x86_64]

=> java -version
java version "1.8.0_11"
Java(TM) SE Runtime Environment (build 1.8.0_11-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode)

=> cat .rvmrc
rvm --create use jruby-1.7.20.1@testing

The bundle worked when I removed the following lines from ruby-spark's Gemfile:

platform :mri do
  gem 'rjb'
  gem 'msgpack'
  gem 'oj'
  gem 'narray'
end

Strange, of course, as I am not even using MRI.

Any ideas?

[ RDD.takeOrdered ] how to get the biggest N element? maybe #max(N) or #max_by(N){key block}

Really appreciate your efforts of developing Ruby-Spark and it's great.

In Spark there's a handy function of takeOrdered(N) to get the top N biggest element, very similar to Ruby's #max(N) or #max_by(N){block} method. Is it possible to make RDD.max(N) or RDD.max_by(N){key from block...} take N too?

Thanks!

Closure support ? [question]

Hi ,

Do you think that automatic serialization of closure could be supported ?

Block or Proc?

What is better way how to define a function on RDD?

# as Proc
rdd.map(lambda{|x| x*2})

# as block
rdd.map {|x| x*2}

Which method should be supported?

As Proc:

the same way as in Python
currently implemented

As block:

what about aggregate(zero_value, seq_op, comb_op)
method needs 2 function

Both:

what about reduce_by_key(f, num_partitions=nil)
if you would like to use block and num_partitions:
```
  rdd.reduce_by_key(nil, 2){|x,y| x+y}
```

Can't read file from local file system

file = $sc.text_file("test.txt")

when try to use the file, it will report following error:

Exception RuntimeException' at /home/jacshen/.rvm/gems/ruby-2.2.1/gems/ruby-spark-1.2.1/lib/spark/rdd.rb:203 - java.io.IOException: No FileSystem for scheme: file ExceptionSpark::RDDError' at /home/jacshen/.rvm/gems/ruby-2.2.1/gems/ruby-spark-1.2.1/lib/spark/rdd.rb:207 - java.io.IOException: No FileSystem for scheme: file
Exception Spark::RDDError' at /home/jacshen/.rvm/gems/ruby-2.2.1/gems/pry-0.10.3/lib/pry/pry_instance.rb:355 - java.io.IOException: No FileSystem for scheme: file Spark::RDDError: java.io.IOException: No FileSystem for scheme: file from /home/jacshen/.rvm/gems/ruby-2.2.1/gems/ruby-spark-1.2.1/lib/spark/rdd.rb:207:inrescue in collect'

Class org.apache.spark.SparkConf is missing. Make sure that Spark and RubySpark is assembled. (Spark::JavaBridgeError)

I followed the installation steps, but got the following error when running tests. Is there anything I have missed? Thanks!

git clone ...
bundle install
rake compile
rake
bin/ruby-spark build
rake

/Users/david/.rvm/rubies/ruby-2.2.1/bin/ruby -I/Users/david/.rvm/gems/ruby-2.2.1/gems/rspec-core-3.3.1/lib:/Users/david/.rvm/gems/ruby-2.2.1/gems/rspec-support-3.3.0/lib /Users/david/.rvm/gems/ruby-2.2.1/gems/rspec-core-3.3.1/exe/rspec --pattern spec/\*\*\{,/\*/\*\*\}/\*_spec.rb
Coverage report generated for RSpec to /Users/david/gitrepos/ruby-spark/coverage. 316 / 458 LOC (69.0%) covered.
/Users/david/gitrepos/ruby-spark/lib/spark/java_bridge/base.rb:198:in `raise_missing_class': Class org.apache.spark.SparkConf is missing. Make sure that Spark and RubySpark is assembled. (Spark::JavaBridgeError)
    from /Users/david/gitrepos/ruby-spark/lib/spark/java_bridge/rjb.rb:20:in `rescue in import'
    from /Users/david/gitrepos/ruby-spark/lib/spark/java_bridge/rjb.rb:18:in `import'
    from /Users/david/gitrepos/ruby-spark/lib/spark/java_bridge/base.rb:53:in `block in import_all'
    from /Users/david/gitrepos/ruby-spark/lib/spark/java_bridge/base.rb:52:in `each'
    from /Users/david/gitrepos/ruby-spark/lib/spark/java_bridge/base.rb:52:in `import_all'
    from /Users/david/gitrepos/ruby-spark/lib/spark.rb:214:in `load_lib'
    from /Users/david/gitrepos/ruby-spark/spec/spec_helper.rb:9:in `<top (required)>'
    from /Users/david/gitrepos/ruby-spark/spec/lib/collect_spec.rb:1:in `require'
    from /Users/david/gitrepos/ruby-spark/spec/lib/collect_spec.rb:1:in `<top (required)>'
    from /Users/david/.rvm/gems/ruby-2.2.1/gems/rspec-core-3.3.1/lib/rspec/core/configuration.rb:1327:in `load'
    from /Users/david/.rvm/gems/ruby-2.2.1/gems/rspec-core-3.3.1/lib/rspec/core/configuration.rb:1327:in `block in load_spec_files'
    from /Users/david/.rvm/gems/ruby-2.2.1/gems/rspec-core-3.3.1/lib/rspec/core/configuration.rb:1325:in `each'
    from /Users/david/.rvm/gems/ruby-2.2.1/gems/rspec-core-3.3.1/lib/rspec/core/configuration.rb:1325:in `load_spec_files'
    from /Users/david/.rvm/gems/ruby-2.2.1/gems/rspec-core-3.3.1/lib/rspec/core/runner.rb:102:in `setup'
    from /Users/david/.rvm/gems/ruby-2.2.1/gems/rspec-core-3.3.1/lib/rspec/core/runner.rb:88:in `run'
    from /Users/david/.rvm/gems/ruby-2.2.1/gems/rspec-core-3.3.1/lib/rspec/core/runner.rb:73:in `run'
    from /Users/david/.rvm/gems/ruby-2.2.1/gems/rspec-core-3.3.1/lib/rspec/core/runner.rb:41:in `invoke'
    from /Users/david/.rvm/gems/ruby-2.2.1/gems/rspec-core-3.3.1/exe/rspec:4:in `<main>'
/Users/david/.rvm/rubies/ruby-2.2.1/bin/ruby -I/Users/david/.rvm/gems/ruby-2.2.1/gems/rspec-core-3.3.1/lib:/Users/david/.rvm/gems/ruby-2.2.1/gems/rspec-support-3.3.0/lib /Users/david/.rvm/gems/ruby-2.2.1/gems/rspec-core-3.3.1/exe/rspec --pattern spec/\*\*\{,/\*/\*\*\}/\*_spec.rb failed

Problems with running Spark with SQL

Just testing spark for myself.
Installation is perfect, everything runs smoothly, then RDD works just fine.
But I can't do a thing with SQL, do I miss something? Should I import or require something?
Spark.start_sql - undefined method start_sql
Spark::SQL - uninitialized constant Spark::SQL

Any hints how to run it?

Does this project have a future?

Hi, I'm really excited by Spark, but not so excited about transitioning my work to Python. I'd love to see this project succeed, but it looks like it might be stalled. Is that true?

I see there's a pull request for DataFrame support that hasn't been merged. Is there anything I could do to help move that along?

ondra-m / ruby-spark Goto Github PK

ruby-spark's People

Contributors

Stargazers

Watchers

Forkers

ruby-spark's Issues

Recommend Projects

Recommend Topics

Recommend Org