Git Product home page Git Product logo

silex's Introduction

silex

something to help you spark

This is a library of reusable code for Spark applications, factored out of applications we've built at Red Hat. It will grow in the future but for now we have an application skeleton, some useful extensions to data frames and RDDs, utility functions for handling time and stemming text, and helpers for feature extraction.

Using silex

Add the following resolver to your project:

resolvers += "Will's bintray" at "https://dl.bintray.com/willb/maven/"

and then add Silex as a dependency:

libraryDependencies += "io.radanalytics" %% "silex" % "0.2.0"

Since version 0.0.9, Silex is built for both Scala 2.10 and Scala 2.11. Since version 0.1.0, Silex depends on Spark 2.0.

Documentation

The Silex web site includes some examples of Silex functionality in use and API docs.

Notes for developers

To cut a new release, use the git flow release workflow.

  1. Start a new release branch with git flow release start x.y.z
  2. Incorporate any release-specific patches that do not belong on the develop branch
  3. Bump version numbers in the README, build definition, and Jekyll configuration.
  4. Run tests for every cross build: sbt +test
  5. Publish binary artifacts to bintray for each cross-build: sbt +publish
  6. Publish an updated site for the project: sbt ghpages-push-site

CI Status

Build Status Coverage Status

silex's People

Contributors

erikerlandson avatar rnowling avatar willb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

silex's Issues

Interface for per-file-parallelism with Hadoop FS

For many of the use cases we've seen, parallelism per file versus per line is sufficient performance wise and makes parsing files easier. (e.g. JSON, CSV, etc. file formats which may require keeping track of state from the start of the file is problematic for chunking up files.)

We'd like to provide an API for per-file-parallelism. We'd like to reuse Hadoop's Filesystem interfaces which provides a unified interface to readers for a variety of sources including S3, FTP, HDFS, and POSIX. If using lazy evaluation is possible, we wouldn't have to keep large files in memory. We may also want a way to list and filter on the list of files. (E.g., more than just glob-type expressions.)

Investigate exact p-value computation for Cramer's V (#56)

Issue derived from #56

The wikipedia page on Cramer's V metions this:
"The p-value for the significance of V is the same one that is calculated using the Pearson's chi-squared test."

Does this mean p-vals can be computed exactly, or is there a niche for the permutation-based estimator?

Spark configuration parameter `spark.kryoserializer.buffer.mb` is deprecated

Output from using ConsoleApp:

16/01/15 11:56:44 WARN SparkConf: The configuration key 'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and may be removed in the future. Please use spark.kryoserializer.buffer instead. The default value for spark.kryoserializer.buffer.mb was previously specified as '0.064'. Fractional values are no longer accepted. To specify the equivalent now, one may use '64k'.
16/01/15 11:56:44 WARN SparkConf: The configuration key 'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and and may be removed in the future. Please use the new key 'spark.kryoserializer.buffer' instead.

We need to update the ConsoleApp.

Improvements to histogramming api

  1. change object HistogramRDD to package object histogramming, or something
  2. input to countByFlat and histByFlat can be TraversableOnce
  3. output of histogramming may as well be Seq, as I create sequence to sort prior to output

AWL: alternative hash function

The current implementation of tabulation hashing in ApproximateWhitelist will result in identical values for strings that are permutations of one another. This is likely not a big deal in general, but it is probably worth investigating a permutation-safe tabulation hash variant or an alternative hash function.

crash due to json4s incompatibility

Thanks to Tomasz Fruboes for reporting the following issue:


I was trying to play with your silex library. Unfortunately it looks like there is some incompatibility between json4s-jackson 3.2.11 and spark leading to a crash:

[error] (run-main-0) java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.render(Lorg/json4s/JsonAST$JValue;)Lorg/json4s/JsonAST$JValue;
java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.render(Lorg/json4s/JsonAST$JValue;)Lorg/json4s/JsonAST$JValue;
       at org.apache.spark.sql.types.DataType.json(dataTypes.scala:264)

json4s v3.2.11 seems to be spark unfriendly, see json4s/json4s#212.

Steps to reproduce - create a build.sbt with:

resolvers += "Will's bintray" at "https://dl.bintray.com/willb/maven/"
libraryDependencies += "com.redhat.et" %% "silex" % "0.0.6"

and try reading a parquet file in spark:

         val conf = new SparkConf().setMaster("local[2]")
         val sc = new SparkContext(conf)
         val sqlContext = new SQLContext(sc)
         val df = sqlContext.parquetFile("pathToFile")
         df.show(5)

0.0.5 version seems to be ok. I have tried to checkout silex code from github, change in build.sbt change jackson version to 3.2.10 and reference it in my project in the following way:

lazy val silex = ProjectRef(file("../silex/"), "silex")

lazy val myapp = project.in(file(".")).
       dependsOn(silex).
       settings(
           scalaVersion := "2.10.4"
       )

unfortunately error persists, which is bit confusing. Any idea why?

`sbt` script under `develop` branch is out of date

When I tried to use the sbt script to build silex, the script reported an error about retrieving sbt-launch.jar. I noticed that it tried to use the old artifactoryonline.com repo. Updated the sbt script fixed the issue for me.

Utilities for custom partitioning

Custom partitioning can be make certain operations easier (e.g., grouping data to control mapping between data and files). We should evaluate the space of how custom partitioning can be used and provide utilities to make this easier. Maybe good to define high level tasks that need customer partitioning and present interfaces for those as well.

predict function in Kmedoid model

def predict(points: RDD[T]): RDD[Int]
Return an RDD produced by predicting the closest medoid to each row

I am using this as ::
// provided all the parameters
val obj1 = new KMedoids(metric: (Vector, Vector) โ‡’ Double, k: Int, maxIterations: Int, epsilon: Double, fractionEpsilon: Double, sampleSize: Int, numThreads: Int, seed: Long)
// rows is RDD of vectors
val obj2 = obj1.run(rows)
val predictions : RDD[Int] = obj2.predict(rows)
This is throwing exception Task not serializable

Implement Kendall's Tau

Implement Kendall's Tau, a measure of ordinal association.

Ping @erikerlandson -- do you have an implementation sitting around you could easily make into a PR? :)

Add implementation of Cramer's V

Cramer's V is a measure of association between nominal (categorical) variables. Useful for feature selection, comparing clusterings, potentially evaluating splits in Decision Trees trained on purely categorical data, etc.

Can silex provide testing utilities / infra?

Silex has several utilities such as PerTestSparkContext which make writing unit tests for Spark applications easier. Could Silex provide similar utilities to the apps using it? If so, what should the scope be?

Not able to train SOMs

Hi,

I am trying to create a SOM model using the following code:

val model = SOM.train(xdim = 5, ydim = 5, examples = ffRDD, iterations = 10, fdim = 3)

Unfortunately I'm getting the following error. Can you please provide some documentation through which I can understand if I'm doing anything wrong in building the model.

Exception in thread "main" java.io.NotSerializableException: com.redhat.et.silex.util.SampleSink Serialization stack: - object not serializable (class: com.redhat.et.silex.util.SampleSink, value: SampleSink(count=0, min=Infinity, max=-Infinity, mean=0.0, variance=NaN)) - field (class: com.redhat.et.silex.som.SOM, name: mqsink, type: class com.redhat.et.silex.util.SampleSink) - object (class com.redhat.et.silex.som.SOM, com.redhat.et.silex.som.SOM@7402c49f) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:236) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:236) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1287) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:237) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:107) at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:86) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1370) at com.redhat.et.silex.som.SOM$$anonfun$train$1.apply(som.scala:204) at com.redhat.et.silex.som.SOM$$anonfun$train$1.apply(som.scala:201) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.immutable.Range.foreach(Range.scala:141) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) at scala.collection.AbstractTraversable.foldLeft(Traversable.scala:105) at com.redhat.et.silex.som.SOM$.train(som.scala:201) at com.datarpm.dsl.som.som$.main(som.scala:42) at com.datarpm.dsl.som.som.main(som.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

I'm using Spark v2.0.0 & Scala v2.10.4

Thanks for your time.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.