radanalyticsio / silex Goto Github PK

View Code? Open in Web Editor NEW

65.0 12.0 13.0 1.57 MB

something to help you spark

License: Apache License 2.0

Scala 90.86% HTML 2.07% CSS 2.93% Shell 4.14%

scala silex spark

silex's Introduction

silex

something to help you spark

This is a library of reusable code for Spark applications, factored out of applications we've built at Red Hat. It will grow in the future but for now we have an application skeleton, some useful extensions to data frames and RDDs, utility functions for handling time and stemming text, and helpers for feature extraction.

Using silex

Add the following resolver to your project:

resolvers += "Will's bintray" at "https://dl.bintray.com/willb/maven/"

and then add Silex as a dependency:

libraryDependencies += "io.radanalytics" %% "silex" % "0.2.0"

Since version 0.0.9, Silex is built for both Scala 2.10 and Scala 2.11. Since version 0.1.0, Silex depends on Spark 2.0.

Documentation

The Silex web site includes some examples of Silex functionality in use and API docs.

Notes for developers

To cut a new release, use the git flow release workflow.

Start a new release branch with git flow release start x.y.z
Incorporate any release-specific patches that do not belong on the develop branch
Bump version numbers in the README, build definition, and Jekyll configuration.
Run tests for every cross build: sbt +test
Publish binary artifacts to bintray for each cross-build: sbt +publish
Publish an updated site for the project: sbt ghpages-push-site

CI Status

silex's People

Contributors

Stargazers

Watchers

Forkers

erikerlandson willbern jyt109 pxy0592 harschware willb mallik-g korterling ruivieira mattf vikrambhosle

silex's Issues

quick start with spark-shell

I like the way how https://github.com/LucaCanali/sparkMeasure#getting-started-examples-of-sparkmeasure-usage provide a code snippet for very quick start, what about something similar for silex?

It'd be easier, if it's in the maven central, but it's also possible with 'non-default' maven repos:

https://gist.github.com/Jiri-Kremser/efd968a3c8128fd125382da76b0f5b65

We should add a unit testing framework

Add sparse binary libraries

Interface for per-file-parallelism with Hadoop FS

For many of the use cases we've seen, parallelism per file versus per line is sufficient performance wise and makes parsing files easier. (e.g. JSON, CSV, etc. file formats which may require keeping track of state from the start of the file is problematic for chunking up files.)

We'd like to provide an API for per-file-parallelism. We'd like to reuse Hadoop's Filesystem interfaces which provides a unified interface to readers for a variety of sources including S3, FTP, HDFS, and POSIX. If using lazy evaluation is possible, we wouldn't have to keep large files in memory. We may also want a way to list and filter on the list of files. (E.g., more than just glob-type expressions.)

Squash Parquet warning / info logging output for `FrameDirSpec`

FrameDirSpec generates a large number of Parquet warning / info log output. This should be squashed by reducing the logging level to increase readability of the tests.

Investigate exact p-value computation for Cramer's V (#56)

Issue derived from #56

The wikipedia page on Cramer's V metions this:
"The p-value for the significance of V is the same one that is calculated using the Pearson's chi-squared test."

Does this mean p-vals can be computed exactly, or is there a niche for the permutation-based estimator?

Spark configuration parameter `spark.kryoserializer.buffer.mb` is deprecated

Output from using ConsoleApp:

16/01/15 11:56:44 WARN SparkConf: The configuration key 'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and may be removed in the future. Please use spark.kryoserializer.buffer instead. The default value for spark.kryoserializer.buffer.mb was previously specified as '0.064'. Fractional values are no longer accepted. To specify the equivalent now, one may use '64k'.
16/01/15 11:56:44 WARN SparkConf: The configuration key 'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and and may be removed in the future. Please use the new key 'spark.kryoserializer.buffer' instead.

We need to update the ConsoleApp.

RF Clustering fails with Spark 2.3

Improvements to histogramming api

change object HistogramRDD to package object histogramming, or something
input to countByFlat and histByFlat can be TraversableOnce
output of histogramming may as well be Seq, as I create sequence to sort prior to output

Squash warnings in log output for `IIDFeatureSamplingMethodsRDDSpec`

IIDFeatureSamplingMethodsRDDSpec produces warnings about containing large tasks. These should be squashed to increase readability of the tests by reducing the logging level.

AWL: alternative hash function

The current implementation of tabulation hashing in ApproximateWhitelist will result in identical values for strings that are permutations of one another. This is likely not a big deal in general, but it is probably worth investigating a permutation-safe tabulation hash variant or an alternative hash function.

insecure mixed content (detected by netlify)

silex/src/jekyll/_includes/themes/bootstrap/default.html

Line 20 in 18413a8

http://freevariable.com/mint/?js has an https option that should be used

crash due to json4s incompatibility

Thanks to Tomasz Fruboes for reporting the following issue:

I was trying to play with your silex library. Unfortunately it looks like there is some incompatibility between json4s-jackson 3.2.11 and spark leading to a crash:

[error] (run-main-0) java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.render(Lorg/json4s/JsonAST$JValue;)Lorg/json4s/JsonAST$JValue;
java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.render(Lorg/json4s/JsonAST$JValue;)Lorg/json4s/JsonAST$JValue;
       at org.apache.spark.sql.types.DataType.json(dataTypes.scala:264)

json4s v3.2.11 seems to be spark unfriendly, see json4s/json4s#212.

Steps to reproduce - create a build.sbt with:

resolvers += "Will's bintray" at "https://dl.bintray.com/willb/maven/"
libraryDependencies += "com.redhat.et" %% "silex" % "0.0.6"

and try reading a parquet file in spark:

         val conf = new SparkConf().setMaster("local[2]")
         val sc = new SparkContext(conf)
         val sqlContext = new SQLContext(sc)
         val df = sqlContext.parquetFile("pathToFile")
         df.show(5)

0.0.5 version seems to be ok. I have tried to checkout silex code from github, change in build.sbt change jackson version to 3.2.10 and reference it in my project in the following way:

lazy val silex = ProjectRef(file("../silex/"), "silex")

lazy val myapp = project.in(file(".")).
       dependsOn(silex).
       settings(
           scalaVersion := "2.10.4"
       )

unfortunately error persists, which is bit confusing. Any idea why?

Add Random Forest Clustering libraries

rewrite histBy and friends using rdd.countByValue, and countByValueApprox

I presume those will be more likely to have plenty of optimizing work on them, in particular the approximate version.

Investigate bias correction for Cramer's V (#56)

Issue from #56

Do you think it's worthwhile to apply this bias correction?
https://en.wikipedia.org/wiki/Cram%C3%A9r's_V#Bias_correction
They make it sound like a good idea but I don't have any experience with it.

`sbt` script under `develop` branch is out of date

When I tried to use the sbt script to build silex, the script reported an error about retrieving sbt-launch.jar. I noticed that it tried to use the old artifactoryonline.com repo. Updated the sbt script fixed the issue for me.

Add checkpointing library

Add FeatureExtractor libraries

Utilities for custom partitioning

Custom partitioning can be make certain operations easier (e.g., grouping data to control mapping between data and files). We should evaluate the space of how custom partitioning can be used and provide utilities to make this easier. Maybe good to define high level tasks that need customer partitioning and present interfaces for those as well.

predict function in Kmedoid model

def predict(points: RDD[T]): RDD[Int]
Return an RDD produced by predicting the closest medoid to each row

I am using this as ::
// provided all the parameters
val obj1 = new KMedoids(metric: (Vector, Vector) ⇒ Double, k: Int, maxIterations: Int, epsilon: Double, fractionEpsilon: Double, sampleSize: Int, numThreads: Int, seed: Long)
// rows is RDD of vectors
val obj2 = obj1.run(rows)
val predictions : RDD[Int] = obj2.predict(rows)
This is throwing exception Task not serializable

Implement Kendall's Tau

Implement Kendall's Tau, a measure of ordinal association.

Ping @erikerlandson -- do you have an implementation sitting around you could easily make into a PR? :)

Add implementation of Cramer's V

Cramer's V is a measure of association between nominal (categorical) variables. Useful for feature selection, comparing clusterings, potentially evaluating splits in Decision Trees trained on purely categorical data, etc.

SplitSampleSpec causing spurious build failures

[info] SplitSampleSpec:
[info] - should provide splitSample with integer argument
[info] - should provide weightedSplitSample with weights argument *** FAILED ***
[info]   false was not true (split.scala:62)

https://travis-ci.org/willb/silex/jobs/119218111

cc @erikerlandson

Can silex provide testing utilities / infra?

Silex has several utilities such as PerTestSparkContext which make writing unit tests for Spark applications easier. Could Silex provide similar utilities to the apps using it? If so, what should the scope be?

Extend JSON schema transformation to DataFrames

Current implementation of JSON schema transformation only supports RDDs. We should support DataFrames, too.

Refactor to include a sub-package that has no deps on Spark

A lot of silex components have no actual dependency on Spark - ideally these could be published in a sub-package so that people can consume them without a Spark dependency

Not able to train SOMs

Hi,

I am trying to create a SOM model using the following code:

val model = SOM.train(xdim = 5, ydim = 5, examples = ffRDD, iterations = 10, fdim = 3)

Unfortunately I'm getting the following error. Can you please provide some documentation through which I can understand if I'm doing anything wrong in building the model.

Exception in thread "main" java.io.NotSerializableException: com.redhat.et.silex.util.SampleSink Serialization stack: - object not serializable (class: com.redhat.et.silex.util.SampleSink, value: SampleSink(count=0, min=Infinity, max=-Infinity, mean=0.0, variance=NaN)) - field (class: com.redhat.et.silex.som.SOM, name: mqsink, type: class com.redhat.et.silex.util.SampleSink) - object (class com.redhat.et.silex.som.SOM, com.redhat.et.silex.som.SOM@7402c49f) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:236) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:236) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1287) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:237) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:107) at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:86) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1370) at com.redhat.et.silex.som.SOM$$anonfun$train$1.apply(som.scala:204) at com.redhat.et.silex.som.SOM$$anonfun$train$1.apply(som.scala:201) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.immutable.Range.foreach(Range.scala:141) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) at scala.collection.AbstractTraversable.foldLeft(Traversable.scala:105) at com.redhat.et.silex.som.SOM$.train(som.scala:201) at com.datarpm.dsl.som.som$.main(som.scala:42) at com.datarpm.dsl.som.som.main(som.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

I'm using Spark v2.0.0 & Scala v2.10.4

Thanks for your time.

Some kind of documentation framework

Scala-doc seems like a plausible lightweight solution. Other options: gh-pages? github project wiki?

add word alternatives to `*` in RichSlice

Probably start and end