Git Product home page Git Product logo

spark2cassandra's Introduction

Spark2Cassandra

Spark Library for Bulk Loading into Cassandra

Build Status

Requirements

Spark2Cassandra supports Spark 1.5 and above.

Spark2Cassandra Version Cassandra Version
2.1.X 2.1.5+
2.2.X 2.1.X

Downloads

SBT

libraryDependencies += "com.github.jparkie" %% "spark2cassandra" % "2.1.0"

Or:

libraryDependencies += "com.github.jparkie" %% "spark2cassandra" % "2.2.0"

Add the following resolver if needed:

resolvers += "Sonatype OSS Releases" at "https://oss.sonatype.org/content/repositories/releases"
resolvers += "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"

Maven

<dependency>
  <groupId>com.github.jparkie</groupId>
  <artifactId>spark2cassandra_2.10</artifactId>
  <version>x.y.z-SNAPSHOT</version>
</dependency>

It is planned for Spark2Cassandra to be available on the following:

Features

Usage

Bulk Loading into Cassandra

// Import the following to have access to the `bulkLoadToEs()` function for RDDs or DataFrames.
import com.github.jparkie.spark.cassandra.rdd._
import com.github.jparkie.spark.cassandra.sql._

val sparkConf = new SparkConf()
val sc = SparkContext.getOrCreate(sparkConf)
val sqlContext = SQLContext.getOrCreate(sc)

val rdd = sc.parallelize(???)

val df = sqlContext.read.parquet("<PATH>")

// Specify the `keyspaceName` and the `tableName` to write.
rdd.bulkLoadToCass(
  keyspaceName = "twitter",
  tableName = "tweets_by_date"
)

// Specify the `keyspaceName` and the `tableName` to write.
df.bulkLoadToCass(
  keyspaceName = "twitter",
  tableName = "tweets_by_author"
)

Refer to for more: SparkCassRDDFunction.scala Refer to for more: SparkCassDataFrameFunctions.scala

Configurations

As Spark2Cassandra utilizes https://github.com/datastax/spark-cassandra-connector for serializations from Spark and session management, please refer to the following for more configurations: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md.

SparkCassWriteConf

Refer to for more: SparkCassWriteConf.scala

Property Name Default Description
spark.cassandra.bulk.write.partitioner org.apache.cassandra.dht.Murmur3Partitioner The 'partitioner' defined in cassandra.yaml.
spark.cassandra.bulk.write.throughput_mb_per_sec Int.MaxValue The maximum throughput to throttle.
spark.cassandra.bulk.write.connection_per_host 1 The number of connections per host to utilize when streaming SSTables.

SparkCassServerConf

Refer to for more: SparkCassServerConf.scala

Property Name Default Description
spark.cassandra.bulk.server.storage.port 7000 The 'storage_port' defined in cassandra.yaml.
spark.cassandra.bulk.server.sslStorage.port 7001 The 'ssl_storage_port' defined in cassandra.yaml.
spark.cassandra.bulk.server.internode.encryption "none" The 'server_encryption_options:internode_encryption' defined in cassandra.yaml.
spark.cassandra.bulk.server.keyStore.path conf/.keystore The 'server_encryption_options:keystore' defined in cassandra.yaml.
spark.cassandra.bulk.server.keyStore.password cassandra The 'server_encryption_options:keystore_password' defined in cassandra.yaml.
spark.cassandra.bulk.server.trustStore.path conf/.truststore The 'server_encryption_options:truststore' defined in cassandra.yaml.
spark.cassandra.bulk.server.trustStore.password cassandra The 'server_encryption_options:truststore_password' defined in cassandra.yaml.
spark.cassandra.bulk.server.protocol TLS The 'server_encryption_options:protocol' defined in cassandra.yaml.
spark.cassandra.bulk.server.algorithm SunX509 The 'server_encryption_options:algorithm' defined in cassandra.yaml.
spark.cassandra.bulk.server.store.type JKS The 'server_encryption_options:store_type' defined in cassandra.yaml.
spark.cassandra.bulk.server.cipherSuites TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA The 'server_encryption_options:cipher_suites' defined in cassandra.yaml.
spark.cassandra.bulk.server.requireClientAuth false The 'server_encryption_options:require_client_auth' defined in cassandra.yaml.

Documentation

Scaladocs are currently unavailable.

spark2cassandra's People

Contributors

jparkie avatar

Stargazers

Mikhail Chernetsov avatar

Watchers

Eric Penfold avatar

Forkers

camshrun

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.