Git Product home page Git Product logo

hbase-rdd-examples's Introduction

HBase RDD examples

logo

This is an example project for HBase RDD. It currently runs on CDH 5.5, although it will run on other versions of CDH with minor modifications.

Running

First, build the project with

sbt assembly

This will generate target/scala-2.12/hbase-rdd-examples-assembly-0.9.1.jar.

You can then copy this file, together with the files in the scripts directory, on a gateway machine of the cluster, and then run the scripts to launch the jobs. Of course, you may have to adapt some parameters in the scripts.

Jobs

We assume the existence of a file test-input on the user directory on HDFS of the user running the job. Each line contains four fields, each one being a random printable string of length 10. Let us call k, col1, col2, col3 the four fields.

WriteSingleCf copies the contents of this file inside test-table, using k as rowkey, putting col1 and col2 under the column family cf1 and discarding col3.

WriteMultiCf copies the contents of this file inside test-table, using k as rowkey, putting col1 and col2 under the column family cf1 and col3 under cf2.

WriteBulk does the same as WriteSingleCf (on table test-table-bulk), by writing to HFiles on HDFS and then submitting these HFiles to the HBase servers.

All the write jobs create the table if it does not exist already - WriteBulk also takes care of computing splits appropriate for the file contents and a desired region size (128M in the example).

Read reads the contents of test-tables and reassambles a TSV output on HDFS under test-output, in the same format as the original. It does this by specifying both the column families and the columns to read, as a Map[String, Set[String]]. ReadTS is the same as above but also reads timestamps.

ReadCf does the same as Read but only specifies the column families, as a Set[String]. The whole column families are read - this is useful in the cases where the set of columns in a family is not known a priori, e.g. when the column families are used as a set (using a dummy marker value for all columns). ReadTSCf is the same as above but also reads timestamps.

DeleteSingleCf deletes contents from test-table, using k as rowkey, deleting col1 and col2 under the column family cf1.

DeleteMultiCf deletes contents from test-table, using k as rowkey, deleting col1 and col2 under the column family cf1 and col3 under cf2.

DeleteRows deletes rows from test-table, using k as rowkey.

In all jobs we are using String values for the cells, but HBaseRDD is not limited to this. Any other type A is supported, provided there is an implicit Reads[A] or Writes[A] in scope. These are traits defined by HBaseRDD that essentially wrap conversions from Array[Byte] to A and viceversa.

By default, HBaseRDD ships converters for Array[Byte] (duh!), String and JValue from Json4s, but you can write your own implicit conversions as necessary.

Test file

You can generate the test file as you prefer. A quick way would be to open a Scala console and write

import java.io._
import scala.util.Random
def printToFile(path: String)(op: PrintWriter => Unit) {
  val p = new PrintWriter(new File(path))
  try { op(p) } finally { p.close() }
}

def nextString = (1 to 10) map (_ => Random.nextPrintableChar) mkString
def nextLine = (1 to 4) map (_ => nextString) mkString "\t"

printToFile(args(0)) { p =>
  for (_ <- 1 to 100000000) {
    p.println(nextLine)
  }
}

hbase-rdd-examples's People

Contributors

andreaferretti avatar fralken avatar mfirry avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hbase-rdd-examples's Issues

Datatype: Any

I am trying to write a key-value pair where value can be of Any datatype. It seems .toHbase method is not working for such a RDD, any work around?

Dependencies problem "out of the box"

Since I am working on Hortonworks, I decided I will try a "naive" way, removed all of the "provided" from the build.sbt, and executed "sbt assembly".

I got the error:

java.lang.RuntimeException: deduplicate: different file contents found in the following:
/home/cloudbreak/.ivy2/cache/commons-logging/commons-logging/jars/commons-logging-1.2.jar:META-INF/maven/commons-logging/commons-logging/pom.properties
/home/cloudbreak/.ivy2/cache/org.apache.htrace/htrace-core/jars/htrace-core-3.1.0-incubating.jar:META-INF/maven/commons-logging/commons-logging/pom.properties

Since it was just for a PoC, I wasn't able to dive into solving it.
Is there any simple solution for that, or should I depend on "provided" JARs ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.