Git Product home page Git Product logo

scalding-orc's Introduction

scalding-orc

This project provides read and write support for ORC file format in Scalding.

Build Status

Basics

Define a case class with schema matching that of your source or sink. Member names should match the column names in the schema, and their types should correspond. Nested schemas as well as Arrays/Lists and Maps are supported.

// Import implicit macro conversions
import io.applicative.scalding.orc.MacroImplicits._

// Define your record as a case class
case class ReadSample(boolean1: Boolean, byte1: Byte, short1: Short, int1: Int, long1: Long)

// Read:
val myPipe = TypedPipe.from(TypedOrc[ReadSample]("/path/to/file.orc"))
// Write:
myPipe.write(TypedOrc[ReadSample](outputPath))

Column pruning

To eliminate unneeded columns, only define the relevant fields in your case class. Make sure to match the column names. Orc Reader will skip unneded columns, improving IO performance.

Predicate pushdown

Predicate pushdown is a hint to the Orc Reader to skip some rows.

val fp = org.apache.hadoop.hive.ql.io.sarg.SearchArgumentFactoy.newBuilder
  .startAnd.equals("columnname", "value").end.build()
val myPipe = TypedPipe.from(TypedOrc[ReadSample]("/path/to/file.orc", fp))

Common issues

Failed to generate proper converter/setter

This occurs when the macro for your case class couldn't be generated. Check the case class member types, compile with "-Xlog-implicits" flag, and look for 'materializeCaseClassTupleSetter' and 'materializeCaseClassTupleConverter'. If you can't spot the error, file an issue with your case class implementation.

readTypeInfo [...] does not match actualTypeInfo [...]

The schema of the file doesn't match the schema specified by your case class. Double check column names and types.

scalding-orc's People

Contributors

applicative-io avatar danosipov avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.