Git Product home page Git Product logo

datasketches-pig's Introduction

Build Status Maven Central Language grade: Java Total alerts Coverage Status

=================

DataSketches Java UDF/UDAF Adaptors for Apache Pig

Please visit the main DataSketches website for more information.

If you are interested in making contributions to this site please see our Community page for how to contact us.

Hadoop Pig UDFs

See relevant sections under the different sketch types in Java Core Documentation.

Build Instructions

NOTE: This component accesses resource files for testing. As a result, the directory elements of the full absolute path of the target installation directory must qualify as Java identifiers. In other words, the directory elements must not have any space characters (or non-Java identifier characters) in any of the path elements. This is required by the Oracle Java Specification in order to ensure location-independent access to resources: See Oracle Location-Independent Access to Resources

JDK8 is required to compile

This DataSketches component is pure Java and you must compile using JDK 8.

Recommended Build Tool

This DataSketches component is structured as a Maven project and Maven is the recommended Build Tool.

There are two types of tests: normal unit tests and tests run by the strict profile.

To run normal unit tests:

$ mvn clean test

To run the strict profile tests:

$ mvn clean test -P strict

To install jars built from the downloaded source:

$ mvn clean install -DskipTests=true

This will create the following jars:

  • datasketches-pig-X.Y.Z.jar The compiled main class files.
  • datasketches-pig-X.Y.Z-tests.jar The compiled test class files.
  • datasketches-pig-X.Y.Z-sources.jar The main source files.
  • datasketches-pig-X.Y.Z-test-sources.jar The test source files
  • datasketches-pig-X.Y.Z-javadoc.jar The compressed Javadocs.

Dependencies

Run-time

This has the following top-level dependencies:

  • org.apache.datasketches : datasketches-java
  • org.apache.pig : pig
  • org.apache.hadoop : hadoop-common
  • org.apache.commons : commons-math3

Testing

See the pom.xml file for test dependencies.

datasketches-pig's People

Contributors

alexandersaydakov avatar dependabot[bot] avatar jmalkin avatar joshwalters avatar leerho avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datasketches-pig's Issues

ArrayOfDoubles P-Value Ratio Metric

The ArrayOfDoublesSketchesToPValueEstimates UDF generates p-values for given metrics. However, we have need of p-values for ratio metrics. For example, the click-through rate (CTR) is computed as clicks / impressions. Impressions would be a metric, and clicks would be a metric. We would need to compute the p-value for that ratio, and be able to express it in terms of a UDF.

I think a UDF that takes the following parameters:

RATIO_PVALUE(sketch_A, sketch_B, numerator_index, denominator_index)

and returns a map:

{'pvalue': X, 'delta': Y}

The math required for ratio p-value computation is different from regular p-value computation.

EvalFunc Return Type

I am curious about the decision to have the return type of all of the *ToSketch UDFs be Tuple as opposed to DataByteArray.

When attempting to use these UDFs, it seems more natural to just have them return the DataByteArray as opposed to wrapping it in a Tuple? I couldn't find any instances where these functions really needed to return more than the single binary representation of the sketch.

Wrapping the result in a Tuple requires the user to call FLATTEN before being able to store the intermediate result or to pass the result into one of the UDFs to extract the estimate.

Shading of sketches-core

We need to have a separate sketches-pig-X.Y.Z-incCore.jar that includes the shaded core library. The normal sketches-pig-X.Y.Z-.jar should be unshaded.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.