Git Product home page Git Product logo

topnotch's Introduction

TopNotch

Build Status codecov.io

What Is TopNotch?

TopNotch is a system for quality controlling large scale data sets. It addresses the following three problems:

  1. How to define and measure data quality
  2. How to efficiently ensure data quality across many data sets
  3. How to institutionalize existing knowledge of data sets

TopNotch uses rules to verify individual components of a data set. Each rule defines and measures some small component of data quality. The combination of rules provides a complete definition of and metrics for quality in a data set. The rules can be reused on other data sets to maximize efficiency. Finally, the clear definitions and reuseability of these rules allows users to institutionalize knowledge by documenting a data set.

Getting Started

Requirements

  1. The java command and the JAVA_HOME environment variable pointing to Java 8
  2. Spark 2.0.2

Quick Start Steps

  1. Clone this repo.
  2. Get the latest JAR, TopNotch-assembly-0.2.1.jar, either by building this project (see docs/DEVELOPMENT.md for guidance on this) or by downloading it from the releases portion of TopNotch's GitHub page. Place it in this project's top level bin folder.
  3. Create the configuration files to test your data set
    1. See the example folder for a sample data set and configuration files.
  4. Run bin/TopNotchRunner.sh with the plan file passed in as an argument.
    1. To try the example, run bin/TopNotchRunner.sh --planPath example/plan.json.
    2. Note that you must set the SPARK_HOME variable either in the script or as external environment variables
    3. Note that if you have configured your Spark installation to use an existing HDFS system, you will need to upload example/exampleAssertionInput.parquet to that HDFS system. You should make an example folder in your home folder on HDFS and upload example/exampleAssertionInput.parquet to that folder on HDFS.
  5. View the resulting report and parquet file in the topnotch folder in your home directory on HDFS.
    1. To view the results of the example, look at the JSON file topnotch/plan.json and the Parquet file example/exampleAssertionOutput.parquet. Note that if you have configured your Spark installation to use an exisiting HDFS system, the JSON and Parquet files will appear in the topnotch and example folders in your home directory on HDFS.

Please note that you must change bin/TopNotchRunner.sh in order to run TopNotch with a master other than local. It is currently recommended that you run TopNotch in local or client mode.

What To Read Next

The docs folder contains the documentation. What documentation you should read depends on whether you want to use, deploy, or further develop TopNotch:

  1. CONCEPTS.md
    1. Target Audience: All
    2. Content: An overview of the parts of TopNotch and what they should be used for.
  2. USER_GUIDE.md
    1. Target Audience: Users
    2. Content: A guide for how to write the TopNotch JSON input and the specific options available for each feature.
  3. DEVELOPMENT.md
    1. Target Audience: Developers
    2. Content: A guide on how to setup TopNotch on your local computer for development and how to run the unit tests.
  4. CLUSTER_INSTALL.md
    1. Target Audience: Developers/DevOps/ProdOps
    2. Content: A guide on how to install TopNotch on your cluster.

Copyright © 2017 BlackRock, Inc. All Rights Reserved.

topnotch's People

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.