Git Product home page Git Product logo

schema-guru's Introduction

Schema Guru

Build Status Release License

Schema Guru is a tool (CLI, Spark job and web) allowing you to derive [JSON Schemas] [json-schema] from a set of JSON instances process and transform it into different data definition formats.

Current primary features include:

  • Derivation of JSON Schema from set of JSON instances (schema command)

Unlike other tools for deriving JSON Schemas, Schema Guru allows you to derive schema from an unlimited set of instances (making schemas much more precise), and supports many more JSON Schema validation properties.

Schema Guru is used heavily in association with Snowplow's own [Snowplow] [snowplow], [Iglu] [iglu].

User Quickstart

Download the latest Schema Guru from Bintray:

$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_0.6.2.zip
$ unzip schema_guru_0.6.2.zip
$ mv schema-guru-0.6.2 schema-guru

Assuming you have a recent JVM installed.

CLI

Schema derivation

You can use as input either single JSON file or directory with JSON instances (it will be processed recursively).

Following command will print JSON Schema to stdout:

$ ./schema-guru schema {{input}}

Also you can specify output file for your schema:

$ ./schema-guru schema --output {{json_schema_file}} {{input}}

You can also switch Schema Guru into [NDJSON] [ndjson] mode, where it will look for newline delimited JSON files:

$ ./schema-guru schema --ndjson {{input}}

You can specify the enum cardinality tolerance for your fields. It means that all fields which are found to have less than the specified cardinality will be specified in the JSON Schema using the enum property.

$ ./schema-guru schema --enum 5 {{input}}

If you know that some particular set of values can appear, but don't want to set big enum cardinality, you may want to specify predefined enum set with --enum-sets multi-option, like this:

$ ./schema-guru schema --enum-sets iso_4217 --enum-sets iso_3166-1_aplha-3 /path/to/instances

Currently Schema Guru includes following built-in enum sets (written as they should appear in CLI):

If you need to include very specific enum set, you can define it by yourself in JSON file with array like this:

["Mozilla Firefox", "Google Chrome", "Netscape Navigator", "Internet Explorer"]

And pass path to this file instead of enum name:

$ ./schema-guru schema --enum-sets all --enum-sets /path/to/browsers.json /path/to/instances

Schema Guru will derive minLength and maxLength properties for strings based on shortest and longest strings. But this may be a problem if you process small amount of instances. To avoid this too strict Schema, you can use --no-length option.

$ ./schema-guru schema --no-length /path/to/few-instances

Web UI

To run it locally:

$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_webui_0.6.2.zip
$ unzip schema_guru_webui_0.6.2.zip
$ ./schema-guru-webui-0.6.2

The above will run a Spray web server containing Schema Guru on [0.0.0.0:8000] [webui-local]. Interface and port can be specified by --interface and --port respectively.

Apache Spark

Since version 0.4.0 Schema Guru ships with a Spark job for deriving JSON Schemas. To help users getting started with Schema Guru on Amazon Elastic MapReduce we provide [pyinvoke] [pyinvoke] tasks.py.

The recommended way to start is install all requirements and assemble a fat JAR as described in the developer quickstart.

Before run you need:

  • An AWS CLI profile, e.g. my-profile
  • A EC2 key pair, e.g. my-ec2-keypair
  • At least one Amazon S3 bucket, e.g. my-bucket

To provision the cluster and start the job you need to use run_emr task:

$ cd sparkjob
$ inv run_emr my-profile my-bucket/input/ my-bucket/output/ my-bucket/errors/ my-bucket/logs my-ec2-keypair

If you need some specific options for Spark job, you can specify these in tasks.py. The Spark job accepts the same options as the CLI application, but note that --output isn't optional and we have a new optional --errors-path. Also, instead of specifying some of predefined enum sets you can just enable it with --enum-sets flag, so it has the same behaviour as --enum-sets all.

Developer Quickstart

Assuming git, [pyinvoke]: http://www.pyinvoke.org/

schema-guru's People

Contributors

chuwy avatar michaelmior avatar alexanderdean avatar jbeemster avatar camshaft avatar

Stargazers

Odd Möller avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.