Git Product home page Git Product logo

embulk-filter-expand_json's Introduction

Expand Json filter plugin for Embulk

Release Status Build Status

expand columns having json into multiple columns

Overview

  • Plugin type: filter

Configuration

  • json_column_name: a column name having json to be expanded (string, required)
  • root: root property to start fetching each entries, specify in JsonPath style (string, default: "$.")
  • expanded_columns: columns expanded into multiple columns (array of hash, required)
    • name: name of the column. you can define JsonPath style.
    • type: type of the column (see below)
    • format: format of the timestamp if type is timestamp
    • timezone: Time zone of each timestamp columns if values don’t include time zone description (UTC by default)
  • keep_expanding_json_column: Not remove the expanding json column from input schema if it's true (false by default)
  • default_timezone: Time zone of timestamp columns if values don’t include time zone description (UTC by default)
  • stop_on_invalid_record: Stop bulk load transaction if an invalid record is included (false by default)
  • cache_provider: Cache provider name for JsonPath. "LRU" and "NOOP" are built-in. You can specify user defined class. (string, default: "LRU")
    • "NOOP" becomes default in the future.

type of the column

name description
boolean true or false
long 64-bit signed integers
timestamp Date and time with nano-seconds precision
double 64-bit floating point numbers
string Strings

Example

filters:
  - type: expand_json
    json_column_name: json_payload
    root: "$."
    expanded_columns:
      - {name: "phone_numbers", type: string}
      - {name: "app_id", type: long}
      - {name: "point", type: double}
      - {name: "created_at", type: timestamp, format: "%Y-%m-%d", timezone: "UTC"}
      - {name: "profile.anniversary.et", type: string}
      - {name: "profile.anniversary.voluptatem", type: string}
      - {name: "profile.like_words[1]", type: string}
      - {name: "profile.like_words[2]", type: string}
      - {name: "profile.like_words[0]", type: string}

Note

  • If the value evaluated by JsonPath is Array or Hash, the value is written as JSON.

Dependencies

Development

Run Example

$ ./gradlew gem
$ embulk run -Ibuild/gemContents/lib ./example/config.yml

Build

$ ./gradlew gem  # -t to watch change of files and rebuild continuously

Benchmark for cache_provider option

In some cases, cache_provider: NOOP improves the performance of this plugin by 3 times (embulk#41). So we do a benchmark about cache_provider. In our case, cache_provider: noop improves the performance by 1.5 times.

use expand_json filter cache_provider Time took records/s
false none 7.62s 1,325,459/s
true "LRU" 2m9s 78,025/s
true "NOOP" 1m25s 118,476/s

You can reproduce the bench by the below way.

./gradlew gem
./bench/run.sh

For Maintainers

Release

Modify version in build.gradle at a detached commit, and then tag the commit with an annotation.

git checkout --detach master

(Edit: Remove "-SNAPSHOT" in "version" in build.gradle.)

git add build.gradle

git commit -m "Release vX.Y.Z"

git tag -a vX.Y.Z

(Edit: Write a tag annotation in the changelog format.)

See Keep a Changelog for the changelog format. We adopt a part of it for Git's tag annotation like below.

## [X.Y.Z] - YYYY-MM-DD

### Added
- Added a feature.

### Changed
- Changed something.

### Fixed
- Fixed a bug.

Push the annotated tag, then. It triggers a release operation on GitHub Actions after approval.

git push -u origin vX.Y.Z

Contributor

  • @Civitaspo
  • @muga
  • @sakama

embulk-filter-expand_json's People

Contributors

calorie avatar civitaspo avatar dmikurube avatar muga avatar sakama avatar sasamuku avatar vietnguyen-td avatar

Forkers

st-tech

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.