Git Product home page Git Product logo

xiashuijun / flowman Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dimajix/flowman

0.0 0.0 0.0 7.92 MB

Flowman is a Spark based data build tool. By using high level flow specifications with YAML files, Flowman simplifies the development of data pipelines.

Home Page: https://flowman.io

License: Apache License 2.0

Shell 0.27% Scala 97.72% Java 0.38% Dockerfile 0.03% HTML 0.29% JavaScript 0.28% Vue 0.96% Batchfile 0.07%

flowman's Introduction

Flowman

License Build Status Documentation

Flowman.io

Flowman is a Spark based ETL program that simplifies the act of writing data transformations. The main idea is that users write so called specifications in purely declarative YAML files instead of writing Spark jobs in Scala or Python. The main advantage of this approach is that many technical details of a correct and robust implementation are encapsulated and the user can concentrate on the data transformations themselves.

In addition to writing and executing data transformations, Flowman can also be used for managing physical data models, i.e. Hive tables. Flowman can create such tables from a specification with the correct schema. This helps to keep all aspects (like transformations and schema information) in a single place managed by a single program.

Noteable Features

  • Declarative syntax in YAML files
  • Data model management (Create and Destroy Hive tables or file based storage)
  • Flexible expression language
  • Jobs for managing build targets (like copying files or uploading data via sftp)
  • Powerful yet simple command line tool
  • Extendable via Plugins

Documentation

You can find the official homepage at Flowman.io and a comprehensive documentation at Read the Docs.

Installation

You can either grab an appropriate pre-build package at https://github.com/dimajix/flowman/releases or you can build your own version via Maven with

mvn clean install

Please also read BUILDING.md for detailed instructions, specifically on build profiles.

Installing the Packed Distribution

The packed distribution file is called flowman-{version}-bin.tar.gz and can be extracted at any location using

tar xvzf flowman-{version}-bin.tar.gz

Command Line Utils

The primary tool provided by Flowman is called flowexec and is locaed in the bin folder of the installation directory.

General Usage

The flowexec tool has several subcommands for working with objects and projects. The general pattern looks as follows

flowexec [generic options] <cmd> <subcommand> [specific options and arguments]

For working with flowexec, either your current working directory needs to contain a Flowman project with a file project.yml or you need to specify the path to a valid project via

flowexec -f /path/to/project/folder <cmd>

Interactive Shell

With version 0.14.0, Flowman also introduced a new interactive shell for executing data flows. The shell can be started via

flowshell -f <project>

Within the shell, you can interactively build targets and inspect intermediate mappings.

flowman's People

Contributors

kupferk avatar dependabot[bot] avatar jackusz avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.