Git Product home page Git Product logo

chevrons's Introduction

chevrons >>

Rapidly build pipelines for out-of-core data processing using higher order functions.

Table of Contents

#What is this?

chevrons is a collection of tools for building memory-efficient pipelines on iterators by composing functions. It introduces a clean syntax for passing data between functions which are composed. Data passing through each part of the pipeline is evaluated lazily, making it easy to work with very large (or infinite) streams of data in settings with limited resources.

chevrons is perfect for situations where you have a lot of data that you want to process out-of-core, but not enough to warrant a "big data" solution. It lets you build pipeline components that can be composed, reordered, and reused easily. The name of the package comes from the syntax of these pipelines, which looks like this:

output = input_data | function_1 >> function_2 >> function_3

This syntax makes pipelines readable and easy to understand. Similarly, subsections of the pipeline can be named and reused:

preprocess = function_1 >> function_2
output1 = input_data1 | preprocess >> function_3
output2 = input_data2 | preprocess >> function_4

The objects passed between functions are all generators/iterators, so output1 and output2 will be turned into generators which can be evaluated one element at a time (for example, written to a file or used to train a machine learning model).

You can construct your own pieces of the pipeline by writing a subclass of one of the classes in pipeline_base, or perform Map, Fold, and Reduce by importing those blocks from pipeline_hof.

Please note that at the moment, chevrons only supports Python 3 and later.

#Higher order Functions

Currently, the Map, Fold and Reduce functions are included as pipeline components. For a quick introduction to these functions and how they work, check out this page.

Each one takes a helper function which is user defined. For example, take the following (contrived) pipeline, which sums up the square roots of all inputs which are multiples of three:

from chevrons.pipeline_hof import Map, Filter, Fold
import math

def is_multiple_of_three(x):
    return x % 3 == 0
    
def add(x1,x2):
    return x1+x2
    
data = range(0,10)

sum_sqrt_threes = data | Filter(is_multiple_of_three) >> Map(math.sqrt) >> Fold(add)

#Parallelism

Note: Parallel processing is still in beta - while these classes are stable enough for most use cases, please report any bugs that you find.

One advantage of structuring computation as higher-order functions is that they are extremely easy to parallelize effectively. It's easy to make pipeline elements parallel using chevrons! The pipeline_parallel module defines pipeline pieces for running things in parallel. Note that all consecutive parallel elements should begin with a BeginParallel block and end with an EndParallel block or a FoldParallel block. The BeginParallel will take a batch size, which will be used throughout the pipeline. We can rewrite the above pipeline example to utilize parallelization very easily:

from chevrons.pipeline_parallel import MapParallel, FilterParallel, FoldParallel, BeginParallel, EndParallel

data = range(0,10)

data | BeginParallel(5) >> FilterParallel(is_multiple_of_three) >> MapParallel(math.sqrt) >> FoldParallel(add)

For pipelines where the individual functions are very computationally intensive, this can lead to very significant speedups when the correct batch size is chosen.

#Implementing Your Own Pipeline Pieces

#Example: Training a Scikit learn model on synthetic data

import random
from chevrons.pipeline_base import Zip
from chevrons.pipeline_hof import Map
from chevrons.pipe line_extra import TrainScikitModel
from sklearn.linear_model import LinearRegression

def add_noise(data_point):
    return random.gauss(data_point, 1)

def calculate_true_value(X):
    true_slope = 0.4
    true_intercept = 1
    return true_slope*X[0] + true_intercept

X = [[i] for i in range(1000)]
model = LinearRegression()

generate_synthetic_output = Map(calculate_true_value) >> Map(add_noise)
y = X | generate_synthetic_output
(X,y) | Zip() >> TrainScikitModel(model, batch_size=100)
print(model.coef_)

#Contact

I'd love to hear any feedback you have, or ideas for features you think would be interesting! Send me an email at [email protected] if you'd like to let me know of your happiness/displeasure/exuberance/vexation.

chevrons's People

Contributors

lmc2179 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.