chevrons `>>`

Rapidly build pipelines for out-of-core data processing using higher order functions.

What is this?
Higher order Functions
Parallelism
Implementing Your Own Pipeline Pieces
Example: Training a Scikit learn model on synthetic data

#What is this?

chevrons is a collection of tools for building memory-efficient pipelines on iterators by composing functions. It introduces a clean syntax for passing data between functions which are composed. Data passing through each part of the pipeline is evaluated lazily, making it easy to work with very large (or infinite) streams of data in settings with limited resources.

chevrons is perfect for situations where you have a lot of data that you want to process out-of-core, but not enough to warrant a "big data" solution. It lets you build pipeline components that can be composed, reordered, and reused easily. The name of the package comes from the syntax of these pipelines, which looks like this:

output = input_data | function_1 >> function_2 >> function_3

This syntax makes pipelines readable and easy to understand. Similarly, subsections of the pipeline can be named and reused:

preprocess = function_1 >> function_2
output1 = input_data1 | preprocess >> function_3
output2 = input_data2 | preprocess >> function_4

The objects passed between functions are all generators/iterators, so output1 and output2 will be turned into generators which can be evaluated one element at a time (for example, written to a file or used to train a machine learning model).

You can construct your own pieces of the pipeline by writing a subclass of one of the classes in pipeline_base, or perform Map, Fold, and Reduce by importing those blocks from pipeline_hof.

Please note that at the moment, chevrons only supports Python 3 and later.

#Higher order Functions

Currently, the Map, Fold and Reduce functions are included as pipeline components. For a quick introduction to these functions and how they work, check out this page.

Each one takes a helper function which is user defined. For example, take the following (contrived) pipeline, which sums up the square roots of all inputs which are multiples of three:

from chevrons.pipeline_hof import Map, Filter, Fold
import math

def is_multiple_of_three(x):
    return x % 3 == 0
    
def add(x1,x2):
    return x1+x2
    
data = range(0,10)

sum_sqrt_threes = data | Filter(is_multiple_of_three) >> Map(math.sqrt) >> Fold(add)

#Parallelism

Note: Parallel processing is still in beta - while these classes are stable enough for most use cases, please report any bugs that you find.

One advantage of structuring computation as higher-order functions is that they are extremely easy to parallelize effectively. It's easy to make pipeline elements parallel using chevrons! The pipeline_parallel module defines pipeline pieces for running things in parallel. Note that all consecutive parallel elements should begin with a BeginParallel block and end with an EndParallel block or a FoldParallel block. The BeginParallel will take a batch size, which will be used throughout the pipeline. We can rewrite the above pipeline example to utilize parallelization very easily:

from chevrons.pipeline_parallel import MapParallel, FilterParallel, FoldParallel, BeginParallel, EndParallel

data = range(0,10)

data | BeginParallel(5) >> FilterParallel(is_multiple_of_three) >> MapParallel(math.sqrt) >> FoldParallel(add)

For pipelines where the individual functions are very computationally intensive, this can lead to very significant speedups when the correct batch size is chosen.

#Implementing Your Own Pipeline Pieces

#Example: Training a Scikit learn model on synthetic data

import random
from chevrons.pipeline_base import Zip
from chevrons.pipeline_hof import Map
from chevrons.pipe line_extra import TrainScikitModel
from sklearn.linear_model import LinearRegression

def add_noise(data_point):
    return random.gauss(data_point, 1)

def calculate_true_value(X):
    true_slope = 0.4
    true_intercept = 1
    return true_slope*X[0] + true_intercept

X = [[i] for i in range(1000)]
model = LinearRegression()

generate_synthetic_output = Map(calculate_true_value) >> Map(add_noise)
y = X | generate_synthetic_output
(X,y) | Zip() >> TrainScikitModel(model, batch_size=100)
print(model.coef_)

#Contact

I'd love to hear any feedback you have, or ideas for features you think would be interesting! Send me an email at [email protected] if you'd like to let me know of your happiness/displeasure/exuberance/vexation.

mfkiwl / chevrons Goto Github PK

chevrons's Introduction

chevrons `>>`

Table of Contents

chevrons's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

mfkiwl / chevrons Goto Github PK

chevrons's Introduction

chevrons >>

Table of Contents

chevrons's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org

chevrons `>>`