cylondata / cylon Goto Github PK

Cylon is a fast, scalable, distributed memory, parallel runtime with a Pandas like DataFrame.

License: Apache License 2.0

CMake 2.37% C++ 46.43% Java 0.95% Python 26.64% Makefile 0.10% Shell 0.94% Dockerfile 0.24% Cython 8.40% Cuda 0.06% C 0.02% Jupyter Notebook 13.86%

data join shuffle dataframe dataframes-api mpi deep-learning preprocessing

cylon's Introduction

Cylon

Cylon is a fast, scalable distributed memory data parallel library for processing structured data. Cylon implements a set of relational operators to process data. While ”Core Cylon” is implemented using system level C/C++, multiple language interfaces (Python and Java ) are provided to seamlessly integrate with existing applications, enabling both data and AI/ML engineers to invoke data processing operators in a familiar programming language. By default it works with MPI for distributing the applications.

Internally Cylon uses Apache Arrow to represent the data in a column format.

The documentation can be found at https://cylondata.org

Email - [email protected]

Mailing List - Join

Getting Started

We can use Conda to install PyCylon. At the moment Cylon only works on Linux Systems. The Conda binaries need Ubuntu 16.04 or higher.

conda create -n cylon-0.4.0 -c cylondata pycylon python=3.7
conda activate cylon-0.4.0

Now lets run our first Cylon application inside the Conda environment. The following code creates two DataFrames and joins them.

from pycylon import DataFrame, CylonEnv
from pycylon.net import MPIConfig

df1 = DataFrame([[1, 2, 3], [2, 3, 4]])
df2 = DataFrame([[1, 1, 1], [2, 3, 4]])

# local merge
df3 = df1.merge(right=df2, on=[0, 1])
print("Local Merge")
print(df3)

Now lets run a parallel version of this program. Here if we create n processes (parallelism), n instances of the program will run. They will each load two DataFrames in their memory and do a distributed join among the DataFrames. The results will be created in the parallel processes as well.

from pycylon import DataFrame, CylonEnv
from pycylon.net import MPIConfig
import random

# distributed join
env = CylonEnv(config=MPIConfig())

df1 = DataFrame([random.sample(range(10*env.rank, 15*(env.rank+1)), 5),
                 random.sample(range(10*env.rank, 15*(env.rank+1)), 5)])
df2 = DataFrame([random.sample(range(10*env.rank, 15*(env.rank+1)), 5),
                 random.sample(range(10*env.rank, 15*(env.rank+1)), 5)])
df2.set_index([0], inplace=True)
print("Distributed Join")
df3 = df1.join(other=df2, on=[0], env=env)
print(df3)

You can run the above program in the Conda environment by using the following command. It uses mpirun command with 2 parallel processes.

mpirun -np 2 python <name of your python file>

Compiling Cylon

Refer to the documentation on how to compile Cylon

Compiling on Linux

License

Cylon uses the Apache Lincense Version 2.0

cylon's People

Stargazers

Watchers

cylon's Issues

Check the memory free

Change table API to free input tables when user doesn't need them after

Define all the types we support s TwisterX types

[python] add a requirements.txt file for the setup

for requirements such as numpy

add python3-dev as a pre-req under installation

Add Python Logging

Pip packaging for manylinux distributions

Reference: https://python-packaging-tutorial.readthedocs.io/en/latest/binaries_dependencies.html

Using memory pool correctly

We should pass the memory pool to methods, it is better to create a TwisterX context and pass pool along with that to methods requiring it

Minor file refactors

Better if this can be moved into python folder

https://github.com/cylondata/cylon/blob/master/requirements.txt

check for null values in iindex columns (join, group by)

Fix make install command

Read cell values

Hash on multiple columns

glog library when installing C++ with the new build

/usr/bin/ld: cannot find -lglog

Support child data in arrow_all_to_all

Add docs in Twister2

Fixing the readme file

Union divide by zero

If I run the union example with following csv as both arguments it gives a divide by zero error

1,2
3,4
2,2
3,4

Add cpplint to style check cpp

https://github.com/cpplint/cpplint

add integration tests to cmake

add multi index support for operations

Add comms APIs to Java

Disk based operations

This task is to track the progress of disk based operations

Integrate Travis

Integration Test

Fixed size binary type support

Add dates and biginteger types

Union(axis = 1) horizontal stacking

Concatenating horizontally ( on axis = 1 ) using the index elements should be supported.

Windows support

Hey folks,
I read your paper and wanted to check out the library. I'd use the library with the Java bindings and would need cross plattform linux/win/osx support.

Looking over to OpenMPI I found that they natively support linux and osx but only support windows over cygwin.
I then tried to compile cylon over cygwin according to the docs, which gave me following error:

$ ./build.sh -pyenv ~/cylon/ENV/ -bpath ~/cylon/build/ --java
PYTHON ENV PATH    = /home/Robin/cylon/ENV/
BUILD PATH         = /home/Robin/cylon/build/
FLAG CPP BUILD     = ON
FLAG PYTHON BUILD  = OFF
FLAG BUILD ALL     = OFF
FLAG BUILD DEBUG   = OFF
FLAG BUILD RELEASE = OFF
FLAG RUN TEST      = OFF
FLAG STYLE CHECK   = OFF
\-ipath|--install_path is NOT set default to cmake
=================================================================
Building CPP in Release mode
=================================================================
mkdir: das Verzeichnis „/home/Robin/cylon/build/“ kann nicht angelegt werden: File exists
~/cylon/build ~/cylon/repo
Running on Release mode...
Cylon Python Build [UNREAD]
Python Executable Path /home/Robin/cylon/ENV/
-- Could NOT find MPI_C (missing: MPI_C_WORKS)
-- Could NOT find MPI_CXX (missing: MPI_CXX_WORKS)
CMake Error at /usr/share/cmake-3.17.3/Modules/FindPackageHandleStandardArgs.cmake:164 (message):
  Could NOT find MPI (missing: MPI_C_FOUND MPI_CXX_FOUND)

      Reason given by package: MPI component 'Fortran' was requested, but language Fortran is not enabled.

Call Stack (most recent call first):
  /usr/share/cmake-3.17.3/Modules/FindPackageHandleStandardArgs.cmake:445 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.17.3/Modules/FindMPI.cmake:1717 (find_package_handle_standard_args)
  CMakeLists.txt:92 (find_package)


-- Configuring incomplete, errors occurred!
See also "/home/Robin/cylon/build/CMakeFiles/CMakeOutput.log".
See also "/home/Robin/cylon/build/CMakeFiles/CMakeError.log".

As I'm somewhat new to compiling C++, is this error related to OpenMPI or am I missing something else?
I tried compiling OpenMPI from source as stated in the docs, however that didn't work (atleast with v4.0.1), because of some issues with win32-header files. I did however add the OpenMPI development package for cygwin.

If I can provide more logs or anything please feel free to ask! Thanks for your time!

And as a side note. Stated in your docs is Note: The default build mode is debug. This is not the case, since I'm not specifying a build-mode-flag and it says Building CPP in Release mode.

Explore compression for data transfer

Explore K-Sorted merge

Add Sphinx docs for Python

Python and C++ Objects wrapping and unwrapping for Cython

UCX integration

This task is to track the UCX integration progress

Read CSV with Infer Schema is not provided

Currently, we assume that the first line of a CSV is the header.

It is not the case in some cases, to handle those cases, we need to provide an inferred schema-based data loading.

Here we load the data to the arrow table but after that, we add a schema.
This schema is inferred and it doesn't show real meaning.

This will help the seamless conversion of Twisterx Tables to Pandas.

product=pd.DataFrame({
    'Product_ID':[101,102,103,104,105,106,107],
    'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop'],
    'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics'],
    'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0],
    'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore']
})

customer=pd.DataFrame({
    'id':[1,2,3,4,5,6,7,8,9],
    'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
    'age':[20,25,15,10,30,65,35,18,23],
    'Product_ID':[101,0,106,0,103,104,0,0,107],
    'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
    'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
cylon_merge = customer.merge(product) #inner join

print(product)
   Product_ID Product_name     Category    Price Seller_City
0         101        Watch      Fashion    299.0       Delhi
1         102          Bag      Fashion   1350.5      Mumbai
2         103        Shoes      Fashion   2999.0     Chennai
3         104   Smartphone  Electronics  14999.0     Kolkata
4         105        Books        Study    145.0       Delhi
5         106          Oil      Grocery    110.0     Chennai
6         107       Laptop  Electronics  79999.0   Bengalore

print(customer)
   id     name  age  Product_ID Purchased_Product       City
0   1   Olivia   20         101             Watch     Mumbai
1   2   Aditya   25           0                NA      Delhi
2   3     Cory   15         106               Oil  Bangalore
3   4  Isabell   10           0                NA    Chennai
4   5  Dominic   30         103             Shoes    Chennai
5   6    Tyler   65         104        Smartphone      Delhi
6   7   Samuel   35           0                NA    Kolkata
7   8   Daniel   18           0                NA      Delhi
8   9   Jeremy   23         107            Laptop     Mumbai

print(cylon_merge)
1,NA,20,101,NA,NA,101,NA,NA,299.000000,NA
3,NA,15,106,NA,NA,106,NA,NA,110.000000,NA
5,NA,30,103,NA,NA,103,NA,NA,2999.000000,NA
6,NA,65,104,NA,NA,104,NA,NA,14999.000000,NA
9,NA,23,107,NA,NA,107,NA,NA,79999.000000,NA

Strings are shown as NA. When I try to convert 'cylon_merge' table to arrow, I get this error.

arw_table = Table.to_arrow(cylon_table)
  File "pycylon/data/table.pyx", line 329, in pycylon.data.table.Table.to_arrow
  File "pyarrow/public-api.pxi", line 341, in pyarrow.lib.pyarrow_wrap_table
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column data for field 1 with type binary is inconsistent with schema string

Add write options to csv

Relational algebra upon schema

customer=pd.DataFrame({
    'id':[1,2,3,4,5,6,7,8,9],
    'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
    'age':[20,25,15,10,30,65,35,18,23],
    'Product_ID':[101,0,106,0,103,104,0,0,107],
    'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
    'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
product=pd.DataFrame({
    'Product_ID':[101,102,103,104,105,106,107],
    'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop'],
    'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics'],
    'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0],
    'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore']
})`
pd.merge(product,customer,on='Product_ID')

Need to add this feature(join based on variable)

Clean the build script

Add flags for multiple build settings
- Python Flag, CPP Flag,
Unify the build to a single script

Integration Test

Check the applications with API changes
Check Python Imports for all developed packages

product=pd.DataFrame({
   'Product_ID':[101,102,103,104,105,106,107],
   'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0],
})


customer=pd.DataFrame({
   'id':[1,2,3,4,5,6,7,8,9],
   'age':[20,25,15,10,30,65,35,18,23],
   'Product_ID':[101,0,106,0,103,104,0,0,107],
})
modin_merge = customer.merge(product)
print(modin_merge)

     lt-0    lt-1    lt-2    rt-3    rt-4
0     1      20      101    101.0    299.0	
1 	
2 	
3 	
4 	
5 
6 	
7
8

Column names in the merged table should be the same as that of the original tables

Test Environment Issues

cmake -v 3.x.x must be in the build environment (check with build script and warnings must be added"
Check lib path depending on the build environment

export LD_LIBRARY_PATH=$(pwd)/cpp/build/arrow/install/lib:$(pwd)/cpp/build/lib:$LD_LIBRARY_PATH

instead

export LD_LIBRARY_PATH=$(pwd)/cpp/build/arrow/install/lib64:$(pwd)/cpp/build/lib:$LD_LIBRARY_PATH

Verify this and customize build.

Found this issue when building in Future Systems (RHEL7 with Python 3.6.8, default CMake 2.x. and 3.x with cmake3 command)