Git Product home page Git Product logo

cylondata / cylon Goto Github PK

View Code? Open in Web Editor NEW
297.0 19.0 48.0 11.01 MB

Cylon is a fast, scalable, distributed memory, parallel runtime with a Pandas like DataFrame.

Home Page: https://cylondata.org

License: Apache License 2.0

CMake 2.37% C++ 46.43% Java 0.95% Python 26.64% Makefile 0.10% Shell 0.94% Dockerfile 0.24% Cython 8.40% Cuda 0.06% C 0.02% Jupyter Notebook 13.86%
data join shuffle dataframe dataframes-api mpi deep-learning preprocessing

cylon's Introduction

Cylon

Build Status License

Cylon is a fast, scalable distributed memory data parallel library for processing structured data. Cylon implements a set of relational operators to process data. While ”Core Cylon” is implemented using system level C/C++, multiple language interfaces (Python and Java ) are provided to seamlessly integrate with existing applications, enabling both data and AI/ML engineers to invoke data processing operators in a familiar programming language. By default it works with MPI for distributing the applications.

Internally Cylon uses Apache Arrow to represent the data in a column format.

The documentation can be found at https://cylondata.org

Email - [email protected]

Mailing List - Join

Getting Started

We can use Conda to install PyCylon. At the moment Cylon only works on Linux Systems. The Conda binaries need Ubuntu 16.04 or higher.

conda create -n cylon-0.4.0 -c cylondata pycylon python=3.7
conda activate cylon-0.4.0

Now lets run our first Cylon application inside the Conda environment. The following code creates two DataFrames and joins them.

from pycylon import DataFrame, CylonEnv
from pycylon.net import MPIConfig

df1 = DataFrame([[1, 2, 3], [2, 3, 4]])
df2 = DataFrame([[1, 1, 1], [2, 3, 4]])

# local merge
df3 = df1.merge(right=df2, on=[0, 1])
print("Local Merge")
print(df3)

Now lets run a parallel version of this program. Here if we create n processes (parallelism), n instances of the program will run. They will each load two DataFrames in their memory and do a distributed join among the DataFrames. The results will be created in the parallel processes as well.

from pycylon import DataFrame, CylonEnv
from pycylon.net import MPIConfig
import random

# distributed join
env = CylonEnv(config=MPIConfig())

df1 = DataFrame([random.sample(range(10*env.rank, 15*(env.rank+1)), 5),
                 random.sample(range(10*env.rank, 15*(env.rank+1)), 5)])
df2 = DataFrame([random.sample(range(10*env.rank, 15*(env.rank+1)), 5),
                 random.sample(range(10*env.rank, 15*(env.rank+1)), 5)])
df2.set_index([0], inplace=True)
print("Distributed Join")
df3 = df1.join(other=df2, on=[0], env=env)
print(df3)

You can run the above program in the Conda environment by using the following command. It uses mpirun command with 2 parallel processes.

mpirun -np 2 python <name of your python file>

Compiling Cylon

Refer to the documentation on how to compile Cylon

Compiling on Linux

License

Cylon uses the Apache Lincense Version 2.0

cylon's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cylon's Issues

Using memory pool correctly

We should pass the memory pool to methods, it is better to create a TwisterX context and pass pool along with that to methods requiring it

Union divide by zero

If I run the union example with following csv as both arguments it gives a divide by zero error

1,2
3,4
2,2
3,4

Windows support

Hey folks,
I read your paper and wanted to check out the library. I'd use the library with the Java bindings and would need cross plattform linux/win/osx support.

Looking over to OpenMPI I found that they natively support linux and osx but only support windows over cygwin.
I then tried to compile cylon over cygwin according to the docs, which gave me following error:

$ ./build.sh -pyenv ~/cylon/ENV/ -bpath ~/cylon/build/ --java
PYTHON ENV PATH    = /home/Robin/cylon/ENV/
BUILD PATH         = /home/Robin/cylon/build/
FLAG CPP BUILD     = ON
FLAG PYTHON BUILD  = OFF
FLAG BUILD ALL     = OFF
FLAG BUILD DEBUG   = OFF
FLAG BUILD RELEASE = OFF
FLAG RUN TEST      = OFF
FLAG STYLE CHECK   = OFF
\-ipath|--install_path is NOT set default to cmake
=================================================================
Building CPP in Release mode
=================================================================
mkdir: das Verzeichnis „/home/Robin/cylon/build/“ kann nicht angelegt werden: File exists
~/cylon/build ~/cylon/repo
Running on Release mode...
Cylon Python Build [UNREAD]
Python Executable Path /home/Robin/cylon/ENV/
-- Could NOT find MPI_C (missing: MPI_C_WORKS)
-- Could NOT find MPI_CXX (missing: MPI_CXX_WORKS)
CMake Error at /usr/share/cmake-3.17.3/Modules/FindPackageHandleStandardArgs.cmake:164 (message):
  Could NOT find MPI (missing: MPI_C_FOUND MPI_CXX_FOUND)

      Reason given by package: MPI component 'Fortran' was requested, but language Fortran is not enabled.

Call Stack (most recent call first):
  /usr/share/cmake-3.17.3/Modules/FindPackageHandleStandardArgs.cmake:445 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.17.3/Modules/FindMPI.cmake:1717 (find_package_handle_standard_args)
  CMakeLists.txt:92 (find_package)


-- Configuring incomplete, errors occurred!
See also "/home/Robin/cylon/build/CMakeFiles/CMakeOutput.log".
See also "/home/Robin/cylon/build/CMakeFiles/CMakeError.log".

As I'm somewhat new to compiling C++, is this error related to OpenMPI or am I missing something else?
I tried compiling OpenMPI from source as stated in the docs, however that didn't work (atleast with v4.0.1), because of some issues with win32-header files. I did however add the OpenMPI development package for cygwin.
grafik

If I can provide more logs or anything please feel free to ask! Thanks for your time!

And as a side note. Stated in your docs is Note: The default build mode is debug. This is not the case, since I'm not specifying a build-mode-flag and it says Building CPP in Release mode.

Read CSV with Infer Schema is not provided

Currently, we assume that the first line of a CSV is the header.

It is not the case in some cases, to handle those cases, we need to provide an inferred schema-based data loading.

Here we load the data to the arrow table but after that, we add a schema.
This schema is inferred and it doesn't show real meaning.

This will help the seamless conversion of Twisterx Tables to Pandas.

String support for Cylon table operations

Example

product=pd.DataFrame({
    'Product_ID':[101,102,103,104,105,106,107],
    'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop'],
    'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics'],
    'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0],
    'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore']
})

customer=pd.DataFrame({
    'id':[1,2,3,4,5,6,7,8,9],
    'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
    'age':[20,25,15,10,30,65,35,18,23],
    'Product_ID':[101,0,106,0,103,104,0,0,107],
    'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
    'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
cylon_merge = customer.merge(product) #inner join
print(product)
   Product_ID Product_name     Category    Price Seller_City
0         101        Watch      Fashion    299.0       Delhi
1         102          Bag      Fashion   1350.5      Mumbai
2         103        Shoes      Fashion   2999.0     Chennai
3         104   Smartphone  Electronics  14999.0     Kolkata
4         105        Books        Study    145.0       Delhi
5         106          Oil      Grocery    110.0     Chennai
6         107       Laptop  Electronics  79999.0   Bengalore

print(customer)
   id     name  age  Product_ID Purchased_Product       City
0   1   Olivia   20         101             Watch     Mumbai
1   2   Aditya   25           0                NA      Delhi
2   3     Cory   15         106               Oil  Bangalore
3   4  Isabell   10           0                NA    Chennai
4   5  Dominic   30         103             Shoes    Chennai
5   6    Tyler   65         104        Smartphone      Delhi
6   7   Samuel   35           0                NA    Kolkata
7   8   Daniel   18           0                NA      Delhi
8   9   Jeremy   23         107            Laptop     Mumbai

print(cylon_merge)
1,NA,20,101,NA,NA,101,NA,NA,299.000000,NA
3,NA,15,106,NA,NA,106,NA,NA,110.000000,NA
5,NA,30,103,NA,NA,103,NA,NA,2999.000000,NA
6,NA,65,104,NA,NA,104,NA,NA,14999.000000,NA
9,NA,23,107,NA,NA,107,NA,NA,79999.000000,NA

Strings are shown as NA. When I try to convert 'cylon_merge' table to arrow, I get this error.

arw_table = Table.to_arrow(cylon_table)
  File "pycylon/data/table.pyx", line 329, in pycylon.data.table.Table.to_arrow
  File "pyarrow/public-api.pxi", line 341, in pyarrow.lib.pyarrow_wrap_table
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column data for field 1 with type binary is inconsistent with schema string

Relational algebra upon schema

customer=pd.DataFrame({
    'id':[1,2,3,4,5,6,7,8,9],
    'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
    'age':[20,25,15,10,30,65,35,18,23],
    'Product_ID':[101,0,106,0,103,104,0,0,107],
    'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
    'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
product=pd.DataFrame({
    'Product_ID':[101,102,103,104,105,106,107],
    'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop'],
    'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics'],
    'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0],
    'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore']
})`
pd.merge(product,customer,on='Product_ID')

Need to add this feature(join based on variable)

Clean the build script

  • Add flags for multiple build settings
    • Python Flag, CPP Flag,
  • Unify the build to a single script

Integration Test

  • Check the applications with API changes
  • Check Python Imports for all developed packages

Preserving original column names after merge

Example

product=pd.DataFrame({
   'Product_ID':[101,102,103,104,105,106,107],
   'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0],
})


customer=pd.DataFrame({
   'id':[1,2,3,4,5,6,7,8,9],
   'age':[20,25,15,10,30,65,35,18,23],
   'Product_ID':[101,0,106,0,103,104,0,0,107],
})
modin_merge = customer.merge(product)
print(modin_merge)

     lt-0    lt-1    lt-2    rt-3    rt-4
0     1      20      101    101.0    299.0	
1 	
2 	
3 	
4 	
5 
6 	
7
8 

Column names in the merged table should be the same as that of the original tables

Test Environment Issues

  1. cmake -v 3.x.x must be in the build environment (check with build script and warnings must be added"
  2. Check lib path depending on the build environment
export LD_LIBRARY_PATH=$(pwd)/cpp/build/arrow/install/lib:$(pwd)/cpp/build/lib:$LD_LIBRARY_PATH 

instead

export LD_LIBRARY_PATH=$(pwd)/cpp/build/arrow/install/lib64:$(pwd)/cpp/build/lib:$LD_LIBRARY_PATH

Verify this and customize build.

Found this issue when building in Future Systems (RHEL7 with Python 3.6.8, default CMake 2.x. and 3.x with cmake3 command)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.