Let's Learn Bodo through Examples!

Welcome to the Bodo Examples Repo! This is where you can find examples to help you get started using Bodo.

Bodo is the next generation big-data processing engine that brings supercomputing-style performance and scalability to native Python and SQL codes automatically. Bodo has several advantages over other big data transformation systems that makes it one of the most performant and cost-effective solutions for large scale data analytics, particularly ETL and ELT.

This repository teaches you to use Bodo effectively through examples. If you know SQL and Python, you already know how to use Bodo, and you don't need any new language or API. You will just import bodo and learn some programming tricks to improve your existing applications to save $$$ on compute resources while delivering value in a much shorter time-frame. Benchmarks have shown that Bodo can be orders of magnitude faster than its competitors like Spark.

How to run these examples?

We recommend that you run these examples on the Bodo Platform. You can sign up to our platform to try it out. Some examples like modules 1 to 3 can run on small clusters, e.g., 2 nodes of c5.2xlarge with total of 8 physical cores (16 vCPU) and 32GB RAM, and some examples need larger clusters. The description provided with each example indicates the size of cluster that is required to run it.

You can also run these examples locally by installing bodo on your laptop. However, we recommend using the Bodo Platform for the best experience as it provides a notebook environment with all the code available and required packages already installed for you.

What if I wanted to test my code with my data?

If you wanted to run your application codes with your own data, please refer to the instructions here on how to set up the identity access management, policies, and credentials to integrate your cloud provider with bodo platform. This allows bodo to spin up EC2 instances, create a cluster, and enable you to access your data within your VPC. Everything, including your data stays in your VPC.

Modules outline

Modules 1 and 2 focus on compute heavy data transformations through ETL applications. You will find examples with operational databases like PostgreSQL, Oracle, MySQL in module 01, a data warehouse like Snowflake in module 02, and a data Lakehouse example with Iceberg in module 03.

Modules 04 and 05 contain larger scale examples with Machine Learning, Business use cases (financial, transportation, etc.). Finally, module 06 contains a performance comparison of Bodo vs Spark on a set of queries derived from the TPC-H benchmark suite.

This is an open-source repository, so please consider adding your Bodo examples to it! You can contribute by creating a feature branch and submit a pull request for us to review.

Installation/runtime Issues with jupyterlab 3.X

Issue

Hi, I'm having issues installing and using Bodo in JupyterLab 3.X.

I've tried installing bodo via conda but that doesn't work.
I've tried the sample here: https://github.com/Bodo-inc/Bodo-examples/blob/master/docker/BodoNotebook.Dockerfile but still unable to resolve the conda bodo package.

[+] Building 15.8s (5/5) FINISHED
 => [internal] load build definition from dockerfile                                                                                                                                        0.0s
 => => transferring dockerfile: 207B                                                                                                                                                        0.0s
 => [internal] load .dockerignore                                                                                                                                                           0.0s
 => => transferring context: 2B                                                                                                                                                             0.0s
 => [internal] load metadata for docker.io/jupyter/minimal-notebook:latest                                                                                                                  1.1s
 => CACHED [1/2] FROM docker.io/jupyter/minimal-notebook:latest@sha256:cfeab9b91dfce03d9be1683f9d4728860c30757c593f9b9277da4ce7d1a4e7f3                                                     0.0s
 => ERROR [2/2] RUN conda install -y bodo ipyparallel -c bodo.ai -c conda-forge                                                                                                            14.7s
------
 > [2/2] RUN conda install -y bodo ipyparallel -c bodo.ai -c conda-forge:
#5 0.684 Collecting package metadata (current_repodata.json): ...working... done
#5 5.036 Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
#5 5.037 Collecting package metadata (repodata.json): ...working... done
#5 13.57 Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
#5 13.58
#5 13.58 PackagesNotFoundError: The following packages are not available from current channels:
#5 13.58
#5 13.58   - bodo
#5 13.58
#5 13.58 Current channels:
#5 13.58
#5 13.58   - https://conda.anaconda.org/bodo.ai/linux-aarch64
#5 13.58   - https://conda.anaconda.org/bodo.ai/noarch
#5 13.58   - https://conda.anaconda.org/conda-forge/linux-aarch64
#5 13.58   - https://conda.anaconda.org/conda-forge/noarch
#5 13.58
#5 13.58 To search for alternate channels that may provide the conda package you're
#5 13.58 looking for, navigate to
#5 13.58
#5 13.58     https://anaconda.org
#5 13.58
#5 13.58 and use the search bar at the top of the page.
#5 13.58
#5 13.58
------

I then proceeded to install ipyparallel and bodo via pip and have tried setting up the jupyterlab as per instructions: https://docs.bodo.ai/latest/source/installation_and_setup/ipyparallel.html#ipyparallelsetup

However, after running the sample code there are import issues with the bodo package (see last notebook cell output)

# input
import ipyparallel as ipp

import psutil; n = min(psutil.cpu_count(logical=False), 8)

rc = ipp.Cluster(engines='mpi', n=n).start_and_connect_sync(activate=True)

# result
Using existing profile dir: '/opt/app-root/src/.ipython/profile_default'

Starting 8 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>

100%

8/8 [00:12<00:00, 12.02s/engine]

# Input
%%px

from mpi4py import MPI

comm = MPI.COMM_WORLD

print(f"Hello World from rank {comm.Get_rank()}. total ranks={comm.Get_size()}")

# result
[stdout:0] Hello World from rank 0. total ranks=8

[stdout:1] Hello World from rank 1. total ranks=8

[stdout:7] Hello World from rank 7. total ranks=8

[stdout:4] Hello World from rank 4. total ranks=8

[stdout:6] Hello World from rank 6. total ranks=8

[stdout:2] Hello World from rank 2. total ranks=8

[stdout:5] Hello World from rank 5. total ranks=8

[stdout:3] Hello World from rank 3. total ranks=8

# input
import bodo

# result
---------------------------------------------------------------------------

ImportError                               Traceback (most recent call last)

<ipython-input-6-7b53d0178f92> in <module>

----> 1 import bodo

 

/opt/conda/lib/python3.8/site-packages/bodo/__init__.py in <module>

     20 from numba.core.types import List

     21 import bodo.libs

---> 22 import bodo.libs.distributed_api

     23 import bodo.libs.timsort

     24 import bodo.io

 

/opt/conda/lib/python3.8/site-packages/bodo/libs/distributed_api.py in <module>

     20 from numba.parfors.array_analysis import ArrayAnalysis

     21 import bodo

---> 22 from bodo.hiframes.datetime_date_ext import datetime_date_array_type

     23 from bodo.hiframes.datetime_timedelta_ext import datetime_timedelta_array_type

     24 from bodo.hiframes.pd_categorical_ext import CategoricalArrayType

 

/opt/conda/lib/python3.8/site-packages/bodo/hiframes/datetime_date_ext.py in <module>

     17 import bodo

     18 from bodo.hiframes.datetime_datetime_ext import DatetimeDatetimeType

---> 19 from bodo.hiframes.datetime_timedelta_ext import datetime_timedelta_type

     20 from bodo.libs import hdatetime_ext

     21 from bodo.utils.indexing import array_getitem_bool_index, array_getitem_int_index, array_getitem_slice_index, array_setitem_bool_index, array_setitem_int_index, array_setitem_slice_index

 

/opt/conda/lib/python3.8/site-packages/bodo/hiframes/datetime_timedelta_ext.py in <module>

     15 import bodo

     16 from bodo.hiframes.datetime_datetime_ext import datetime_datetime_type

---> 17 from bodo.libs import hdatetime_ext

     18 from bodo.utils.indexing import get_new_null_mask_bool_index, get_new_null_mask_int_index, get_new_null_mask_slice_index, setitem_slice_index_null_bits

     19 from bodo.utils.typing import BodoError, get_overload_const_str, is_iterable_type, is_list_like_index_type, is_overload_constant_str

 

ImportError: libmpi-badaf374.so.12.1.8: cannot open shared object file: No such file or directory

Packages

(2022-01-27 20:29:29) (/opt/conda/lib/python3.8/site-packages/bodo/libs)

-$ pip list | grep -i bodo

bodo 2021.10.1

(2022-01-27 20:30:37) (/opt/conda/lib/python3.8/site-packages/bodo/libs)

-$ pip list | grep -i ipyparallel

ipyparallel 8.1.0

(2022-01-27 20:30:44) (/opt/conda/lib/python3.8/site-packages/bodo/libs)

-$ pip list | grep -i jupyterlab

jupyterlab 3.2.8

bodo-inc / bodo-examples Goto Github PK

bodo-examples's Introduction

Let's Learn Bodo through Examples!

How to run these examples?

What if I wanted to test my code with my data?

Modules outline

bodo-examples's People

Stargazers

Watchers

Forkers

bodo-examples's Issues

Installation/runtime Issues with jupyterlab 3.X

Issue

Packages

bodo not support csv

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent