Git Product home page Git Product logo

bodo-examples's Introduction

Let's Learn Bodo through Examples!

Welcome to the Bodo Examples Repo! This is where you can find examples to help you get started using Bodo.

Bodo is the next generation big-data processing engine that brings supercomputing-style performance and scalability to native Python and SQL codes automatically. Bodo has several advantages over other big data transformation systems that makes it one of the most performant and cost-effective solutions for large scale data analytics, particularly ETL and ELT.

This repository teaches you to use Bodo effectively through examples. If you know SQL and Python, you already know how to use Bodo, and you don't need any new language or API. You will just import bodo and learn some programming tricks to improve your existing applications to save $$$ on compute resources while delivering value in a much shorter time-frame. Benchmarks have shown that Bodo can be orders of magnitude faster than its competitors like Spark.

How to run these examples?

We recommend that you run these examples on the Bodo Platform. You can sign up to our platform to try it out. Some examples like modules 1 to 3 can run on small clusters, e.g., 2 nodes of c5.2xlarge with total of 8 physical cores (16 vCPU) and 32GB RAM, and some examples need larger clusters. The description provided with each example indicates the size of cluster that is required to run it.

You can also run these examples locally by installing bodo on your laptop. However, we recommend using the Bodo Platform for the best experience as it provides a notebook environment with all the code available and required packages already installed for you.

What if I wanted to test my code with my data?

If you wanted to run your application codes with your own data, please refer to the instructions here on how to set up the identity access management, policies, and credentials to integrate your cloud provider with bodo platform. This allows bodo to spin up EC2 instances, create a cluster, and enable you to access your data within your VPC. Everything, including your data stays in your VPC.

Modules outline

Modules 1 and 2 focus on compute heavy data transformations through ETL applications. You will find examples with operational databases like PostgreSQL, Oracle, MySQL in module 01, a data warehouse like Snowflake in module 02, and a data Lakehouse example with Iceberg in module 03.

Modules 04 and 05 contain larger scale examples with Machine Learning, Business use cases (financial, transportation, etc.). Finally, module 06 contains a performance comparison of Bodo vs Spark on a set of queries derived from the TPC-H benchmark suite.

This is an open-source repository, so please consider adding your Bodo examples to it! You can contribute by creating a feature branch and submit a pull request for us to review.

bodo-examples's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

bodo-examples's Issues

Installation/runtime Issues with jupyterlab 3.X

Issue

Hi, I'm having issues installing and using Bodo in JupyterLab 3.X.

I've tried installing bodo via conda but that doesn't work.
I've tried the sample here: https://github.com/Bodo-inc/Bodo-examples/blob/master/docker/BodoNotebook.Dockerfile but still unable to resolve the conda bodo package.

[+] Building 15.8s (5/5) FINISHED
 => [internal] load build definition from dockerfile                                                                                                                                        0.0s
 => => transferring dockerfile: 207B                                                                                                                                                        0.0s
 => [internal] load .dockerignore                                                                                                                                                           0.0s
 => => transferring context: 2B                                                                                                                                                             0.0s
 => [internal] load metadata for docker.io/jupyter/minimal-notebook:latest                                                                                                                  1.1s
 => CACHED [1/2] FROM docker.io/jupyter/minimal-notebook:latest@sha256:cfeab9b91dfce03d9be1683f9d4728860c30757c593f9b9277da4ce7d1a4e7f3                                                     0.0s
 => ERROR [2/2] RUN conda install -y bodo ipyparallel -c bodo.ai -c conda-forge                                                                                                            14.7s
------
 > [2/2] RUN conda install -y bodo ipyparallel -c bodo.ai -c conda-forge:
#5 0.684 Collecting package metadata (current_repodata.json): ...working... done
#5 5.036 Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
#5 5.037 Collecting package metadata (repodata.json): ...working... done
#5 13.57 Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
#5 13.58
#5 13.58 PackagesNotFoundError: The following packages are not available from current channels:
#5 13.58
#5 13.58   - bodo
#5 13.58
#5 13.58 Current channels:
#5 13.58
#5 13.58   - https://conda.anaconda.org/bodo.ai/linux-aarch64
#5 13.58   - https://conda.anaconda.org/bodo.ai/noarch
#5 13.58   - https://conda.anaconda.org/conda-forge/linux-aarch64
#5 13.58   - https://conda.anaconda.org/conda-forge/noarch
#5 13.58
#5 13.58 To search for alternate channels that may provide the conda package you're
#5 13.58 looking for, navigate to
#5 13.58
#5 13.58     https://anaconda.org
#5 13.58
#5 13.58 and use the search bar at the top of the page.
#5 13.58
#5 13.58
------

I then proceeded to install ipyparallel and bodo via pip and have tried setting up the jupyterlab as per instructions: https://docs.bodo.ai/latest/source/installation_and_setup/ipyparallel.html#ipyparallelsetup

However, after running the sample code there are import issues with the bodo package (see last notebook cell output)

# input
import ipyparallel as ipp

import psutil; n = min(psutil.cpu_count(logical=False), 8)

rc = ipp.Cluster(engines='mpi', n=n).start_and_connect_sync(activate=True)
# result
Using existing profile dir: '/opt/app-root/src/.ipython/profile_default'

Starting 8 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>

100%

8/8 [00:12<00:00, 12.02s/engine]
# Input
%%px

from mpi4py import MPI

comm = MPI.COMM_WORLD

print(f"Hello World from rank {comm.Get_rank()}. total ranks={comm.Get_size()}")
# result
[stdout:0] Hello World from rank 0. total ranks=8

[stdout:1] Hello World from rank 1. total ranks=8

[stdout:7] Hello World from rank 7. total ranks=8

[stdout:4] Hello World from rank 4. total ranks=8

[stdout:6] Hello World from rank 6. total ranks=8

[stdout:2] Hello World from rank 2. total ranks=8

[stdout:5] Hello World from rank 5. total ranks=8

[stdout:3] Hello World from rank 3. total ranks=8
# input
import bodo

โ€‹

# result
---------------------------------------------------------------------------

ImportError                               Traceback (most recent call last)

<ipython-input-6-7b53d0178f92> in <module>

----> 1 import bodo

 

/opt/conda/lib/python3.8/site-packages/bodo/__init__.py in <module>

     20 from numba.core.types import List

     21 import bodo.libs

---> 22 import bodo.libs.distributed_api

     23 import bodo.libs.timsort

     24 import bodo.io

 

/opt/conda/lib/python3.8/site-packages/bodo/libs/distributed_api.py in <module>

     20 from numba.parfors.array_analysis import ArrayAnalysis

     21 import bodo

---> 22 from bodo.hiframes.datetime_date_ext import datetime_date_array_type

     23 from bodo.hiframes.datetime_timedelta_ext import datetime_timedelta_array_type

     24 from bodo.hiframes.pd_categorical_ext import CategoricalArrayType

 

/opt/conda/lib/python3.8/site-packages/bodo/hiframes/datetime_date_ext.py in <module>

     17 import bodo

     18 from bodo.hiframes.datetime_datetime_ext import DatetimeDatetimeType

---> 19 from bodo.hiframes.datetime_timedelta_ext import datetime_timedelta_type

     20 from bodo.libs import hdatetime_ext

     21 from bodo.utils.indexing import array_getitem_bool_index, array_getitem_int_index, array_getitem_slice_index, array_setitem_bool_index, array_setitem_int_index, array_setitem_slice_index

 

/opt/conda/lib/python3.8/site-packages/bodo/hiframes/datetime_timedelta_ext.py in <module>

     15 import bodo

     16 from bodo.hiframes.datetime_datetime_ext import datetime_datetime_type

---> 17 from bodo.libs import hdatetime_ext

     18 from bodo.utils.indexing import get_new_null_mask_bool_index, get_new_null_mask_int_index, get_new_null_mask_slice_index, setitem_slice_index_null_bits

     19 from bodo.utils.typing import BodoError, get_overload_const_str, is_iterable_type, is_list_like_index_type, is_overload_constant_str

 

ImportError: libmpi-badaf374.so.12.1.8: cannot open shared object file: No such file or directory

Packages

(2022-01-27 20:29:29) (/opt/conda/lib/python3.8/site-packages/bodo/libs)

-$ pip list | grep -i bodo

bodo 2021.10.1

(2022-01-27 20:30:37) (/opt/conda/lib/python3.8/site-packages/bodo/libs)

-$ pip list | grep -i ipyparallel

ipyparallel 8.1.0

(2022-01-27 20:30:44) (/opt/conda/lib/python3.8/site-packages/bodo/libs)

-$ pip list | grep -i jupyterlab

jupyterlab 3.2.8

bodo not support csv

Hello, thanks for providing this great module.

I installed bodo by pip3 successfully.
However, when I run the @bodo.jit, it occur error that it has no attribute jit.
"AttributeError: partially initialized module 'bodo' has no attribute 'jit' (most likely due to a circular import)
How to fix this environment issue? Thank you
The above issue is fixed.

I would like to ask whether bodo support csv files? Does it only support parquet?
Thank you very much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.