Git Product home page Git Product logo

versioning-article's Introduction

versioning-benchmarks

Here we will run benchmarks for different data-versioning tools.

Setup

S3 setup

TODO

Repositories

Create a XetHub repository and three github repositories for DVC, git LFS and git-annex with readme's. Here I named them xethub-py, xethub-git, versioning-dvc, versioning-lfs, versioning-lfs-github,

We clone them locally and setup the remotes: Setup your git user name and python environment:

GITUSER=$(git config --global user.name) # or manually set your GitHub/XetHub user name

python -m venv .venv \
&& source .venv/bin/activate \
&& pip install -r requirements.txt

# Download data - takes time! 
python src/download.py --dir=data --download=all --limit=2

# For quick testing. Seed can by any number. 
python src/generate.py --path=0.csv --rows=1000 [--seed=1]
python src/generate.py --path=0.parquet--rows=1000 [--seed=1]

XetHub setup

  1. git xet clone https://xethub.com/$GITUSER/xethub-py.git xethub-pyxet # use your own repository

  2. git xet clone https://xethub.com/xdssio/xethub-git.git xethub-git

  3. Get token and setup as environment variables:

      export XET_USER_NAME=<user-name>
      export XET_USER_TOKEN=<xethub-token>
  4. pip install pyxet

  5. (Optional) Install CLI

DVC setup

  1. git clone https://github.com/$GITUSER/versioning-dvc dvc
  2. Install CLI
  3. pip install dvc dvc-s3
  4. setup remote:
    cd dvc
    dvc init
    dvc remote add -d versioning-article s3://<your-bucket-name>/dvc
    dvc remote modify versioning-article region us-west-2

LFS - natural setup

Warning: THIS WILL COST YOU MONEY! Limitations:

  • GitHub Free and GitHub Pro have a maximum file size limit of 2 GB
  • GitHub Team has a maximum file size limit of 4 GB
  • GitHub Enterprise Cloud has a maximum file size limit of 5 GB
  • Bitbucket Cloud has a maximum file upload limit of 10 GB

Setup:

  1. git clone https://github.com/xdssio/versioning-lfs-github.git lfs-github
  2. Install CLI
  3. cd lfs-github
  4. git lfs install
  5. git lfs track '*.parquet'
  6. git lfs track '*.csv'
  7. git add .gitattributes && git commit -m "Enable LFS" && git push

LFS setup + S3

  1. git clone https://github.com/$GITUSER/versioning-lfs lfs-s3
  2. Install CLI
  3. cd lfs-s3
  4. git lfs install
  5. git lfs track '*.parquet'
  6. git lfs track '*.csv'
  7. git add .gitattributes && git commit -m "Enable LFS" && git push
  8. cd .. # so we can setup the server
  9. LFS server setup - Reference
    • Generating a random key is easy: openssl rand -hex 32

      • Keep this secret and save it in a password manager so you don't lose it. We will pass this to the server below.
    • Create a lfs-server/.env file with the following contents:

      AWS_ACCESS_KEY_ID=XXXXXXXXXXXXXXXXXXXX
      AWS_SECRET_ACCESS_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
      AWS_DEFAULT_REGION=us-west-2
      LFS_ENCRYPTION_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # the result of the openssl command above
      LFS_S3_BUCKET=my-bucket
      LFS_MAX_CACHE_SIZE=10GB
    • Improve performance (optional)

      # Increase the number of worker threads
      git config --global lfs.concurrenttransfers 64
      # Use a global LFS cache to make re-cloning faster
      git config --global lfs.storage ~/.cache/lfs      
  10. Update the lfs-s3/.lfsconfig file:
[lfs]
url = "http://http://0.0.0.0:8081/api/my-org/my-project"
           ─────────┬──────── ──┬─ ─┬─ ───┬── ─────┬────
                    │           │   │     │        └ Replace with your project's name
                    │           │   │     └ Replace with your organization name   
                    │           │   └ Required to be "api"
                    │           └ The port your server started with
                    └ The host name of your server
  1. Run local : docker-compose up

LakeFS

  1. Install CLI
  • On mac: brew tap treeverse/lakefs && brew install lakefs
  1. Run using docker metadata and credentials
    mkdir ~/lakefs/metadata  # for persistency 
    docker run --pull always -p 8000:8000 -e LAKEFS_BLOCKSTORE_TYPE='s3' -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY -e LAKEFS_DATABASE_LOCAL_PATH=/etc/lakefs/metadata -v ~/lakefs/metadata:/etc/lakefs/metadata treeverse/lakefs run --local-settings
  2. Copy credentials ands save to ~/.lakefs.yaml.
  3. Create a repository and connect to S3 in the UI

Run

Prepare docker servers

# Terminal 1
(cd lfs-server && docker-compose up)
# Terminal 2
export AWS_ACCESS_KEY_ID
docker run --pull always -p 8000:8000 -e LAKEFS_BLOCKSTORE_TYPE='s3' -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY -e LAKEFS_DATABASE_LOCAL_PATH=/etc/lakefs/metadata -v ~/lakefs/metadata:/etc/lakefs/metadata treeverse/lakefs run --local-settings

Workflows

Numeric non-git

export PYTHONPATH="$(pwd):$PYTHONPATH" XET_LOG_LEVEL=debug XET_LOG_PATH=pwd/numeric.log python src/main.py numeric -i=20 --show --upload

Append to blog csv

export PYTHONPATH="$(pwd):$PYTHONPATH"
XET_LOG_LEVEL=debug XET_LOG_PATH=`pwd`/append.log python src/main.py append -i=30 --show --upload

Mock data

export PYTHONPATH="$(pwd):$PYTHONPATH"
python src/generate.py --dir=mock --count=10 --rows=10
python src/main.py --dir=mock --show --upload

Taxi

export PYTHONPATH="$(pwd):$PYTHONPATH"
python src/download.py --dir=data --download=all --limit=40
python src/main.py --dir=data --show --upload

Tests


PYTHONPATH="$(pwd):$PYTHONPATH" pytest tests

python src/generate.py --dir=mock --count=10 --rows=10
python src/main.py --dir=mock --show # for quick testing

# or
export PYTHONPATH="$(pwd):$PYTHONPATH" 
python src/main.py --dir=mock --rows=10000

Analyse results

Visualize the results with snakeviz:

snakeviz profile.prof

Use jupyter notebook to analyse the results:

pip install jupyter
jupyter notebook

# open the notebook in the browser

All data is pointers

# deprecated
AWS_ACCESS_KEY_ID='LAKEFS_ACCESS_KEY' AWS_SECRET_ACCESS_KEY='LAKEFS_SECRET' aws s3 ls --endpoint http://localhost:8000

CLI

Upload a file:
lakectl fs upload -s mock/0.parquet lakefs://versioning-article/main/0.parquet
Downloading:
lakectl fs download lakefs://<REPO>/<BRANCH>/path/to/object <DESTINATION>
You can also leverage the --recursive flag to download/upload dirs



versioning-article's People

Contributors

xdssio avatar ylow avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.