Git Product home page Git Product logo

d27x / lance Goto Github PK

View Code? Open in Web Editor NEW

This project forked from lancedb/lance

1.0 0.0 0.0 8.09 MB

Modern columnar data format for ML implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..

Home Page: https://eto-ai.github.io/lance/

License: Apache License 2.0

Shell 0.28% C++ 0.17% Python 8.99% C 0.29% Rust 80.52% Makefile 0.08% CMake 0.24% Jupyter Notebook 9.41% Dockerfile 0.03%

lance's Introduction

Lance Logo

Modern columnar data format for ML. Convert from parquet in 2-lines of code for 100x faster random access, a vector index, data versioning, and more.
Compatible with pandas, duckdb, polars, pyarrow, with more integrations on the way.

DocumentationBlogDiscordTwitter

CI Badge Docs Badge crates.io badge Python versions badge


Lance is a modern columnar data format that is optimized for ML workflows and datasets. Lance is perfect for:

  1. Building search engines and features stores.
  2. Large-scale ML training requiring high performance IO and shuffles.
  3. Storing, querying, and inspecting deeply nested data for robotics or large blobs like images, point-clouds, and more.

The key features of Lance include:

  • High-performance random access: 100x faster than Parquet without sacrificing scan performance.

  • Vector search: find nearest neighbors in milliseconds and combine OLAP-queries with vector search.

  • Zero-copy, automatic versioning: manage versions of your data without needing extra infrastructure.

  • Ecosystem integrations: Apache-Arrow, Pandas, Polars, DuckDB and more on the way.

Quick Start

Installation

pip install pylance

Converting to Lance

import lance

import pandas as pd
import pyarrow as pa
import pyarrow.dataset

df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')

parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")

Reading Lance data

dataset = lance.dataset("/tmp/test.lance")
assert isinstance(dataset, pa.dataset.Dataset)

Pandas

df = dataset.to_table().to_pandas()
df

DuckDB

import duckdb

# If this segfaults, make sure you have duckdb v0.7+ installed
duckdb.query("SELECT * FROM dataset LIMIT 10").to_df()

Vector search

Download the sift1m subset

wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz

Convert it to Lance

import lance
from lance.vector import vec_to_table
import numpy as np
import struct

nvecs = 1000000
ndims = 128
with open("sift/sift_base.fvecs", mode="rb") as fobj:
    buf = fobj.read()
    data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * nvecs * ndims])).reshape((nvecs, ndims))
    dd = dict(zip(range(nvecs), data))

table = vec_to_table(dd)
uri = "vec_data.lance"
sift1m = lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)

Build the index

sift1m.create_index("vector",
                    index_type="IVF_PQ", 
                    num_partitions=256,  # IVF
                    num_sub_vectors=16)  # PQ

Search the dataset

# Get top 10 similar vectors
import duckdb

dataset = lance.dataset(uri)

# Sample 100 query vectors. If this segfaults, make sure you have duckdb v0.7+ installed
sample = duckdb.query("SELECT vector FROM dataset USING SAMPLE 100").to_df()
query_vectors = np.array([np.array(x) for x in sample.vector])

# Get nearest neighbors for all of them
rs = [dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})      
      for q in query_vectors]

Directory structure

Directory Description
rust Core Rust implementation
python Python bindings (pyo3)
docs Documentation source

What makes Lance different

Here we will highlight a few aspects of Lance’s design. For more details, see the full Lance design document.

Vector index: Vector index for similarity search over embedding space

Encodings: to achieve both fast columnar scan and sub-linear point queries, Lance uses custom encodings and layouts.

Nested fields: Lance stores each subfield as a separate column to support efficient filters like “find images where detected objects include cats”.

Versioning: a Manifest can be used to record snapshots. Currently we support creating new versions automatically via appends, overwrites, and index creation

Fast updates (ROADMAP): Updates will be supported via write-ahead logs.

Rich secondary indices (ROADMAP):

  • Inverted index for fuzzy search over many label / annotation fields

Benchmarks

Vector search

We used the sift dataset to benchmark our results with 1M vectors of 128D

  1. For 100 randomly sampled query vectors, we get <1ms average response time (on a 2023 m2 macbook air)

avg_latency.png

  1. ANN is always a trade-off between recall and performance

avg_latency.png

Vs parquet

We create a Lance dataset using the Oxford Pet dataset to do some preliminary performance testing of Lance as compared to Parquet and raw image/xmls. For analytics queries, Lance is 50-100x better than reading the raw metadata. For batched random access, Lance is 100x better than both parquet and raw files.

Why are you building yet another data format?!

Machine Learning development cycle involves the steps:

graph LR
    A[Collection] --> B[Exploration];
    B --> C[Analytics];
    C --> D[Feature Engineer];
    D --> E[Training];
    E --> F[Evaluation];
    F --> C;
    E --> G[Deployment];
    G --> H[Monitoring];
    H --> A;
Loading

People use different data representations to varying stages for the performance or limited by the tooling available. The academia mainly uses XML / JSON for annotations and zipped images/sensors data for deep learning, which is difficult to integrated into data infrastructure and slow to train over cloud storage. While the industry uses data lake (Parquet-based techniques, i.e., Delta Lake, Iceberg) or data warehouse (AWS Redshift or Google BigQuery) to collect and analyze data, they have to convert the data into training-friendly formats, such as Rikai/Petastorm or Tfrecord. Multiple single-purpose data transforms, as well as syncing copies between cloud storage to local training instances have become a common practice among ML practices.

While each of the existing data formats excel at its original designed workload, we need a new data format to tailored for multistage ML development cycle to reduce the fraction in tools and data silos.

A comparison of different data formats in each stage of ML development cycle.

Lance Parquet & ORC JSON & XML Tfrecord Database Warehouse
Analytics Fast Fast Slow Slow Decent Fast
Feature Engineering Fast Fast Decent Slow Decent Good
Training Fast Decent Slow Fast N/A N/A
Exploration Fast Slow Fast Slow Fast Decent
Infra Support Rich Rich Decent Limited Rich Rich

Presentations and Talks

lance's People

Contributors

ananis25 avatar asadullahfarooqi avatar changhiskhan avatar da-tubi avatar dacort avatar dnsco avatar eddyxu avatar gsajko avatar gsilvestrin avatar haoxins avatar hzhang86 avatar jaichopra avatar renkai avatar wangfenjin avatar yah01 avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.