Git Product home page Git Product logo

lancedb / lance Goto Github PK

View Code? Open in Web Editor NEW
3.7K 40.0 197.0 17.66 MB

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..

Home Page: https://lancedb.github.io/lance/

License: Apache License 2.0

Python 12.07% Dockerfile 0.01% Rust 76.51% CMake 0.04% Makefile 0.02% C 0.10% C++ 0.03% Shell 0.04% Jupyter Notebook 8.66% Java 2.53%
machine-learning computer-vision data-format deep-learning python apache-arrow duckdb mlops data-analysis data-analytics

lance's Introduction

Lance Logo

Modern columnar data format for ML. Convert from Parquet in 2-lines of code for 100x faster random access, a vector index, data versioning, and more.
Compatible with pandas, DuckDB, Polars, and pyarrow with more integrations on the way.

DocumentationBlogDiscordTwitter

CI Badge Docs Badge crates.io badge Python versions badge


Lance is a modern columnar data format that is optimized for ML workflows and datasets. Lance is perfect for:

  1. Building search engines and feature stores.
  2. Large-scale ML training requiring high performance IO and shuffles.
  3. Storing, querying, and inspecting deeply nested data for robotics or large blobs like images, point clouds, and more.

The key features of Lance include:

  • High-performance random access: 100x faster than Parquet without sacrificing scan performance.

  • Vector search: find nearest neighbors in milliseconds and combine OLAP-queries with vector search.

  • Zero-copy, automatic versioning: manage versions of your data without needing extra infrastructure.

  • Ecosystem integrations: Apache Arrow, Pandas, Polars, DuckDB and more on the way.

Tip

Lance is in active development and we welcome contributions. Please see our contributing guide for more information.

Quick Start

Installation

pip install pylance

To install a preview release:

pip install --pre --extra-index-url https://pypi.fury.io/lancedb/ pylance

Tip

Preview releases are released more often than full releases and contain the latest features and bug fixes. They receive the same level of testing as full releases. We guarantee they will remain published and available for download for at least 6 months. When you want to pin to a specific version, prefer a stable release.

Converting to Lance

import lance

import pandas as pd
import pyarrow as pa
import pyarrow.dataset

df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')

parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")

Reading Lance data

dataset = lance.dataset("/tmp/test.lance")
assert isinstance(dataset, pa.dataset.Dataset)

Pandas

df = dataset.to_table().to_pandas()
df

DuckDB

import duckdb

# If this segfaults, make sure you have duckdb v0.7+ installed
duckdb.query("SELECT * FROM dataset LIMIT 10").to_df()

Vector search

Download the sift1m subset

wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz

Convert it to Lance

import lance
from lance.vector import vec_to_table
import numpy as np
import struct

nvecs = 1000000
ndims = 128
with open("sift/sift_base.fvecs", mode="rb") as fobj:
    buf = fobj.read()
    data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * nvecs * ndims])).reshape((nvecs, ndims))
    dd = dict(zip(range(nvecs), data))

table = vec_to_table(dd)
uri = "vec_data.lance"
sift1m = lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)

Build the index

sift1m.create_index("vector",
                    index_type="IVF_PQ",
                    num_partitions=256,  # IVF
                    num_sub_vectors=16)  # PQ

Search the dataset

# Get top 10 similar vectors
import duckdb

dataset = lance.dataset(uri)

# Sample 100 query vectors. If this segfaults, make sure you have duckdb v0.7+ installed
sample = duckdb.query("SELECT vector FROM dataset USING SAMPLE 100").to_df()
query_vectors = np.array([np.array(x) for x in sample.vector])

# Get nearest neighbors for all of them
rs = [dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})
      for q in query_vectors]

Directory structure

Directory Description
rust Core Rust implementation
python Python bindings (pyo3)
docs Documentation source

What makes Lance different

Here we will highlight a few aspects of Lance’s design. For more details, see the full Lance design document.

Vector index: Vector index for similarity search over embedding space. Support both CPUs (x86_64 and arm) and GPU (Nvidia (cuda) and Apple Silicon (mps)).

Encodings: To achieve both fast columnar scan and sub-linear point queries, Lance uses custom encodings and layouts.

Nested fields: Lance stores each subfield as a separate column to support efficient filters like “find images where detected objects include cats”.

Versioning: A Manifest can be used to record snapshots. Currently we support creating new versions automatically via appends, overwrites, and index creation .

Fast updates (ROADMAP): Updates will be supported via write-ahead logs.

Rich secondary indices (ROADMAP):

  • Inverted index for fuzzy search over many label / annotation fields.

Benchmarks

Vector search

We used the SIFT dataset to benchmark our results with 1M vectors of 128D

  1. For 100 randomly sampled query vectors, we get <1ms average response time (on a 2023 m2 MacBook Air)

avg_latency.png

  1. ANNs are always a trade-off between recall and performance

avg_latency.png

Vs. parquet

We create a Lance dataset using the Oxford Pet dataset to do some preliminary performance testing of Lance as compared to Parquet and raw image/XMLs. For analytics queries, Lance is 50-100x better than reading the raw metadata. For batched random access, Lance is 100x better than both parquet and raw files.

Why are you building yet another data format?!

The machine learning development cycle involves the steps:

graph LR
    A[Collection] --> B[Exploration];
    B --> C[Analytics];
    C --> D[Feature Engineer];
    D --> E[Training];
    E --> F[Evaluation];
    F --> C;
    E --> G[Deployment];
    G --> H[Monitoring];
    H --> A;
Loading

People use different data representations to varying stages for the performance or limited by the tooling available. Academia mainly uses XML / JSON for annotations and zipped images/sensors data for deep learning, which is difficult to integrated into data infrastructure and slow to train over cloud storage. While industry uses data lakes (Parquet-based techniques, i.e., Delta Lake, Iceberg) or data warehouses (AWS Redshift or Google BigQuery) to collect and analyze data, they have to convert the data into training-friendly formats, such as Rikai/Petastorm or TFRecord. Multiple single-purpose data transforms, as well as syncing copies between cloud storage to local training instances have become a common practice.

While each of the existing data formats excels at the workload it was originally designed for, we need a new data format tailored for multistage ML development cycles to reduce and data silos.

A comparison of different data formats in each stage of ML development cycle.

Lance Parquet & ORC JSON & XML TFRecord Database Warehouse
Analytics Fast Fast Slow Slow Decent Fast
Feature Engineering Fast Fast Decent Slow Decent Good
Training Fast Decent Slow Fast N/A N/A
Exploration Fast Slow Fast Slow Fast Decent
Infra Support Rich Rich Decent Limited Rich Rich

Community Highlights

Lance is currently used in production by:

  • LanceDB, a serverless, low-latency vector database for ML applications
  • Self-driving car company for large-scale storage, retrieval and processing of multi-modal data.
  • E-commerce company for billion-scale+ vector personalized search.
  • and more.

Presentations and Talks

lance's People

Contributors

albertlockett avatar ananis25 avatar broccolispicy avatar bubblecal avatar changhiskhan avatar chebbychefneq avatar da-tubi avatar dnsco avatar dsgibbons avatar eddyxu avatar gsilvestrin avatar haoxins avatar heiher avatar jaichopra avatar jiachengdb avatar lillianye avatar liweijie avatar luqqiu avatar nickdarvey avatar niyue avatar raunaks13 avatar renkai avatar rok avatar tanaymeh avatar tevinwang avatar trueutkarsh avatar universalmind303 avatar westonpace avatar wjones127 avatar yah01 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lance's Issues

Verify initial installation scenarios

  • Support python 3.8-3.10
  • Support mac x86, mac arm64, manylinux
  • pip install pylance into a clean venv (not conda) on a system that haven’t installed libarrow* separately
  • Verify that we can import lance and pyarrow
  • Verify that we can call lance.dataset and lance.scan

Put this in a GH action so it can be re-run before every release and also at least once a day on a schedule

Support offset and limit pushdown

Problem Statement

A major use case for lance format is containing the ML dataset that is easy to inspect. It should answer the query like SELECT * FROM dataset WHERE ... LIMIT 20 fast. Combining with the fact that lance is designed for embedded assets, such as images, it would be desired to reduce the data scanned.

Desired Behavior

Support offset and limit pushdown, to only scan large columns if necessary.

Examples of lance files for verifying different implementations

Since we may implement lance parser/generators in different languages (Rust/Scala/etc), a collection of lance files would be useful to test the correctness and integrality of different implements.

s3://eto-public/datasets/oxford_pet/pet.lance has 1GiB, too big for this reason.

Change Var-length Binary Encoding to store the offset first then store the values.

Problem Statement

Currently, we are storing the values before offsets:

| value1 | value2 | value3 | value4 | offset0 | offset1 | offset2 | offste3 | offset4 |

This makes decoding a batch of string values requires 2 separate reads, one for the offset array. Moving offset to the front of the values can potentially allow speculating read to read both offset and values in one I/O.

We can also evaluate other alternatives to reduce the number of I/Os to scan a Var-length binary encoded page.

Related to #67

Fix CMake Warnings for not finding c-ares

Problem Statement

We have the following warnings on Ubuntu during cmake configuration.

CMake Warning at /usr/lib/x86_64-linux-gnu/cmake/arrow/Findc-aresAlt.cmake:25 (find_package):
  By not providing "Findc-ares.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "c-ares", but
  CMake did not find one.

  Could not find a package configuration file provided by "c-ares" with any
  of the following names:

    c-aresConfig.cmake
    c-ares-config.cmake

  Add the installation prefix of "c-ares" to CMAKE_PREFIX_PATH or set
  "c-ares_DIR" to a directory containing one of the above files.  If "c-ares"
  provides a separate development package or SDK, be sure it has been
  installed.
Call Stack (most recent call first):
  /usr/share/cmake-3.22/Modules/CMakeFindDependencyMacro.cmake:47 (find_package)
  /usr/lib/x86_64-linux-gnu/cmake/arrow/ArrowConfig.cmake:97 (find_dependency)
  /usr/lib/x86_64-linux-gnu/cmake/arrow/FindArrow.cmake:223 (find_package)
  /usr/lib/x86_64-linux-gnu/cmake/arrow/FindArrow.cmake:344 (arrow_find_package_cmake_package_configuration)
  /usr/lib/x86_64-linux-gnu/cmake/arrow/FindArrow.cmake:389 (arrow_find_package)
  CMakeLists.txt:61 (find_package)

Desired Behavior

No such error messages during cmake.

Release intel and apple silicon wheels from Github Action

Problem Statement

Let's figure out how to build x86_64 and arm64 targets from Github Action mac runners. So that we can do release directly from GH actions.

  • Install / Build arrow / protobuf for both x86_64 and arm64
  • Build lance cpp SDK for both x86_64 and arm64.
  • Build python wheel via cibuildwheel for both x86_64 and arm64

Improve column scan performance

Problem Statement

On commit fbc3645, running on EC2 (m6i.8xlarge)

import lance
import time
import pyarrow

ds = lance.dataset("s3://.../pet.lance")
parq_ds = pyarrow.dataset.dataset("s3://.../pet.parquet")

start = time.time()
parq_ds.scanner(columns=["label"]).to_table()
print(f"Parquet: Loading label column: {time.time() - start}")

start = time.time()
parq_ds.scanner(columns=["width"]).to_table()
print(f"Parquet: Loading width column: {time.time() - start}")

start = time.time()
lance.scanner(ds, columns=["label"]).to_table()
print(f"Lance: Loading label column: {time.time() - start}")

start = time.time()
lance.scanner(ds, columns=["width"]).to_table()
print(f"Lance: Loading width column: {time.time() - start}")
Parquet: Loading label column: 0.0885462760925293
Parquet: Loading width column: 0.05171823501586914
Lance: Loading label column: 0.3289780616760254
Lance: Loading width column: 0.262282133102417

Using parquet-utils to inspect parquet scheme

############ Column(width) ############
name: width
path: width
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 31%)

############ Column(label) ############
name: label
path: label
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 6%)

Desired Behavior

Given that the storage space of these two columns is similar between lance/parquet formats, there should be some low-hanging fruit to improve lance scan performance.

Provide WriteTable API in Python.

Problem

Currently, the python package can read an lance file. We need to add write support in Python as well.

The WriteTable functionality has already been implemented in C++ side.

Desired behavior

Implement pylance.write_table(...) API to wrap C++ WriteTable API, takes in an Pyarrow Table, and write to disk.

Reading COCO segmentation failed due to nulls

Problem Statement

Reading COCO dataset with segmentation, failed with error message:

row_id=12472 Mismatching child array lengths

Addining more debug info into FileReader::GetListArray() and FileReader::GetStructArray(), I got the following messages:

Failed to build struct array: field=segmentation children=["height", "polygon", "rle", "type", "width"] batch=17: Mismatching child array lengths
Arrays: [[
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  ...
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  478
], <Invalid array: Length spanned by list offsets (23) larger than values array (length 22)>, [
  [],
...

Compute min/max values for columns to support file-level and chunk-level pruning.

Problem Statement

Using min/max values to skip / pruning chunks is a common practice in columar storages. Let's add it as well to support coarse grained filtering.

Open questions:

  • Should we use different bit width to support different data types (i.e., 8 bits for int8/uint8, 64 bits for int64/uint64/double, or just one implementation for all.
  • How to support string values efficiently. Do we only want to support min/max for dictionary string values or string values in general?

Desired behavior

Compute min/max values and store them in a fashion that does not hurt either full scan or point queries. Ideally, such indices are only loaded when necessary, while also taking advantage of optimal I/O size on S3/GCS, and vectorizations.

segfault querying coco data

In [1]: import lance

In [2]: import duckdb

In [3]: uri_lance = "s3://eto-public/datasets/coco/coco.lance"

In [4]: dataset_lance = lance.dataset(uri_lance)

In [5]: duckdb.query("SELECT id FROM dataset_lance")
Out[5]: 
---------------------
--- Relation Tree ---
---------------------
Subquery

---------------------
-- Result Columns  --
---------------------
- id (BIGINT)

---------------------
-- Result Preview  --
---------------------
id	
BIGINT	
[ Rows: 10]
31625	
257943	
153631	
461884	
376449	
534373	
486864	
94795	
193474	
482362	



In [6]: [1]    996655 segmentation fault (core dumped)  ipython

Protobuf build error on MacOS

Problem Statement

make failed on macOS , after protobuf being upgraded by homebrew

[ 82%] Built target io
[ 83%] Linking CXX shared library liblance.dylib
Undefined symbols for architecture arm64:
  "google::protobuf::internal::InternalMetadata::~InternalMetadata()", referenced from:
      google::protobuf::MessageLite::~MessageLite() in format.pb.cc.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [liblance.dylib] Error 1
make[1]: *** [CMakeFiles/lance.dir/all] Error 2
make: *** [all] Error 2

OS: macOS 12.4
Protobuf: 21.2 (Homebrew)
Arrow: 8.0.0_4 (Homebrew)

Benchmark Pet / COCO raw dataset performance

Problem Statement

We could keep a benchmark suite to test analytics and point query performance over ML datasets (i.e.., COCO and Oxford Pet). Using this as baseline, to test against Lance performance.

Conceptually, the benchmarks should answer queries like the following from raw JSON / XML metadata files:

# Label distribution
SELECT label, count(1) FROM dataset GROUP BY 1

# Point query
SELECT image, label, id FROM dataset WHERE label in ("dog", "cat") LIMIT 20 OFFSET 50;

Desired Behavior

Implement a benchmark that we test baseline of raw ML dataset performance and answer the desired queries.

Update README

Update README to prepare for public consumption.

  • Project goals and non-goals
  • Instructions for building the project
  • Simple code snippet / examples.
  • Link to docs / spec / references / talks / presentations

Support and enforce primary key

Problem Statement

We assume that each lance dataset has a primary key, to guarantee a later fast point query.
Currently, the WriteTable API takes a mandate primary key parameter, however, we do not enforce the uniqueness and sorting by the primary key at the moment.

Desired Behavior

WriteTable should enforce:

  • A primary key is provided
  • The column of primary key exists
  • The uniqueness of the primary key in the column.
  • Optionally, sort the dataset by the primary key?
  • Open question: in the first version, do we want to limit the data types supported as a primary key?

be more graceful handling unknow data types

Currently if you try to write an Arrow extension dtype using lance.write_table it segfaults

./cpp/src/arrow/result.cc:28: ValueOrDie called on an error: NotImplemented: FromLogicalType: logical_type "extension<my_package.uuid<UuidType>>" is not supported yet
/lib/x86_64-linux-gnu/libarrow.so.800(+0x156f1df)[0x7f433acac1df]
/lib/x86_64-linux-gnu/libarrow.so.800(_ZN5arrow4util8ArrowLogD1Ev+0xe1)[0x7f433a5a40f1]
/lib/x86_64-linux-gnu/libarrow.so.800(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x12a)[0x7f433a694b8a]
/home/chang/code/eto/lance/cpp/build/liblance.so(_ZNO5arrow6ResultISt10shared_ptrINS_8DataTypeEEE10ValueOrDieEv+0x4b)[0x7f431290cf05]
/home/chang/code/eto/lance/cpp/build/liblance.so(_ZNK5lance6format5Field4typeEv+0x369)[0x7f43129056a5]
/home/chang/code/eto/lance/cpp/build/liblance.so(_ZN5lance2io10FileWriter10WriteArrayERKSt10shared_ptrINS_6format5FieldEERKS2_IN5arrow5ArrayEE+0x63)[0x7f4312950bab]
/home/chang/code/eto/lance/cpp/build/liblance.so(_ZN5lance2io10FileWriter5WriteERKSt10shared_ptrIN5arrow11RecordBatchEE+0x137)[0x7f43129509c5]
/home/chang/code/eto/lance/cpp/build/liblance.so(_ZN5lance5arrow10WriteTableERKN5arrow5TableESt10shared_ptrINS1_2io12OutputStreamEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt8optionalINS0_16FileWriteOptionsEE+0x2d8)[0x7f43128b842f]


Improve select by indices performance on list array

Problem Statement

The current implementation is FileReader read the full column between indices[0], indices[-1] in memory first, and apply arrow::compute::Function("take") to filter out the data.

Desired Behavior

We can push the selection further down to reduce the amount of data read from S3 / disk

Use Apache Arrow's Executor to manage thread pool

Problem

We implemented a naive thread pool in Scanner to perform batch prefetch. Apache Arrow offers a thread executor for other data formats, we should probably just re-use their thread pool executor, making Lance play nicer with the ecosystem.

https://arrow.apache.org/docs/cpp/threading.html
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/thread_pool.h

Desired Behavior

Use Apache Arrow to manage threads in Lance Reader / Scanner.

Pyarrow Dataset Scanner has no public cython definiation

Problem Statement

#61 attempts to create a new arrow Scanner object to support nested column parsing, as well as LIMIT / OFFSET pushdown. However pyarrow._dataset.Scanner does not have a public definition in _dataset.pyd. It makes it difficult to wrap the CScanner object from C++ into pyarrow.dataset.Scanner python object.

For example, we can not call https://github.com/apache/arrow/blob/02c8598d264c839a5b5cf3109bfd406f3b8a6ba5/python/pyarrow/_dataset.pyx#L2361-L2362

Desired Behavior

Short term, we'd like to be able to create something that Duckdb can consume correctly from python.
Long term, we need either be able to create Scanner by our own, or fix arrow / duckdb to support nested column parsing.

parallel reads seems to be turned off in lance?

I ran the label distribution query against coco for parquet and lance. For parquet, the total CPU time was greater than the Wall time, but for lance that was not the case. I vaguely remember discussing this before, not sure if it's already a known issue.

Improve build and release process for the python package

Problem Statement

Currently, we build python package using a manually compiled pyarrow. The reasons were that:

  • We are using the official C++ packages via apt-get to install libarrow-*.dev. However it is not compatible to libarrow.so installed from pip install pyarrow.
  • Pypi pyarrow was built via manylinux2014(?), but we are building the Lance using Ubuntu 22.04 with C++20.

Desired behavior

Figure out the correct practice to build and release python package.

Establish performance baseline

Problem Statement

We should maintain a set of benchmarks to maintain the performance baseline, observe regressions, and drive performance optimization.

Desired behavior

We have a set of representative benchmarks that we can run between releases to monitor performance improvement/degredation.

Support simple predicates push down

Problem Statement

Assuming a dataset with schema primary_key: string, label: dict[string], image: Image, if we run the query:

SELECT image WHERE label = "cat" Limit 20

We should expect that Lance does not actually scan all image data. Instead, it should use the filter (label = 'cat') to selectively read the image column.

Desired Behavior

Use the predicate push down to reduce the scan traffic over a large column (ie.., Tensor / Image).

Duckdb issues

Keep track of a list of duckdb issues to support sophon's workload. We can later decide whether we want to report these issues/feature requests to upstream.

  • Duckdb does not pushdown LIMIT / OFFSET via Apache Arrow. Apache Arrow does not support LIMIT / OFFSET in Scanner in general.
    • We might need to tune he batch size to simulate the "limit" request.
  • Does not support projection over list<struct>
>>> duckdb.query("SELECT annotations.label FROM coco")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Binder Error: Cannot extract field 'label' from expression "annotations" because it is not a struct

Enable Prefetching when Limit / Offset clause is included.

Problem Statement

For correctness and simplicity, we disabled prefetch and parallel scan when the LIMIT / Offset clause is provided (#45), for easier counting the number of rows read. We expect that to serve a web app, usually we only apply LIMIT 20-100 for pagination purpose. However, there are more opportunities to improve the I/O process and improve latency.

For example, we can still allow sequential prefetch threads happened in the background without sacrificing the correctness. Furthermore, we can use hedged read to improve parallelism as well, it needs fine-tuned schedule tho.

scanner api should take string filter expressions

automatically convert str to pyarrow._compute.Expression (pyarrow should have APIs to do this already)

Also, columns is Optional[str] but actually requires Iterable[str] currently. It should be able to handle None, str, list-like of str

Prepare Python Release

Release Checklist

  • Python CI #4
  • Manylinux 2014 on Linux x86 (python 3.8 - 3.10)
  • Intel / Apple Silicon release on Mac (target latest MacOS and default python3 version, which is 3.10)
  • NO conda
  • Verify that we can pip install into clean env with no libarrow* pre-installed
  • Verify we can import pyarrow and lance
  • Verify we can create a pyarrow dataset with lance.dataset

Reduce cython compile warnings

Problem Statement

When compile cython via python setup.py build, there are warnings like:

lance/_lib.cpp:4810:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_tensor’ defined but not used [-Wunused-variable]
 4810 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_tensor)(std::shared_ptr< arrow::Tensor>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4809:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_csr_matrix’ defined but not used [-Wunused-variable]
 4809 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_csr_matrix)(std::shared_ptr< arrow::SparseCSRMatrix>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4808:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_csf_tensor’ defined but not used [-Wunused-variable]
 4808 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_csf_tensor)(std::shared_ptr< arrow::SparseCSFTensor>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4807:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_csc_matrix’ defined but not used [-Wunused-variable]
 4807 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_csc_matrix)(std::shared_ptr< arrow::SparseCSCMatrix>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4806:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_coo_tensor’ defined but not used [-Wunused-variable]
 4806 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_coo_tensor)(std::shared_ptr< arrow::SparseCOOTensor>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4805:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_chunked_array’ defined but not used [-Wunused-variable]
 4805 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_chunked_array)(std::shared_ptr< arrow::ChunkedArray>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4804:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_array’ defined but not used [-Wunused-variable]
 4804 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_array)(std::shared_ptr< arrow::Array>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4803:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_scalar’ defined but not used [-Wunused-variable]
 4803 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_scalar)(std::shared_ptr< arrow::Scalar>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4802:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_schema’ defined but not used [-Wunused-variable]
 4802 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_schema)(std::shared_ptr< arrow::Schema>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4801:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_field’ defined but not used [-Wunused-variable]
 4801 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_field)(std::shared_ptr< arrow::Field>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4800:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_data_type’ defined but not used [-Wunused-variable]
 4800 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_data_type)(std::shared_ptr< arrow::DataType>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4799:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_resizable_buffer’ defined but not used [-Wunused-variable]
 4799 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_resizable_buffer)(std::shared_ptr< arrow::ResizableBuffer>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4798:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_buffer’ defined but not used [-Wunused-variable]
 4798 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_buffer)(std::shared_ptr< arrow::Buffer>  const &); /*proto*/

It is probably due to we did cimport * which imports more unused structures.

from pyarrow.includes.common cimport *

Desired Behavior

We should expect no such unused-variable warnings during python build.

Lance filtering not working

Use case: build a scanner that filters on annotations.name using lance.scanner
We cannot do that directly because #60
So as a workaround, I'm filtering for the image_ids using unnest in then setting up a scanner with the filtered ids.

However, the filter does not work on Lance but does for parquet.

Repo:

Works in parquet:

import pyarrow as pa
from pyarrow.dataset import dataset

ids = [391895, 522418, 184613, 318219, 554625, 574769, 60623, 309022, 5802, 222564]
ds = dataset('s3://eto-public/datasets/coco/coco_links.parquet')
tbl = ds.to_table(filter=pc.field('image_id').isin(ids))

Fails with Lance:

import lance
ids = [391895, 522418, 184613, 318219, 554625, 574769, 60623, 309022, 5802, 222564]
ds = lance.dataset('s3://eto-public/datasets/coco/coco_links.lance')
tbl = ds.to_table(filter=pc.field('image_id').isin(ids))

With error message:

---------------------------------------------------------------------------
ArrowIndexError                           Traceback (most recent call last)
Input In [24], in <cell line: 1>()
----> 1 tbl = ds.to_table(filter=pc.field('image_id').isin(ids))

File ~/code/eto/lance/python/thirdparty/arrow/python/pyarrow/_dataset.pyx:331, in pyarrow._dataset.Dataset.to_table()

File ~/code/eto/lance/python/thirdparty/arrow/python/pyarrow/_dataset.pyx:2577, in pyarrow._dataset.Scanner.to_table()

File ~/code/eto/lance/python/thirdparty/arrow/python/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/code/eto/lance/python/thirdparty/arrow/python/pyarrow/error.pxi:127, in pyarrow.lib.check_status()

ArrowIndexError: Index 9 out of bounds

Expose filter APIs via Python

Problem Statement

Lance format itself does support nested columns projection pushdown, filter/predicates pushdown as well as LIMIT / OFFSET pushdown (#45). However, the Apache Arrow ScanOptions and Duckdb Pyarrow integration do not fully support these optimizations yet (#46).

We could expose these filters via our Python and C++ API first to make them usable first.

Desired Behavior

Enrich Python / C++ API support for nested column projection pushdown, filter / predicates pushdown as well as limit/offset clause pushdown.

Proposed python API change:

# lance/__init__.py
def dataset(
   uri: str,
   columns: Optional[list[str]] = None,
   filters: Optional[pyarrow.Expression] = None,
   limit: Optional[int] = None,
   offset: int = 0
) -> pyarrow.dataset.Dataset:
    ...

Cannot write datetime column

In [7]: import pandas as pd

In [8]: import pyarrow as pa

In [9]: import lance

In [10]: df = pd.DataFrame({'date': pd.date_range('2022-01-01', periods=100, freq='D')})

In [11]: df.dtypes
Out[11]: 
date    datetime64[ns]
dtype: object

In [12]: table = pa.Table.from_pandas(df)

In [13]: table.schema
Out[13]: 
date: timestamp[ns]
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 377

In [14]: lance.write_table(table, '/tmp/test.lance')
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 lance.write_table(table, '/tmp/test.lance')

File ~/code/eto/lance/python/lance/__init__.py:63, in write_table(table, destination)
     53 def write_table(table: pa.Table, destination: Union[str, Path]):
     54     """Write an Arrow Table into the destination.
     55 
     56     Parameters
   (...)
     61         The destination to write dataset to.
     62     """
---> 63     WriteTable(table, destination)

File ~/code/eto/lance/python/lance/_lib.pyx:117, in lance.lib.WriteTable()

File ~/code/eto/lance/python/thirdparty/arrow/python/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: PlainEncoder:: does not support data type timestamp[ns]
> /home/chang/code/eto/lance/python/scripts/pyarrow/error.pxi(100)pyarrow.lib.check_status()

ipdb> 

Null supports in all encodings

Problem Statement

Apache Arrow supports Array with nulls. We should expect to store them efficiently on disk as well.

  • PlainEncoding
  • VarBinaryEncoding

Dictionary encoding uses plain encoding for its index, so Null support can be done via Plain encoding.

projection design

when we project annotations.label for example, the output schema is still a list of struct, so we're still faced as query syntax issues as before.

Would it make sense for the output schema to become just a list of string instead?

Expected Behavior

For a schema of list of struct, annotations.label returns list<string> instead of list<struct<label: string>>

Support Image, 2D bounding box and segmentation as Arrow extension types

Problem Statement

To support rich semantic types, we can utilize Arrow's extension types. It will help us to reach to the feature parity to Rikai storage format.

Questions to answers:

  • Do we want to only implement the data type in Lance?
  • How should we consolidate with Rikai's type system?
  • How to make the type system extensible?
  • Or should we keep the extension types out of the lance data format. Lance just accepts extension types?

Desired Behavior

  • Support Image type
  • Bounding Box
  • Segmentation
  • Label?
  • These objects can be SerDe transparently without manually providing schema.

Segment fault using lance.Scanner(offset=N)

Problem Statement

The following code causes python crashes with the following code, on commit 3264363

start = time.time()
lance.scanner(ds, columns=["image", "label"], limit=20).to_table()
print(f"Executing: Lance {time.time() - start}")

Support dictionary encoding

Problem

Dictionary encoding can help save storage spaces as well as accelerate filtering/scans for the columns with small cardinality. In Arrow there is a Dictionary type, and Pandas has a Categorical type. Parquet has directory (plain and RLE) encoding as well.

Desired behavior.

  • Lance supports dictionary encoding. P0: plain dictionary, P1: RLE dictionary.
  • Design the physical layout.
  • Understand the implementation of Directionary Array from Arrow, whether the dictionary is at the array level or the table level?

The desired results: we can write a pyarrow.Dictionary array via Python API.

Fixed size binary encoding

Problem Statement

Fixed size binary encoding is very useful when the users know that each blob has the same size, i.e., storing prepared Tensors. It allows fast point query without lookups on the offsets.

Additionally, since the data is aligned on the disk/cloud storage, it allows the "dataset loader" to read a batch of tensors (either in BCHW or BCWH form) without further transposing them in memory again.

add project description to pypi

basically just do something like this:

from setuptools import setup

# read the contents of your README file
from pathlib import Path
this_directory = Path(__file__).parent
long_description = (this_directory / "README.md").read_text()

setup(
    name='an_example_package',
    # other arguments omitted
    long_description=long_description,
    long_description_content_type='text/markdown'
)

Apache Arrow does not support FieldRef to list of structs

Problem Statement

Apache Arrow does not support field reference to a list<struct>

import duckdb

ds = lance.dataset("./coco.lance").scanner(columns=["id", "annotations.label"])

Error:

Traceback (most recent call last):
  File "/Users/lei/work/lance/./query.py", line 6, in <module>
    ds = lance.dataset("./coco.lance").scanner(columns=["id", "annotations.label"])
  File "pyarrow/_dataset.pyx", line 271, in pyarrow._dataset.Dataset.scanner
  File "pyarrow/_dataset.pyx", line 2328, in pyarrow._dataset.Scanner.from_dataset
  File "pyarrow/_dataset.pyx", line 2174, in pyarrow._dataset._populate_builder
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(annotations.label) in id: int64
width: int64
height: int64
file_name: string
image: struct<data: binary>
annotations: list<item: struct<area: double, box: struct<xmax: double, xmin: double, ymax: double, ymin: double>, label: string, label_id: int64, segmentation: struct<height: int64, polygon: list<item: list<item: double>>, rle: list<item: int64>, type: int64, width: int64>, supercategory: string>>
__index_level_0__: int64
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string

Expected Behavior

Using annotations.label should returns values with type list<struct<label: str>> , a subset view of the original annotations list<struct>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.