hazyresearch / meerkat Goto Github PK

Creative interactive views of any dataset.

License: Apache License 2.0

Makefile 0.07% Python 69.82% Batchfile 0.04% JavaScript 1.60% HTML 0.05% CSS 0.02% Svelte 27.35% TypeScript 0.93% Jinja 0.13%

ml data-science foundation-models machine-learning pandas

meerkat's Issues

[FEATURE] Better support for missing values

Currently, missing values are supported only in the context of merge. There the solution is rather slipshod: we convert the column to a ListColumn that can store a mix of None and other types.

Replace this with something faster and more robust.
Consider: wrapper column around a smaller column with only the non missing columns.

[FEATURE] Support visualizing images in DataPanels

Something like:

This should be possible via the Pandas.to_html(formatter=...)

When adding same column with `overwrite=True`, `AbstractColumn.columns` has multiple entries for that same column

[BUG] `TensorColumn.view(...)` is overloaded

Describe the bug
TensorColumn is meant to operate like a torch.Tensor, but certain naming conventions may conflict with tensor names.

For example, we often want to reshape our tensor without copying the underlying data (e.g. tensor.view). Should TensorColumn.view() call AbstractColumn.view() (called currently) or torch.Tensor.view(), which take different args? If the former, we should be explicit that view is not supported.

[FEATURE] Improve I/O space and time efficiency

[FEATURE] Add support for `DataPanel.lz.map`

It would be nice to be able to call

dp.lz.map(...)

instead of

dp.map(..., materialize=False)

[BUG] Importing Pandas requires updated version of Pandas

Describe the bug
A clear and concise description of what the bug is.

Importing meerkat requires an updated version of pandas

To Reproduce
Steps and code snippet that reproduce the behavior:

Code: import meerkat as mk
Instructions: Run
Error: ModuleNotFoundError: No module named 'pandas.core.strings.accessor'; 'pandas.core.strings' is not a package

Expected behavior

I expected the import to function as normal.

System Information

MacOS, Linux

Context:
My current pandas version is 1.1.5. Updating pandas to version 1.2.4 resolves this issue.

[BUG] Can't move saved DataPanel due to absolute paths

[BUG] ValueError: array is not C-contiguous with `EmbeddingColumn`

Getting the following ValueError when using EmbeddingColumn

    faiss_index.add(embs)
  File "/dfs/scratch0/lorr1/env_bootleg_38/lib/python3.8/site-packages/faiss/__init__.py", line 104, in replacement_add
    self.add_c(n, swig_ptr(x))
  File "/dfs/scratch0/lorr1/env_bootleg_38/lib/python3.8/site-packages/faiss/swigfaiss.py", line 6016, in swig_ptr
    return _swigfaiss.swig_ptr(a)
ValueError: array is not C-contiguous

Steps and code snippet that reproduce the behavior:

# entity_dp is datapanel with "emb" as EmbeddingColumn
embs = entity_dp["emb"].numpy()
# THIS WAS MY FIX! - it broke without it
embs = np.ascontiguousarray(embs)
index = faiss.IndexFlatL2
faiss_index = index(embs.shape[1])
faiss_index.add(embs)

thanks @lorr1

[BUG] Passing `shuffle=True` to `DataPanel.batch` not supported.

When trying to shuffle the batches when calling DataPanel.batch via the shuffle=True argument, it can’t set shuffle to be True when the sampler is not None.

[FEATURE] Implement `LambdaColumn`

Add support for prediction columns in mosaic

Include the ability to add prediction columns directly to testbenches for evaluating models.

Potential design:
prediction mixin --> ClassifierMixin [logits, probs, argmax -- moving between things]

Need to support more than classification (e.g. segmentation, text generation).

Linking the prediction to the task.

One caveat for ClassifierMixin is supporting multi-label problems (i.e. imageA can be both class1 and class2)

[FEATURE] Add `all_values` to DataPanel

values() returns only the columns that are visible, while all_values() could return all columns.

[BUG] Can't overwrite save with block/manager.py

In block/manager.py line 188, there's a call to os.makedirs(block_dirs). If the folder already exists, this throws an error. I wasnt' sure how you folks wanted to handle this. I generally think people will commonly save over folders so maybe have the exists ok flag turned on?

[BUG] `ImageColumn` in README doesn't work

Describe the bug
For using ImageColumn, ImageColumn.from_filepaths(...) has to be used and simply using ImageColumn(...) does not work.

To Reproduce
The following code from the README gives an error and ImageColumn needs to be replaced with ImageColumn.from_filepaths to get the correct functionality.

from mosaic import DataPanel, ImageColumn

#Images are NOT read from disk at DataPanel creation...
dp = DataPanel({
    'text': ['The quick brown fox.', 'Jumped over.', 'The lazy dog.'],
    'image': ImageColumn(['fox.png', 'jump.png', 'dog.png']),
    'label': [0, 1, 0]
}) 

# ...only at this point is "fox.png" read from disk
dp["image"][0]

[FEATURE] Sort DataPanel by a column

Add a sort function that can be used to sort the DataPanel by values in a column.

dp = mk.DataPanel({'a': [1, 3, 2], 'b': ['a', 'c', 'b']})
dp.sort('a') # sorted view into the dp

[BUG] Error importing meerkat without write access to /tmp

I installed the latest version of meerkat v0.2.1 using pip install meerkat-ml.

I got the following permission denied error when import meerkat in python. Because by default it creates a directory under /tmp but I have no write access to /tmp.

Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/home/siyitang/anaconda3/envs/s3/lib/python3.9/site-packages/meerkat/__init__.py", line 10, in <module>
  initialize_logging()
 File "/home/siyitang/anaconda3/envs/s3/lib/python3.9/site-packages/meerkat/logging/utils.py", line 26, in initialize_logging
  os.makedirs(log_path, exist_ok=True)
 File "/home/siyitang/anaconda3/envs/s3/lib/python3.9/os.py", line 215, in makedirs
  makedirs(head, exist_ok=exist_ok)
 File "/home/siyitang/anaconda3/envs/s3/lib/python3.9/os.py", line 225, in makedirs
  mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/tmp/2021_10_29/17_06_57'

Temporary workaround is to set export TMPDIR=<some-dir>.

Make column testbeds more modular

The BlockManager (see #104) introduces a need for more robust DataPanel testing that tests DataPanels with a diverse set of columns. As we add more columns, we don't want to have to update the DataPanel tests for each new column. Instead, we should specify a TestBed for each column that plugs in to the DataPanel tests.

Started this for NumpyArrayColumn with #108

Add support for loading a subset of columns from DataPanel

Since columns are stored separately, it should be possible to only load in a subset of the columns in a DataPanel with something like:
DataPanel.read("path_to_datapanel", columns=["a", "b", "d"])

[FEATURE] Add support for rename in `DataPanel`

Currently, renaming the columns in a DataPanel is cumbersome. Say, for example, we want to rename a column "ppl" to "people", this might look like:

dp["people"] = dp["ppl"]
dp.remove("ppl")

Further, if we wanted to do this out of place, we'd have to call an additional dp.view() at the top.

In Pandas, this can be done with a single line of code

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html.
Let's bring similar functionality to Meerkat.

Add support for provenance/lineage

[FEATURE] Avoid recomputing values when chaining `LambdaColumn` in one DataPanel

If a single DataPanel contains a chain of LambdaColumns, like so

dp["a_b"] = LambdaColumn(dp["a"], fn)
dp["a_b_c"] = LambdaColumn(dp["a_b"], fn_2)

then indexing the DataPanel with dp[0] will perform the materialization of dp["a_b"] twice.

Ideally, the DataPanel should be aware of these dependencies and only materialize things once.

[BUG] Have torch operations work on tensor column. For example, torch.all(tensor_column) should work

[FEATURE] Add better support for backwards compatibility with previously saved DataPanels

Since meerkat is changing quite quickly, and folks are often working off the dev branch, it's hard to ensure that every DataPanel that gets saved is backwards compatible with future versions of Meerkat. Eventually, once Meerkat is stable and folks are working off of major and minor versions, we should support backwards compatibility. But for the time being, when everyone's working off of dev, how should we support them? I see two options:

(1) Save the meerkat commit with every DataPanel and column, and create a conversion script that allows for converting saved DataPanel's between commits.

(2) Try to support backwards compatibility in the read and write code directly.

I think (2) will be quite challenging because we rely on Pickle, which runs into issues when classes and names change. One approach is to offer an "approximate" read that skips all data in pickles and raises appropriate warnings. This seems not ideal though and adds a lot of mess to the code.

I think my preference is for (1), but curious to hear other thoughts.

[FEATURE] Do away with `ListColumn` index in favor of something faster

Meerkat imposes a ListColumn index on all DataPanel. In many cases, this is the slowest column in the dp and it bottlenecks performance, since all the other columns are based on numpy, pandas or torch.
We should do away with the index column in favor of something else...

It's also worth thinking about what purpose the "index" column serves in the DataPanel. Its not clear to me that it's providing anything important (though I may be missing something). It seems that its main purpose is to provide some sort of unique "id" for every row in the DataPanel, but I don't think this is something we should impose on the DataPanel. That being said, I see the appeal of being able to specify some "id" columns that have some special properties (e.g. always get carried from one dp to the next when indexing).

I propose doing away with this single "index" column design in favor of a new design inspired by the idea of indexes in database management systems:

We can mark any column in a DataPanelas an index. For example:

mimic_dp = ...
mimic_dp.set_index(column="dicom_id")

Rows can then be looked up by that index in a shorthand form:

row = mimic_dp.idx["dicom_id", "id_e324198"]
sub_dp =  mimic_dp.idx["dicom_id", ["id_e324198", "id_e1236493",]]

Under the hood, columns can specify fast implementations of indexes. By default columns will implement the naive slow O(n) index:

def idx(self, index_name, index):
    return self[self[index_name] == index]

but columns can override this with faster implementations, e.g. based on a pandas.Index object which is backed by a Cython dict, so provides O(1) lookups (https://pandas.pydata.org/pandas-docs/stable/development/internals.html).

Implement a Pandas Series column

[FEATURE] Support for choosing default column types via an options module

Pandas has an options module that lets the user specify settings for pandas. It would be great to have something similar in meerkat. This would allow users to customize their preferred default behavior (e.g. do strings go into list column or pandas column by default?)

[BUG] Appending along columns not working without suffix argument

Appending to a DataPanel along columns does not work without suffix argument even when the column names do not overlap.

dp = ms.DataPanel({
    'text': ['The quick brown fox.', 'Jumped over.', 'The lazy dog.'],
    'label': [0, 1, 0]
})
dp2 = ms.DataPanel({
    'string': ['The quick brown fox.', 'Jumped over.', 'The lazy dog.'],
    'target': [0, 1, 0]
})
dp.append(dp2, axis=1)

This code throws ValueError. It works when I provide any suffix, although they are not used.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-5f32282aa054> in <module>()
----> 1 dp.append(dp2, axis=1)

1 frames
/usr/local/lib/python3.7/dist-packages/mosaic/datapanel.py in append(self, dp, axis, suffixes, overwrite)
    422             if not overwrite and shared:
    423                 if suffixes is None:
--> 424                     raise ValueError()
    425                 left_suf, right_suf = suffixes
    426                 data = {

ValueError:

[BUG] ConstructorError module not imported

Using version 0.2.0.

import meerkat as mk
import meerkat.nn

dp['emb_col'] = mk.nn.EmbeddingColumn(some 2d np array)

if i save the datapanel and then load it back, i get this error. this was working previously, but i updated the meerkat package this week:

ConstructorError: while constructing a Python object
module 'meerkat.nn.embedding_column' is not imported

Make `DataPanel(dp)` return some shallow copied version of the original `dp`.

Issue

It is very natural for users (and developers) to construct new DataPanel objects from existing ones via DataPanel(dp).

Important Aside

An unexpected consequence of this issue is finding a good way to stratify which attributes should be recomputed and which should simply be shallow copied over.

As an example, two attributes that every DataPanel has is _data and _identifier. _data is typically large and heavy-weight, so we will almost always want to shallow copy it. _identifier is quite lightweight and may be unique to different DataPanels, so maybe this is a property we recompute each time in __init__. Note this is just an example, we may want the identifier to persist.

This is especially relevant for subclassing DataPanel. As of PR #57, self.from_batch() is used to construct new DataPanel containers from existing ones with shared underlying data. However, as the PR mentions, self.from_batch() is called by many other ops (_get, merge, concat, etc.), and none of these methods have a seamless way of passing arguments other than data to __init__.

An example of this is EntityDataPanel, where the index_column should be passed from the current instance to the newly constructed instance. Because there is no way to plumb that information through different calls, the initializer of EntityDataPanel gets called with EntityDataPanel(index_column=None) even if the current instance has an index column. This results in a new column "_ent_index" being added to the new EntityDataPanel.

Proposed Solution 1

Implement a private instance method called _clone(data=None, visible_columns=None...) -> DataPanel/subclass which implements the default functionality for how to construct a new DataPanel with the relevant arguments to plumb from current instance to new instance. We can then call self._clone(data=data, visible_columns-optional) instead of self.from_batch() in ops like _get, merge, concat, etc.

Let's consider the EntityDataPanel case. We want to plumb self.index_column from a current EntityDataPanel to all EntityDataPanels constructed in its image. ._clone will look something like

class EntityDataPanel:
    def _clone(self, data=None) -> EntityDataPanel:
        if data is None:
            data = self.data
        return EntityDataPanel(data, identifier=identifier, index_column=self.index_column)

We can then have ops like DataPanel._get() for example use self._clone() instead of self.from_batch(). For example

class DataPanel:
    def _get(self, idx, materialize=False):
        ...
        # example cases where `index` returns a datapanel
        elif isinstance(index, slice):
            # slice index => multiple row selection (DataPanel)
            # return self.from_batch(
            #    {
            #        k: self._data[k]._get(index, materialize=materialize)
            #        for k in self.visible_columns
            #    })
            return self._clone({
                k: self._data[k]._get(index, materialize=materialize)
                for k in self.visible_columns
            })
        ...

Proposed Solution 2

Instead of having developers reimplement ._clone(), we can have them implement something like _state_keys() but for init args. Something like ._clone_kwargs():

class EntityDataPanel:
    def _clone_kwargs(self) -> EntityDataPanel:
        default_kwargs = super()._clone_kwargs()
        default_kwargs.update({"index_column": self.index_column})
        return default_kwargs

class DataPanel:
    def _default_kwargs(self):
        return {"data": self.data, "identifier": self.identifier}

    def _clone(self, **kwargs):
        default_kwargs = self._clone_kwargs()
        if kwargs:
            default_kwargs.update(kwargs)
        return self.__class__(**default_kwargs)

[FEATURE] Visualize `DataPanel` without converting to pandas

Visualizing a DataPanel in a Jupyter notebook can be frustratingly slow because we are currently: (1) converting the DataPanel to a special "visual" Pandas DataFrame (via _repr_pandas_) and then (2) visualizing using Pandas visualization out of the box. Step 1 can be very slow for large dps.

Make our own HTML visualization module that circumvents the conversion to. We should borrow heavily from/plug into https://github.com/pandas-dev/pandas/blob/master/pandas/io/html.py

[FEATURE] Improve test coverage for memmapping

Test coverage for memmap on TensorColumn and NumpyArrayColumn is lacking.

[FEATURE] Use of visible_columns should be limited to when materialize=True

Is your feature request related to a problem? Please describe.
When indexing a subset of columns in a DataPanel, we are currently always returning a view of the DataPanel with visible rows set. We should only be using visible columns when materialize is False, otherwise we should clone a new DataPanel and only have it contain a subset of columns. See

Describe the solution you'd like

if isinstance(index[0], str):
      if not set(index).issubset(self.visible_columns):
          missing_cols = set(self.visible_columns) - set(index)
          raise ValueError(f"DataPanel does not have columns {missing_cols}")
      dp = self.view()
      dp.visible_columns = index
      return dp

Should become

if isinstance(index[0], str):
                if not set(index).issubset(self.visible_columns):
                    missing_cols = set(self.visible_columns) - set(index)
                    raise ValueError(f"DataPanel does not have columns {missing_cols}")

                if materialize:
                    dp = self._clone(
                        data = {
                            k: self._data[k] for k in index
                        }
                    )
                else:
                    dp = self.view()
                    dp.visible_columns = index
                return dp

[FEATURE] Extend `map`, `filter`, `update` to handle functions with multiple arguments

Currently, users have to apply a partial to the function they input. We can have the kwargs on each of these be passed on to the (multi-argument) function to avoid this.

[FEATURE] Add `from_npy` class method to `NumpyArrayColumn`

Could have signature matching np.load and look something like this:

@classmethod
def from_npy(cls, path, mmap_mode=None, allow_pickle=False, fix_imports=True, encoding='ASCII'):
  data = np.load(path, mmap_mode=mmap_mode, allow_pickle=allow_pickle, fix_imports=fix_imports, encoding=encoding)
  return cls(data)

@Priya2698

[FEATURE] Add caching functionality to LambdaColumn

I’m envisioning is something in between a map and a LambdaColumn where the computation happens lazily but is cached once it’s computed. Right now, it’s either you do it all up front or you don’t get caching.

This idea was raised @ANarayan who pointed out that it would be helpful for caching feature preprocessing in NLP pipelines.

[BUG] Incorrect number of classes when predictions passed to ClassificationOutputColumn

## Data from test_prediction_column.py
logits = torch.as_tensor(
    [
        [-100, -2, -50, 0, 1],
        [0, 3, -1, 5, 4],
        [100, 0, 0, -1, 5],
        [-100, -2, -50, 0, 1],
    ]
).type(torch.float32)
expected_preds = torch.as_tensor([4, 3, 0, 4])

logit_col = ClassificationOutputColumn(logits=logits)
probs_col = ClassificationOutputColumn(probs = logit_col.probabilities().data)
preds_col = ClassificationOutputColumn(preds = logit_col.predictions().data)
print(logit_col.num_classes, probs_col.num_classes, preds_col.num_classes)

The output is 5 5 4.

[BUG] Datapanel outputs in notebook not truncated

When I create a datapanel, and run dp.head() in a jupyter notebook, the values of the entries are completely displayed instead of a truncated version.

On the Domino repo, gdro branch, run the following in a jupyter notebook (on most recent dev branch of meerkat):

df = build_cxr_df.out(load=True)
dp = get_dp(df)
dp.head()

The output looks like the attached screenshot.

[FEATURE] Add filepath property to ImageCell

Add a filepath property that points to .data in ImageCell

[FEATURE] Improve test coverage for `SpacyColumn` and `SpacyCell`

[BUG] "'numpy.ndarray' object is not callable" when creating `ImageColumn`

Describe the bug
A clear and concise description of what the bug is.

Error: TypeError: 'numpy.ndarray' object is not callable

To Reproduce

Code:
import pandas as pd
print(pd.version)
import os
import meerkat as mk
import numpy as np

from meerkat.contrib.imagenette import download_imagenette

BASE_DIR = "./datasets"
dataset_dir = download_imagenette(BASE_DIR);

dp = mk.DataPanel.from_csv(os.path.join(dataset_dir, "imagenette.csv"))
dp["img"] = mk.ImageColumn.from_filepaths(filepaths=dp["img_path"])
dp.head()

Run Code
Error: TypeError: 'numpy.ndarray' object is not callable
Code snippet '....'
Instructions (Run '...')
Errors and traceback '....'

Include any relevant screenshots.

Expected behavior
I expected a new ImageColumn called "img" to be created.

System Information
MacOS, Linux
pandas version 1.2.4

Additional context
N/A

[FEATURE] Use Identifiers for column names

Can probably use the RG identifier class for this

[FEATURE] Add a registry of models to contrib

This would be helpful for outside tools that want to support all of the datasets in the meerkat contrib library

[BUG] FileExistsError on `map` with multiple workers

When running a map over a DataPanel with multiple workers, I get this. This may be because we're creating a log directory for every batch. Consider changing this.

Traceback (most recent call last):                                                                                                                                                                                                                                                                                                   [0/349]
  File "/home/common/envs/conda/envs/rg-sabri/lib/python3.8/pathlib.py", line 1287, in mkdir
    self._accessor.mkdir(self, mode)
FileExistsError: [Errno 17] File exists: '/root/mosaic/RGDataset'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sabri/code/terra/terra/__init__.py", line 230, in _run
    out = self.fn(**args_dict)
  File "/home/sabri/code/spr-21/notebooks/05-23_cxr_forager.py", line 23, in convert_cxr_to_png
    paths = dp.map(
  File "/home/sabri/code/mosaic/mosaic/datapanel.py", line 863, in map
    return super().map(
  File "/home/sabri/code/mosaic/mosaic/mixins/mapping.py", line 58, in map
    for i, batch in tqdm(
  File "/home/common/envs/conda/envs/rg-sabri/lib/python3.8/site-packages/tqdm/std.py", line 1133, in __iter__
    for obj in iterable:
  File "/home/sabri/code/mosaic/mosaic/datapanel.py", line 702, in batch
    yield DataPanel.from_batch({**cell_batch._data, **batch_batch._data})
  File "/home/sabri/code/mosaic/mosaic/datapanel.py", line 550, in from_batch
    return cls(batch, identifier=identifier)
  File "/home/sabri/code/mosaic/mosaic/datapanel.py", line 123, in __init__
    self._create_logdir()
  File "/home/sabri/code/mosaic/mosaic/datapanel.py", line 384, in _create_logdir
    self.logdir.mkdir(parents=True, exist_ok=True)
  File "/home/common/envs/conda/envs/rg-sabri/lib/python3.8/pathlib.py", line 1287, in mkdir
    self._accessor.mkdir(self, mode)
  File "/home/common/envs/conda/envs/rg-sabri/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 7921) is killed by signal: Killed.

[BUG] `map` adds a new `index` column to `DataPanel`

When dp.map is passed a function that returns a dict, the returned DataPanel should use the same index (i.e. example ids) as the original DataPanel. Currently, it attaches a fresh index column that may not match the index column from dp.

[BUG] Avoid writing the same data twice when writing a DataPanel with LambdaColumns

When we write LambdaColumn we also write the column or DataPanel underlying it. When writing a DataPanel with a LambdaColumn dependent on other columns in the DataPanel this can lead to the same columns being written to disk multiple times.

[FEATURE] Rename `nn` module to `ml`

Currently conflicts with torch.nn

@Priya2698

Base writers off of concat and update concat to preserve subclass type

Right now map relies on columns specifying writer classes like the TorchWriter below for TensorColumn

class TorchWriter(AbstractWriter):
    def __init__(
        self,
        *args,
        **kwargs,
    ):
        super(TorchWriter, self).__init__(*args, **kwargs)

    def open(self) -> None:
        self.outputs = []

    def write(self, data, **kwargs) -> None:
        self.outputs.extend(data)

    def flush(self, *args, **kwargs):
        return torch.stack(self.outputs)

    def close(self, *args, **kwargs):
        pass

    def finalize(self, *args, **kwargs) -> None:
        pass

Except for in the memmap case, these writers are basically just doing a concat, so they can be consolidated into one writer class based off of concat.

Additionally, we need concat to preserve subclass type, which we can accomplish by converting the static method into a instance method.

[BUG] add_columns method produces duplicate columns on overwriting

add_column method, when used with overwrite=True, produces a duplicate column.

Code:

dp = mk.DataPanel({
    'text': ['The quick brown fox.', 'Jumped over.', 'The lazy dog.'],
    'label': [0, 1, 0]
})

dp.add_column("label", [1,2,3], overwrite=True)

dp.columns gives ['text', 'label', 'index', 'label']. On printing this DataPanel, both these labelcolumns have the new data.

Add support for `list_datasets()` in registry

list_datasets() function to the registry which returns a list of dataset names, so that people who are using the registry from an outside library see whats available programatically?

hazyresearch / meerkat Goto Github PK

meerkat's Issues

Issue

Important Aside

Proposed Solution 1

Proposed Solution 2

Recommend Projects

Recommend Topics

Recommend Org