hazyresearch / meerkat Goto Github PK
View Code? Open in Web Editor NEWCreative interactive views of any dataset.
License: Apache License 2.0
Creative interactive views of any dataset.
License: Apache License 2.0
Currently, missing values are supported only in the context of merge
. There the solution is rather slipshod: we convert the column to a ListColumn
that can store a mix of None and other types.
Replace this with something faster and more robust.
Consider: wrapper column around a smaller column with only the non missing columns.
Describe the bug
TensorColumn is meant to operate like a torch.Tensor, but certain naming conventions may conflict with tensor names.
For example, we often want to reshape our tensor without copying the underlying data (e.g. tensor.view
). Should TensorColumn.view()
call AbstractColumn.view()
(called currently) or torch.Tensor.view()
, which take different args? If the former, we should be explicit that view
is not supported.
It would be nice to be able to call
dp.lz.map(...)
instead of
dp.map(..., materialize=False)
Describe the bug
A clear and concise description of what the bug is.
Importing meerkat requires an updated version of pandas
To Reproduce
Steps and code snippet that reproduce the behavior:
Expected behavior
I expected the import to function as normal.
System Information
Context:
My current pandas version is 1.1.5. Updating pandas to version 1.2.4 resolves this issue.
Getting the following ValueError when using EmbeddingColumn
faiss_index.add(embs)
File "/dfs/scratch0/lorr1/env_bootleg_38/lib/python3.8/site-packages/faiss/__init__.py", line 104, in replacement_add
self.add_c(n, swig_ptr(x))
File "/dfs/scratch0/lorr1/env_bootleg_38/lib/python3.8/site-packages/faiss/swigfaiss.py", line 6016, in swig_ptr
return _swigfaiss.swig_ptr(a)
ValueError: array is not C-contiguous
Steps and code snippet that reproduce the behavior:
# entity_dp is datapanel with "emb" as EmbeddingColumn
embs = entity_dp["emb"].numpy()
# THIS WAS MY FIX! - it broke without it
embs = np.ascontiguousarray(embs)
index = faiss.IndexFlatL2
faiss_index = index(embs.shape[1])
faiss_index.add(embs)
thanks @lorr1
When trying to shuffle the batches when calling DataPanel.batch
via the shuffle=True
argument, it can’t set shuffle to be True when the sampler is not None.
Include the ability to add prediction columns directly to testbenches for evaluating models.
Potential design:
prediction mixin --> ClassifierMixin [logits, probs, argmax -- moving between things]
Need to support more than classification (e.g. segmentation, text generation).
Linking the prediction to the task.
One caveat for ClassifierMixin is supporting multi-label problems (i.e. imageA can be both class1 and class2)
values()
returns only the columns that are visible, while all_values()
could return all columns.
In block/manager.py
line 188, there's a call to os.makedirs(block_dirs)
. If the folder already exists, this throws an error. I wasnt' sure how you folks wanted to handle this. I generally think people will commonly save over folders so maybe have the exists ok flag turned on?
Describe the bug
For using ImageColumn, ImageColumn.from_filepaths(...) has to be used and simply using ImageColumn(...) does not work.
To Reproduce
The following code from the README gives an error and ImageColumn needs to be replaced with ImageColumn.from_filepaths to get the correct functionality.
from mosaic import DataPanel, ImageColumn
#Images are NOT read from disk at DataPanel creation...
dp = DataPanel({
'text': ['The quick brown fox.', 'Jumped over.', 'The lazy dog.'],
'image': ImageColumn(['fox.png', 'jump.png', 'dog.png']),
'label': [0, 1, 0]
})
# ...only at this point is "fox.png" read from disk
dp["image"][0]
Add a sort
function that can be used to sort the DataPanel by values in a column.
dp = mk.DataPanel({'a': [1, 3, 2], 'b': ['a', 'c', 'b']})
dp.sort('a') # sorted view into the dp
I installed the latest version of meerkat v0.2.1 using pip install meerkat-ml
.
I got the following permission denied error when import meerkat
in python. Because by default it creates a directory under /tmp
but I have no write access to /tmp
.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/siyitang/anaconda3/envs/s3/lib/python3.9/site-packages/meerkat/__init__.py", line 10, in <module>
initialize_logging()
File "/home/siyitang/anaconda3/envs/s3/lib/python3.9/site-packages/meerkat/logging/utils.py", line 26, in initialize_logging
os.makedirs(log_path, exist_ok=True)
File "/home/siyitang/anaconda3/envs/s3/lib/python3.9/os.py", line 215, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/home/siyitang/anaconda3/envs/s3/lib/python3.9/os.py", line 225, in makedirs
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/tmp/2021_10_29/17_06_57'
Temporary workaround is to set export TMPDIR=<some-dir>
.
The BlockManager (see #104) introduces a need for more robust DataPanel
testing that tests DataPanels with a diverse set of columns. As we add more columns, we don't want to have to update the DataPanel
tests for each new column. Instead, we should specify a TestBed
for each column that plugs in to the DataPanel
tests.
Started this for NumpyArrayColumn
with #108
Since columns are stored separately, it should be possible to only load in a subset of the columns in a DataPanel
with something like:
DataPanel.read("path_to_datapanel", columns=["a", "b", "d"])
Currently, renaming the columns in a DataPanel
is cumbersome. Say, for example, we want to rename a column "ppl" to "people", this might look like:
dp["people"] = dp["ppl"]
dp.remove("ppl")
Further, if we wanted to do this out of place, we'd have to call an additional dp.view()
at the top.
In Pandas, this can be done with a single line of code
If a single DataPanel contains a chain of LambdaColumns, like so
dp["a_b"] = LambdaColumn(dp["a"], fn)
dp["a_b_c"] = LambdaColumn(dp["a_b"], fn_2)
then indexing the DataPanel with dp[0]
will perform the materialization of dp["a_b"]
twice.
Ideally, the DataPanel should be aware of these dependencies and only materialize things once.
Since meerkat is changing quite quickly, and folks are often working off the dev branch, it's hard to ensure that every DataPanel that gets saved is backwards compatible with future versions of Meerkat. Eventually, once Meerkat is stable and folks are working off of major and minor versions, we should support backwards compatibility. But for the time being, when everyone's working off of dev, how should we support them? I see two options:
(1) Save the meerkat commit with every DataPanel and column, and create a conversion script that allows for converting saved DataPanel's between commits.
(2) Try to support backwards compatibility in the read
and write
code directly.
I think (2) will be quite challenging because we rely on Pickle, which runs into issues when classes and names change. One approach is to offer an "approximate" read that skips all data in pickles and raises appropriate warnings. This seems not ideal though and adds a lot of mess to the code.
I think my preference is for (1), but curious to hear other thoughts.
Meerkat imposes a ListColumn
index on all DataPanel
. In many cases, this is the slowest column in the dp and it bottlenecks performance, since all the other columns are based on numpy, pandas or torch.
We should do away with the index column in favor of something else...
It's also worth thinking about what purpose the "index" column serves in the DataPanel. Its not clear to me that it's providing anything important (though I may be missing something). It seems that its main purpose is to provide some sort of unique "id" for every row in the DataPanel, but I don't think this is something we should impose on the DataPanel. That being said, I see the appeal of being able to specify some "id" columns that have some special properties (e.g. always get carried from one dp to the next when indexing).
I propose doing away with this single "index" column design in favor of a new design inspired by the idea of indexes in database management systems:
DataPanel
as an index. For example:mimic_dp = ...
mimic_dp.set_index(column="dicom_id")
row = mimic_dp.idx["dicom_id", "id_e324198"]
sub_dp = mimic_dp.idx["dicom_id", ["id_e324198", "id_e1236493",]]
def idx(self, index_name, index):
return self[self[index_name] == index]
but columns can override this with faster implementations, e.g. based on a pandas.Index
object which is backed by a Cython dict, so provides O(1) lookups (https://pandas.pydata.org/pandas-docs/stable/development/internals.html).
Pandas has an options module that lets the user specify settings for pandas. It would be great to have something similar in meerkat. This would allow users to customize their preferred default behavior (e.g. do strings go into list column or pandas column by default?)
Appending to a DataPanel along columns does not work without suffix
argument even when the column names do not overlap.
dp = ms.DataPanel({
'text': ['The quick brown fox.', 'Jumped over.', 'The lazy dog.'],
'label': [0, 1, 0]
})
dp2 = ms.DataPanel({
'string': ['The quick brown fox.', 'Jumped over.', 'The lazy dog.'],
'target': [0, 1, 0]
})
dp.append(dp2, axis=1)
This code throws ValueError
. It works when I provide any suffix, although they are not used.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-18-5f32282aa054> in <module>()
----> 1 dp.append(dp2, axis=1)
1 frames
/usr/local/lib/python3.7/dist-packages/mosaic/datapanel.py in append(self, dp, axis, suffixes, overwrite)
422 if not overwrite and shared:
423 if suffixes is None:
--> 424 raise ValueError()
425 left_suf, right_suf = suffixes
426 data = {
ValueError:
Using version 0.2.0.
import meerkat as mk
import meerkat.nn
dp['emb_col'] = mk.nn.EmbeddingColumn(some 2d np array)
if i save the datapanel and then load it back, i get this error. this was working previously, but i updated the meerkat package this week:
ConstructorError: while constructing a Python object
module 'meerkat.nn.embedding_column' is not imported
It is very natural for users (and developers) to construct new DataPanel objects from existing ones via DataPanel(dp)
.
An unexpected consequence of this issue is finding a good way to stratify which attributes should be recomputed and which should simply be shallow copied over.
As an example, two attributes that every DataPanel has is _data
and _identifier
. _data
is typically large and heavy-weight, so we will almost always want to shallow copy it. _identifier
is quite lightweight and may be unique to different DataPanels, so maybe this is a property we recompute each time in __init__
. Note this is just an example, we may want the identifier to persist.
This is especially relevant for subclassing DataPanel
. As of PR #57, self.from_batch()
is used to construct new DataPanel containers from existing ones with shared underlying data. However, as the PR mentions, self.from_batch()
is called by many other ops (_get
, merge
, concat
, etc.), and none of these methods have a seamless way of passing arguments other than data
to __init__
.
An example of this is EntityDataPanel, where the index_column should be passed from the current instance to the newly constructed instance. Because there is no way to plumb that information through different calls, the initializer of EntityDataPanel gets called with EntityDataPanel(index_column=None) even if the current instance has an index column. This results in a new column "_ent_index" being added to the new EntityDataPanel.
Implement a private instance method called _clone(data=None, visible_columns=None...) -> DataPanel/subclass
which implements the default functionality for how to construct a new DataPanel with the relevant arguments to plumb from current instance to new instance. We can then call self._clone(data=data, visible_columns-optional)
instead of self.from_batch()
in ops like _get
, merge
, concat
, etc.
Let's consider the EntityDataPanel
case. We want to plumb self.index_column
from a current EntityDataPanel
to all EntityDataPanel
s constructed in its image. ._clone
will look something like
class EntityDataPanel:
def _clone(self, data=None) -> EntityDataPanel:
if data is None:
data = self.data
return EntityDataPanel(data, identifier=identifier, index_column=self.index_column)
We can then have ops like DataPanel._get()
for example use self._clone()
instead of self.from_batch()
. For example
class DataPanel:
def _get(self, idx, materialize=False):
...
# example cases where `index` returns a datapanel
elif isinstance(index, slice):
# slice index => multiple row selection (DataPanel)
# return self.from_batch(
# {
# k: self._data[k]._get(index, materialize=materialize)
# for k in self.visible_columns
# })
return self._clone({
k: self._data[k]._get(index, materialize=materialize)
for k in self.visible_columns
})
...
Instead of having developers reimplement ._clone()
, we can have them implement something like _state_keys()
but for init args. Something like ._clone_kwargs()
:
class EntityDataPanel:
def _clone_kwargs(self) -> EntityDataPanel:
default_kwargs = super()._clone_kwargs()
default_kwargs.update({"index_column": self.index_column})
return default_kwargs
class DataPanel:
def _default_kwargs(self):
return {"data": self.data, "identifier": self.identifier}
def _clone(self, **kwargs):
default_kwargs = self._clone_kwargs()
if kwargs:
default_kwargs.update(kwargs)
return self.__class__(**default_kwargs)
Visualizing a DataPanel
in a Jupyter notebook can be frustratingly slow because we are currently: (1) converting the DataPanel to a special "visual" Pandas DataFrame (via _repr_pandas_
) and then (2) visualizing using Pandas visualization out of the box. Step 1 can be very slow for large dps.
Make our own HTML visualization module that circumvents the conversion to. We should borrow heavily from/plug into https://github.com/pandas-dev/pandas/blob/master/pandas/io/html.py
Test coverage for memmap on TensorColumn
and NumpyArrayColumn
is lacking.
Is your feature request related to a problem? Please describe.
When indexing a subset of columns in a DataPanel, we are currently always returning a view of the DataPanel with visible rows set. We should only be using visible columns when materialize is False, otherwise we should clone a new DataPanel and only have it contain a subset of columns. See
Describe the solution you'd like
if isinstance(index[0], str):
if not set(index).issubset(self.visible_columns):
missing_cols = set(self.visible_columns) - set(index)
raise ValueError(f"DataPanel does not have columns {missing_cols}")
dp = self.view()
dp.visible_columns = index
return dp
Should become
if isinstance(index[0], str):
if not set(index).issubset(self.visible_columns):
missing_cols = set(self.visible_columns) - set(index)
raise ValueError(f"DataPanel does not have columns {missing_cols}")
if materialize:
dp = self._clone(
data = {
k: self._data[k] for k in index
}
)
else:
dp = self.view()
dp.visible_columns = index
return dp
Currently, users have to apply a partial
to the function they input. We can have the kwargs
on each of these be passed on to the (multi-argument) function to avoid this.
Could have signature matching np.load
and look something like this:
@classmethod
def from_npy(cls, path, mmap_mode=None, allow_pickle=False, fix_imports=True, encoding='ASCII'):
data = np.load(path, mmap_mode=mmap_mode, allow_pickle=allow_pickle, fix_imports=fix_imports, encoding=encoding)
return cls(data)
I’m envisioning is something in between a map
and a LambdaColumn
where the computation happens lazily but is cached once it’s computed. Right now, it’s either you do it all up front or you don’t get caching.
This idea was raised @ANarayan who pointed out that it would be helpful for caching feature preprocessing in NLP pipelines.
## Data from test_prediction_column.py
logits = torch.as_tensor(
[
[-100, -2, -50, 0, 1],
[0, 3, -1, 5, 4],
[100, 0, 0, -1, 5],
[-100, -2, -50, 0, 1],
]
).type(torch.float32)
expected_preds = torch.as_tensor([4, 3, 0, 4])
logit_col = ClassificationOutputColumn(logits=logits)
probs_col = ClassificationOutputColumn(probs = logit_col.probabilities().data)
preds_col = ClassificationOutputColumn(preds = logit_col.predictions().data)
print(logit_col.num_classes, probs_col.num_classes, preds_col.num_classes)
The output is 5 5 4
.
When I create a datapanel, and run dp.head()
in a jupyter notebook, the values of the entries are completely displayed instead of a truncated version.
On the Domino repo, gdro branch, run the following in a jupyter notebook (on most recent dev branch of meerkat):
df = build_cxr_df.out(load=True)
dp = get_dp(df)
dp.head()
Add a filepath
property that points to .data in ImageCell
Describe the bug
A clear and concise description of what the bug is.
Error: TypeError: 'numpy.ndarray' object is not callable
To Reproduce
from meerkat.contrib.imagenette import download_imagenette
BASE_DIR = "./datasets"
dataset_dir = download_imagenette(BASE_DIR);
dp = mk.DataPanel.from_csv(os.path.join(dataset_dir, "imagenette.csv"))
dp["img"] = mk.ImageColumn.from_filepaths(filepaths=dp["img_path"])
dp.head()
Run Code
Error: TypeError: 'numpy.ndarray' object is not callable
Code snippet '....'
Instructions (Run '...')
Errors and traceback '....'
Include any relevant screenshots.
Expected behavior
I expected a new ImageColumn called "img" to be created.
System Information
MacOS, Linux
pandas version 1.2.4
Additional context
N/A
Can probably use the RG identifier class for this
This would be helpful for outside tools that want to support all of the datasets in the meerkat contrib library
When running a map over a DataPanel
with multiple workers, I get this. This may be because we're creating a log directory for every batch. Consider changing this.
Traceback (most recent call last): [0/349]
File "/home/common/envs/conda/envs/rg-sabri/lib/python3.8/pathlib.py", line 1287, in mkdir
self._accessor.mkdir(self, mode)
FileExistsError: [Errno 17] File exists: '/root/mosaic/RGDataset'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/sabri/code/terra/terra/__init__.py", line 230, in _run
out = self.fn(**args_dict)
File "/home/sabri/code/spr-21/notebooks/05-23_cxr_forager.py", line 23, in convert_cxr_to_png
paths = dp.map(
File "/home/sabri/code/mosaic/mosaic/datapanel.py", line 863, in map
return super().map(
File "/home/sabri/code/mosaic/mosaic/mixins/mapping.py", line 58, in map
for i, batch in tqdm(
File "/home/common/envs/conda/envs/rg-sabri/lib/python3.8/site-packages/tqdm/std.py", line 1133, in __iter__
for obj in iterable:
File "/home/sabri/code/mosaic/mosaic/datapanel.py", line 702, in batch
yield DataPanel.from_batch({**cell_batch._data, **batch_batch._data})
File "/home/sabri/code/mosaic/mosaic/datapanel.py", line 550, in from_batch
return cls(batch, identifier=identifier)
File "/home/sabri/code/mosaic/mosaic/datapanel.py", line 123, in __init__
self._create_logdir()
File "/home/sabri/code/mosaic/mosaic/datapanel.py", line 384, in _create_logdir
self.logdir.mkdir(parents=True, exist_ok=True)
File "/home/common/envs/conda/envs/rg-sabri/lib/python3.8/pathlib.py", line 1287, in mkdir
self._accessor.mkdir(self, mode)
File "/home/common/envs/conda/envs/rg-sabri/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 7921) is killed by signal: Killed.
When dp.map
is passed a function that returns a dict, the returned DataPanel
should use the same index
(i.e. example ids) as the original DataPanel
. Currently, it attaches a fresh index
column that may not match the index column from dp
.
When we write LambdaColumn
we also write the column or DataPanel underlying it. When writing a DataPanel with a LambdaColumn
dependent on other columns in the DataPanel
this can lead to the same columns being written to disk multiple times.
Currently conflicts with torch.nn
Right now map relies on columns specifying writer classes like the TorchWriter below for TensorColumn
class TorchWriter(AbstractWriter):
def __init__(
self,
*args,
**kwargs,
):
super(TorchWriter, self).__init__(*args, **kwargs)
def open(self) -> None:
self.outputs = []
def write(self, data, **kwargs) -> None:
self.outputs.extend(data)
def flush(self, *args, **kwargs):
return torch.stack(self.outputs)
def close(self, *args, **kwargs):
pass
def finalize(self, *args, **kwargs) -> None:
pass
Except for in the memmap case, these writers are basically just doing a concat, so they can be consolidated into one writer class based off of concat.
Additionally, we need concat to preserve subclass type, which we can accomplish by converting the static method into a instance method.
add_column
method, when used with overwrite=True
, produces a duplicate column.
Code:
dp = mk.DataPanel({
'text': ['The quick brown fox.', 'Jumped over.', 'The lazy dog.'],
'label': [0, 1, 0]
})
dp.add_column("label", [1,2,3], overwrite=True)
dp.columns
gives ['text', 'label', 'index', 'label']
. On printing this DataPanel
, both these label
columns have the new data.
list_datasets() function to the registry which returns a list of dataset names, so that people who are using the registry from an outside library see whats available programatically?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.