rapidsai / cudf Goto Github PK

cuDF - GPU DataFrame Library

Home Page: https://docs.rapids.ai/api/cudf/stable/

License: Apache License 2.0

Python 19.72% Shell 0.25% CMake 0.53% C 0.03% C++ 42.79% Cuda 24.72% Java 9.40% Cython 2.56% HTML 0.01% Dockerfile 0.01%

gpu rapids cudf arrow cuda pandas dataframe dask data-analysis data-science

cudf's People

Stargazers

Watchers

Forkers

codeaudit hoardboard cloudadic eagle518 micseb andre-git giserh vdt mallitalluri mingjiang-zeng leliaonvidia matrixquery anilsener adaube mindis sklam wamsiv david4096 andrewseidl aimwts lohithsubramanya ferasos michael-balint jcrist tomaugspurger xennygrimmato mrgoogol uhjish acbrewbaker 17ai shellcat-zero kkraus14 hamza2404 adomore pariyat wanglee513 fastfingersacademy just4jc nikolayvoronchikhin shobhit-agarwal gregwchase nsakharnykh dantegd hhuuggoo yashv28 scopatz ptzagk quansight ct-clmsn kararis ayushdg iroy30 jrhemstad fish5421 seth-priya mrocklin harrism enginbozkurt chomolungma heoa bhushan23 prajaktapitale29 batermj foeinlove zedzero radrangi cedrickchee cxz vishalbelsare gridl rotemfogel bdgowda1 evgeniybochenkov hephaex jaykimbravekjh r4zz3 raydouglass kayush2o6 cuulee raghavmi nanaakwasiabayieboateng shwina rongou devavret alexvk insad juanjavier101 bolinov cjnolet locussam faspl felipeblazing j-ieong bahaugen xiaolin1990 chuckhastings pdmack kevinpsguy thomcom mjsamoht

cudf's Issues

Install Error: symbol not found in library libgdf.so

Getting the following error when I try import pygdf. Appreciative of any and all guidance!

error: symbol 'gdf_left_join_generic' not found in library 'libgdf.so': /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/../../libgdf.so: undefined symbol: gdf_left_join_generic

adding pymapd to conda notebook environment

Would it be possible to add pymapd to notebook_py35.yml? It would take away one step from having to install pymapd after creating the pygdf environment.

Filter function (using Numba JIT)

Implement support for filter(), using Numba to JIT the implementation on the fly.

Expose sum-of-square

So that dask_gdf doesn't need to use internal api: https://github.com/gpuopenanalytics/dask_gdf/pull/5/files#diff-60c51fb7fec020591436a7b48b60823cR246

MapD Thrift interactions in C++ (and C++ in general)

I am wondering if it would be faster and more maintainable to handle serialization, data movement, and other interactions between this system and other platforms (e.g. MapD) in a C++ library that is not Python specific. This would add some complexity to the build system for this project, but that seems inevitable given the nature of the problem (that you'll end up needing to use nvcc at some point to create some support libraries).

Multi-Column Joins

Currently, you can only join on a single column by setting that column as the index on both dataframes.

Creating this issue to open discussion and track progress for the implementation of multi-column joins.

Binary operations between int and float view instead of cast

It appears that instead of casting to a common type, the bytes of other in a binary operation (e.g. __add__(self, other) are viewed in the dtype of self:

In [54]: df = pd.DataFrame({'x': range(10), 'y': list(map(float, range(10)))})

In [55]: gdf = gd.DataFrame.from_pandas(df)

In [56]: gdf
Out[56]:
      x    y
 0    0  0.0
 1    1  1.0
 2    2  2.0
 3    3  3.0
 4    4  4.0
 5    5  5.0
 6    6  6.0
 7    7  7.0
 8    8  8.0
 9    9  9.0

In [57]: gdf.x + gdf.y  # Int + Float
Out[57]:

 0                   0
 1 4607182418800017409
 2 4611686018427387906
 3 4613937818241073155
 4 4616189618054758404
 5 4617315517961601029
 6 4618441417868443654
 7 4619567317775286279
 8 4620693217682128904
 9 4621256167635550217

In [58]: gdf.y + gdf.x  # Float + Int
Out[58]:

 0  0.0
 1  1.0
 2  2.0
 3  3.0
 4  4.0
 5  5.0
 6  6.0
 7  7.0
 8  8.0
 9  9.0

In [59]: df.y + df.x.view('f8')  # Comparison using pandas showing it's view instead of astype
Out[59]:
0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    8.0
9    9.0
dtype: float64

In [60]: df.x + df.y.view('i8')
Out[60]:
0                      0
1    4607182418800017409
2    4611686018427387906
3    4613937818241073155
4    4616189618054758404
5    4617315517961601029
6    4618441417868443654
7    4619567317775286279
8    4620693217682128904
9    4621256167635550217
dtype: int64

Convert to/from pandas DataFrame

Support for join

Based on gpuopenanalytics/libgdf/#9.

Mentioning Apache Arrow in the README

While there's no obligation to do so, it would be positive for the OSS community to mention that Arrow is an important component of how the GDF works.

For example, I have had people ask me about this article https://devblogs.nvidia.com/parallelforall/goai-open-gpu-accelerated-data-analytics/, and the lines

from pygdf.gpuarrow import GpuArrowReader
reader = GpuArrowReader(darr)

e.g. "is that the same Arrow?". It's mentioned nowhere in that article. In the blog post it says "Without shared GPU data structures provided by GOAI". This is a little bit misleading. It would be good to acknowledge that this isn't all an endogenous creation and there is a broader community at work on these problems (zero-copy columnar data interchange).

sort by column(s)

Wrapper around the sort functionality provided by rapidsai/libgdf#8.

Broadcast operations between series and scalars

Currently these only work for comparison operators

In [31]: df = pd.DataFrame({'x': range(10), 'y': list(map(float, range(10)))})

In [32]: gdf = gd.DataFrame.from_pandas(df)

In [33]: gdf.x > 0
Out[33]:

 0 False
 1  True
 2  True
 3  True
 4  True
 5  True
 6  True
 7  True
 8  True
 9  True

In [34]: gdf.x + 0
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-34-7eff5c11afd0> in <module>()
----> 1 gdf.x + 0

TypeError: unsupported operand type(s) for +: 'Series' and 'int'

`Series.scale()` does not retain index

Document build problem on CentOS 7

When I tried to build the documentation on CentOS 7, using the packaged version of python-sphinx, its toolchain didn't work well because sphinx-build distributed with RPM package was too old. (python-sphinx-1.1.3-11.el7.noarch.rpm does not support -M option.)
It is helpful to describe the minimum required version.

[kaigai@namazu docs]$ make html
Sphinx v1.1.3
Usage: /usr/bin/sphinx-build [options] sourcedir outdir [filenames...]
Options: -b <builder> -- builder to use; default is html
         -a        -- write all files; default is to only write new and changed files
         -E        -- don't use a saved environment, always read all files
         -t <tag>  -- include "only" blocks with <tag>
         -d <path> -- path for the cached environment and doctree files
                      (default: outdir/.doctrees)
         -c <path> -- path where configuration file (conf.py) is located
                      (default: same as sourcedir)
         -C        -- use no config file at all, only -D options
         -D <setting=value> -- override a setting in configuration
         -A <name=value>    -- pass a value into the templates, for HTML builder
         -n        -- nit-picky mode, warn about all missing references
         -N        -- do not do colored output
         -q        -- no output on stdout, just warnings on stderr
         -Q        -- no output at all, not even warnings
         -w <file> -- write warnings (and errors) to given file
         -W        -- turn warnings into errors
         -P        -- run Pdb on exception
Modi:
* without -a and without filenames, write new and changed files.
* with -a, write all files.
* with filenames, write these.
make: *** [html] Error 1

Documentation could be built with the latest version download and overwritten with pip command, however, it was little bit inconvenient for CentOS/RHEL environment.

Thanks,

Category comparison is not checking that the categories are equivalent

Support for categoricals

Depends on rapidsai/libgdf#7.

pygdf fails py.test

I am re-installing pygdf on a linux ubuntu 16.04 instance with 1 NVIDIA K80 (Driver version 384.111). I started to have a pygdf 'invalid shared memory key' error and decided to reinstall pygdf. I am now trying to re-create an environment from notebook_py35.yml. The steps to remove and re-create the conda environment are shown below:
Steps taken to reinstall:

Remove pygdf conda envrionment (conda remove)
Did a conda clean to remove all from cache
re created pygdf environment
activated pygdf environment
Tried to run py.test but got the error 'no module named libgdf_cffi'

Not sure why the it won't pass py.test anymore.

The downloading of packages while creating pygdf env has no warning or error

Support for common columnar reduction functions

Depends on rapidsai/libgdf#6.

to_dense_buffer should fill missing values to -1

currently, it's filling it with nan and that requires casting to float64.

does pygdf store data in vram? how to clean data in Vram?

hello, I am trying to use pygdf for Big Data Analytics (billion data row). but when i try to convert pandas dataframe with million data of row an error occurred with message CUDA_ERROR_OUT_OF_MEMORY. it's seem that pygdf memory is limited with vram memory. Is there a solution to this problem?
And also everytime I display DataFrame it's also use vram memory. is there a way to clean vram memory?

Function to add categories to a categorical column

Right now there is no way to add categories to a categorical column without defining a new column. This would be useful for UDFs, fillna, etc. where a user may want to split a categorical column based on another column or fill in nulls.

Example:

import pandas as pd
from pygdf.dataframe import DataFrame

pdf = pd.DataFrame({"cat_key": ["a", "b", None, "c", None], "value": [1, 2, 3, 4, 5]})
pdf['cat_key'] = pdf['cat_key'].astype("category")

gdf = DataFrame.from_pandas(pdf)

In the above the only thing we can fillna to is "a", "b", or "c". There should be an add_categories function similar to pandas.Series.cat.add_categories.

cannot load library 'libgdf.so'

wen i use

python setup.py install
python setup.py build

install and build pygdf.

then,run py.test
an OSError exist:

cannot load library 'libgdf.so': libgdf.so: cannot open shared object file: No such file or directory

one_hot_encoding does not handle masked data properly

One-hot is not respecting the mask and is considering invalid locations. This will lead to incorrect result if the invalid location contains value that matches one of the category by accident.

Current workaround:

masked_series.fillna(something).one_hot_encoding()

Series.unique_k output type

Most pygdf.Series methods return a pygdf.Series, but Series.unique_k returns a numpy array. Is there a good reason for this, or can the data remain on the gpu?

Comparisons with scalars don't work on series of length 1

In [15]: df = pd.DataFrame({'x': range(10), 'y': list(map(float, range(10)))})

In [16]: gdf = gd.DataFrame.from_pandas(df)

In [17]: x1 = gdf.x[:1]

In [18]: x2 = gdf.x[:2]

In [19]: x1
Out[19]:

0    0

In [20]: x2
Out[20]:

0    0
1    1

In [21]: x2 == 0
Out[21]:

0  True
1 False

In [22]: x1 == 0
<long traceback>
...
~/miniconda/envs/gdf/lib/python3.5/site-packages/numba/cuda/cudadrv/driver.py in host_pointer(obj)
   1505
   1506     forcewritable = isinstance(obj, np.void)
-> 1507     return mviewbuf.memoryview_get_buffer(obj, forcewritable)
   1508
   1509

TypeError: expected a writable bytes-like object

In [23]: x1 == x1  # Does work with series
Out[23]:

0 True

one_hot_encoding uses too much gpu memory

A user reported that one_hot_encoding with float64 uses too much memory. It was observed that float32 uses the correct amount of memory.

Installer: ResolvePackageNotFound

I just did a fresh conda install (OS X) and tried to install pygdf, but hit ResolvePackageNotFound on libgdf. Pointers / any recommendations on things to try / report back?

Thanks!

Leos-MBP:pygdf lmeyerov$ conda --version
conda 4.3.30
Leos-MBP:pygdf lmeyerov$ conda env create --name pygdf_dev --file conda_environments/testing_py35.yml
Fetching package metadata .................

ResolvePackageNotFound: 
  - libgdf 0.1.0a2.dev hb999fd6_2

Joining on categorical columns matches categorical code rather than categorical value

Reproducible use case:

import pandas as pd
from pygdf.dataframe import DataFrame

pdf1 = pd.DataFrame({"join_col": ["a", "b", "c", "d", "e"], "data_col_left": [1, 2, 3, 4, 5]})
pdf2 = pd.DataFrame({"join_col": ["c", "e", "f"], "data_col_right": [6, 7, 8]})

pdf1["join_col"] = pdf1["join_col"].astype("category")
pdf2["join_col"] = pdf2["join_col"].astype("category")

gdf1 = DataFrame.from_pandas(pdf1)
gdf2 = DataFrame.from_pandas(pdf2)

gdf1 = gdf1.set_index("join_col")
gdf2 = gdf2.set_index("join_col")

join_gdf = gdf1.join(gdf2)
join_pdf = pdf1.join(pdf2)

GDF incorrect output:

   data_col_left  data_col_right
a              1               6
b              2               7
c              3               8
d              4              -1
e              5              -1

PDF correct output:

          data_col_left  data_col_right
join_col                               
a                     1             NaN
b                     2             NaN
c                     3             6.0
d                     4             NaN
e                     5             7.0

Update to delegate (some) functionality to libgdf

Much of the current functionality in libgdf should be delegated to the C library libgdf. This will allow this functionality to be reused in other projects. PyGDF can then just provide a python wrapper around those functions.

CUDA_ERROR_INVALID_CONTEXT when calling `to_pandas` in parallel

In [1]: %load test.py

In [2]: # %load test.py
   ...: import pygdf as gd
   ...: import pandas as pd
   ...: import dask_gdf as dgd
   ...: import numpy as np
   ...:
   ...: df = pd.DataFrame({'x': np.random.randint(0, 5, size=10000),
   ...:                    'y': np.random.normal(size=10000)})
   ...:
   ...: gdf = gd.DataFrame.from_pandas(df)
   ...:
   ...: works = (dgd.from_pygdf(gdf, npartitions=2)
   ...:             .query('x > 2'))
   ...:
   ...: fails = works.to_dask_dataframe()
   ...:

In [3]: works.head()
Out[3]:
     x               y
2    4  -1.73270757966
4    3 -0.308836664379
6    3 -0.241514128025
8    3 -0.348121287014
15    3  0.377489009207

In [4]: fails.head()
<long traceback>
...
~/miniconda/envs/gdf/lib/python3.5/site-packages/numba/cuda/cudadrv/driver.py in _check_error(self, fname, retcode)
    321                     _logger.critical(msg, _getpid(), self.pid)
    322                     raise CudaDriverError("CUDA initialized before forking")
--> 323             raise CudaAPIError(retcode, msg)
    324
    325     def get_device(self, devnum=0):

CudaAPIError: [201] Call to cuMemcpyDtoH results in CUDA_ERROR_INVALID_CONTEXT

In [5]: import dask

In [6]: fails.compute(get=dask.get).head()  # Using no threads
Out[6]:
    x         y
2   4 -1.732708
4   3 -0.308837
6   3 -0.241514
8   3 -0.348121
15  3  0.377489

The only difference between works and fails is that fails calls to_pandas in parallel, resulting in pulling data off the gpu and back onto the host.

Define basic groupby functionality

What kind of groupby functionality is actually useful from the Python API? (This will almost certainly require Numba to JIT some operations.)

GDF_CUDA_ERROR on binary operation with dataframe columns

Hi,

I am trying to do some basic add and multiply operations on dataframe columns but it is throwing GDF_CUDA_ERROR everytime.

Below is the code and stacktrace.

from pygdf.dataframe import DataFrame
import numpy as np

size = 100000000
df = DataFrame([('a', np.random.random(size)),('b', np.random.random(size))])
print('some rows',df.loc[:5])

df['a'] = df['a'] + df['b']

print('some rows',df.loc[:5])

Traceback (most recent call last):
File "test.py", line 9, in
df['a'] = df['a'] + df['b']
File "/home/test_proj/pygdf/pygdf/series.py", line 234, in add
return self._binaryop(other, 'add')
File "/home/test_proj/pygdf/pygdf/series.py", line 221, in _binaryop
outcol = self._column.binary_operator(fn, other._column)
File "/home/test_proj/pygdf/pygdf/numerical.py", line 65, in binary_operator
out_dtype=self.dtype)
File "/home/test_proj/pygdf/pygdf/numerical.py", line 180, in numeric_column_binop
null_count = _gdf.apply_binaryop(op, lhs, rhs, out)
File "/home/test_proj/pygdf/pygdf/_gdf.py", line 68, in apply_binaryop
binop(*args)
File "/root/miniconda3/envs/pygdf_dev/lib/python3.5/site-packages/libgdf_cffi/wrapper.py", line 28, in wrap
raise GDFError(errcode, errname)
libgdf_cffi.wrapper.GDFError: GDF_CUDA_ERROR

Rethink class structure

While working to implement things like concat, I've noticed a few (potential) issues with the current design that may bite us in the future. I'll lay them out here. Feel free to ignore, I'm very new to working with this codebase and may be missing reasons for the existing layout.

Current Structure

A DataFrame contains:
- A dictionary of columns to series
- An index
A Series contains:
- A buffer
- A "mask" representing missingness
- An implemenation depending on the dtype.
An index could be many things, but the generic one just wraps a series.
A numeric implementation contains:
- A dtype
A categorical implementation contains:
- A dtype
- A codes dtype
- A numerical implementation for the codes dtype
- A set of categories
- Whether the categories have an order

This structure results in a few odd things:

A DataFrame and each of its series all contain references to the same index. This currently isn't checked, but must be true for correctness.
The numeric implementation contains no data, but the categorical implementation does. This makes figuring out where to put methods kind of confusing.

Proposed new structure

A DataFrame contains:
- a dictionary of column names to data objects
- an index
A series contains:
- a data object
- an index
A data object contains:
- a dtype
- a mask
- all methods for working with that dtype
An Index is basically the same as before, but contains a data object instead of a series.

This new layout is nicer because:

A dataframe only has one index reference. A new series is created each time it's indexed out of that dataframe, but this is cheap since it's just a new python object wrapping an existing (not copied) buffer.

def __getitem__(self, key):
    return Series(self._columns[key], index=self.index)

Internally many algorithms are working with data objects instead of series. This makes keeping consistent dtypes, dealing with masks, and extra state from categoricals a lot easier. No longer does the categorical implementation object have to match the data buffer that's passed along inside a series. Correctly implementing concat becomes easier, as all necessary state is located on the same object.

Potential Example Class Hierarchy

class Data(object):
   pass

class NumericData(Data):
    def __init__(self, buffer, mask, dtype):
        pass

class CategoricalData(Data):
    def __init__(self, codes : NumericData, categories, ordered=False):
        pass

class Series(object):
    def __init__(self, data, index=None):
        pass

class DataFrame(object):
    def __init__(self, cols_and_data, index=None):
        pass

I don't think rearranging this would be that tricky. It basically amounts to moving the buffer storage into the implementation classes, and fixing existing code as necessary.

Typo in README?

You say to
conda env create --name pygdf_dev --file conda_environments/testing_py35.yml, but it seems to produce a conflict (Mac OSX) since in the name field in the yaml file there's another string.
Using
conda env create --file conda_environments/testing_py35.yml just goes fine.

Convert to/from numpy array with structured dtype

(Will need option to convert categorical column during export to fixed width string or integer.)

Use pyarrow to generate test data for Arrow IPC reads

The tests in https://github.com/gpuopenanalytics/pygdf/blob/master/pygdf/tests/test_gpu_arrow_parser.py use hard-coded binary data. This makes it more difficult to programmatically extend these tests.

I opened https://issues.apache.org/jira/browse/ARROW-1406 to ensure that this is easy to do and documented. So you will be able to obtain a memoryview containing the serialized schema and record batch in the form that you have currently.

Function to append/concat GDFs

If you currently want to append GDFs you can create an empty GDF and define each column by appending the columns of the dataframes.

Example:

import pandas as pd
from pygdf.dataframe import DataFrame

pdf1 = pd.DataFrame({"col1": [1,2,3], "col2": [11,12,13]})
pdf2 = pd.DataFrame({"col1": [4,5,6], "col2": [14,15,16]})
pdf3 = pd.DataFrame({"col1": [7,8,9], "col2": [17,18,19]})

gdf1 = DataFrame.from_pandas(pdf1)
gdf2 = DataFrame.from_pandas(pdf2)
gdf3 = DataFrame.from_pandas(pdf3)

append_gdfs = [gdf1, gdf2, gdf3]
newgdf = DataFrame()
for col in gdf1.columns:
    new_col = None
    for gdf in append_gdfs:
        if new_col is None:
            new_col = gdf[col]
        else:
            new_col = new_col.append(gdf[col])
    newgdf[col] = new_col

There should be append and concat functions similar to pandas.DataFrame.append and pandas.concat for appending dataframe(s).

more docs for new apply_* features

More docs for #96

Outer joins aren't properly populated with NaNs when values don't exist

Minimal case to reproduce:

import pandas as pd
from pygdf.dataframe import DataFrame

pdf1 = pd.DataFrame({"join_col": [1,2,3,4,5], "data_col_left": [10, 11, 12, 13, 14]})
pdf2 = pd.DataFrame({"join_col": [1,2], "data_col_right1": [15, 16]})
pdf3 = pd.DataFrame({"join_col": [1,2,3,4,5], "data_col_right2": [17, 18, 19, 20, 21]})

gdf1 = DataFrame.from_pandas(pdf1)
gdf2 = DataFrame.from_pandas(pdf2)
gdf3 = DataFrame.from_pandas(pdf3)

gdf1 = gdf1.set_index("join_col")
gdf2 = gdf2.set_index("join_col")
gdf3 = gdf3.set_index("join_col")

test = gdf1.join(gdf2).join(gdf3)
print(test.head().to_pandas()

Returns:

   data_col_left  data_col_right1  data_col_right2
1             10               15               17
2             11               16               18
3             12               12               19
4             13               13               20
5             14               14               21

Instead of the expected:

   data_col_left  data_col_right1  data_col_right2
1             10               15               17
2             11               16               18
3             12               NaN              19
4             13               NaN              20
5             14               NaN              21

Executing more code and doing other operations ends up returning arbitrary values for data_col_right1 3, 4, and 5.

apply() function using Numba JIT

Support for apply()'ing a function to each row of a GDF. Will use Numba JIT to generate code.

error in creating env

conda env create --name pygdf_dev --file conda_environments/testing_py35.yml

with wrong error:

quant@quantaxis:~$ conda env create --name pygdf_dev --file conda_environments/testing_py35.yml
conda: command not found
quant@quantaxis:~$

support for python3.6?

Setting a column to a constant value returns a one length Series instead of an n length Series

Minimal case to reproduce:

import pandas as pd
from pygdf.dataframe import DataFrame

pdf = pd.DataFrame({"key": [1,1,1,2,2,2,3,3,3,4,4,4], "value": [1,2,3,4,5,6,7,8,9,10,11,12]})
gdf = DataFrame.from_pandas(pdf)

gdf['newcol'] = 100
pdf['newcol'] = 100

Incorrect GDF result:

len(gdf['newcol'])

Correct PDF result:

len(pdf['newcol'])

DataFrame.nlargest/nsmallest fail on sliced frame

This works fine if the frame is sliced from the start, but fails if the slice is in the middle:

In [29]: import pandas as pd, pygdf as gd

In [30]: df = pd.DataFrame({'x': range(100), 'y': list(map(float, range(100)))})

In [31]: gdf = gd.DataFrame.from_pandas(df)

In [32]: gdf[:10].nlargest(5, 'x')
Out[32]:
     x    y
9    9  9.0
8    8  8.0
7    7  7.0
6    6  6.0
5    5  5.0

In [33]: gdf[10:20].nlargest(5, 'x')
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-33-51dda14cf36d> in <module>()
----> 1 gdf[10:20].nlargest(5, 'x')

~/Code/pygdf/pygdf/dataframe.py in nlargest(self, n, columns, keep)
    439         * Only a single column is supported in *columns*
    440         """
--> 441         return self._n_largest_or_smallest('nlargest', n, columns, keep)
    442
    443     def nsmallest(self, n, columns, keep='first'):

~/Code/pygdf/pygdf/dataframe.py in _n_largest_or_smallest(self, method, n, columns, keep)
    465                 df[k] = sorted_series
    466             else:
--> 467                 df[k] = self[k].take(df.index.gpu_values)
    468         return df
    469

~/Code/pygdf/pygdf/dataframe.py in __setitem__(self, name, col)
    144             self._cols[name] = self._prepare_series_for_add(col)
    145         else:
--> 146             self.add_column(name, col)
    147
    148     def __delitem__(self, name):

~/Code/pygdf/pygdf/dataframe.py in add_column(self, name, data)
    304         if name in self._cols:
    305             raise NameError('duplicated column name {!r}'.format(name))
--> 306         series = self._prepare_series_for_add(data)
    307         self._cols[name] = series
    308

~/Code/pygdf/pygdf/dataframe.py in _prepare_series_for_add(self, col)
    290             return series
    291         else:
--> 292             raise NotImplementedError("join needed")
    293
    294     def add_column(self, name, data):

NotImplementedError: join needed

Binary operations between numeric & categorical should raise

Currently this only happens if the dispatch hits categorical first:

In [21]: df = pd.DataFrame({'x': range(10), 'y': list(map(float, range(10))), 'z': list('abcde')*2})

In [22]: df.z = df.z.astype('category')

In [23]: gdf = gd.DataFrame.from_pandas(df)

In [24]: gdf.x + gdf.z
Out[24]:

 0 144396680282898688
 1               1028
 2                  7
 3                  8
 4                  9
 5                 10
 6                 11
 7                 12
 8                 13
 9                 14

In [25]: gdf.z + gdf.x
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-01cf7f26eb92> in <module>()
----> 1 gdf.z + gdf.x

~/Code/pygdf/pygdf/series.py in __add__(self, other)
    356
    357     def __add__(self, other):
--> 358         return self._binaryop(other, 'add')
    359
    360     def __sub__(self, other):

~/Code/pygdf/pygdf/series.py in _binaryop(self, other, fn)
    345         if not isinstance(other, Series):
    346             return NotImplemented
--> 347         return self._impl.binary_operator(fn, self, other)
    348
    349     def _unaryop(self, fn):

~/Code/pygdf/pygdf/categorical.py in binary_operator(self, binop, lhs, rhs)
     73     def binary_operator(self, binop, lhs, rhs):
     74         msg = 'Categorical cannot perform the operation: {}'.format(binop)
---> 75         raise TypeError(msg)
     76
     77     def unary_operator(self, unaryop, series):

TypeError: Categorical cannot perform the operation: add

Allow for strings for comparison in UDFs / fillna for categorical columns

If using a categorical column and you try to use .query() or .fillna() you need to use the categorical code rather than the value of the categorical.

Example:

import pandas as pd
from pygdf.dataframe import DataFrame

pdf = pd.DataFrame({"key": ["a", "b", "c", "d", "e", None, "f", None, "null_key"], "value": [1, 2, 3, 4, 5, 6, 7, 8, 9]})
pdf['key'] = pdf['key'].astype('category')

gdf = DataFrame.from_pandas(pdf)

These work by giving the index of null_key in gdf['key'].cat.categories:

gdf['key'] = gdf['key'].fillna(6)
gdf.query('key == 6').head().to_pandas()

These do not work and fail with "Failed at nopython (nopython frontend)". I assume this is due to numba being unable to compile the function with a string type?

gdf['key'] = gdf['key'].fillna("null_key")
gdf.query('key == "null_key"').head().to_pandas()

Collaborations on columnar data structures

Excited to see this new org created. I am interested to see if Apache Arrow (i.e. contiguous columnar data, validity bitmap for nulls) is the appropriate data model for data on the GPU, and if we can collaborate on some aspects of the code. It seems that CUDA 7 now supports C++11, so in theory we could compile the Arrow C++ libraries with nvcc and provide necessary APIs to enable Numba to interact with the raw memory buffers. This might simplify IPC with GPU main memory (record batch loading and unloading) and make less work for you here. I have an NVIDIA GPU on my home desktop, so I could help with testing.

Support more series operators

The following are missing:

negation (-x)
inv (~x)
boolean operations (&, |, ^)
pow (x ** 2)

pow was added in #892 . Negation and inversion are part of #1163. Boolean bitwise operations are in #1292.

Support Label Encoding

Add support for label encoding for both series and dataframe operations.

Support for windows?

I use Linux at work but at home I have windows and would like to be able to run it on my main machine with conda. Currently when I run:

conda env create --name pygdf_dev --file conda_environments/testing_py35.yml

I am seeing this error:

NoPackagesFoundError: Package missing in current win-64 channels:
  - libgdf_cffi >=0.1.0a1.dev

I am hoping that this could be easily added to the win-64 channels?

rapidsai / cudf Goto Github PK

cudf's People

Stargazers

Watchers

Forkers

cudf's Issues

Current Structure

Proposed new structure

Recommend Projects

Recommend Topics

Recommend Org