Describe the bug Performance improvement proposal for cudf parque

<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubu

[BUG] cudf.read_parquet takes too much time(due to cudaMallocHost overhead etc.) to load the zstd compressed parquet files with few thousands to millions of rows about cudf HOT 5 OPEN

pmixer commented on May 24, 2024

[BUG] cudf.read_parquet takes too much time(due to cudaMallocHost overhead etc.) to load the zstd compressed parquet files with few thousands to millions of rows

from cudf.

Comments (5)

etseidl commented on May 24, 2024 2

TL;DR There is a fixed overhead to using CUDA and cuDF, so very small files are not going to show any improvement, but you can cut down on the startup overhead some, and there are other ways to improve efficiency some.

I'll show some example traces on my RTX A6000 using the 1M integer example from above. First the output:

% python3 jnac.py 
0.00894021987915039
0.005298137664794922
0.0025205612182617188

and the associated trace

The area highlighted in green is the actual decode time and is around 21ms. You can see the time is dominated by CUDA initialization (54ms) and setup involved in using kvikio/cuFile (187ms).

The second read_parquet call is then much faster (9ms), and is dominated by the decode kernel (6.5ms).

Since the file is so small, we can skip using cuFile by setting the LIBCUDF_CUFILE_POLICY envvar to OFF. This has no impact on the measured read time, but greatly reduces the setup time.

% env LIBCUDF_CUFILE_POLICY=OFF python3 jnac.py 
0.009036540985107422
0.007204532623291016
0.0022492408752441406

Another thing to notice from the above is that pandas is writing all 1M rows into a single page. But libcudf parallelizes parquet reads at the page level, so to see any improvement you'll want more pages. Parquet-mr and libcudf default to 20000 rows max per page, so using cudf to write the initial file has a measurable impact for the 1M row case.

df = pd.DataFrame({'jnac': [2333] * 1000000})
#df.to_parquet(fname, compression='ZSTD')
cdf = cudf.from_pandas(df)
cdf.to_parquet(fname, compression='ZSTD')

% env LIBCUDF_CUFILE_POLICY=OFF python3 jnac.py
0.002679586410522461
0.005636692047119141
0.0023217201232910156

Now the to_parquet() call is bearing the price of cuda initialization. The decode time (again in highlighted in green) has gone from 20ms to around 11, and the decode kernel has gone from a single threadblock and 6.5ms to 50 threadblocks and 140us. And the second read_parquet is now down to around 2.6ms.

In summary, we're seeing a pretty fixed cost of 50ms for CUDA setup, 190ms for cuFile setup, and around 10ms for cuDF setup (there are some buffer initializations and a stream pool to set up). Python adds its own latency too, and there's the actual file I/O to take into account. So the minimum time you'll see for a single file read is going to be something over 60ms. If you can read files in batches and amortize the startup penalty, you'll still only see performance on par with arrow for such small files. As you've discovered already, to really see the benefit you need files that are large enough to move the bottleneck from setup/IO to the actual compute kernels that will show the parallelization benefit.

from cudf.

pmixer commented on May 24, 2024

read_jnac_parquet_and_nsys_file.zip
get to know what's happening under the hood, gpu kernel trace very sparse, but CUDA API calls takes so long...

from cudf.

pmixer commented on May 24, 2024

There might be some tricks to avoid the long time 1st round run cudaHostAlloc, which I haven't figured out yet, code as below may only handle gpu side mem pre-alloc.

import rmm

# rmm.reinitialize(pool_allocator=True, initial_pool_size= 4 * 10 ** 9)

from cudf.

pmixer commented on May 24, 2024

I also tried 1M rows all (same) integers and 1M rows all (same) string column, cudf.read_parquet still suffering the perf issue, very likely due to the long cudaMallocHost call.

df = pandas.DataFrame({'j2333c': [2333] * 1000000})
df.to_parquet('/dev/shm/j2333c.parquet', compression='ZSTD')

>>> import cudf
>>> import pandas
>>> import pyarrow.parquet
>>> 
>>> import time
>>> 
>>> # not accurate timing, while the diff is so obvious which do not require more accurate timing temporrally
>>> 
>>> ts = time.time(); tb = cudf.read_parquet('j2333c.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = cudf.read_parquet('j2333c.parquet'); te = time.time()
>>> print(te - ts)
0.08919477462768555
>>> 
>>> ts = time.time(); tb = pandas.read_parquet('j2333c.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pandas.read_parquet('j2333c.parquet'); te = time.time()
>>> print(te - ts)
0.026215314865112305
>>> 
>>> ts = time.time(); tb = pyarrow.parquet.read_table('j2333c.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pyarrow.parquet.read_table('j2333c.parquet'); te = time.time()
>>> print(te - ts)
0.014030933380126953


>>> import cudf
>>> import pandas
>>> import pyarrow.parquet
>>> 
>>> import time
>>> 
>>> import rmm
>>> 
>>> rmm.reinitialize(pool_allocator=True, initial_pool_size= 4 * 10 ** 9)
>>> 
>>> # not accurate timing, while the diff is so obvious which do not require more accurate timing temporrally
>>> 
>>> ts = time.time(); tb = cudf.read_parquet('j2333c.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = cudf.read_parquet('j2333c.parquet'); te = time.time()

>>> print(te - ts)
0.08475613594055176
>>> 
>>> ts = time.time(); tb = pandas.read_parquet('j2333c.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pandas.read_parquet('j2333c.parquet'); te = time.time()
>>> print(te - ts)
0.025774002075195312
>>> 
>>> ts = time.time(); tb = pyarrow.parquet.read_table('j2333c.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pyarrow.parquet.read_table('j2333c.parquet'); te = time.time()
>>> print(te - ts)
0.011544227600097656

df = pandas.DataFrame({'jstrc', ['2333'] * 1000000})
df.to_parquet('/dev/shm/jstrc.parquet', compression='ZSTD')

>>> import cudf
>>> import pandas
>>> import pyarrow.parquet
>>> 
>>> import time
>>> 
>>> import rmm
>>> 
>>> # rmm.reinitialize(pool_allocator=True, initial_pool_size= 4 * 10 ** 9)
>>> 
>>> # not accurate timing, while the diff is so obvious which do not require more accurate timing temporrally
>>> 
>>> ts = time.time(); tb = cudf.read_parquet('/dev/shm/jstrc.parquet'); te = time.time()
>>> time.sleep(1)

>>> ts = time.time(); tb = cudf.read_parquet('/dev/shm/jstrc.parquet'); te = time.time()
>>> print(te - ts)
0.08581995964050293
>>> 
>>> ts = time.time(); tb = pandas.read_parquet('/dev/shm/jstrc.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pandas.read_parquet('/dev/shm/jstrc.parquet'); te = time.time()
>>> print(te - ts)
0.057205915451049805
>>> 
>>> ts = time.time(); tb = pyarrow.parquet.read_table('/dev/shm/jstrc.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pyarrow.parquet.read_table('/dev/shm/jstrc.parquet'); te = time.time()
>>> print(te - ts)
0.022694826126098633

from cudf.

pmixer commented on May 24, 2024

Well, as GPUs are throughput machine, if increasing rows num from millions to billions, the advantages got well shown:

>>> import cudf
>>> import pandas as pd
>>> 
>>> df = pd.DataFrame({'jnac': [None] * 1000000000})
>>> df.to_parquet('/dev/shm/jnac.parquet', compression='ZSTD')
>>> import cudf
>>> import pandas
>>> import pyarrow.parquet
>>> 
>>> import time
>>> 
>>> # not accurate timing, while the diff is so obvious which do not require more accurate timing temporrally
>>> 
>>> ts = time.time(); tb = cudf.read_parquet('/dev/shm/jnac.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = cudf.read_parquet('/dev/shm/jnac.parquet'); te = time.time()
>>> print(te - ts)
0.15029525756835938
>>> 
>>> ts = time.time(); tb = pandas.read_parquet('/dev/shm/jnac.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pandas.read_parquet('/dev/shm/jnac.parquet'); te = time.time()
>>> print(te - ts)
7.30379843711853
>>> 
>>> ts = time.time(); tb = pyarrow.parquet.read_table('/dev/shm/jnac.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pyarrow.parquet.read_table('/dev/shm/jnac.parquet'); te = time.time()
>>> print(te - ts)
1.51247239112854
>>>

so, now the major problem is how to resolve the issue for millions scale row num chunked tables.

from cudf.

[BUG] cudf.read_parquet takes too much time(due to cudaMallocHost overhead etc.) to load the zstd compressed parquet files with few thousands to millions of rows about cudf HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent