Comments (5)
TL;DR There is a fixed overhead to using CUDA and cuDF, so very small files are not going to show any improvement, but you can cut down on the startup overhead some, and there are other ways to improve efficiency some.
I'll show some example traces on my RTX A6000 using the 1M integer example from above. First the output:
% python3 jnac.py
0.00894021987915039
0.005298137664794922
0.0025205612182617188
and the associated trace
The area highlighted in green is the actual decode time and is around 21ms. You can see the time is dominated by CUDA initialization (54ms) and setup involved in using kvikio/cuFile (187ms).
The second read_parquet
call is then much faster (9ms), and is dominated by the decode kernel (6.5ms).
Since the file is so small, we can skip using cuFile by setting the LIBCUDF_CUFILE_POLICY
envvar to OFF
. This has no impact on the measured read time, but greatly reduces the setup time.
% env LIBCUDF_CUFILE_POLICY=OFF python3 jnac.py
0.009036540985107422
0.007204532623291016
0.0022492408752441406
Another thing to notice from the above is that pandas is writing all 1M rows into a single page. But libcudf parallelizes parquet reads at the page level, so to see any improvement you'll want more pages. Parquet-mr and libcudf default to 20000 rows max per page, so using cudf to write the initial file has a measurable impact for the 1M row case.
df = pd.DataFrame({'jnac': [2333] * 1000000})
#df.to_parquet(fname, compression='ZSTD')
cdf = cudf.from_pandas(df)
cdf.to_parquet(fname, compression='ZSTD')
% env LIBCUDF_CUFILE_POLICY=OFF python3 jnac.py
0.002679586410522461
0.005636692047119141
0.0023217201232910156
Now the to_parquet()
call is bearing the price of cuda initialization. The decode time (again in highlighted in green) has gone from 20ms to around 11, and the decode kernel has gone from a single threadblock and 6.5ms to 50 threadblocks and 140us. And the second read_parquet
is now down to around 2.6ms.
In summary, we're seeing a pretty fixed cost of 50ms for CUDA setup, 190ms for cuFile setup, and around 10ms for cuDF setup (there are some buffer initializations and a stream pool to set up). Python adds its own latency too, and there's the actual file I/O to take into account. So the minimum time you'll see for a single file read is going to be something over 60ms. If you can read files in batches and amortize the startup penalty, you'll still only see performance on par with arrow for such small files. As you've discovered already, to really see the benefit you need files that are large enough to move the bottleneck from setup/IO to the actual compute kernels that will show the parallelization benefit.
from cudf.
read_jnac_parquet_and_nsys_file.zip
get to know what's happening under the hood, gpu kernel trace very sparse, but CUDA API calls takes so long...
from cudf.
There might be some tricks to avoid the long time 1st round run cudaHostAlloc, which I haven't figured out yet, code as below may only handle gpu side mem pre-alloc.
import rmm
# rmm.reinitialize(pool_allocator=True, initial_pool_size= 4 * 10 ** 9)
from cudf.
I also tried 1M rows all (same) integers and 1M rows all (same) string column, cudf.read_parquet still suffering the perf issue, very likely due to the long cudaMallocHost call.
df = pandas.DataFrame({'j2333c': [2333] * 1000000})
df.to_parquet('/dev/shm/j2333c.parquet', compression='ZSTD')
>>> import cudf
>>> import pandas
>>> import pyarrow.parquet
>>>
>>> import time
>>>
>>> # not accurate timing, while the diff is so obvious which do not require more accurate timing temporrally
>>>
>>> ts = time.time(); tb = cudf.read_parquet('j2333c.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = cudf.read_parquet('j2333c.parquet'); te = time.time()
>>> print(te - ts)
0.08919477462768555
>>>
>>> ts = time.time(); tb = pandas.read_parquet('j2333c.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pandas.read_parquet('j2333c.parquet'); te = time.time()
>>> print(te - ts)
0.026215314865112305
>>>
>>> ts = time.time(); tb = pyarrow.parquet.read_table('j2333c.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pyarrow.parquet.read_table('j2333c.parquet'); te = time.time()
>>> print(te - ts)
0.014030933380126953
>>> import cudf
>>> import pandas
>>> import pyarrow.parquet
>>>
>>> import time
>>>
>>> import rmm
>>>
>>> rmm.reinitialize(pool_allocator=True, initial_pool_size= 4 * 10 ** 9)
>>>
>>> # not accurate timing, while the diff is so obvious which do not require more accurate timing temporrally
>>>
>>> ts = time.time(); tb = cudf.read_parquet('j2333c.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = cudf.read_parquet('j2333c.parquet'); te = time.time()
>>> print(te - ts)
0.08475613594055176
>>>
>>> ts = time.time(); tb = pandas.read_parquet('j2333c.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pandas.read_parquet('j2333c.parquet'); te = time.time()
>>> print(te - ts)
0.025774002075195312
>>>
>>> ts = time.time(); tb = pyarrow.parquet.read_table('j2333c.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pyarrow.parquet.read_table('j2333c.parquet'); te = time.time()
>>> print(te - ts)
0.011544227600097656
df = pandas.DataFrame({'jstrc', ['2333'] * 1000000})
df.to_parquet('/dev/shm/jstrc.parquet', compression='ZSTD')
>>> import cudf
>>> import pandas
>>> import pyarrow.parquet
>>>
>>> import time
>>>
>>> import rmm
>>>
>>> # rmm.reinitialize(pool_allocator=True, initial_pool_size= 4 * 10 ** 9)
>>>
>>> # not accurate timing, while the diff is so obvious which do not require more accurate timing temporrally
>>>
>>> ts = time.time(); tb = cudf.read_parquet('/dev/shm/jstrc.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = cudf.read_parquet('/dev/shm/jstrc.parquet'); te = time.time()
>>> print(te - ts)
0.08581995964050293
>>>
>>> ts = time.time(); tb = pandas.read_parquet('/dev/shm/jstrc.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pandas.read_parquet('/dev/shm/jstrc.parquet'); te = time.time()
>>> print(te - ts)
0.057205915451049805
>>>
>>> ts = time.time(); tb = pyarrow.parquet.read_table('/dev/shm/jstrc.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pyarrow.parquet.read_table('/dev/shm/jstrc.parquet'); te = time.time()
>>> print(te - ts)
0.022694826126098633
from cudf.
Well, as GPUs are throughput machine, if increasing rows num from millions to billions, the advantages got well shown:
>>> import cudf
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({'jnac': [None] * 1000000000})
>>> df.to_parquet('/dev/shm/jnac.parquet', compression='ZSTD')
>>> import cudf
>>> import pandas
>>> import pyarrow.parquet
>>>
>>> import time
>>>
>>> # not accurate timing, while the diff is so obvious which do not require more accurate timing temporrally
>>>
>>> ts = time.time(); tb = cudf.read_parquet('/dev/shm/jnac.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = cudf.read_parquet('/dev/shm/jnac.parquet'); te = time.time()
>>> print(te - ts)
0.15029525756835938
>>>
>>> ts = time.time(); tb = pandas.read_parquet('/dev/shm/jnac.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pandas.read_parquet('/dev/shm/jnac.parquet'); te = time.time()
>>> print(te - ts)
7.30379843711853
>>>
>>> ts = time.time(); tb = pyarrow.parquet.read_table('/dev/shm/jnac.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pyarrow.parquet.read_table('/dev/shm/jnac.parquet'); te = time.time()
>>> print(te - ts)
1.51247239112854
>>>
so, now the major problem is how to resolve the issue for millions scale row num chunked tables.
from cudf.
Related Issues (20)
- [BUG] dask-cudf isin errors when passing in a list of values HOT 1
- [FEA] Avoid materializing temporary table in ORC chunked reader
- [FEA] Potential optimization: Batched memset.
- [FEA] Support binary operations between timezone-aware datetime columns
- [BUG] File written by ORC writer cannot be read in Pandas and Spark
- [BUG] `Index.is_monotonic_*` methods not factoring `nan` values
- [BUG] cudf.Series.duplicated returns error 'Series' object has no attribute 'duplicated' HOT 5
- [BUG] Calling `cat.as_ordered` does not work on a sliced column
- [BUG] `uses_custom_row_groups` should not be hardcoded to true in `chunked_parquet_reader`
- [BUG] Issues with `codecov` on `cudf` CI HOT 1
- [BUG] orc reader returning an incorrect timestamp for `rockylinux8`
- [BUG] OOM in `has_next` and `read_chunk` of chunked parquet reader HOT 9
- [BUG] stop throwing when configuring default host mr
- [FEA] Add an option to enable pandas debugging mode in cudf.pandas fast path HOT 1
- [BUG] `cudf.read_json` does not raise an exception with invalid data when `lines=True` and `engine='cudf'`
- Share struct member definition for parse_options and parse_options_view HOT 1
- [ENH] Use `strict=True` argument to `zip` once py39 support is dropped HOT 1
- [FEA] cudf_kafka: add unit tests HOT 3
- [FEA] Support `arrow:Schema` in Parquet writer for faithful roundtrip with Arrow via Parquet
- [FEA] Handle size overflow in nested columns by ORC chunked reader
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cudf.