Comments (9)
Looked into this but unable to reproduce the issue on my machine. The reader seems to be working fine as seen in the screenshot below.
![image](https://private-user-images.githubusercontent.com/14217455/332673029-155b67b6-ead1-46a4-a506-0e134a918eb5.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk5OTUyNTMsIm5iZiI6MTcxOTk5NDk1MywicGF0aCI6Ii8xNDIxNzQ1NS8zMzI2NzMwMjktMTU1YjY3YjYtZWFkMS00NmE0LWE1MDYtMGUxMzRhOTE4ZWI1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MDMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzAzVDA4MjIzM1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTdhYWY3NzZhYjkzNTgzMDk1YzFkOGI3YTdiZmY4NjI4YWM1NTIzNDVhOGJjNGMwZTcxNWE5ZWU1OWZkZmU1YzQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.hf0hIi5PG0RufyGxM84bEnzYEjJMTdCOAIM4ldjqsP4)
from cudf.
Here is my test code. I did notice in nvidia-smi
that we see a transient GPU memory spike of 21.4GiB
and it quickly goes down and saturates at around 10GiB
. I am assuming that this 21.4GiB
transient might be the culprit behind the OOM. It doesn't fail on my machine as my GPUs are free otherwise and can handle the transient.
def test_parquet_chunked_reader_oom():
reader = ParquetReader(["/home/coder/datasets/lineitem.parquet"], chunk_read_limit=24000000)
while (reader._has_next()):
chunk = reader._read_chunk()
from cudf.
I did notice in
nvidia-smi
that we see a transient GPU memory spike of21.4GiB
and it quickly goes down and saturates at around10GiB
. I am assuming that this21.4GiB
transient might be the culprit behind the OOM.
Exactly, I did notice this too on T4 and since T4's can't handle that much amount of memory we end up with an OOM there too.
from cudf.
Thank you @mhaseeb123 and @galipremsagar for testing this. I don't believe the C++ API encounters this memory spike in the _has_next()
function. @mhaseeb123 would you please check? @nvdbaranec would you please share your thoughts?
from cudf.
First question, the code I'm seeing is just read_parquet(), not the chunked reader
In [4]: table = pa.Table.from_pandas(pd.read_parquet("lineitem.parquet"))
The regular reader is implemented in terms of the chunked reader, but with no limits set. Ie, infinite sizes. So if you're just using that, OOMs are absolutely possible.
If this code is somehow using the chunked reader, note that there are two parameters:
- the output chunk limit, which limits the total size in bytes of the output file, but does nothing to control the memory usage of the decode process.
- the input chunk limit, which limits how much (temporary) memory will be used during the decoding process.
They can be set independently, but only the input chunk limit will work to keep OOMs down.
from cudf.
Thank you @nvdbaranec, yes we are trying to use the chunked reader here in python. It looks like we might not be setting the "input chunk limit"
from cudf.
Sorry, I misread. I thought the first block of code was where the bug was. It is odd that the one that uses the chunked reader directly would fail. There should be no difference between the two in overall memory usage, but maybe that small chunk value specfied (24 MB) is throwing something for a loop.
In this case, I would not expect the input limit to make a difference since it clearly loads in the non-chunked case.
from cudf.
The following test code and patch for #15728 makes things smooth again
def test_parquet_chunked_reader_oom():
reader = ParquetReader(["/home/coder/datasets/lineitem.parquet"], chunk_read_limit=24000000, pass_read_limit=16384000000) # setting pass_read_limit to 16GB but smaller values also work
table = []
while (reader._has_next()):
chunk = reader._read_chunk()
# table = table + chunk # concatenate not needed for testing
diff --git a/python/cudf/cudf/_lib/parquet.pyx b/python/cudf/cudf/_lib/parquet.pyx
index aa18002fe1..14c1d00c06 100644
--- a/python/cudf/cudf/_lib/parquet.pyx
+++ b/python/cudf/cudf/_lib/parquet.pyx
@@ -763,6 +763,7 @@ cdef class ParquetWriter:
cdef class ParquetReader:
cdef bool initialized
cdef unique_ptr[cpp_chunked_parquet_reader] reader
+ cdef size_t pass_read_limit
cdef size_t chunk_read_limit
cdef size_t row_group_size_bytes
cdef table_metadata result_meta
@@ -781,7 +782,7 @@ cdef class ParquetReader:
def __cinit__(self, filepaths_or_buffers, columns=None, row_groups=None,
use_pandas_metadata=True,
- Expression filters=None, int chunk_read_limit=1024000000):
+ Expression filters=None, size_t chunk_read_limit=1024000000, size_t pass_read_limit=0):
# Convert NativeFile buffers to NativeFileDatasource,
# but save original buffers in case we need to use
@@ -831,9 +832,10 @@ cdef class ParquetReader:
self.allow_range_index &= filters is None
self.chunk_read_limit = chunk_read_limit
+ self.pass_read_limit = pass_read_limit
with nogil:
- self.reader.reset(new cpp_chunked_parquet_reader(chunk_read_limit, args))
+ self.reader.reset(new cpp_chunked_parquet_reader(chunk_read_limit, pass_read_limit, args))
self.initialized = False
self.row_groups = row_groups
self.filepaths_or_buffers = filepaths_or_buffers
from cudf.
The issue has been resolved by using pass_read_limit
, Thanks @mhaseeb123 & @nvdbaranec !
from cudf.
Related Issues (20)
- [FEA] Consider fuzz testing with hypothesis
- [FEA] Parquet chunked reader benchmark that emulates heavily nested column data with specific properties.
- [FEA] Create a multi-threaded nvbenchmark for groupby_max
- [BUG] After replace [-np.inf, np.inf] with np.nan, group forward fill not working. HOT 1
- [FEA] Add large strings support in CSV writer HOT 1
- [FEA] Adjust `read_json` to allow reading byte ranges from source files >2 GB
- [FEA] Add `min_by` aggregate support
- [BUG] ImportError cudf, cannot import name 'pa_version_under14p1' from 'pandas.compat' HOT 4
- [FEA] test layout for pylibcudf
- [BUG] Hang in libcudf replace_multiple HOT 1
- [FEA] Implement a more accurate float to decimal conversion that supports rounding instead of truncation HOT 1
- [BUG] StringMethods - Jaccard-index fails with long strings HOT 1
- [BUG] from_arrow_device and from_arrow_host should take `device_async_resource_ref`, not `device_memory_resource*`
- [BUG] make_lists_column does not enforce arrow-compatible invariants on offsets column HOT 4
- [FEA] Support unique expression (polars) HOT 2
- [BUG] Polars combined filter and head throws error with some datasets (row limit in scan error) HOT 3
- [BUG] Support string to datetime conversion in Polars engine HOT 1
- [FEA] Support rolling operations in Polars engine (window functions)
- [FEA] Support Polars expression calculating lengths / count from a parquet file
- [BUG] `cudaErrorInvalidDevice` when reading a parquet file with chunked reader when pass_read_limit is not 0
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cudf.