Comments (2)
I've looked into this a bit more and found that, TFIO can read some parquet files with compression.
Specifically, it is able to read parquet files created by fastparquet v2024.02.0.
However, it is unable to read parquet files with large compressed byte columns created by pyarrow or created via spark.
I say "large" because I don't know the exact point where the data corruption starts. I have observed that with 10K rows of 50 bytes each the read is successful. With 10K rows of 500 bytes each the reads are unsuccessful.
import random
import string
NUM_EXAMPLES = 1024 * 10
def make_bytes_list(num, length=None, maxlen=100):
lengths = [length] * num if length is not None else [random.randint(0, maxlen) for _ in range(num)]
strings = [''.join(random.choices(string.ascii_lowercase, k=length)) for length in lengths]
return [s.encode() for s in strings]
strings = make_bytes_list(NUM_EXAMPLES, length=492)
out_path_pa_compress = "/tmp/out.pa.gz"
df = pd.DataFrame(strings, columns=['str_f'])
df.to_parquet(out_path_pa_compress, compression='gzip', engine='pyarrow')
ds_pa_recover = tfio.IODataset.from_parquet(out_path_pa_compress, columns={"str_f": tf.string})
str_pa_recover = [ex['str_f'] for ex in ds_pa_recover]
print("Recovered all rows:", len(str_pa_recover) == len(strings))
print("Recovered equal bytes size:", sum([ len(rec.numpy()) for rec in str_pa_recover ]) == sum([ len(s) for s in strings]))
print("All strings equal:", [rec for rec in str_pa_recover] == [s for s in strings])
print("First mismatch at:", [ rec == s for (rec, s) in zip(str_pa_recover, strings) ].index(False))
With the following output
Recovered all rows: True
Recovered equal bytes size: True
All strings equal: False
First mismatch at: 4096
from io.
The data read error doesn't seem to be in the tensorflow_io python layer.
this call goes to the cpp parquet reader and returns erroneous data.
To step through the CPP code, I was able to attach gdb to my python process and get a breakpoint inside ParquetReadableResource::Read but debugger could not step through the function -- complains of missing line no information.
Any notes or pointers on how to get gdb working with the tensorflow_io C++ core_ops
would be really appreciated. I've tried the development build instructions with the addition of --compilation_mode=dbg
to the bazel invocation to build a .so
followed by building and installing the python wheel. After installing the built-from-source wheel, gdb can no longer find the ParquetReadableRead
class to be able to set the breakpoint. I have verified that the symbols in the compiled .so include the obfuscated class name so I'm not sure why gdb is unabel to find the names.
from io.
Related Issues (20)
- Tensorflow 2.15 support HOT 3
- Missing ARM64 wheels for v0.35.0 on PyPI HOT 5
- [v0.35.0] Build Failure on "Analysis of target '@bazel_tools//platforms:windows' failed"
- S3 filesystem pure virtual method called; terminate called without an active exception HOT 12
- DICOM `scale=preserve` not working as intended and performance consideration
- Is the windows support dropped? HOT 1
- Inefficient Write+Copy+Delete pattern when writing to S3. HOT 1
- S3 read throughput slow down after hit prefix limit HOT 1
- extra not provided
- Tensorflow version-pinning should be reflected in setup.py. HOT 1
- Unable to use tensorflow_io.audio.resample on Mac M1
- tensorflow-io-gcs-filesystem==0.36.0 not available via pip HOT 1
- Missing files for Python 3.12 HOT 2
- Tensorflow 2.16 support HOT 1
- Grail bazel toolchain changed/missing
- S3 Express One Zone support
- The wheel in tensorflow-io 0.37.0 is not compatible with Python 3.12 HOT 1
- Please Make TFIO work on Mac M1, I need to read mp3 files!!!!
- Can not build `tensorflow-io 0.37.0` from source on Apple silicon.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from io.