Comments (25)
The folder does not have a metadata file in it, which is what fastparquet expects.
In #95 , implemented the ability to read from a set of files without an explicit metadata file, but this functionality isn't in dask yet. You would have either use dask.delayed
to read each file, and combine into a single dataframe with dask.dataframe.from_delayed
; or use fastparquet.writer.merge
to create the metadata file from the data files.
from fastparquet.
Thanks for letting me know. I will try both options, or seek an alternative route. Apache drill does not have any options for specifying meta data during parquet file creation.
from fastparquet.
Could not find any example for fastparquet.writer.merge.
Can you kindly point me to the correct location?
Thanks.
from fastparquet.
Detailed documentation should be written indeed, but I find the docstring approachable. Here is a simple example:
import fastparquet, glob
filelist = glob.glob('mydir.parq/*/part*.parq')
fastparquet.writer.merge(filelist)
will write the file mydir.parq/_metadata
, so that you can then load the whole dataset with
pf = fastparquet.ParquetFile('mydir.parq')
df = pf.to_pandas(....)
from fastparquet.
but as you can see from #95, you can now skip one line, and directly read the set of files
pf = fastparquet.ParquetFile(filelist)
from fastparquet.
This throws an exception:
pf = ParquetFile(filelist)
The first version works.
from fastparquet.
Are you certain?
Something like
filelist = glob.glob('parquet/parquet_large55/*.parquet')
pf = ParquetFile(filelist)
from fastparquet.
Yes, I am.
from fastparquet.
In that case, please open a separate issue, with full details of your experience.
from fastparquet.
BTW, reading 5GB of parquet files using this method is almost as slow as reading HD5 directly. I am starting to question the whole point of using fastparquet since it is not really fast as far as I can tell.
I hope the Dask version will be faster (once it works, I will check it). I have to process around 50GB every day.
from fastparquet.
You cannot really improve on the raw rate of loading binary packed data for standard types like float64. The documentation gives some hints on how to improve performance (categories, nulls, partitioning, etc), depending on what your data is like. The great strength of parquet is not the packing of the data, but being able to read only the columns and chunks that you actually need.
Yes, dask can load your chunks of data in parallel, but you will still have bottlenecks around disc IO and total memory footprint, plus some overhead for the scheduler. If you run with threads, there will be thread contention since only some operations release the GIL, and if you run with processes, then you may have serialization costs, depending on what you do with the data.
from fastparquet.
I'm having similar error such that I can't do it in one line
$ python
Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:42:40)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> from fastparquet import ParquetFile
>>> from glob import glob
>>> import os
>>> ec2_path = os.path.expanduser("~")
>>> trainData = ParquetFile(glob(ec2_path + '/trainData/part-*.parquet')).to_pandas()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/fastparquet/api.py", line 45, in __init__
fn2 = sep.join([fn, '_metadata'])
TypeError: sequence item 0: expected string, list found
from fastparquet.
I suspect you are using an older version of fastparquet. Can you please update from git master and try again?
from fastparquet.
from fastparquet.
I'll schedule a release to conda-forge soon.
from fastparquet.
import fastparquet, glob
filelist = glob.glob('mydir.parq/*/part*.parq')
fastparquet.writer.merge(filelist)
Can we do this with s3filesystem ?
With 2 different parquet file schemas?
from fastparquet.
Yes, I believe it should work with s3fs:
s3 = s3fs.S3FileSystem(...)
filelist = s3fs.glob('mybucket/mydir.parq/*/part*.parq')
fastparquet.writer.merge(filelist, open_with=s3.open)
You cannot merge parquet datasets with different schemas.
from fastparquet.
@martindurant
Thanks for the info,
this might sound silly, do I need to have enough ram to merge two s3 files?
If I have to marge two 300 mb parquet files (10 Gb in JSON Files, 8GB of DataFrame Size), is this mandatory to have more than 10 Gb of RAM?
from fastparquet.
Merging parquet files does not load them into memory, but rather loads their metadata (which is a relatively small amount of binary at the end of the file) and creates a common metadata file with pointers to the original files.
from fastparquet.
@martindurant
Is it the same behaviour if the 2 files are written with following option enabled?
file_scheme="simple"
from fastparquet.
merge does not have such an option. merge basically produces a metadata file from the metadata contained in the files you supply. Is it not behaving as intended for you?
from fastparquet.
for par_file_path in result_dir_structure_list:
parquet_file_path.append(par_file_path.strip()+"*.parquet")
#print(parquet_file_path)
all_paths_from_s3 = []
for s3_path in parquet_file_path:
all_paths_from_s3 += fs.glob(path=s3_path)
print(all_paths_from_s3)
myopen = s3.open
pf = fastparquet.ParquetFile(all_paths_from_s3, open_with=myopen)
##Facing below issue can you suggest what is the root cause for this one:
Traceback (most recent call last):
File "pr_file.py", line 80, in <module>
main()
File "pr_file.py", line 14, in main
historic_dir=get_historic_dir(bucket,prefix,2.5,input_date)
File "pr_file.py", line 70, in get_historic_dir
pf = fp.ParquetFile(all_paths_from_s3, open_with=myopen)
File "/usr/local/lib64/python3.7/site-packages/fastparquet/api.py", line 122, in __init__
fs=fs)
File "/usr/local/lib64/python3.7/site-packages/fastparquet/util.py", line 185, in metadata_from_many
pf0 = api.ParquetFile(f0, open_with=open_with)
File "/usr/local/lib64/python3.7/site-packages/fastparquet/api.py", line 157, in __init__
raise ValueError("No files in dir")
ValueError: No files in dir
from fastparquet.
It would be best to first check that all_paths_from_s3
does indeed have the paths of your data files.
from fastparquet.
These are the parquet files that I got in all_paths_from_s3 :
['odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-01/20220531.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-01/abc.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-01/userdata1.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-01/userdata2.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-01/xyz.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-02/userdata1.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-02/userdata2.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-05/userdata1.parquet']
But here some of corners are as below:
- In the first file 20220531.parquet this one is the folder not file and in this folder there are no parquet files.
- only userdata1.parquet and userdata2.parquet having data.
- In xyz.parquet and abc.parquet are empty parquet file.
Error-
pf = fp.ParquetFile(all_paths_from_s3, open_with=myopen)
File "/usr/local/lib64/python3.7/site-packages/fastparquet/api.py", line 122, in init
fs=fs)
File "/usr/local/lib64/python3.7/site-packages/fastparquet/util.py", line 185, in metadata_from_many
pf0 = api.ParquetFile(f0, open_with=open_with)
File "/usr/local/lib64/python3.7/site-packages/fastparquet/api.py", line 157, in init
raise ValueError("No files in dir")
ValueError: No files in dir
from fastparquet.
Indeed, you should only include valid parquet data files (not directories) to fastparquet.
from fastparquet.
Related Issues (20)
- to_pandas(): cramjam.DecompressionError: snappy: output buffer (size = 262144) is smaller than required (size = 1048576) HOT 1
- BUG: dataframe.empty with non-nano pd.DatetimeTZDtype HOT 2
- a python-3.12 windows wheel HOT 13
- Some `fastparquet`-related tests are failing on Python 3.10 HOT 10
- Regression due to `_from_sequence` HOT 1
- attrs persistance for Pandas HOT 1
- Nullable types for 1 row vs multiple rows HOT 3
- update_file_custom_metadata error when file has no properties.
- schema evolution when writing the row groups does not work HOT 4
- Bug loading parquet files with timezone information HOT 6
- When changing to a larger dtype, its size must be a advisor of the total size in bytes of the last axis of the array HOT 6
- PyArrow will become a required dependency with pandas 3.0
- Option to not close() after write() when writing to buffer HOT 3
- Support zoneinfo.ZoneInfo timezones
- Loading List of List of Strings leads to nans HOT 6
- Upcoming pandas (>2.2.0) raises "read-only" errors HOT 3
- Categorical dtype not preserved with fastparquet-write, pyarrow-read HOT 2
- Numpy 2: OverflowError with int96 HOT 4
- Fastparquet raises on import with numpy 2.0 rc HOT 5
- New release? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fastparquet.