Hi, I created a folder with parquet files using Apache drill. However, if the

but as you can see from <a class="issue-link js-issue-link" data-error-text="Failed to

Are you certain? Something like <div class="snippet-clipboard-co

Fastparquet can not read parquet folder generated by Apache Drill about fastparquet HOT 25 CLOSED

dask commented on August 14, 2024

Fastparquet can not read parquet folder generated by Apache Drill

from fastparquet.

Comments (25)

martindurant commented on August 14, 2024

The folder does not have a metadata file in it, which is what fastparquet expects.
In #95 , implemented the ability to read from a set of files without an explicit metadata file, but this functionality isn't in dask yet. You would have either use dask.delayed to read each file, and combine into a single dataframe with dask.dataframe.from_delayed; or use fastparquet.writer.merge to create the metadata file from the data files.

from fastparquet.

commented on August 14, 2024

Thanks for letting me know. I will try both options, or seek an alternative route. Apache drill does not have any options for specifying meta data during parquet file creation.

from fastparquet.

commented on August 14, 2024

Could not find any example for fastparquet.writer.merge.
Can you kindly point me to the correct location?
Thanks.

from fastparquet.

martindurant commented on August 14, 2024

Detailed documentation should be written indeed, but I find the docstring approachable. Here is a simple example:

import fastparquet, glob
filelist = glob.glob('mydir.parq/*/part*.parq')
fastparquet.writer.merge(filelist)

will write the file mydir.parq/_metadata, so that you can then load the whole dataset with

pf = fastparquet.ParquetFile('mydir.parq')
df = pf.to_pandas(....)

from fastparquet.

martindurant commented on August 14, 2024

but as you can see from #95, you can now skip one line, and directly read the set of files

pf = fastparquet.ParquetFile(filelist)

from fastparquet.

commented on August 14, 2024

This throws an exception:
pf = ParquetFile(filelist)

The first version works.

from fastparquet.

martindurant commented on August 14, 2024

Are you certain?

Something like

filelist = glob.glob('parquet/parquet_large55/*.parquet')
pf = ParquetFile(filelist)

from fastparquet.

commented on August 14, 2024

Yes, I am.

from fastparquet.

martindurant commented on August 14, 2024

In that case, please open a separate issue, with full details of your experience.

from fastparquet.

commented on August 14, 2024

BTW, reading 5GB of parquet files using this method is almost as slow as reading HD5 directly. I am starting to question the whole point of using fastparquet since it is not really fast as far as I can tell.
I hope the Dask version will be faster (once it works, I will check it). I have to process around 50GB every day.

from fastparquet.

martindurant commented on August 14, 2024

You cannot really improve on the raw rate of loading binary packed data for standard types like float64. The documentation gives some hints on how to improve performance (categories, nulls, partitioning, etc), depending on what your data is like. The great strength of parquet is not the packing of the data, but being able to read only the columns and chunks that you actually need.

Yes, dask can load your chunks of data in parallel, but you will still have bottlenecks around disc IO and total memory footprint, plus some overhead for the scheduler. If you run with threads, there will be thread contention since only some operations release the GIL, and if you run with processes, then you may have serialization costs, depending on what you do with the data.

from fastparquet.

data-steve commented on August 14, 2024

I'm having similar error such that I can't do it in one line

$ python
Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:42:40) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> from fastparquet import ParquetFile
>>> from glob import glob
>>> import os
>>> ec2_path = os.path.expanduser("~")
>>> trainData = ParquetFile(glob(ec2_path + '/trainData/part-*.parquet')).to_pandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/fastparquet/api.py", line 45, in __init__
    fn2 = sep.join([fn, '_metadata'])
TypeError: sequence item 0: expected string, list found

from fastparquet.

martindurant commented on August 14, 2024

I suspect you are using an older version of fastparquet. Can you please update from git master and try again?

from fastparquet.

data-steve commented on August 14, 2024

I did the conda-forge install as suggested on docs page. This env is on AWS and git install there is a bother. I'll just remember that in future. ~ Steve Sent via telepathy

…

On Apr 3, 2017, at 5:34 PM, Martin Durant ***@***.***> wrote: I suspect you are using an older version of fastparquet. Can you please update from git master and try again? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

from fastparquet.

martindurant commented on August 14, 2024

I'll schedule a release to conda-forge soon.

from fastparquet.

fsck-mount commented on August 14, 2024

@martindurant

import fastparquet, glob
filelist = glob.glob('mydir.parq/*/part*.parq')
fastparquet.writer.merge(filelist)

Can we do this with s3filesystem ?

With 2 different parquet file schemas?

from fastparquet.

martindurant commented on August 14, 2024

Yes, I believe it should work with s3fs:

s3 = s3fs.S3FileSystem(...)
filelist = s3fs.glob('mybucket/mydir.parq/*/part*.parq')
fastparquet.writer.merge(filelist, open_with=s3.open)

You cannot merge parquet datasets with different schemas.

from fastparquet.

fsck-mount commented on August 14, 2024

@martindurant
Thanks for the info,
this might sound silly, do I need to have enough ram to merge two s3 files?

If I have to marge two 300 mb parquet files (10 Gb in JSON Files, 8GB of DataFrame Size), is this mandatory to have more than 10 Gb of RAM?

from fastparquet.

martindurant commented on August 14, 2024

Merging parquet files does not load them into memory, but rather loads their metadata (which is a relatively small amount of binary at the end of the file) and creates a common metadata file with pointers to the original files.

from fastparquet.

fsck-mount commented on August 14, 2024

@martindurant
Is it the same behaviour if the 2 files are written with following option enabled?

file_scheme="simple"

from fastparquet.

martindurant commented on August 14, 2024

merge does not have such an option. merge basically produces a metadata file from the metadata contained in the files you supply. Is it not behaving as intended for you?

from fastparquet.

navnee22 commented on August 14, 2024

@martindurant

 for par_file_path in result_dir_structure_list:
        parquet_file_path.append(par_file_path.strip()+"*.parquet")
    #print(parquet_file_path)
    all_paths_from_s3 = []
    for s3_path in parquet_file_path:
        all_paths_from_s3 += fs.glob(path=s3_path)
    print(all_paths_from_s3)
    myopen = s3.open
    pf = fastparquet.ParquetFile(all_paths_from_s3, open_with=myopen)

##Facing below issue can you suggest what is the root cause for this one:

Traceback (most recent call last):
  File "pr_file.py", line 80, in <module>
    main()
  File "pr_file.py", line 14, in main
    historic_dir=get_historic_dir(bucket,prefix,2.5,input_date)
  File "pr_file.py", line 70, in get_historic_dir
    pf = fp.ParquetFile(all_paths_from_s3, open_with=myopen)
  File "/usr/local/lib64/python3.7/site-packages/fastparquet/api.py", line 122, in __init__
    fs=fs)
  File "/usr/local/lib64/python3.7/site-packages/fastparquet/util.py", line 185, in metadata_from_many
    pf0 = api.ParquetFile(f0, open_with=open_with)
  File "/usr/local/lib64/python3.7/site-packages/fastparquet/api.py", line 157, in __init__
    raise ValueError("No files in dir")
ValueError: No files in dir

from fastparquet.

martindurant commented on August 14, 2024

It would be best to first check that all_paths_from_s3 does indeed have the paths of your data files.

from fastparquet.

navnee22 commented on August 14, 2024

@martindurant

These are the parquet files that I got in all_paths_from_s3 :
['odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-01/20220531.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-01/abc.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-01/userdata1.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-01/userdata2.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-01/xyz.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-02/userdata1.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-02/userdata2.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-05/userdata1.parquet']

But here some of corners are as below:

In the first file 20220531.parquet this one is the folder not file and in this folder there are no parquet files.
only userdata1.parquet and userdata2.parquet having data.
In xyz.parquet and abc.parquet are empty parquet file.

Error-

pf = fp.ParquetFile(all_paths_from_s3, open_with=myopen)

File "/usr/local/lib64/python3.7/site-packages/fastparquet/api.py", line 122, in init
fs=fs)
File "/usr/local/lib64/python3.7/site-packages/fastparquet/util.py", line 185, in metadata_from_many
pf0 = api.ParquetFile(f0, open_with=open_with)
File "/usr/local/lib64/python3.7/site-packages/fastparquet/api.py", line 157, in init
raise ValueError("No files in dir")
ValueError: No files in dir

from fastparquet.

martindurant commented on August 14, 2024

Indeed, you should only include valid parquet data files (not directories) to fastparquet.

from fastparquet.

Fastparquet can not read parquet folder generated by Apache Drill about fastparquet HOT 25 CLOSED

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent