Git Product home page Git Product logo

Comments (25)

martindurant avatar martindurant commented on August 14, 2024

The folder does not have a metadata file in it, which is what fastparquet expects.
In #95 , implemented the ability to read from a set of files without an explicit metadata file, but this functionality isn't in dask yet. You would have either use dask.delayed to read each file, and combine into a single dataframe with dask.dataframe.from_delayed; or use fastparquet.writer.merge to create the metadata file from the data files.

from fastparquet.

 avatar commented on August 14, 2024

Thanks for letting me know. I will try both options, or seek an alternative route. Apache drill does not have any options for specifying meta data during parquet file creation.

from fastparquet.

 avatar commented on August 14, 2024

Could not find any example for fastparquet.writer.merge.
Can you kindly point me to the correct location?
Thanks.

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024

Detailed documentation should be written indeed, but I find the docstring approachable. Here is a simple example:

import fastparquet, glob
filelist = glob.glob('mydir.parq/*/part*.parq')
fastparquet.writer.merge(filelist)

will write the file mydir.parq/_metadata, so that you can then load the whole dataset with

pf = fastparquet.ParquetFile('mydir.parq')
df = pf.to_pandas(....)

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024

but as you can see from #95, you can now skip one line, and directly read the set of files

pf = fastparquet.ParquetFile(filelist)

from fastparquet.

 avatar commented on August 14, 2024

This throws an exception:
pf = ParquetFile(filelist)

The first version works.

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024

Are you certain?

Something like

filelist = glob.glob('parquet/parquet_large55/*.parquet')
pf = ParquetFile(filelist)

from fastparquet.

 avatar commented on August 14, 2024

Yes, I am.

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024

In that case, please open a separate issue, with full details of your experience.

from fastparquet.

 avatar commented on August 14, 2024

BTW, reading 5GB of parquet files using this method is almost as slow as reading HD5 directly. I am starting to question the whole point of using fastparquet since it is not really fast as far as I can tell.
I hope the Dask version will be faster (once it works, I will check it). I have to process around 50GB every day.

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024

You cannot really improve on the raw rate of loading binary packed data for standard types like float64. The documentation gives some hints on how to improve performance (categories, nulls, partitioning, etc), depending on what your data is like. The great strength of parquet is not the packing of the data, but being able to read only the columns and chunks that you actually need.

Yes, dask can load your chunks of data in parallel, but you will still have bottlenecks around disc IO and total memory footprint, plus some overhead for the scheduler. If you run with threads, there will be thread contention since only some operations release the GIL, and if you run with processes, then you may have serialization costs, depending on what you do with the data.

from fastparquet.

data-steve avatar data-steve commented on August 14, 2024

I'm having similar error such that I can't do it in one line

$ python
Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:42:40) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> from fastparquet import ParquetFile
>>> from glob import glob
>>> import os
>>> ec2_path = os.path.expanduser("~")
>>> trainData = ParquetFile(glob(ec2_path + '/trainData/part-*.parquet')).to_pandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/fastparquet/api.py", line 45, in __init__
    fn2 = sep.join([fn, '_metadata'])
TypeError: sequence item 0: expected string, list found

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024

I suspect you are using an older version of fastparquet. Can you please update from git master and try again?

from fastparquet.

data-steve avatar data-steve commented on August 14, 2024

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024

I'll schedule a release to conda-forge soon.

from fastparquet.

fsck-mount avatar fsck-mount commented on August 14, 2024

@martindurant

import fastparquet, glob
filelist = glob.glob('mydir.parq/*/part*.parq')
fastparquet.writer.merge(filelist)

Can we do this with s3filesystem ?

With 2 different parquet file schemas?

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024

Yes, I believe it should work with s3fs:

s3 = s3fs.S3FileSystem(...)
filelist = s3fs.glob('mybucket/mydir.parq/*/part*.parq')
fastparquet.writer.merge(filelist, open_with=s3.open)

You cannot merge parquet datasets with different schemas.

from fastparquet.

fsck-mount avatar fsck-mount commented on August 14, 2024

@martindurant
Thanks for the info,
this might sound silly, do I need to have enough ram to merge two s3 files?

If I have to marge two 300 mb parquet files (10 Gb in JSON Files, 8GB of DataFrame Size), is this mandatory to have more than 10 Gb of RAM?

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024

Merging parquet files does not load them into memory, but rather loads their metadata (which is a relatively small amount of binary at the end of the file) and creates a common metadata file with pointers to the original files.

from fastparquet.

fsck-mount avatar fsck-mount commented on August 14, 2024

@martindurant
Is it the same behaviour if the 2 files are written with following option enabled?

file_scheme="simple"

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024

merge does not have such an option. merge basically produces a metadata file from the metadata contained in the files you supply. Is it not behaving as intended for you?

from fastparquet.

navnee22 avatar navnee22 commented on August 14, 2024

@martindurant

 for par_file_path in result_dir_structure_list:
        parquet_file_path.append(par_file_path.strip()+"*.parquet")
    #print(parquet_file_path)
    all_paths_from_s3 = []
    for s3_path in parquet_file_path:
        all_paths_from_s3 += fs.glob(path=s3_path)
    print(all_paths_from_s3)
    myopen = s3.open
    pf = fastparquet.ParquetFile(all_paths_from_s3, open_with=myopen)

##Facing below issue can you suggest what is the root cause for this one:

Traceback (most recent call last):
  File "pr_file.py", line 80, in <module>
    main()
  File "pr_file.py", line 14, in main
    historic_dir=get_historic_dir(bucket,prefix,2.5,input_date)
  File "pr_file.py", line 70, in get_historic_dir
    pf = fp.ParquetFile(all_paths_from_s3, open_with=myopen)
  File "/usr/local/lib64/python3.7/site-packages/fastparquet/api.py", line 122, in __init__
    fs=fs)
  File "/usr/local/lib64/python3.7/site-packages/fastparquet/util.py", line 185, in metadata_from_many
    pf0 = api.ParquetFile(f0, open_with=open_with)
  File "/usr/local/lib64/python3.7/site-packages/fastparquet/api.py", line 157, in __init__
    raise ValueError("No files in dir")
ValueError: No files in dir

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024

It would be best to first check that all_paths_from_s3 does indeed have the paths of your data files.

from fastparquet.

navnee22 avatar navnee22 commented on August 14, 2024

@martindurant

These are the parquet files that I got in all_paths_from_s3 :
['odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-01/20220531.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-01/abc.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-01/userdata1.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-01/userdata2.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-01/xyz.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-02/userdata1.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-02/userdata2.parquet',
'odp-us-innovation-ds-start/start-aisphere/training-error-log-data/MR/USCAN/main-model-hist-sr-model/dl_load_date=2022-01-05/userdata1.parquet']

But here some of corners are as below:

  1. In the first file 20220531.parquet this one is the folder not file and in this folder there are no parquet files.
  2. only userdata1.parquet and userdata2.parquet having data.
  3. In xyz.parquet and abc.parquet are empty parquet file.

Error-

pf = fp.ParquetFile(all_paths_from_s3, open_with=myopen)

File "/usr/local/lib64/python3.7/site-packages/fastparquet/api.py", line 122, in init
fs=fs)
File "/usr/local/lib64/python3.7/site-packages/fastparquet/util.py", line 185, in metadata_from_many
pf0 = api.ParquetFile(f0, open_with=open_with)
File "/usr/local/lib64/python3.7/site-packages/fastparquet/api.py", line 157, in init
raise ValueError("No files in dir")
ValueError: No files in dir

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024

Indeed, you should only include valid parquet data files (not directories) to fastparquet.

from fastparquet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.