Git Product home page Git Product logo

pyorc's People

Contributors

blkerby avatar dbaxa avatar dirtysalt avatar noirello avatar odidev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pyorc's Issues

Not able to install pyorc on linux machine present on GCP

Error:
linux-x86_64-3.7/src/_pyorc/_pyorc.o -DVERSION_INFO="0.3.0" -std=c++17 -fvisibility=hidden
In file included from src/_pyorc/_pyorc.cpp:1:
src/_pyorc/Reader.h:7:10: fatal error: orc/OrcFile.hh: No such file or directory
#include "orc/OrcFile.hh"
^~~~~~~~~~~~~~~~
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

----------------------------------------

Command "/usr/bin/python3 -u -c "import setuptools, tokenize;file='/tmp/pip-install-l9pn60e0/pyorc/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.c
lose();exec(compile(code, file, 'exec'))" install --record /tmp/pip-record-gxlj1vmw/install-record.txt --single-version-externally-managed --compile --user --prefix=" failed with error code 1
in /tmp/pip-install-l9pn60e0/pyorc/

Issue with date time conversion?


TypeError Traceback (most recent call last)
in
22 with pyorc.Writer(data, "struct<TERM_ID:double,LAST_UPDATE_DATE:timestamp,LAST_UPDATED_BY:double,CREATION_DATE:timestamp,CREATED_BY:double,LAST_UPDATE_LOGIN:double,NAME:string,ENABLED_FLAG:string,DUE_CUTOFF_DAY:double,DESCRIPTION:string,TYPE:string,START_DATE_ACTIVE:timestamp,END_DATE_ACTIVE:timestamp,RANK:double,ATTRIBUTE_CATEGORY:string,ATTRIBUTE1:string,ATTRIBUTE2:string,ATTRIBUTE3:string,ATTRIBUTE4:string,ATTRIBUTE5:string,ATTRIBUTE6:string,ATTRIBUTE7:string,ATTRIBUTE8:string,ATTRIBUTE9:string,ATTRIBUTE10:string,ATTRIBUTE11:string,ATTRIBUTE12:string,ATTRIBUTE13:string,ATTRIBUTE14:string,ATTRIBUTE15:string,LANGUAGE:string,SOURCE_LANG:string>") as writer:
23 for row in df.itertuples():
---> 24 writer.write(row)
25 writer.close()

/usr/local/anaconda/lib/python3.7/site-packages/pyorc/converters.py in to_orc(obj)
29 @staticmethod
30 def to_orc(obj: datetime) -> Tuple[int, int]:
---> 31 return int(obj.replace(microsecond=0).timestamp()), obj.microsecond * 1000
32
33

TypeError: replace() takes no keyword arguments

memory leak problem

When i try to read the two orc file, and i found the memory of the first reader have not been released. I found no del() in the Reader. How can i release the memory.
`
import os
import sys
import pyorc
from memory_profiler import profile
import gc

@Profile
def main():
fr = open("000748_0", "rb")
r = pyorc.reader
reader = r.Reader(fr)
for i in range(100):
pass
a = reader.read(1000)
b = reader.read(20000)
c = reader.read(10000)
del reader
del r
del a,b,c
fr.close()
gc.collect()
fr = open("001751_0","rb")
r = pyorc.reader
reader = r.Reader(fr)
cnt = reader.len()
for i in range(cnt):
a = reader.next()
c = [1] * (10 ** 6)

if name == 'main':
main()`

Conda?

This would be potentially usable by cuDF if it were uploaded to Anaconda/conda-forge. Could this be done soon?

Saving a Pandas Dataframe to orc

Hi,

First of all, cheers, a much needed library!

Could you perhaps add an example on how to save a Pandas dataframe to an ORC file using pyorc?

I'm not sure how to go about it.

Thanks,
Eli

Possible to control filesize ?

Hello,

Is there a easy way to get the current file size of writer action ? I thinked about tracemalloc but it will be just a rough estimation... Thanks in advance

pyorc.errors.ParseError: Footer is corrupt: types(1701470799) not exists

When using pyorc with tensorflow (i.e., importing both in the same script) I get the footer is corrupt error. My investigations got me to the issue with different versions of protobuf. Tensorflow and some other packages rely on newer versions of protobuf and this causes the crash. Has anyone had this problem? Is there any workaround for this type of issue?

predicate to skip rows doesn't seem to work for timestamps

I have a script with a predicate I would like to use to skip rows that have a timestamp <= some value - but - it does not seem to skip:

Here is a excerpt of my code snipette:

ttime = datetime.datetime.strptime('2021-12-08 00:00:00 +00:00','%Y-%m-%d %H:%M:%S %z')
pred = pyorc.PredicateColumn(pyorc.TypeKind.TIMESTAMP, name="timestamp") > ttime
reader = pyorc.Reader(data,predicate=pred)

Resulting rows don't seem to skip anything at all.

can't find '__main__' module

Hi,

We're trying to create quite huge files using pyorc and it's way too slow, so I tried using pypy3, however I get this strange error when installing the pyorc depdendency. Only pyorc has this problem; everything else loads fine.

# pypy3 -m pip install pyorc
pypy3 wouldn't find modules
pypy3 -m pip install pyorc
Collecting pyorc
Using cached pyorc-0.6.0.tar.gz (54 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
ERROR: Command errored out with exit status 1:
command: /usr/bin/pypy3 /usr/share/python-wheels/pep517-0.8.2-py2.py3-none-any.whl/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmpd7wwooav
cwd: /tmp/pip-install-mki2amwp/pyorc
Complete output (1 lines):
/usr/bin/pypy3: can't find '__main__' module in '/usr/share/python-wheels/pep517-0.8.2-py2.py3-none-any.whl/pep517/_in_process.py'
----------------------------------------
ERROR: Command errored out with exit status 1: /usr/bin/pypy3 /usr/share/python-wheels/pep517-0.8.2-py2.py3-none-any.whl/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmpd7wwooav Check the logs for full command output.

Missing /etc/localtime

When I run pyorc inside a docker image, likely with no timezone, it complained with missing timezone. Is there a workaround ? Docker image without local timezone is pretty common.

[2020-06-10 01:39:48,258] {logging_mixin.py:112} INFO - [2020-06-10 01:39:48,258] {pod_launcher.py:125} INFO - b'Traceback (most recent call last):\n'
[2020-06-10 01:39:48,258] {logging_mixin.py:112} INFO - [2020-06-10 01:39:48,258] {pod_launcher.py:125} INFO - b'  File "/usr/local/bin/mapid_enrichment", line 8, in <module>\n'
[2020-06-10 01:39:48,258] {logging_mixin.py:112} INFO - [2020-06-10 01:39:48,258] {pod_launcher.py:125} INFO - b'    sys.exit(main())\n'
[2020-06-10 01:39:48,258] {logging_mixin.py:112} INFO - [2020-06-10 01:39:48,258] {pod_launcher.py:125} INFO - b'  File "/usr/local/lib/python3.7/dist-packages/tasks/mapid_enrichment/app.py", line 229, in main\n'
[2020-06-10 01:39:48,258] {logging_mixin.py:112} INFO - [2020-06-10 01:39:48,258] {pod_launcher.py:125} INFO - b"    orc_json = read_orc_in_json(f'{ETL_TEMP_DIR}/{s3_file}')\n"
[2020-06-10 01:39:48,258] {logging_mixin.py:112} INFO - [2020-06-10 01:39:48,258] {pod_launcher.py:125} INFO - b'  File "/usr/local/lib/python3.7/dist-packages/tasks/mapid_enrichment/app.py", line 168, in read_orc_in_json\n'
[2020-06-10 01:39:48,258] {logging_mixin.py:112} INFO - [2020-06-10 01:39:48,258] {pod_launcher.py:125} INFO - b'    reader = pyorc.Reader(orc_data)\n'
[2020-06-10 01:39:48,259] {logging_mixin.py:112} INFO - [2020-06-10 01:39:48,259] {pod_launcher.py:125} INFO - b'  File "/usr/local/lib/python3.7/dist-packages/pyorc/reader.py", line 67, in __init__\n'
[2020-06-10 01:39:48,259] {logging_mixin.py:112} INFO - [2020-06-10 01:39:48,259] {pod_launcher.py:125} INFO - b'    fileo, batch_size, column_indices, column_names, struct_repr, conv\n'
[2020-06-10 01:39:48,259] {logging_mixin.py:112} INFO - [2020-06-10 01:39:48,259] {pod_launcher.py:125} INFO - b"RuntimeError: Can't open /etc/localtime\n"

line 67 is at the c++ constructor call:

 66         super().__init__(
 67             fileo, batch_size, column_indices, column_names, struct_repr, conv

Perhaps, if /etc/localtime is not found, UTC can be assumed.

Issue with installation in ubuntu

The package will not build on ubuntu 18.04. The error seems to be related to missing files in _pyorc directory. The error message is this:

In file included from src/_pyorc/_pyorc.cpp:1: src/_pyorc/Reader.h:7:10: fatal error: 'orc/OrcFile.hh' file not found #include "orc/OrcFile.hh"

I could install the library on Mac (which I guess it was using wheel) but my docker container with ubuntu 18.04 keeps failing.

Reader can filter

I'd like to read only the data which matches some criteria but I don't want to implement code handling Stripes and filtering on every project.

Could it be implemented in the Reader?

Writer does not support field names with special characters

I'm seeing that field names containing special characters lead to an error when constructing a Writer. Here is an example, with a field name "abc.def" containing a period:

import pyorc
import io

schema = pyorc.Struct(**{"abc.def": pyorc.Int()})
writer = pyorc.Writer(io.BytesIO(), schema)

It leads to this error:

RuntimeError: Unrecognized character.

It appears that what is happening is that the TypeDescription in Python is getting converted into a string (here struct<abc.def:int>) and then parsed on the C++ side, but in this process there is no way to quote/escape the field names so that special characters could pass through.

I am thinking we could solve this by instead creating the C++ Type directly from the Python TypeDescription by walking over it recursively, using the C++ orc functions/methods like createPrimitiveType, createStructType, addChildType, addStructField, etc. Basically this would be an inverse of createTypeDescription in the pyorc reader. This would require more code than the existing solution but may be more robust than using a string as an intermediate representation here. I'd be happy to work on this and create a pull request for it, if this approach seems good?

Memory leak in Writer?

Hello! Thanks for pyorc; using it has been a pleasure so far, with the exception that we seem to be running into memory issues. I think Writer is leaking memory? Our workload is roughly:

  • Open ~100 writers to different files
  • Iterate over our input rows (in the millions) and send each row to exactly one writer
  • Close all writers
  • Repeat

Memory usage will grow without bound between iterations. This, coupled with the fact that lowering the stripe size all the way down to 1M has no effect, makes me suspect a memory leak. Below is a script that will reproduce -- around iteration 10 it gets to 20G and then killed by the OOM killer on my machine. Let me know if there's anything I can do to help track it down!

https://gist.github.com/JohnEmhoff/274f6e05cba3f17a16683eb394bfe6b5

set_metadata() casts to str()

This line incorrectly cast from py::bytes to py::string. If the data is in fact binary then this will fail, instead I believe this should be directly cast to a std::string

writer->addUserMetadata(py::cast<std::string>(key), py::str(value));

save orc contents to csv

Hi

Is there a way to save ORC file contents to CSV?
I want to save a_reader to a CSV file.

import pyorc

with open("./2.orc", "rb") as data:
    a_reader = pyorc.Reader(data)
    print(type(reader))
    i = 0
    for row in a_reader:
        i += 1
        if i < 10:
            payload = row[1]
            print(type(payload))
            #print(row[1])
            print("")

thanks.

protocol-buffer collides with another libraries

Hey @noirello, thanks for the awesome lib

I have found an issue, which is really hard to debug:

import aioprometheus  # import protobuf under the hood
import pyorc

with open("./new_data.orc", "wb") as data:
    with pyorc.Writer(data, "struct<col0:int,col1:string>") as writer:
        writer.write((1, "ORC from Python"))

will fail in two different ways randomly

[libprotobuf FATAL /Users/runner/runners/2.163.1/work/1/s/deps/orc-1.6.2/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/message_lite.cc:71] CHECK failed: (bytes_produced_by_serialization) == (byte_size_before_serialization): Byte size calculation and serialization were inconsistent.  This may indicate a bug in protocol buffers or it may be caused by concurrent modification of orc.proto.Footer.
Traceback (most recent call last):
  File "crash.py", line 6, in <module>
    writer.write((1, "ORC from Python"))
  File "XXXX/env/lib/python3.7/site-packages/pyorc/writer.py", line 66, in __exit__
    super().close()
RuntimeError: CHECK failed: (bytes_produced_by_serialization) == (byte_size_before_serialization): Byte size calculation and serialization were inconsistent.  This may indicate a bug in protocol buffers or it may be caused by concurrent modification of orc.proto.Footer.
Segmentation fault: 11
python(50048,0x1094b35c0) malloc: *** error for object 0x7f834747a0f8: pointer being freed was not allocated
python(50048,0x1094b35c0) malloc: *** set a breakpoint in malloc_error_break to debug
Abort trap: 6

Is there any chance that You will be able to take a look on it?

Footer is corrupt: malformed link from type 0 to 0

I'm getting the following error when reading the ORC file:
File "/Users/fritzbudiyanto/.local/share/virtualenvs/workflows-66nlkjZ7/lib/python3.7/site-packages/pyorc/reader.py", line 67, in init
fileo, batch_size, column_indices, column_names, struct_repr, conv
pyorc.errors.ParseError: Footer is corrupt: malformed link from type 0 to 0

pyorc.errors.ParseError: Footer is corrupt: types(272536112) not exists

The file can be read properly with Java ORC client. How to debug this further ?

Missing ORC 1.6.3 version on apache download

Apache has released newer version of ORC file i.e. 1.6.4 and they have moved their older version source to archive repository - which is causing HTTP 404 failure while building orc binaries.
i.e. python setup.py build_orc failing

Old URL: https://downloads.apache.org/orc/orc-1.6.3/orc-1.6.3.tar.gz
Moved to: https://archive.apache.org/dist/orc/orc-1.6.3/orc-1.6.3.tar.gz

$ python setup.py build_orc
running build_orc
Build ORC C++ Core library
error: HTTP Error 404: Not Found

Retrieve loaded data as pandas.DataFrame

I think this library could be a great alternative to pyarrow and pyspark to easily read and write ORC files without requiring a big library to just achieve that (not to mention the current issues that pyarrow is having, and the overhead of loading a SparkSession just to read/write data). Therefore, in order to make it usable in most of data processing application, an easy connection with Pandas (the most popular library for local processing of tabular data) would be convenient.
I'm new using this library, but I was evaluating two options:

  1. As a method, to indicate that data should be retrieved as a pandas.DataFrame (that requires adding pandas as a dependency, which may not be desired)
  2. Just an example on the documentation, to let interested users understand how a ORC file could be loaded as a pandas.DataFrame.

So far, I've solved that through this snippet (from what I could understand of the library):

import pandas as pd
import pyorc 
 
path_to_data =  "path/to/data.orc"

with open(path_to_data, "rb") as f:
    reader = pyorc.Reader(f) 
    columns = reader.schema.fields
    # sort by column id to ensure correct order (since "fields" is a dict, order may not be correct)
    columns = [y for x, y in sorted([(reader.schema.find_column_id(c), c) for c in columns])] 
    df = pd.DataFrame(reader, columns=columns)

Unable to write array data type

I tried writing an array of string, and getting exception below:

    with open("/tmp/data.orc", "wb") as data:
        with pyorc.Writer(data, "struct<col0:array<string>>") as writer:
            writer.write((["a:b", "c:d"]))

Here is the trace back:

Traceback (most recent call last):
  File "/Users/fritzbudiyanto/vagrant/ubuntu/workflows/tasks/mapid_enrichment/app.py", line 120, in <module>
    main()
  File "/Users/fritzbudiyanto/vagrant/ubuntu/workflows/tasks/mapid_enrichment/app.py", line 114, in main
    writer.write((["a:b", "c:d"]))
TypeError: Item ['a:b', 'c:d'] is not an instance of tuple

Cannot install pyorc on Mac M1

First of all, thank you very much for this amazing package, it's super useful!

I'm afraid I'm unable to install it on MacOS 12.3.1 Apple chip, via python 3.10.4:

source /Users/ocarmi/PycharmProjects/shield/venv/bin/activate
pip install pyorc
Collecting pyorc
  Using cached pyorc-0.6.0.tar.gz (54 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: pyorc
  Building wheel for pyorc (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for pyorc (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [15 lines of output]
      running bdist_wheel
      running build
      running build_py
      running egg_info
      writing pyorc.egg-info/PKG-INFO
      writing dependency_links to pyorc.egg-info/dependency_links.txt
      writing requirements to pyorc.egg-info/requires.txt
      writing top-level names to pyorc.egg-info/top_level.txt
      reading manifest file 'pyorc.egg-info/SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      warning: no files found matching '*.css' under directory 'docs'
      warning: no previously-included files matching '*' found under directory 'docs/_build'
      adding license file 'LICENSE'
      running build_ext
      error: HTTP Error 404: Not Found
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pyorc
Failed to build pyorc
ERROR: Could not build wheels for pyorc, which is required to install pyproject.toml-based projects

Allow whitespace and newline in ORC schema

Right now, the ORC schema must be specified in a single line without any whitespace, for example:

schema1 = """struct<col0:int,col1:string,col2:struct<col3:int,col4:string,col5:array<float>>,col6:map<string,int>,col7:bigint,col8:boolean,col9:timestamp>"""

Otherwise the module will through an error. It will really great to allow whitespace and newline, for example:

schema1 = """struct<
  col0:int, 
  col1:string,
  col2:struct<
    col3:int,
    col4:string,
    col5:array<float>>,
  col6:map<string,int>,
  col7:bigint,
  col8:boolean,
  col9:timestamp>"""

CompressionKind Issue

When I try to change CompressionKind from ZLIB to SNAPPY

import pyorc

tuples = (1, "hii")
with open("/path/toOrcFile", "wb") as writeData:
with pyorc.Writer(writeData, schema, compression=pyorc.CompressionKind.SNAPPY) as orcWriter:
orcWriter.write(tuples)

It throws error as:
File "/PycharmProjects/orcWriter/venv/lib/python3.7/site-packages/pyorc/writer.py", line 59, in init
conv,
RuntimeError: compression codec

Problem of building from the source: C++ 11 vs 14

When we try to build from the 0.3.0 source, there are issues with handling compiler options (C++ 11 compiler is installed on our Linux hosts), but the package has a bug in testing the feature. Then, the package has an issue with 2 build tasks, where one needs another, but they are not scheduled in order.
Finally, there is one line in the code that uses C++ 14 feature, despite the package claiming that it supports C++ 11. By using some #ifdef checks in the C++ code we are able to put an equivalent C++ 11 implementation in the code and after that the package builds on both Linux and macOS (where compiler actually uses C++ 17 features).

orc Minimum is error?

write orc file:

output = open("/tmp/new.orc", "wb")
writer = pyorc.Writer(output, "struct<col0:int,col1:string,col2:timestamp>", timezone=zoneinfo.ZoneInfo('Asia/Shanghai'))
t = datetime(2022, 4, 10, 9, 30, 30, 0)
writer.write((100, "1000", t))
writer.close()

use orc-statistics to show

--- Column 3 ---
Data type: Timestamp
Values: 1
Has null: no
Minimum: 2022-04-10 03:30:58.800
LowerBound: 2022-04-10 03:30:58.800
Maximum: 2022-04-10 03:30:58.800
UpperBound: 2022-04-10 03:30:58.801
./orc-1.7.3/build/tools/src/orc-contents /tmp/new.orc
{"col0": 100, "col1": "1000", "col2": "2022-04-10 11:30:30.0"}

why Maximum or Minimum is 2022-04-10 03:30:58.800

Metadata support?

Seems like it's not possible to read user metadata, would be nice to have such a feature. Thanks!

Question: pyorc.StructRepr difference between Tuple and Dict?

Hi,

Can you explain the difference in pyorc.StructRepr between Tuple and Dict?
Is it just different how Writer.write accepts the values as parameters?
If this is the latter, can you please provide an example with a dict? Are the keys then column names and the value their value for the corresponding cell?

thank you.
Very nice library btw.: I use it to provide an exporter to Orc for the Scrapy framework under: https://github.com/zuinnote/scrapy-contrib-bigexporters

best regards

Support for missing values in integer types

Does the writer support writing null values to integer types? I know ORC supports null in all types but I can't seem to figure out how to make it work with the writer. A reproducible example is something like this:

import random
import pandas as pd
# create empty dataframe
orc_sample = pd.DataFrame()
# make an int8 column (equivalent to tinyint)
orc_sample['int8_col'] = [random.randrange(0, 128, 2) for i in range(0, 10)]
# replicate that and add a missing value to create a second column
orc_sample['Int8_col'] = orc_sample['int8_col'].astype('Int8')
orc_sample.loc[6, 'Int8_col'] = None
# then write
strct = "struct<" + "int8_col:tinyint,Int_8:tinyint" + ">"
bytefile = io.BytesIO()
with pyorc.Writer(bytefile,
                  strct,
                  struct_repr=pyorc.StructRepr.DICT,
                  compression=pyorc.CompressionKind.ZLIB) as writer:
    writer.writerows(orc.to_dict(orient="records"))

I keep getting: Item cannot be cast to long int at struct field name 'Int8_col'
I'm wondering if anyone has successfully written integer types with missing values?! Or if there is any trick to make it work?!

fatal error C1083: Cannot open include file: 'orc/OrcFile.hh': No such file or directory

Hello,

Any one having a solution for this - fatal error C1083: Cannot open include file: 'orc/OrcFile.hh': No such file or directory
seems like a widespread issue that many are facing.

Help is appreciated.
thanks.

building 'pyorc._pyorc' extension
creating build\temp.win-amd64-3.8
creating build\temp.win-amd64-3.8\Release
creating build\temp.win-amd64-3.8\Release\src
creating build\temp.win-amd64-3.8\Release\src_pyorc
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Ic:\a\python\lib\site-packages\pybind11\include -Ic:\a\python\lib\site-packages\pybind11\include -Ideps/include/ -Ic:\a\python\include -Ic:\a\python\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\8.1\include\shared" "-IC:\Program Files (x86)\Windows Kits\8.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\winrt" /EHsc /Tpsrc/_pyorc_pyorc.cpp /Fobuild\temp.win-amd64-3.8\Release\src/_pyorc_pyorc.obj /EHsc /DVERSION_INFO="0.3.0"
_pyorc.cpp
c:\users\narasas\appdata\local\temp\pip-install-aokjbm44\pyorc_2ceb6efca0c442ed869ea875a6fb7421\src_pyorc\Reader.h(7): fatal error C1083: Cannot open include file: 'orc/OrcFile.hh': No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2
----------------------------------------
ERROR: Command errored out with exit status 1: 'c:\a\python\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\narasas\AppData\Local\Temp\pip-install-aokjbm44\pyorc_2ceb6efca0c442ed869ea875a6fb7421\setup.py'"'"'; file='"'"'C:\Users\narasas\AppData\Local\Temp\pip-install-aokjbm44\pyorc_2ceb6efca0c442ed869ea875a6fb7421\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\narasas\AppData\Local\Temp\pip-record-ns_oflt3\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\a\python\Include\pyorc' Check the logs for full command output.

Issue with dates

I like pyorc's interface. It's been easy to use. However, I'm running into a slight problem. If I try to write a date, then when I read the date, it is one day off. If I run the test_write_primitive_type test, then it fails on dates as well:

@pytest.mark.parametrize("orc_type,values", TESTDATA)
def test_write_primitive_type(orc_type, values):
    data = io.BytesIO()
    writer = Writer(data, orc_type)
    for rec in values:
        writer.write(rec)
    writer.close()

    data.seek(0)
    reader = Reader(data)
    if orc_type == "float":
        result = reader.read()
        assert len(result) == len(values)
        for res, exp in zip(result, values):
            if exp is None:
                assert res is None
            else:
                assert math.isclose(res, exp, rel_tol=1e-07, abs_tol=0.0)
    else:
      assert reader.read() == values

E assert [datetime.dat...2019, 11, 10)] == [datetime.dat...2019, 11, 11)]
E At index 0 diff: datetime.date(1909, 12, 7) != datetime.date(1909, 12, 8)
E Use -v to get the full diff

I encountered this problem in python 3.6 on a CentOS server, however, I replicated on a Mac using python 3.7.

Segmentation fault caused on reading files [negative scenario]

Hi,

I tried few different python related ORC packages and found pyorc is best. I was testing pyorc for both positive scenario and negative scenario listed from apache orc project. For negative scenario files I was expecting appropriate exception message but pyorc causes segmentation fault, Could you please look into this?

import pyorc

with open("./missing_blob_stream_in_string_dict.orc", "rb") as data:
    reader = pyorc.Reader(data)
    for row in reader:
        print(row)

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

These are the below files which causes segmentation fault when read by pyorc. Some details taken from apache ORC project which may help:

missing_blob_stream_in_string_dict.orc => ORC-591 => [C++] Check missing blob stream for StringDictionaryColumnRe…

missing_length_stream_in_string_dict.orc => ORC-590 => [C++] added check for missing LENGTH stream in StringDiction…

negative_dict_entry_lengths.orc => ORC-589 => [C++] add checks about negative dictionary entry lengths

stripe_footer_bad_column_encodings.orc => ORC-580 => [C++] Verify ColumnEncodings in StripeFooter (#463)

Rec skips during sequencial reads

Not sure if this is a result of deleted records in my ORC file, but, using the AWS planet.orc (Open Steet Maps) file, during sequenctial reads or each row, the first columns (row[0]) does not increment by 1 in all cases. Here is a sample output from my script - the rec: value is from the ORC data[0] and the "recs:" value is simply an incremented counter.

Opening ORC Source: s3://osm-pds/planet/planet-latest.orc
Reading ORC data
Schema: struct<id:bigint,type:string,tags:map<string,string>,lat:decimal(9,7),lon:decimal(10,7),nds:array<structref:bigint>,members:array<structtype:string,ref:bigint,role:string>,changeset:bigint,timestamp:timestamp,uid:bigint,user:string,version:bigint,visible:boolean>
Stripes: 3085
Rows: 8222531885
Lengths: {'content_length': 88272336575, 'file_footer_length': 47320, 'file_postscript_length': 27, 'file_length': 88273129466, 'stripe_statistics_length': 745543}
Stripes 3085
Stripe Len 3085
Test SB True: True
Test SB True: True
Test SB False: False
Rec:1 Recs:1 Time:2021-07-10 00:53:52+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:0
Rec:2 Recs:2 Time:2021-05-18 09:03:40+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:1
Rec:3 Recs:3 Time:2021-10-19 17:46:10+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:2
Rec:10 Recs:4 Time:2020-04-13 13:27:47+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:3
Rec:54 Recs:5 Time:2021-09-20 09:43:13+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:4
Rec:100 Recs:6 Time:2021-01-06 19:54:29+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:5
Rec:110 Recs:7 Time:2018-07-21 22:01:43+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:6
Rec:111 Recs:8 Time:2020-03-29 17:27:08+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:7
Rec:112 Recs:9 Time:2016-09-18 15:36:55+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:8
Rec:113 Recs:10 Time:2018-07-21 22:01:43+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:9
Rec:114 Recs:11 Time:2021-01-05 21:02:32+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:10

Apple Silicon Support?

Do we support Apple Silicon (new M1-chip Apple Device)? It seems to fail for some reasons.

$ sw_vers
ProductName:	macOS
ProductVersion:	12.2
BuildVersion:	21D49

$ pip3 install pyorc
Collecting pyorc
  Using cached pyorc-0.5.0.tar.gz (52 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: pyorc
  Building wheel for pyorc (pyproject.toml) ... error
  ERROR: Command errored out with exit status 1:
   command: /Users/dongjoon/.pyenv/versions/3.9.9/bin/python3.9 /Users/dongjoon/.pyenv/versions/3.9.9/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /var/folders/xn/by0ddv_s7sd1gx235zj5vh300000gn/T/tmpawez8m7x
       cwd: /private/var/folders/xn/by0ddv_s7sd1gx235zj5vh300000gn/T/pip-install-l_0nvejb/pyorc_0c21ca5b99ed4e18834ea7834d18bb84
  Complete output (30 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.macosx-12.2-arm64-3.9
  creating build/lib.macosx-12.2-arm64-3.9/pyorc
  copying src/pyorc/enums.py -> build/lib.macosx-12.2-arm64-3.9/pyorc
  copying src/pyorc/predicates.py -> build/lib.macosx-12.2-arm64-3.9/pyorc
  copying src/pyorc/__init__.py -> build/lib.macosx-12.2-arm64-3.9/pyorc
  copying src/pyorc/reader.py -> build/lib.macosx-12.2-arm64-3.9/pyorc
  copying src/pyorc/typedescription.py -> build/lib.macosx-12.2-arm64-3.9/pyorc
  copying src/pyorc/writer.py -> build/lib.macosx-12.2-arm64-3.9/pyorc
  copying src/pyorc/converters.py -> build/lib.macosx-12.2-arm64-3.9/pyorc
  copying src/pyorc/errors.py -> build/lib.macosx-12.2-arm64-3.9/pyorc
  running egg_info
  warning: no files found matching '*.css' under directory 'docs'
  warning: no previously-included files matching '*' found under directory 'docs/_build'
  writing manifest file 'pyorc.egg-info/SOURCES.txt'
  running build_ext
  clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -I/Users/dongjoon/.pyenv/versions/3.9.9/include/python3.9 -c flagcheck.cpp -o flagcheck.o -std=c++17
  creating build/temp.macosx-12.2-arm64-3.9
  creating build/temp.macosx-12.2-arm64-3.9/src
  creating build/temp.macosx-12.2-arm64-3.9/src/_pyorc
  clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -Ideps/include -I/private/var/folders/xn/by0ddv_s7sd1gx235zj5vh300000gn/T/pip-build-env-2dqyrqg8/overlay/lib/python3.9/site-packages/pybind11/include -I/Users/dongjoon/.pyenv/versions/3.9.9/include/python3.9 -c src/_pyorc/Converter.cpp -o build/temp.macosx-12.2-arm64-3.9/src/_pyorc/Converter.o -std=c++17 -mmacosx-version-min=10.14 -fvisibility=hidden -g0 -stdlib=libc++
  In file included from src/_pyorc/Converter.cpp:3:
  src/_pyorc/Converter.h:6:10: fatal error: 'orc/OrcFile.hh' file not found
  #include "orc/OrcFile.hh"
           ^~~~~~~~~~~~~~~~
  1 error generated.
  error: command '/usr/bin/clang' failed with exit code 1
  ----------------------------------------
  ERROR: Failed building wheel for pyorc
Failed to build pyorc
ERROR: Could not build wheels for pyorc, which is required to install pyproject.toml-based projects

Add support to release Linux aarch64 wheels

Problem

On aarch64, ‘pip install pyorc’ builds the wheels from source code and it also requires various dependencies to be installed. While building from the source it is giving the below error-

In file included from src/_pyorc/_pyorc.cpp:1: 
  src/_pyorc/Reader.h:7:10: fatal error: orc/OrcFile.hh: No such file or directory 
      7 | #include "orc/OrcFile.hh" 
        |          ^~~~~~~~~~~~~~~~ 
  compilation terminated. 
  error: command 'aarch64-linux-gnu-gcc' failed with exit status 1 
  ---------------------------------------- 
  ERROR: Failed building wheel for pyorc 

Resolution

On aarch64, ‘pip install pyorc’ should download the wheels from pypi

@noirello Please let me know your interest in releasing aarch64 wheels. I can help in this.

Needs C++ compiler to pip install on Windows

With Windows 10 and Python 3.8, pip install pyorc results in

Collecting pyorc
  Using cached pyorc-0.2.0.tar.gz (39 kB)
Collecting pybind11>=2.4
  Using cached pybind11-2.5.0-py2.py3-none-any.whl (296 kB)
Installing collected packages: pybind11, pyorc
    Running setup.py install for pyorc ... error
    ERROR: Command errored out with exit status 1:
     command: '...\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'...\\pyorc\\setup.py'"'"'; __file__='"'"'...\\pip-install-av36s2u9\\pyorc\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record '...\pip-record-87c7z330\install-record.txt' --single-version-externally-managed --compile --install-headers 
...
    building 'pyorc._pyorc' extension
    error: Microsoft Visual C++ 14.0 is required. Get it with "Build Tools for Visual Studio": https://visualstudio.microsoft.com/downloads/
    ----------------------------------------
ERROR: Command errored out with exit status 1: 
...

As I cannot install MS Visual C, this keeps me from using pyorc.

Got installation error - Trying to install on Alpine

Hi

Anyone successfully installed on alpine docker?
I am getting an error.

/ # python3 --version
Python 3.8.7

/# gcc --version
gcc (Alpine 10.2.1_pre1) 10.2.1 20201203
Copyright (C) 2020 Free Software Foundation, Inc.

/ # pip install pyorc==0.3.0
Collecting pyorc==0.3.0
Using cached pyorc-0.3.0.tar.gz (41 kB)
Requirement already satisfied: pybind11>=2.5 in /usr/lib/python3.8/site-packages (from pyorc==0.3.0) (2.6.1)
Using legacy 'setup.py install' for pyorc, since package 'wheel' is not installed.
Installing collected packages: pyorc
Running setup.py install for pyorc ... error
ERROR: Command errored out with exit status 1:
command: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-5dn_w2gy/pyorc_f3b26ecd51574b30bc2257a2e63b7d5e/setup.py'"'"'; file='"'"'/tmp/pip-install-5dn_w2gy/pyorc_f3b26ecd51574b30bc2257a2e63b7d5e/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-wrlz7rvf/install-record.txt --single-version-externally-managed --compile --install-headers /usr/include/python3.8/pyorc
cwd: /tmp/pip-install-5dn_w2gy/pyorc_f3b26ecd51574b30bc2257a2e63b7d5e/
Complete output (42 lines):
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.8
creating build/lib.linux-x86_64-3.8/pyorc
copying src/pyorc/writer.py -> build/lib.linux-x86_64-3.8/pyorc
copying src/pyorc/converters.py -> build/lib.linux-x86_64-3.8/pyorc
copying src/pyorc/reader.py -> build/lib.linux-x86_64-3.8/pyorc
copying src/pyorc/errors.py -> build/lib.linux-x86_64-3.8/pyorc
copying src/pyorc/enums.py -> build/lib.linux-x86_64-3.8/pyorc
copying src/pyorc/typedescription.py -> build/lib.linux-x86_64-3.8/pyorc
copying src/pyorc/init.py -> build/lib.linux-x86_64-3.8/pyorc
running egg_info
writing pyorc.egg-info/PKG-INFO
writing dependency_links to pyorc.egg-info/dependency_links.txt
writing requirements to pyorc.egg-info/requires.txt
writing top-level names to pyorc.egg-info/top_level.txt
reading manifest file 'pyorc.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching '.css' under directory 'docs'
warning: no previously-included files matching '
' found under directory 'docs/_build'
writing manifest file 'pyorc.egg-info/SOURCES.txt'
running build_ext
creating tmp
gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fomit-frame-pointer -g -fno-semantic-interposition -fomit-frame-pointer -g -fno-semantic-interposition -fomit-frame-pointer -g -fno-semantic-interposition -DTHREAD_STACK_SIZE=0x100000 -fPIC -I/usr/include/python3.8 -c /tmp/tmp9meo18pg.cpp -o tmp/tmp9meo18pg.o -std=c++17
building 'pyorc._pyorc' extension
creating build/temp.linux-x86_64-3.8
creating build/temp.linux-x86_64-3.8/src
creating build/temp.linux-x86_64-3.8/src/_pyorc
gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fomit-frame-pointer -g -fno-semantic-interposition -fomit-frame-pointer -g -fno-semantic-interposition -fomit-frame-pointer -g -fno-semantic-interposition -DTHREAD_STACK_SIZE=0x100000 -fPIC -I/usr/lib/python3.8/site-packages/pybind11/include -I/usr/lib/python3.8/site-packages/pybind11/include -Ideps/include/ -I/usr/include/python3.8 -c src/_pyorc/_pyorc.cpp -o build/temp.linux-x86_64-3.8/src/_pyorc/_pyorc.o -DVERSION_INFO="0.3.0" -std=c++17 -fvisibility=hidden
In file included from /usr/lib/python3.8/site-packages/pybind11/include/pybind11/pytypes.h:12,
from /usr/lib/python3.8/site-packages/pybind11/include/pybind11/cast.h:13,
from /usr/lib/python3.8/site-packages/pybind11/include/pybind11/attr.h:13,
from /usr/lib/python3.8/site-packages/pybind11/include/pybind11/pybind11.h:45,
from src/_pyorc/Reader.h:4,
from src/_pyorc/_pyorc.cpp:1:
/usr/lib/python3.8/site-packages/pybind11/include/pybind11/detail/common.h:122:10: fatal error: Python.h: No such file or directory
122 | #include <Python.h>
| ^~~~~~~~~~
compilation terminated.
error: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-5dn_w2gy/pyorc_f3b26ecd51574b30bc2257a2e63b7d5e/setup.py'"'"'; file='"'"'/tmp/pip-install-5dn_w2gy/pyorc_f3b26ecd51574b30bc2257a2e63b7d5e/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-wrlz7rvf/install-record.txt --single-version-externally-managed --compile --install-headers /usr/include/python3.8/pyorc Check the logs for full command output.

handle uniontype

I'm trying to use uniontype type. Is this supported?

>>> fp = open("./new_data-6.orc", "wb")
>>> writer1 = pyorc.Writer(fp, "struct<col1:uniontype<int,double>>")
>>> writer1.write((0, 10))
>>> writer1.write((1, 11))
>>> writer1.write((22, 0))
>>> writer1.write((33, 1))
>>> writer1.write((0, 44, 44))
>>> writer1.write((1, 55, 55))
>>> writer1.close()
>>> fp.close()

I can write data but the contents of the file looks wrong when inspected with orc-contents. Only tag 0 has values and tag 1 is always empty..

$ orc-contents new_data-6.orc 
{"col1": {"tag": 0, "value": 0}}
{"col1": {"tag": 0, "value": 1}}
{"col1": {"tag": 0, "value": 22}}
{"col1": {"tag": 0, "value": 33}}
{"col1": {"tag": 0, "value": 0}}
{"col1": {"tag": 0, "value": 1}}

Schema looks OK:

orc-metadata new_data-6.orc 
{ "name": "new_data-6.orc",
  "type": "struct<col1:uniontype<int,double>>",
  "attributes": {},
...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.