Comments (9)
You're probably right, separate classes for reading and writing are overkill. Maybe two simple functions defined in a submodule could be sufficient enough.
I don't want to add pandas as a default requirement to the module.
from pyorc.
Honestly, I'm not that familiar with pandas. I have to look into more, but the simplest solution to add pandas as an extra to the module, and inherit a new Reader/Writer from the existing ones with methods that can read and write pandas dataframes. There's probably a need for special converter functions for pandas' special types as well.
As you've already noticed, my goal with this library is to be a simple ORC reader and writer with the smallest overhead as possible. A few smaller tasks are on my todo list now, but I'm not against the idea. The best would be, if someone with a better knowledge of pandas could contribute. 😉
from pyorc.
Therefore, what do you suggest to solve this issue? These are the options I guess:
- PandasReader & PandasWriter (I don't like this one)
- "as_pandas" as method or argument on reader (but this adds pandas as dependency)
- Just an example of how to get a Pandas DF from loaded data.
- Nothing to do (just add a comment to let users understand the scope of pyorc)
Let me know what is your concern on this and then I could make a PR related to that.
from pyorc.
OK so you prefer to add functions/methods for that, but not to require pandas to use pyorc. I can add this behavior without adding Panda as a requirement, and raise an error if Pandas is not installed, but that will be a very bad practice. Therefore, the best option will be to just add an example with Pandas for start, and if the example is not enough you could consider expanding the scope of pyorc to be more integrated with Pandas (which is my suggestion, since it's a pretty common requirement in most of data processing libraries). Let me know your thoughts and I can make a PR from that (either adding an example or adding methods with Pandas)
from pyorc.
I think an example would be great as a start. A PR about that would be much appreciated. Thank you.
from pyorc.
Can someone please include a short example of how to use converters in the Reader? I tried really hard to figure this out but I couldn't. I'm reading a file like this where orc_bytes is of class bytes:
orc = pyorc.Reader(fileo=io.BytesIO(orc_bytes))
This works fine and I can convert the resulting Reader object to a pandas dataframe. Now I am trying to add a converter to Reader to convert decimal to float upon reading. I know it needs a dictionary with keys being TypeKind but I can't figure out how to pass the dictionary values. So I am stuck at:
orc = pyorc.Reader(fileo=io.BytesIO(orc_bytes), converters={pyorc.TypeKind.DECIMAL: ???})
I found an example in test_reader.py file (https://github.com/noirello/pyorc/blob/master/tests/test_reader.py#L325) that uses ORCConverter to define a class and a from_orc method for TypeKind.TIMESTAMP but I have no idea how this should be defined to convert decimal to float. Any help please?
from pyorc.
This is what I have so far but it is not doing any conversion:
import numpy as np
import pyorc
from pyorc.converters import ORCConverter
class TypeConverter(ORCConverter):
@staticmethod
def from_orc(decimal_input):
return np.array(decimal_input, dtype=float)
orc = pyorc.Reader(fileo=io.BytesIO(orc_bytes), converters={pyorc.TypeKind.DECIMAL: TypeConverter})
Any suggestions?!
from pyorc.
I updated the docs about ORCConverter
Your converter above should return a numpy array with a float in it, for every item in a decimal ORC column.
from pyorc.
@noirello Thanks a lot! Highly appreciated.
from pyorc.
Related Issues (20)
- How to edit the readme document? HOT 1
- Support for missing values in integer types HOT 2
- Installation on clean pypy3 environment fails: pybind11 missing HOT 5
- Reader can filter HOT 6
- pyorc.errors.ParseError: Footer is corrupt: types(1701470799) not exists HOT 2
- Invalid orc version
- Private Network Build HOT 2
- Possible to control filesize ? HOT 1
- Support for timestamp with local time zone HOT 2
- Add support to release Linux aarch64 wheels
- PyPI release for pyorc with orc v1.7.0 HOT 4
- set_metadata() casts to str() HOT 2
- Rec skips during sequencial reads HOT 2
- predicate to skip rows doesn't seem to work for timestamps HOT 6
- Apple Silicon Support? HOT 3
- can't find '__main__' module HOT 3
- orc Minimum is error? HOT 3
- Cannot install pyorc on Mac M1 HOT 2
- Please make python 3.11 wheels available from pypi HOT 1
- handle uniontype HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyorc.