Comments (13)
I have some experience with arrow (as an arrow committer) so let me try to set this up.
Current plan is to split into two parts:
- Arrow schema reading
- Arrow file / data loading and off-heap memory management
Subsequent features can come into more tangible forms when reading is done. Eg arrow file writing, streaming, predicate push down, etc.
from dataframe.
Hello again.
I am working with more complex unit test for Arrow reading. Will make PR a little later.
Just now, you can look at data example and code it was generated with here
from dataframe.
I suggest next mapping if use Infer
as a parameter:
Infer.Nulls
— set nullable flag in DataFrame schema if and only if there are null values in the column, make default;Infer.None
— copy Arrow schema to DataFrame, throw Exception like "notNullableColumn marked not nullable in schema, but has nulls";Infer.Type
— copy Arrow schema to DataFrame, change not nullable to nullable if there are null values. Or it actually would be the same asInfer.Nulls
(single type is already guaranteed by Arrow).
from dataframe.
Implemented in #129
Narrowing
was renamed to Keeping
because on schema ignoring we can get no nulls in nullable as well as some nulls in not nullable.
from dataframe.
Hi, Lundez!
Currently DataFrame
doesn't use Arrow as backend, but it's on the roadmap.
Until now we were mostly focused on frontend part: typesafe Kotlin API, code generation, schema inference and other tricks that provide great experience when you work with data in Kotlin. But now API and overall model are getting stable, so it's time to do more performance tuning and scalability, including Arrow support as a backend.
Currently the project has only two active contributors, so any help will be very much appreciated!
from dataframe.
Hi, do you have any headers on how to start?
Do you think the java arrow API can work with your "typing" (or whatever to call the typing is used in data frames)? 😊
I think adding arrow would give this project a big boost.
Also adding a query optimizer would follow up as a huge bonus, like pola.rs / spark. To optimize columns and other this when using arrow makes a lot of sense! 😄
from dataframe.
@jimexist incredibly excited to hear this!
from dataframe.
Currently the project has only two active contributors, so any help will be very much appreciated!
Hello @nikitinas, what do you think about my last PR-s?
Also I have made some code writing to Arrow but it does not cover all DataFrame-supported column types (was made for Krangl originally)
from dataframe.
@koperagen, @nikitinas, I want your opinion about the next detail.
In Arrow schema we have nullable
flag but it's value does not depend on column content. And we may get a column that is marked as not nullable but actually contains null values. Here is an example.
So, we can:
- Ignore nullable flag in the file, read all data and set nullable flag in DataFrame schema if and only if there are null values in the column;
- Look at nullable flag and always copy it to DataFrame schema; thus reading data like above will produce an error;
- Look at nullable flag, copy it to DataFrame schema by default and then change not nullable to nullable if there are null values.
What behavior is the best and should we support different of them, in your point of view?
from dataframe.
@koperagen, @nikitinas, I want your opinion about the next detail.
In Arrow schema we have
nullable
flag but it's value does not depend on column content. And we may get a column that is marked as not nullable but actually contains null values. Here is an example.So, we can:
- Ignore nullable flag in the file, read all data and set nullable flag in DataFrame schema if and only if there are null values in the column;
- Look at nullable flag and always copy it to DataFrame schema; thus reading data like above will produce an error;
- Look at nullable flag, copy it to DataFrame schema by default and then change not nullable to nullable if there are null values.
What behavior is the best and should we support different of them, in your point of view?
Could we support different read-modes? Defaulting to first or third makes sense, but a strict-mode would be great (second) through a flag/read-mode IMO
from dataframe.
@koperagen, @nikitinas, I want your opinion about the next detail.
In Arrow schema we have
nullable
flag but it's value does not depend on column content. And we may get a column that is marked as not nullable but actually contains null values. Here is an example.So, we can:
- Ignore nullable flag in the file, read all data and set nullable flag in DataFrame schema if and only if there are null values in the column;
- Look at nullable flag and always copy it to DataFrame schema; thus reading data like above will produce an error;
- Look at nullable flag, copy it to DataFrame schema by default and then change not nullable to nullable if there are null values.
What behavior is the best and should we support different of them, in your point of view?
Hm, i would prefer 1 as a default, because in REPL it can help avoid unnecessary null handling when there are no nulls. But we also need 3 for Gradle plugin which generates schema declaration from data sample.
Do i understand the second option right? Something like this would be possible?
val df = DataFrame.readArrow()
df.notNullableColumn.map { it / 2 } // null pointer exception
I think we shouldn't have this mode unless there is very strong evidence that it is very useful for someone :)
Or do you mean this?
val df = DataFrame.readArrow() // Exception: notNullableColumn marked not nullable in schema, but has nulls
All that reminds me of "Infer" that is used as a flat for some operations.
from dataframe.
Thank you for highlighting Infer
enum. It can probably be used as parameter.
Hm, i would prefer 1 as a default
OK, thanks for sharing.
About 2, I expected something like
val df = DataFrame.readArrow() // Exception: notNullableColumn marked not nullable in schema, but has nulls
when callnig
DataColumn.createValueColumn(field.name, listWithNulls, typeNotNullable, Infer.None)
but actually we have
val df = DataFrame.readArrow()
df.notNullableColumn.map { it / 2 } // null pointer exception
now. I will fix that.
Where can I read more about the Gradle plugin? How do you use it?
from dataframe.
Where can I read more about the Gradle plugin? How do you use it?
https://kotlin.github.io/dataframe/gradle.html
I suggest next mapping if use
Infer
as a parameter:
I'm not sure about it anymore. Because Infer.Type
does a different thing in other operations. Infer.Nulls
is
"actual data nullability" == "schema nullability", and in our case
"set nullable flag in DataFrame schema if and only if there are null values in the column" is "narrow nullability if possible", and a third option is "widen nullability if needed"
What do you think about a new enum, let's say something like SchemaVerification
? It describes variants of this operation:
actual nullability (from data) + schema nullability (from file) -> nullability | error
Maybe some other name, idk.
edit. Colleagues suggested NullabilityOptions
, NullabilityTransformOptions
, NullabilityOperatorOptions
, NullabilityCompositionOptions
As for enum variants, could be WIDENING
, NARROWING
, CHECKING
.
from dataframe.
Related Issues (20)
- `DataColumn<Nothing>.isNumber()` gives `true` HOT 1
- `"" + columnRef` becomes `String` HOT 3
- Update kotlinx.datetime version to 0.6.0
- Test `writeJsonStr` is broken and probably is not a part of common suite
- `NullPointerException` using `except` to exclude double nested column HOT 3
- DataFrame fails on simple actions with casting BigInteger to Long HOT 5
- Add a User Guide "How to handle large CSV?"
- Documentation and README lack information about Kotlin Notebook HOT 1
- Documentation for df codegen workflow HOT 1
- Support for KSP2 (for beta K2 compiler exploration) HOT 3
- Integration with ktor-client
- Update `readSQL` documentation for release 0.14
- Move Jupyter integration in new dataframe-jupyter module HOT 3
- Add an actual table of all functions with level of support in Compiler Plugin
- Make ReducedGroupBy implement DataFrame interface
- GroupBy.sort and sortBy vague error reporting
- Embed into the library some widely known dataset to learn better our library
- isOpenApiStr logger leaks to Gradle HOT 2
- Improve and document CSV reading options
- Add a migration guide for Pandas developers
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataframe.