Comments (6)
Since there is a recommended way, I'll close this Q&A issue. We can still continue to discuss on this thread, @sinkinben .
from orc.
Thank you for sharing the background. I know spark-rapids
. However, reading ORC files
is different from trying to schema evolution
like String-to-Decimal
.
And I am working on a feature, that is to support reading ORC file as an cuDF (CUDA DataFrame). cuDF is an in-memory data-format of GPU.
IMO, you don't need to follow any behaviors on Schema Evolution
. It has been never robust nor fast. Not only in ORC, it's generally not good in other data formats too.
from orc.
Hi, @sinkinben . You are trying Schema Evolution (Upcasting)
.
Both Apache Spark and ORC community recommend to use explicit SQL CAST
method instead of depending on data source's Schema Evolution
. There are three reasons.
- First of all, if you use explicit
CAST
syntax, you will get the expected result.
scala> sql("select cast('2022-01-32' as DATE)").show()
+------------------------+
|CAST(2022-01-32 AS DATE)|
+------------------------+
| null|
+------------------------+
scala> sql("select cast('9808-02-30' as DATE)").show()
+------------------------+
|CAST(9808-02-30 AS DATE)|
+------------------------+
| null|
+------------------------+
scala> sql("select cast('2022-06-31' as DATE)").show()
+------------------------+
|CAST(2022-06-31 AS DATE)|
+------------------------+
| null|
+------------------------+
- Second, Spark provides many data sources like CSV/Avro/Parquet/ORC. A data source's schema evolution capability is heterogeneous from each others. In other words, you cannot expect a consistent result when you change the file-based data source format. You will get different results from other data sources like Parquet. FYI, Apache Spark community has a test coverage for that feature parity issue and has been tracking it.
* The reader schema is said to be evolved (or projected) when it changed after the data is
* written by writers. The followings are supported in file-based data sources.
* Note that partition columns are not maintained in files. Here, `column` means non-partition
* column.
*
* 1. Add a column
* 2. Hide a column
* 3. Change a column position
* 4. Change a column type (Upcast)
*
* Here, we consider safe changes without data loss. For example, data type changes should be
* from small types to larger types like `int`-to-`long`, not vice versa.
*
* So far, file-based data sources have the following coverages.
*
* | File Format | Coverage | Note |
* | ------------ | ------------ | ------------------------------------------------------ |
* | TEXT | N/A | Schema consists of a single string column. |
* | CSV | 1, 2, 4 | |
* | JSON | 1, 2, 3, 4 | |
* | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. |
* | PARQUET | 1, 2, 3 | |
* | AVRO | 1, 2, 3 | |
- Last but not least, Apache Spark has three ORC readers. For your use case, you can set
spark.sql.orc.impl=hive
to get a correct result if you really need to depend on Apache ORC's Schema Evolution inevitably.
scala> sql("set spark.sql.orc.impl=hive")
scala> :paste
// Entering paste mode (ctrl-D to finish)
val data = Seq(
("", "2022-01-32"), // pay attention to this, null
("", "9808-02-30"), // pay attention to this, 9808-02-29
("", "2022-06-31"), // pay attention to this, 2022-06-30
)
val cols = Seq("str", "date_str")
val df = spark.createDataFrame(data).toDF(cols:_*).repartition(1)
df.write.format("orc").mode("overwrite").save("/tmp/df")
spark.read.format("orc").schema("date_str date").load("/tmp/df").show(false)
// Exiting paste mode, now interpreting.
+--------+
|date_str|
+--------+
|null |
|null |
|null |
+--------+
from orc.
Hi, @dongjoon-hyun , many thx for you reply.
I have made more tests after I set the conf spark.sql.orc.impl
.
scala> :paste
// Entering paste mode (ctrl-D to finish)
val data = Seq(
("", "2002-01-01"),
("", "2022-08-29"),
("", "2022-08-31")
)
val cols = Seq("str", "date_str")
val df=spark.createDataFrame(data).toDF(cols:_*).repartition(1)
df.printSchema()
df.show(100)
df.write.mode("overwrite").orc("/tmp/orc/data.orc")
// Exiting paste mode, now interpreting.
root
|-- str: string (nullable = true)
|-- date_str: string (nullable = true)
+---+----------+
|str| date_str|
+---+----------+
| |2002-01-01|
| |2022-08-29|
| |2022-08-31|
+---+----------+
data: Seq[(String, String)] = List(("",2002-01-01), ("",2022-08-29), ("",2022-08-31))
cols: Seq[String] = List(str, date_str)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [str: string, date_str: string]
scala> sql("set spark.sql.orc.impl=hive")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
scala> var df = spark.read.schema("date_str date").orc("/tmp/orc/data.orc"); df.show()
+--------+
|date_str|
+--------+
| null|
| null|
| null|
+--------+
df: org.apache.spark.sql.DataFrame = [date_str: date]
These three cases are valid date, but why are they converted to null
s?
from orc.
- First, may I ask why you cannot follow the community recommendation (1), @sinkinben ?
- Second, for your question, that's an independent question because
spark.sql.orc.impl=hive
is not Apache ORC code. It's Apache Spark code before Apache ORC adoption. You can file an Apache Spark JIRA issue for further investigation in Apache Spark community. I guess it will goes to Apache HiveHiveDecimalWritable
or related classes in the end.
from orc.
- I am working on the project: https://github.com/NVIDIA/spark-rapids
- And I am working on a feature, that is to support reading ORC file as an cuDF (CUDA DataFrame). cuDF is an in-memory data-format of GPU.
- So I need to follow the behaviors of ORC reading in CPU. Otherwise, the users of spark-rapids will feel strange with the results.
- Therefore I want to know why those happpened.
- Thanks for your advice, and I will have a look on ORC reading by hive-sql.
from orc.
Related Issues (20)
- ORC-1562: Bump `com.google.guava:guava` to 33.0.0-jre HOT 1
- one compression block not allow to bigger than 8M HOT 15
- ORC-1576: Upgrade `spark.jackson.version` to 2.15.2 in `bench` module HOT 1
- ORC-1578: Fix `SparkBenchmark` on `sales` data according to SPARK-40918 HOT 1
- ORC-1578: Fix `SparkBenchmark` on `sales` data according to SPARK-40918 HOT 1
- ORC-1578: Fix `SparkBenchmark` on `sales` data according to SPARK-40918 HOT 1
- ORC-1586: Fix IllegalAccessError when SparkBenchmark runs on JDK17 HOT 1
- ORC-1591: Lower log level from INFO to DEBUG in `*ReaderImpl/WriterImpl/PhysicalFsWriter` HOT 1
- ORC-1592: Suppress `KeyProvider` missing log HOT 1
- ORC-1602: [C++] limit compression block size HOT 1
- ORC-1607: Fix `testDoubleNaNAndInfinite` to use `TestFileDump.checkOutput` HOT 1
- 1.9.2: build fails HOT 6
- orc CXX fail to build if libgtest-dev is installed (debian-like systems) HOT 5
- ORC-1616: Upgrade aircompressor to 0.26 HOT 1
- In cpp/java sdk, SearchArgument looks like didn't use the footer and stripe stats. HOT 1
- ORC-1618: Disable building tests for snappy HOT 1
- ORC-1620: Add Apple Silicon Test Coverage HOT 1
- ORC-1621: Switch to `oraclelinux9` from `rocky9` HOT 1
- ORC-1621: Switch to `oraclelinux9` from `rocky9` HOT 1
- What's the meaning of EvaluatedRowGroupCount in ReaderMetrics HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from orc.