I created an ORC file by the code as follows. <div class="highlight highlight-sour

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

First, may I ask why you cannot follow the community recommendation (1), <a clas

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The result is strange when casting `string` to `date` in ORC reading via spark (Schema Evolution) about orc HOT 6 CLOSED

apache commented on July 25, 2024

The result is strange when casting `string` to `date` in ORC reading via spark (Schema Evolution)

from orc.

Comments (6)

dongjoon-hyun commented on July 25, 2024 1

Since there is a recommended way, I'll close this Q&A issue. We can still continue to discuss on this thread, @sinkinben .

from orc.

dongjoon-hyun commented on July 25, 2024 1

Thank you for sharing the background. I know spark-rapids. However, reading ORC files is different from trying to schema evolution like String-to-Decimal.

And I am working on a feature, that is to support reading ORC file as an cuDF (CUDA DataFrame). cuDF is an in-memory data-format of GPU.

IMO, you don't need to follow any behaviors on Schema Evolution. It has been never robust nor fast. Not only in ORC, it's generally not good in other data formats too.

from orc.

dongjoon-hyun commented on July 25, 2024

Hi, @sinkinben . You are trying Schema Evolution (Upcasting).

Both Apache Spark and ORC community recommend to use explicit SQL CAST method instead of depending on data source's Schema Evolution. There are three reasons.

First of all, if you use explicit CAST syntax, you will get the expected result.

scala> sql("select cast('2022-01-32' as DATE)").show()
+------------------------+
|CAST(2022-01-32 AS DATE)|
+------------------------+
|                    null|
+------------------------+


scala> sql("select cast('9808-02-30' as DATE)").show()
+------------------------+
|CAST(9808-02-30 AS DATE)|
+------------------------+
|                    null|
+------------------------+


scala> sql("select cast('2022-06-31' as DATE)").show()
+------------------------+
|CAST(2022-06-31 AS DATE)|
+------------------------+
|                    null|
+------------------------+

Second, Spark provides many data sources like CSV/Avro/Parquet/ORC. A data source's schema evolution capability is heterogeneous from each others. In other words, you cannot expect a consistent result when you change the file-based data source format. You will get different results from other data sources like Parquet. FYI, Apache Spark community has a test coverage for that feature parity issue and has been tracking it.

https://github.com/apache/spark/blob/146f187342140635b83bfe775b6c327755edfbe1/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaTest.scala#L40-L49

 * The reader schema is said to be evolved (or projected) when it changed after the data is
 * written by writers. The followings are supported in file-based data sources.
 * Note that partition columns are not maintained in files. Here, `column` means non-partition
 * column.
 *
 *   1. Add a column
 *   2. Hide a column
 *   3. Change a column position
 *   4. Change a column type (Upcast)
 *
 * Here, we consider safe changes without data loss. For example, data type changes should be
 * from small types to larger types like `int`-to-`long`, not vice versa.
 *
 * So far, file-based data sources have the following coverages.
 *
 *   | File Format  | Coverage     | Note                                                   |
 *   | ------------ | ------------ | ------------------------------------------------------ |
 *   | TEXT         | N/A          | Schema consists of a single string column.             |
 *   | CSV          | 1, 2, 4      |                                                        |
 *   | JSON         | 1, 2, 3, 4   |                                                        |
 *   | ORC          | 1, 2, 3, 4   | Native vectorized ORC reader has the widest coverage.  |
 *   | PARQUET      | 1, 2, 3      |                                                        |
 *   | AVRO         | 1, 2, 3      |                                                        |

Last but not least, Apache Spark has three ORC readers. For your use case, you can set spark.sql.orc.impl=hive to get a correct result if you really need to depend on Apache ORC's Schema Evolution inevitably.

scala> sql("set spark.sql.orc.impl=hive")

scala> :paste
// Entering paste mode (ctrl-D to finish)

val data = Seq(
    ("", "2022-01-32"),  // pay attention to this, null
    ("", "9808-02-30"),  // pay attention to this, 9808-02-29
    ("", "2022-06-31"),  // pay attention to this, 2022-06-30
)
val cols = Seq("str", "date_str")
val df = spark.createDataFrame(data).toDF(cols:_*).repartition(1)
df.write.format("orc").mode("overwrite").save("/tmp/df")
spark.read.format("orc").schema("date_str date").load("/tmp/df").show(false)

// Exiting paste mode, now interpreting.

+--------+
|date_str|
+--------+
|null    |
|null    |
|null    |
+--------+

from orc.

sinkinben commented on July 25, 2024

Hi, @dongjoon-hyun , many thx for you reply.

I have made more tests after I set the conf spark.sql.orc.impl.

scala> :paste
// Entering paste mode (ctrl-D to finish)

val data = Seq(
    ("", "2002-01-01"),
    ("", "2022-08-29"),
    ("", "2022-08-31")
)
val cols = Seq("str", "date_str")
val df=spark.createDataFrame(data).toDF(cols:_*).repartition(1)
df.printSchema()
df.show(100)
df.write.mode("overwrite").orc("/tmp/orc/data.orc")

// Exiting paste mode, now interpreting.

root
 |-- str: string (nullable = true)
 |-- date_str: string (nullable = true)

+---+----------+
|str|  date_str|
+---+----------+
|   |2002-01-01|
|   |2022-08-29|
|   |2022-08-31|
+---+----------+

data: Seq[(String, String)] = List(("",2002-01-01), ("",2022-08-29), ("",2022-08-31))
cols: Seq[String] = List(str, date_str)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [str: string, date_str: string]

scala> sql("set spark.sql.orc.impl=hive")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> var df = spark.read.schema("date_str date").orc("/tmp/orc/data.orc"); df.show()
+--------+
|date_str|
+--------+
|    null|
|    null|
|    null|
+--------+

df: org.apache.spark.sql.DataFrame = [date_str: date]

These three cases are valid date, but why are they converted to nulls?

from orc.

dongjoon-hyun commented on July 25, 2024

First, may I ask why you cannot follow the community recommendation (1), @sinkinben ?
Second, for your question, that's an independent question because spark.sql.orc.impl=hive is not Apache ORC code. It's Apache Spark code before Apache ORC adoption. You can file an Apache Spark JIRA issue for further investigation in Apache Spark community. I guess it will goes to Apache Hive HiveDecimalWritable or related classes in the end.
- https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala

from orc.

sinkinben commented on July 25, 2024

@dongjoon-hyun

I am working on the project: https://github.com/NVIDIA/spark-rapids
- And I am working on a feature, that is to support reading ORC file as an cuDF (CUDA DataFrame). cuDF is an in-memory data-format of GPU.
- So I need to follow the behaviors of ORC reading in CPU. Otherwise, the users of spark-rapids will feel strange with the results.
- Therefore I want to know why those happpened.
Thanks for your advice, and I will have a look on ORC reading by hive-sql.

from orc.

The result is strange when casting `string` to `date` in ORC reading via spark (Schema Evolution) about orc HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent