[SPARK-40289] The result is strange when casting string to date in ORC reading via Schema Evolution - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.1.1
Fix Version/s: None
Component/s: Spark Shell
Labels:
None
Environment:
- Ubuntu 1804 LTS
- Spark 311

Description

I created an ORC file by the code as follows.

val data = Seq(
    ("", "2022-01-32"),  // pay attention to this, null
    ("", "9808-02-30"),  // pay attention to this, 9808-02-29
    ("", "2022-06-31"),  // pay attention to this, 2022-06-30
)
val cols = Seq("str", "date_str")
val df=spark.createDataFrame(data).toDF(cols:_*).repartition(1)
df.printSchema()
df.show(100)
df.write.mode("overwrite").orc("/tmp/orc/data.orc")

Please note that these three cases are invalid date.
And I read it via:

scala> var df = spark.read.schema("date_str date").orc("/tmp/orc/data.orc"); df.show()
+----------+
|  date_str|
+----------+
|      null|
|9808-02-29|
|2022-06-30|
+----------+

Why is `2022-01-32` converted to `null`, while `9808-02-30` is converted to `9808-02-29`?

Intuitively, they are invalid date, we should return 3 nulls. Is it a bug or a feature?

Background

I am working on the project: https://github.com/NVIDIA/spark-rapids
And I am working on a feature, that is to support reading ORC file as an cuDF (CUDA DataFrame). cuDF is an in-memory data-format of GPU.
I need to follow the behaviors of ORC reading in CPU. Otherwise, the users of spark-rapids will feel strange with the results.
Therefore I want to know why those happpened.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Jianbang Xian

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 31/Aug/22 08:31

Updated:: 12/Dec/22 18:10