Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
3.1.1
-
None
-
None
-
- Ubuntu 1804 LTS
- Spark 311
Description
I created an ORC file by the code as follows.
val data = Seq( ("", "2022-01-32"), // pay attention to this, null ("", "9808-02-30"), // pay attention to this, 9808-02-29 ("", "2022-06-31"), // pay attention to this, 2022-06-30 ) val cols = Seq("str", "date_str") val df=spark.createDataFrame(data).toDF(cols:_*).repartition(1) df.printSchema() df.show(100) df.write.mode("overwrite").orc("/tmp/orc/data.orc")
Please note that these three cases are invalid date.
And I read it via:
scala> var df = spark.read.schema("date_str date").orc("/tmp/orc/data.orc"); df.show() +----------+ | date_str| +----------+ | null| |9808-02-29| |2022-06-30| +----------+
Why is `2022-01-32` converted to `null`, while `9808-02-30` is converted to `9808-02-29`?
Intuitively, they are invalid date, we should return 3 nulls. Is it a bug or a feature?
Background
- I am working on the project: https://github.com/NVIDIA/spark-rapids
- And I am working on a feature, that is to support reading ORC file as an cuDF (CUDA DataFrame). cuDF is an in-memory data-format of GPU.
- I need to follow the behaviors of ORC reading in CPU. Otherwise, the users of spark-rapids will feel strange with the results.
- Therefore I want to know why those happpened.