Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40289

The result is strange when casting string to date in ORC reading via Schema Evolution

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.1.1
    • None
    • Spark Shell
    • None
      • Ubuntu 1804 LTS
      • Spark 311

    Description

      I created an ORC file by the code as follows.

      val data = Seq(
          ("", "2022-01-32"),  // pay attention to this, null
          ("", "9808-02-30"),  // pay attention to this, 9808-02-29
          ("", "2022-06-31"),  // pay attention to this, 2022-06-30
      )
      val cols = Seq("str", "date_str")
      val df=spark.createDataFrame(data).toDF(cols:_*).repartition(1)
      df.printSchema()
      df.show(100)
      df.write.mode("overwrite").orc("/tmp/orc/data.orc")
      

      Please note that these three cases are invalid date.
      And I read it via:

      scala> var df = spark.read.schema("date_str date").orc("/tmp/orc/data.orc"); df.show()
      +----------+
      |  date_str|
      +----------+
      |      null|
      |9808-02-29|
      |2022-06-30|
      +----------+

      Why is `2022-01-32` converted to `null`, while `9808-02-30` is converted to `9808-02-29`?

      Intuitively, they are invalid date, we should return 3 nulls. Is it a bug or a feature?

       

       

      Background

      • I am working on the project: https://github.com/NVIDIA/spark-rapids
      • And I am working on a feature, that is to support reading ORC file as an cuDF (CUDA DataFrame). cuDF is an in-memory data-format of GPU.
      • I need to follow the behaviors of ORC reading in CPU. Otherwise, the users of spark-rapids will feel strange with the results.
      • Therefore I want to know why those happpened.

      Attachments

        Activity

          People

            Unassigned Unassigned
            only1kb Jianbang Xian
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: