Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18510

Partition schema inference corrupts data

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.1.0
    • 2.1.0
    • SQL, Structured Streaming
    • None

    Description

      Not sure if this is a regression from 2.0 to 2.1. I was investigating this for Structured Streaming, but it seems it affects batch data as well.

      Here's the issue:
      If I specify my schema when doing

      spark.read
        .schema(someSchemaWherePartitionColumnsAreStrings)
      

      but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted.

      Reproduction:

      val createArray = udf { (length: Long) =>
          for (i <- 1 to length.toInt) yield i.toString
      }
      spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 'part).coalesce(1).write
              .partitionBy("part", "id")
              .mode("overwrite")
              .parquet(src.toString)
      val schema = new StructType()
              .add("id", StringType)
              .add("part", IntegerType)
              .add("ex", ArrayType(StringType))
      spark.read
            .schema(schema)
            .format("parquet")
            .load(src.toString)
            .show()
      

      The UDF is useful for creating a row long enough so that you don't hit other weird NullPointerExceptions caused for the same reason I believe.
      Output:

      +---------+----+--------------------+
      |       id|part|                  ex|
      +---------+----+--------------------+
      |�|   1|[1, 2, 3, 4, 5, 6...|
      | |   0|[1, 2, 3, 4, 5, 6...|
      |  |   3|[1, 2, 3, 4, 5, 6...|
      |   |   2|[1, 2, 3, 4, 5, 6...|
      |    |   1|  [1, 2, 3, 4, 5, 6]|
      |     |   0|     [1, 2, 3, 4, 5]|
      |      |   3|        [1, 2, 3, 4]|
      |       |   2|           [1, 2, 3]|
      |        |   1|              [1, 2]|
      |         |   0|                 [1]|
      +---------+----+--------------------+
      

      I was hoping to fix the issue as part of SPARK-18407 but it seems it's not only applicable to StructuredStreaming and deserves it's own JIRA.

      Attachments

        Issue Links

          Activity

            People

              brkyvz Burak Yavuz
              brkyvz Burak Yavuz
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: