[SPARK-18510] Partition schema inference corrupts data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.1.0
Fix Version/s: 2.1.0
Component/s: SQL, Structured Streaming
Labels:
None

Target Version/s:

2.1.0

Description

Not sure if this is a regression from 2.0 to 2.1. I was investigating this for Structured Streaming, but it seems it affects batch data as well.

Here's the issue:
If I specify my schema when doing

spark.read
  .schema(someSchemaWherePartitionColumnsAreStrings)

but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted.

Reproduction:

val createArray = udf { (length: Long) =>
    for (i <- 1 to length.toInt) yield i.toString
}
spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 'part).coalesce(1).write
        .partitionBy("part", "id")
        .mode("overwrite")
        .parquet(src.toString)
val schema = new StructType()
        .add("id", StringType)
        .add("part", IntegerType)
        .add("ex", ArrayType(StringType))
spark.read
      .schema(schema)
      .format("parquet")
      .load(src.toString)
      .show()

The UDF is useful for creating a row long enough so that you don't hit other weird NullPointerExceptions caused for the same reason I believe.
Output:

+---------+----+--------------------+
|       id|part|                  ex|
+---------+----+--------------------+
|�|   1|[1, 2, 3, 4, 5, 6...|
| |   0|[1, 2, 3, 4, 5, 6...|
|  |   3|[1, 2, 3, 4, 5, 6...|
|   |   2|[1, 2, 3, 4, 5, 6...|
|    |   1|  [1, 2, 3, 4, 5, 6]|
|     |   0|     [1, 2, 3, 4, 5]|
|      |   3|        [1, 2, 3, 4]|
|       |   2|           [1, 2, 3]|
|        |   1|              [1, 2]|
|         |   0|                 [1]|
+---------+----+--------------------+

I was hoping to fix the issue as part of ~~SPARK-18407~~ but it seems it's not only applicable to StructuredStreaming and deserves it's own JIRA.

Attachments

Issue Links

supercedes

SPARK-18407 Inferred partition columns cause assertion error

Resolved

links to

[Github] Pull Request #15951 (brkyvz)

[Github] Pull Request #15997 (zsxwing)

Activity

People

Assignee:: Burak Yavuz

Reporter:: Burak Yavuz

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 20/Nov/16 03:12

Updated:: 28/Nov/16 10:09

Resolved:: 23/Nov/16 19:51