Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
2.1.0
-
None
Description
Not sure if this is a regression from 2.0 to 2.1. I was investigating this for Structured Streaming, but it seems it affects batch data as well.
Here's the issue:
If I specify my schema when doing
spark.read .schema(someSchemaWherePartitionColumnsAreStrings)
but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted.
Reproduction:
val createArray = udf { (length: Long) => for (i <- 1 to length.toInt) yield i.toString } spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 'part).coalesce(1).write .partitionBy("part", "id") .mode("overwrite") .parquet(src.toString) val schema = new StructType() .add("id", StringType) .add("part", IntegerType) .add("ex", ArrayType(StringType)) spark.read .schema(schema) .format("parquet") .load(src.toString) .show()
The UDF is useful for creating a row long enough so that you don't hit other weird NullPointerExceptions caused for the same reason I believe.
Output:
+---------+----+--------------------+ | id|part| ex| +---------+----+--------------------+ |�| 1|[1, 2, 3, 4, 5, 6...| | | 0|[1, 2, 3, 4, 5, 6...| | | 3|[1, 2, 3, 4, 5, 6...| | | 2|[1, 2, 3, 4, 5, 6...| | | 1| [1, 2, 3, 4, 5, 6]| | | 0| [1, 2, 3, 4, 5]| | | 3| [1, 2, 3, 4]| | | 2| [1, 2, 3]| | | 1| [1, 2]| | | 0| [1]| +---------+----+--------------------+
I was hoping to fix the issue as part of SPARK-18407 but it seems it's not only applicable to StructuredStreaming and deserves it's own JIRA.
Attachments
Issue Links
- supercedes
-
SPARK-18407 Inferred partition columns cause assertion error
- Resolved
- links to