[SPARK-17093] Roundtrip encoding of array<struct<>> fields is wrong when whole-stage codegen is disabled - ASF JIRA

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.0.1, 2.1.0
Component/s: SQL
Labels:
- correctness

Target Version/s:

2.0.1

Description

The following failing test demonstrates a bug where Spark mis-encodes array-of-struct fields if whole-stage codegen is disabled:

withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "false") {
  val data = Array(Array((1, 2), (3, 4)))
  val ds = spark.sparkContext.parallelize(data).toDS()
  assert(ds.collect() === data)
}

When wholestage codegen is enabled (the default), this works fine. When it's disabled, as in the test above, Spark returns Array(Array((3,4), (3,4))). Because the last element of the array appears to be repeated my best guess is that the interpreted evaluation codepath forgot to copy() somewhere.