[SPARK-18407] Inferred partition columns cause assertion error - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.0.2
Fix Version/s: 2.1.0
Component/s: Structured Streaming
Labels:
None

Target Version/s:

2.1.0

Description

This assertion fails when you run a stream against json data that is stored in partitioned folders, if you manually specify the schema and that schema omits the partitioned columns.

My hunch is that we are inferring those columns even though the schema is being passed in manually and adding them to the end.

While we are fixing this bug, it would be nice to make the assertion better. Truncating is not terribly useful as, at least in my case, it truncated the most interesting part. I changed it to this while debugging:

          s"""
             |Batch does not have expected schema
             |Expected: ${output.mkString(",")}
             |Actual: ${newPlan.output.mkString(",")}
             |
             |== Original ==
             |$logicalPlan
             |
             |== Batch ==
             |$newPlan
           """.stripMargin

I also tried specifying the partition columns in the schema and now it appears that they are filled with corrupted data.

Attachments

Issue Links

is superceded by

SPARK-18510 Partition schema inference corrupts data

Resolved

links to

[Github] Pull Request #15942 (brkyvz)

Activity

People

Assignee:: Burak Yavuz

Reporter:: Michael Armbrust

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 10/Nov/16 21:32

Updated:: 28/Nov/16 10:09

Resolved:: 28/Nov/16 10:09