The optimizer rule org.apache.spark.sql.catalyst.optimizer.ReorderJoin performs join reordering on inner joins. This was introduced from
SPARK-12032 in 2015-12.
After it had reordered the joins, though, it didn't check whether or not the column order (in terms of the output attribute list) is still the same as before. Thus, it's possible to have a mismatch between the reordered column order vs the schema that a DataFrame thinks it has.
This can be demonstrated with the example:
here's what the DataFrame thinks:
here's what the optimized plan thinks, after join reordering:
If we exclude the ReorderJoin rule (using Spark 2.4's optimizer rule exclusion feature), it's back to normal:
Note that this column ordering problem leads to data corruption, and can manifest itself in various symptoms:
- Silently corrupting data, if the reordered columns happen to either have matching types or have sufficiently-compatible types (e.g. all fixed length primitive types are considered as "sufficiently compatible" in an UnsafeRow), then only the resulting data is going to be wrong but it might not trigger any alarms immediately. Or
- Weird Java-level exceptions like java.lang.NegativeArraySizeException, or even SIGSEGVs.