Affects Version/s: 2.4.0
Fix Version/s: None
Spark version 2.4.0
Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144
The same issue is observed when PySpark is run on both macOS 10.13.6 and CentOS 7, so this appears to be a cross-platform issue.
Whenever a field in a PySpark Row requires serialization (such as a DateType or TimestampType), the DataFrame generated by the code below will assign column values in alphabetical order, rather than assigning each column value to its specified columns.
(Note that my_a_value and my_b_value are transposed.)
Reviewing the relevant code on GitHub, there are two relevant conditional blocks:
Row is implemented as both a tuple of alphabetically-sorted columns, and a dictionary of named columns. In Block 2, there is a conditional that works specifically to serialize a Row object:
There is no such condition in Block 1, so we fall into this instead:
The behaviour in the zip call is wrong, since obj (the Row) will return a different ordering than the schema fields. So we end up with:
Correct behaviour is observed if you use a Python list or dict instead of PySpark's Row object:
Correct behaviour is also observed if you have no fields that require serialization; in this example, changing date_column to StringType avoids the correctness issue.