Affects Version/s: 1.6.2
Fix Version/s: None
Microsoft Windows 10
Standalone Spark Cluster
Caching a DataFrame with >200 columns causes the contents to be ~nulled. This is quite a painful bug for us and caused us to place all sorts of bandaid bypasses in our production work recently.
Minimally reproducible example:
As shown in the example above, the underlying RDD seems to survive the caching, but as soon as we serialize to parquet the data corruption becomes complete.
This is occurring on Windows 10 with Python 3.5.x. We're running a Spark Standalone cluster. Everything works fine with <200 columns/fields. We have Kyro serialization turned on at the moment, but the same error manifested when we turned it off.
I will try to get this tested on Spark 2.0.0 in the near future, but I generally steer clear of x.0.0 releases as best I can.
I tried to search for another issue related to this and came up with nothing. My apologies if I missed it; there doesn't seem to be a good combination of keywords to describe this glitch.
Happy to provide more details.