Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
1.6.2
-
None
-
None
-
Microsoft Windows 10
Python v3.5.x
Standalone Spark Cluster
Description
Caching a DataFrame with >200 columns causes the contents to be ~nulled. This is quite a painful bug for us and caused us to place all sorts of bandaid bypasses in our production work recently.
Minimally reproducible example:
from pyspark.sql import SQLContext import tempfile sqlContext = SQLContext(sc) path_fail_parquet = tempfile.mkdtemp() + '/fail_parquet.parquet' list_df_varnames = [] list_df_values = [] for i in range(210): list_df_varnames.append('var'+str(i)) list_df_values.append(str(i)) test_df = sqlContext.createDataFrame([list_df_values], list_df_varnames) test_df.show() # Still looks okay print(test_df.collect()) # Still looks okay test_df.cache() # When everything goes awry test_df.show() # All values have been ~nulled print(test_df.collect()) # Still looks okay # Serialize and read back from parquet now test_df.write.parquet(path_fail_parquet) loaded_df = sqlContext.read.parquet(path_fail_parquet) loaded_df.show() # All values have been ~nulled print(loaded_df.collect()) # All values have been ~nulled
As shown in the example above, the underlying RDD seems to survive the caching, but as soon as we serialize to parquet the data corruption becomes complete.
This is occurring on Windows 10 with Python 3.5.x. We're running a Spark Standalone cluster. Everything works fine with <200 columns/fields. We have Kyro serialization turned on at the moment, but the same error manifested when we turned it off.
I will try to get this tested on Spark 2.0.0 in the near future, but I generally steer clear of x.0.0 releases as best I can.
I tried to search for another issue related to this and came up with nothing. My apologies if I missed it; there doesn't seem to be a good combination of keywords to describe this glitch.
Happy to provide more details.
Attachments
Issue Links
- duplicates
-
SPARK-16664 Spark 1.6.2 - Persist call on Data frames with more than 200 columns is wiping out the data.
- Resolved