[SPARK-17218] Caching a DataFrame with >200 columns ~nulls the contents - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.6.2
Fix Version/s: None
Component/s: PySpark
Labels:
None
Environment:

Microsoft Windows 10
Python v3.5.x
Standalone Spark Cluster

Description

Caching a DataFrame with >200 columns causes the contents to be ~nulled. This is quite a painful bug for us and caused us to place all sorts of bandaid bypasses in our production work recently.

Minimally reproducible example:

from pyspark.sql import SQLContext
import tempfile

sqlContext = SQLContext(sc)
path_fail_parquet = tempfile.mkdtemp() + '/fail_parquet.parquet'

list_df_varnames = []
list_df_values = []
for i in range(210):
    list_df_varnames.append('var'+str(i))
    list_df_values.append(str(i))

test_df = sqlContext.createDataFrame([list_df_values], list_df_varnames)
test_df.show() # Still looks okay
print(test_df.collect()) # Still looks okay

test_df.cache() # When everything goes awry
test_df.show() # All values have been ~nulled
print(test_df.collect()) # Still looks okay

# Serialize and read back from parquet now
test_df.write.parquet(path_fail_parquet)
loaded_df = sqlContext.read.parquet(path_fail_parquet)

loaded_df.show() # All values have been ~nulled
print(loaded_df.collect()) # All values have been ~nulled

As shown in the example above, the underlying RDD seems to survive the caching, but as soon as we serialize to parquet the data corruption becomes complete.

This is occurring on Windows 10 with Python 3.5.x. We're running a Spark Standalone cluster. Everything works fine with <200 columns/fields. We have Kyro serialization turned on at the moment, but the same error manifested when we turned it off.

I will try to get this tested on Spark 2.0.0 in the near future, but I generally steer clear of x.0.0 releases as best I can.

I tried to search for another issue related to this and came up with nothing. My apologies if I missed it; there doesn't seem to be a good combination of keywords to describe this glitch.

Happy to provide more details.

Attachments

Issue Links

duplicates

SPARK-16664 Spark 1.6.2 - Persist call on Data frames with more than 200 columns is wiping out the data.

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Shea Parkes

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 24/Aug/16 15:41

Updated:: 24/Aug/16 16:46

Resolved:: 24/Aug/16 16:46