Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17218

Caching a DataFrame with >200 columns ~nulls the contents

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.6.2
    • None
    • PySpark
    • None
    • Microsoft Windows 10
      Python v3.5.x
      Standalone Spark Cluster

    Description

      Caching a DataFrame with >200 columns causes the contents to be ~nulled. This is quite a painful bug for us and caused us to place all sorts of bandaid bypasses in our production work recently.

      Minimally reproducible example:

      from pyspark.sql import SQLContext
      import tempfile
      
      sqlContext = SQLContext(sc)
      path_fail_parquet = tempfile.mkdtemp() + '/fail_parquet.parquet'
      
      list_df_varnames = []
      list_df_values = []
      for i in range(210):
          list_df_varnames.append('var'+str(i))
          list_df_values.append(str(i))
      
      test_df = sqlContext.createDataFrame([list_df_values], list_df_varnames)
      test_df.show() # Still looks okay
      print(test_df.collect()) # Still looks okay
      
      test_df.cache() # When everything goes awry
      test_df.show() # All values have been ~nulled
      print(test_df.collect()) # Still looks okay
      
      # Serialize and read back from parquet now
      test_df.write.parquet(path_fail_parquet)
      loaded_df = sqlContext.read.parquet(path_fail_parquet)
      
      loaded_df.show() # All values have been ~nulled
      print(loaded_df.collect()) # All values have been ~nulled
      

      As shown in the example above, the underlying RDD seems to survive the caching, but as soon as we serialize to parquet the data corruption becomes complete.

      This is occurring on Windows 10 with Python 3.5.x. We're running a Spark Standalone cluster. Everything works fine with <200 columns/fields. We have Kyro serialization turned on at the moment, but the same error manifested when we turned it off.

      I will try to get this tested on Spark 2.0.0 in the near future, but I generally steer clear of x.0.0 releases as best I can.

      I tried to search for another issue related to this and came up with nothing. My apologies if I missed it; there doesn't seem to be a good combination of keywords to describe this glitch.

      Happy to provide more details.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              shea.parkes Shea Parkes
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: