Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17218

Caching a DataFrame with >200 columns ~nulls the contents

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 1.6.2
    • Fix Version/s: None
    • Component/s: PySpark
    • Labels:
      None
    • Environment:

      Microsoft Windows 10
      Python v3.5.x
      Standalone Spark Cluster

      Description

      Caching a DataFrame with >200 columns causes the contents to be ~nulled. This is quite a painful bug for us and caused us to place all sorts of bandaid bypasses in our production work recently.

      Minimally reproducible example:

      from pyspark.sql import SQLContext
      import tempfile
      
      sqlContext = SQLContext(sc)
      path_fail_parquet = tempfile.mkdtemp() + '/fail_parquet.parquet'
      
      list_df_varnames = []
      list_df_values = []
      for i in range(210):
          list_df_varnames.append('var'+str(i))
          list_df_values.append(str(i))
      
      test_df = sqlContext.createDataFrame([list_df_values], list_df_varnames)
      test_df.show() # Still looks okay
      print(test_df.collect()) # Still looks okay
      
      test_df.cache() # When everything goes awry
      test_df.show() # All values have been ~nulled
      print(test_df.collect()) # Still looks okay
      
      # Serialize and read back from parquet now
      test_df.write.parquet(path_fail_parquet)
      loaded_df = sqlContext.read.parquet(path_fail_parquet)
      
      loaded_df.show() # All values have been ~nulled
      print(loaded_df.collect()) # All values have been ~nulled
      

      As shown in the example above, the underlying RDD seems to survive the caching, but as soon as we serialize to parquet the data corruption becomes complete.

      This is occurring on Windows 10 with Python 3.5.x. We're running a Spark Standalone cluster. Everything works fine with <200 columns/fields. We have Kyro serialization turned on at the moment, but the same error manifested when we turned it off.

      I will try to get this tested on Spark 2.0.0 in the near future, but I generally steer clear of x.0.0 releases as best I can.

      I tried to search for another issue related to this and came up with nothing. My apologies if I missed it; there doesn't seem to be a good combination of keywords to describe this glitch.

      Happy to provide more details.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                shea.parkes Shea Parkes
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: