Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17294

Caching invalidates data on mildly wide dataframes

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 1.6.2, 2.0.0
    • Fix Version/s: None
    • Component/s: PySpark
    • Labels:
      None

      Description

      Caching a dataframe with > 200 columns causes the data within to simply vanish under certain circumstances.

      Consider the following code, where we create a one-row dataframe containing the numbers from 0 to 200.

      n_cols = 201
      rng = range(n_cols)
      df = spark.createDataFrame(
          data=[rng]
      )
      
      last = df.columns[-1]
      print(df.select(last).collect())
      df.select(F.greatest(*df.columns).alias('greatest')).show()
      

      Returns:

      [Row(_201=200)]
      
      +--------+
      |greatest|
      +--------+
      |     200|
      +--------+
      

      As expected column _201 contains the number 200 and as expected the greatest value within that single row is 200.

      Now if we introduce a .cache on df:

      n_cols = 201
      rng = range(n_cols)
      df = spark.createDataFrame(
          data=[rng]
      ).cache()
      
      last = df.columns[-1]
      print(df.select(last).collect())
      df.select(F.greatest(*df.columns).alias('greatest')).show()
      

      Returns:

      [Row(_201=200)]
      
      +--------+
      |greatest|
      +--------+
      |       0|
      +--------+
      

      the last column _201 still seems to contain the correct value, but when I try to select the greatest value within the row, 0 is returned. When I issue .show() on the dataframe, all values will be zero. As soon as I limit the columns on a number < 200, everything looks fine again.

      When the number of columns is < 200 from the beginning, even the cache will not break things and everything works as expected.

      It doesn't matter whether the data is loaded from disk or created on the fly and this happens in Spark 1.6.2 and 2.0.0 (haven't tested anything else).

      Can anyone confirm this?

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                kalle Karl-Johan Wettin
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: