Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22538

SQLTransformer.transform(inputDataFrame) uncaches inputDataFrame

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 2.2.1, 2.3.0
    • Component/s: ML, PySpark, SQL, Web UI
    • Labels:
      None

      Description

      When running the below code on PySpark v2.2.0, the cached input DataFrame df disappears from SparkUI after SQLTransformer.transform(...) is called on it.

      I don't yet know whether this is only a SparkUI bug, or the input DataFrame df is indeed unpersisted from memory. If the latter is true, this can be a serious bug because any new computation using new_df would have to re-do all the work leading up to df.

      import pandas
      import pyspark
      from pyspark.ml.feature import SQLTransformer
      
      spark = pyspark.sql.SparkSession.builder.getOrCreate()
      
      df = spark.createDataFrame(pandas.DataFrame(dict(x=[-1, 0, 1])))
      
      # after below step, SparkUI Storage shows 1 cached RDD
      df.cache(); df.count()
      
      # after below step, cached RDD disappears from SparkUI Storage
      new_df = SQLTransformer(statement='SELECT * FROM __THIS__').transform(df)
      

        Attachments

          Activity

            People

            • Assignee:
              viirya Liang-Chi Hsieh
              Reporter:
              MBALearnsToCode V Luong
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: