Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22538

SQLTransformer.transform(inputDataFrame) uncaches inputDataFrame

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.0
    • 2.2.1, 2.3.0
    • ML, PySpark, SQL, Web UI
    • None

    Description

      When running the below code on PySpark v2.2.0, the cached input DataFrame df disappears from SparkUI after SQLTransformer.transform(...) is called on it.

      I don't yet know whether this is only a SparkUI bug, or the input DataFrame df is indeed unpersisted from memory. If the latter is true, this can be a serious bug because any new computation using new_df would have to re-do all the work leading up to df.

      import pandas
      import pyspark
      from pyspark.ml.feature import SQLTransformer
      
      spark = pyspark.sql.SparkSession.builder.getOrCreate()
      
      df = spark.createDataFrame(pandas.DataFrame(dict(x=[-1, 0, 1])))
      
      # after below step, SparkUI Storage shows 1 cached RDD
      df.cache(); df.count()
      
      # after below step, cached RDD disappears from SparkUI Storage
      new_df = SQLTransformer(statement='SELECT * FROM __THIS__').transform(df)
      

      Attachments

        Activity

          People

            viirya L. C. Hsieh
            MBALearnsToCode V Luong
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: