Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18608

Spark ML algorithms that check RDD cache level for internal caching double-cache data

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.2.1, 2.3.0
    • Component/s: ML
    • Labels:
      None

      Description

      Some algorithms in Spark ML (e.g. LogisticRegression, LinearRegression, and I believe now KMeans) handle persistence internally. They check whether the input dataset is cached, and if not they cache it for performance.

      However, the check is done using dataset.rdd.getStorageLevel == NONE. This will actually always be true, since even if the dataset itself is cached, the RDD returned by dataset.rdd will not be cached.

      Hence if the input dataset is cached, the data will end up being cached twice, which is wasteful.

      To see this:

      scala> import org.apache.spark.storage.StorageLevel
      import org.apache.spark.storage.StorageLevel
      
      scala> val df = spark.range(10).toDF("num")
      df: org.apache.spark.sql.DataFrame = [num: bigint]
      
      scala> df.storageLevel == StorageLevel.NONE
      res0: Boolean = true
      
      scala> df.persist
      res1: df.type = [num: bigint]
      
      scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
      res2: Boolean = true
      
      scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
      res3: Boolean = false
      
      scala> df.rdd.getStorageLevel == StorageLevel.NONE
      res4: Boolean = true
      

      Before SPARK-16063, there was no way to check the storage level of the input DataSet, but now we can, so the checks should be migrated to use dataset.storageLevel.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                podongfeng zhengruifeng
                Reporter:
                mlnick Nick Pentreath
              • Votes:
                0 Vote for this issue
                Watchers:
                11 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: