[SPARK-9141] DataFrame recomputed instead of using cached parent. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.4.0, 1.4.1
Fix Version/s: 1.5.0
Component/s: SQL
Labels:
- cache
- dataframe

Target Version/s:

1.5.0
Sprint:
Spark 1.5 release

Description

As I understand, DataFrame.cache() is supposed to work the same as RDD.cache(), so that repeated operations on it will use the cached results and not recompute the entire lineage. However, it seems that some DataFrame operations (e.g. withColumn) change the underlying RDD lineage so that cache doesn't work as expected.

Below is a Scala example that demonstrates this. First, I define two UDF's that use println so that it is easy to see when they are being called. Next, I create a simple data frame with one row and two columns. Next, I add a column, cache it, and call count() to force the computation. Lastly, I add another column, cache it, and call count().

I would have expected the last statement to only compute the last column, since everything else was cached. However, because withColumn() changes the lineage, the whole data frame is recomputed.

    // Examples udf's that println when called 
    val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 } 
    val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 } 

    // Initial dataset 
    val df1 = sc.parallelize(Seq(("a", 1))).toDF("name", "value") 

    // Add column by applying twice udf 
    val df2 = df1.withColumn("twice", twice($"value")) 
    df2.cache() 
    df2.count() //prints Computed: twice(1) 

    // Add column by applying triple udf 
    val df3 = df2.withColumn("triple", triple($"value")) 
    df3.cache() 
    df3.count() //prints Computed: twice(1)\nComputed: triple(1)

I found a workaround, which helped me understand what was going on behind the scenes, but doesn't seem like an ideal solution. Basically, I convert to RDD then back DataFrame, which seems to freeze the lineage. The code below shows the workaround for creating the second data frame so cache will work as expected.

    val df2 = {
      val tmp = df1.withColumn("twice", twice($"value"))
      sqlContext.createDataFrame(tmp.rdd, tmp.schema)
    }

Attachments

Issue Links

links to

[Github] Pull Request #7920 (marmbrus)

[Github] Pull Request #7964 (yhuai)

Activity

People

Assignee:: Michael Armbrust

Reporter:: Nick Pritchard

Votes:: 1 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 17/Jul/15 20:18

Updated:: 02/Feb/16 01:23

Resolved:: 05/Aug/15 16:02