Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.4.3
-
None
-
None
Description
There are three actions in this piece of code: reduceByKey, sortBy, and collect. But data is not persisted, which will cause recomputation.
private[fpm] def findFrequentItems[Item: ClassTag]( data: RDD[Array[Array[Item]]], minCount: Long): Array[Item] = { data.flatMap { itemsets => val uniqItems = mutable.Set.empty[Item] itemsets.foreach(set => uniqItems ++= set) uniqItems.toIterator.map((_, 1L)) }.reduceByKey(_ + _).filter { case (_, count) => count >= minCount }.sortBy(-_._2).map(_._1).collect() }
This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.
Attachments
Issue Links
- duplicates
-
SPARK-29818 Missing persist on RDD
- Resolved
- links to