Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call cache() on uncached input datasets to improve performance.
Unfortunately, these algorithms a) check input persistence inaccurately (SPARK-18608) and b) check the persistence level of the input dataset but not any of its parents. These issues can result in unwanted double-caching of input data & degraded performance (see SPARK-21799).
This ticket proposes adding a boolean handlePersistence param (org.apache.spark.ml.param) so that users can specify whether an ML algorithm should try to cache un-cached input data. handlePersistence will be true by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting handlePersistence to false.