[SPARK-18356] KMeans should cache RDD before training - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.0, 2.0.1
Fix Version/s: 2.2.0
Component/s: ML
Labels:
- easyfix

Description

Hello,

I'm newbie in spark, but I think that I found a small problem that can affect spark Kmeans performances.
Before starting to explain the problem, I want to explain the warning that I faced.

I tried to use Spark Kmeans with Dataframes to cluster my data

df_Part = assembler.transform(df_Part)
df_Part.cache()
while (k<=max_cluster) and (wssse > seuilStop):
kmeans = KMeans().setK(k)
model = kmeans.fit(df_Part)
wssse = model.computeCost(df_Part)
k=k+1

but when I run the code I receive the warning :
WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached.

I searched in spark source code to find the source of this problem, then I realized there is two classes responsible for this warning:
(mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
(mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )

When my dataframe is cached, the fit method transform my dataframe into an internally rdd which is not cached.
Dataframe -> rdd -> run Training Kmeans Algo(rdd)

-> The first class (ml package) responsible for converting the dataframe into rdd then call Kmeans Algorithm
->The second class (mllib package) implements Kmeans Algorithm, and here spark verify if the rdd is cached, if not a warning will be generated.

So, the solution of this problem is to cache the rdd before running Kmeans Algorithm.
https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
All what we need is to add two lines:
Cache rdd just after dataframe transformation, then uncached it after training algorithm.

I hope that I was clear.
If you think that I was wrong, please let me know.

Sincerely,
Zakaria HILI

Attachments

Issue Links

links to

[Github] Pull Request #15964 (ZakariaHili)

[Github] Pull Request #15965 (ZakariaHili)

[Github] Pull Request #16295 (ZakariaHili)

Activity

People

Assignee:: zakaria hili

Reporter:: zakaria hili

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 08/Nov/16 15:42

Updated:: 24/Jul/19 10:50

Resolved:: 25/Nov/16 13:20