[SPARK-12861] Changes to support KMeans with large feature space - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 1.6.0
Fix Version/s: None
Component/s: ML, MLlib
Labels:
- bulk-closed
- patch

Description

The problem:
-----------------
In Spark's KMeans code the center vectors are always represented as dense vectors. As a result, when each such center has a large domain space the algorithm quickly runs out of memory. In my example I have a feature space of around 50000 and k ~= 500. This sums up to around 200MB RAM for the center vectors alone while in fact the center vectors are very sparse and require a lot less RAM.
Since I am running on a system with relatively low resources I keep getting OutOfMemory errors. In my setting it is OK to trade off runtime for using less RAM. This is what I set out to do in my solution while allowing users the flexibility to choose.

One solution could be to reduce the dimensions of the feature space but this is not always the best approach. For example, when the object space is comprised of users and the feature space of items. In such an example we may want to run kmeans over a feature space which is a function of how many times user i clicked item j. If we reduce the dimensions of the items we will not be able to map the centers vectors back to the items. Moreover in a streaming context detecting the changes WRT previous runs gets more difficult.

My solution:
----------------
Allow the kmeans algorithm to accept a VectorFactory which decides when vectors used inside the algorithm should be sparse and when they should be dense. For backward compatibility the default behavior is to always make them dense (like the situation is now). But now potentially the user can provide a SmartVectorFactory (or some proprietary VectorFactory) which can decide to make vectors sparse.

For this I made the following changes:
(1) Added a method called reassign to SparseVectors allowing to change the indices and values
(2) Allow axpy to accept SparseVectors
(3) create a trait called VectorFactory and two implementations for it that are used within KMeans code

To get the above described solution do the following:

git clone https://github.com/levin-royl/spark.git -b SupportLargeFeatureDomains

Note
------
There are some similar issues opened in JIRA in the past, e.g.:
https://issues.apache.org/jira/browse/SPARK-4039
https://issues.apache.org/jira/browse/SPARK-1212
https://github.com/mesos/spark/pull/736

But the difference is that in the problem I describe reducing the dimensions of the problem (i.e., the feature space) to allow using dense vectors is not suitable. Also, the solution I implemented supports this while allowing full flexibility to the user — i.e., using the default dense vector implementation or selecting an alternative (only when the default it is not desired).

Attachments

Issue Links

duplicates

SPARK-4039 KMeans support sparse cluster centers

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Roy Levin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Jan/16 08:02

Updated:: 21/May/19 04:37

Resolved:: 21/May/19 04:37