[MAHOUT-823] RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.5
Fix Version/s: 0.6
Component/s: classic
Labels:
- dot
- dot-product
- vector

Description

http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172

The complexity of the algorithm is O(num nondefault elements in this), while it could clearly be O(min(num nondefault in this, num nondefault in x)).

This can be fixed by adding this code before line 189.

if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
  return x.dot(this);
}

An easy case where this asymmetry is very apparent and makes a huge difference in performance is K-Means clustering.

In K-Means for high-dimensional points (e.g. those that arise in text retrieval problems), the centroids often have a huge number of non-zero components, whereas points have a small number of them.

So, if you make a mistake and use centroid.dot(point) in your code for computing the distance, instead of point.dot(centroid), you end up with orders of magnitude worse performance (which is what we actually observed - the clustering time was a couple of minutes with this fix and over an hour without it).

So, perhaps, if you make this fix, quite a few people who had a similar case but didn't notice it will suddenly have a dramatic performance increase

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAHOUT-823.patch
30/Sep/11 12:24
6 kB
Sean R. Owen

Activity

People

Assignee:: Sean R. Owen

Reporter:: Eugene Kirpichov

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 30/Sep/11 11:21

Updated:: 31/Jan/24 22:16

Resolved:: 01/Oct/11 13:30