[SPARK-4547] OOM when making bins in BinaryClassificationMetrics - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.1.0
Fix Version/s: 1.3.0
Component/s: MLlib
Labels:
None

Target Version/s:

1.3.0

Description

Also following up on http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3CCAMAsSdK4s4TNkf3_ecLC6yD-pLpys_PpT3WB7Tp6=yoXUxFpMA@mail.gmail.com%3E – this one I intend to make a PR for a bit later. The conversation was basically:

Recently I was using BinaryClassificationMetrics to build an AUC curve for a classifier over a reasonably large number of points (~12M). The scores were all probabilities, so tended to be almost entirely unique.

The computation does some operations by key, and this ran out of memory. It's something you can solve with more than the default amount of memory, but in this case, it seemed unuseful to create an AUC curve with such fine-grained resolution.

I ended up just binning the scores so there were ~1000 unique values
and then it was fine.

and:

Yes, if there are many distinct values, we need binning to compute the AUC curve. Usually, the scores are not evenly distribution, we cannot simply truncate the digits. Estimating the quantiles for binning is necessary, similar to RangePartitioner:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L104

Limiting the number of bins is definitely useful.

Attachments

Issue Links

links to

[Github] Pull Request #3702 (srowen)

Activity

People

Assignee:: Sean R. Owen

Reporter:: Sean R. Owen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 21/Nov/14 23:17

Updated:: 31/Dec/14 21:37

Resolved:: 31/Dec/14 21:37