[SPARK-1328] Current implementation of Standard Deviation in MLUtils may cause catastrophic cancellation, and loss precision. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9.0
Fix Version/s: 1.0.0
Component/s: MLlib
Labels:
- MLLib,
- statistics
- vector

Description

Standard Deviation (SD) is used for dataset normalization, which is useful in the training process of Lasso, etc. Current implementation of SD is using the second-order expectations equation E^2( x )-E(x^2), which is not a stable algorithm facing with floating point computing.

Instead of that, the first-order equation performs better.

Moreover, MLutils is not a right place to hold standard statistics methods, It is more suitable that put it in the VectorRDDFunctions. Some other affected machine learning algorithms should also be refined.

Attachments

Activity

People

Assignee:: Xusen Yin

Reporter:: Xusen Yin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 26/Mar/14 03:28

Updated:: 12/Apr/14 02:45

Resolved:: 12/Apr/14 02:45