Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1328

Current implementation of Standard Deviation in MLUtils may cause catastrophic cancellation, and loss precision.

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 1.0.0
    • Component/s: MLlib

      Description

      Standard Deviation (SD) is used for dataset normalization, which is useful in the training process of Lasso, etc. Current implementation of SD is using the second-order expectations equation E^2( x )-E(x^2), which is not a stable algorithm facing with floating point computing.

      Instead of that, the first-order equation performs better.

      Moreover, MLutils is not a right place to hold standard statistics methods, It is more suitable that put it in the VectorRDDFunctions. Some other affected machine learning algorithms should also be refined.

        Attachments

          Activity

            People

            • Assignee:
              yinxusen Xusen Yin
              Reporter:
              xusen Xusen Yin
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: