Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1328

Current implementation of Standard Deviation in MLUtils may cause catastrophic cancellation, and loss precision.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.9.0
    • 1.0.0
    • MLlib

    Description

      Standard Deviation (SD) is used for dataset normalization, which is useful in the training process of Lasso, etc. Current implementation of SD is using the second-order expectations equation E^2( x )-E(x^2), which is not a stable algorithm facing with floating point computing.

      Instead of that, the first-order equation performs better.

      Moreover, MLutils is not a right place to hold standard statistics methods, It is more suitable that put it in the VectorRDDFunctions. Some other affected machine learning algorithms should also be refined.

      Attachments

        Activity

          People

            yinxusen Xusen Yin
            xusen Xusen Yin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: