Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13639

Statistics.colStats(rdd).mean and variance should handle NaN in the input vectors

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Incomplete
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: MLlib
    • Labels:

      Description

      val denseData = Array(
      Vectors.dense(3.8, 0.0, 1.8),
      Vectors.dense(1.7, 0.9, 0.0),
      Vectors.dense(Double.NaN, 0, 0.0)
      )

      val rdd = sc.parallelize(denseData)
      println(Statistics.colStats(rdd).mean)

      [NaN,0.3,0.6]

      This is just a proposal for discussion on how to handle the NaN value in the vectors. We can ignore the NaN value in the computation or just output NaN as it is now as a warning.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                yuhaoyan yuhao yang
              • Votes:
                1 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: