Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13639

Statistics.colStats(rdd).mean and variance should handle NaN in the input vectors

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Trivial
    • Resolution: Incomplete
    • None
    • None
    • MLlib

    Description

      val denseData = Array(
      Vectors.dense(3.8, 0.0, 1.8),
      Vectors.dense(1.7, 0.9, 0.0),
      Vectors.dense(Double.NaN, 0, 0.0)
      )

      val rdd = sc.parallelize(denseData)
      println(Statistics.colStats(rdd).mean)

      [NaN,0.3,0.6]

      This is just a proposal for discussion on how to handle the NaN value in the vectors. We can ignore the NaN value in the computation or just output NaN as it is now as a warning.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yuhaoyan yuhao yang
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: