[SPARK-13639] Statistics.colStats(rdd).mean and variance should handle NaN in the input vectors - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Trivial
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: MLlib
Labels:
- bulk-closed

Description

val denseData = Array(
Vectors.dense(3.8, 0.0, 1.8),
Vectors.dense(1.7, 0.9, 0.0),
Vectors.dense(Double.NaN, 0, 0.0)
)

val rdd = sc.parallelize(denseData)
println(Statistics.colStats(rdd).mean)

[NaN,0.3,0.6]

This is just a proposal for discussion on how to handle the NaN value in the vectors. We can ignore the NaN value in the computation or just output NaN as it is now as a warning.

Attachments

Issue Links

is related to

SPARK-13568 Create feature transformer to impute missing values

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: yuhao yang

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 03/Mar/16 05:46

Updated:: 21/May/19 04:33

Resolved:: 21/May/19 04:33