[SPARK-2012] PySpark StatCounter with numpy arrays - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.0.0
Fix Version/s: 1.1.0
Component/s: PySpark
Labels:
None

Description

In Spark 0.9, the PySpark version of StatCounter worked with an RDD of numpy arrays just as with an RDD of scalars, which was very useful (e.g. for computing stats on a set of vectors in ML analyses). In 1.0.0 this broke because the added functionality for computing the minimum and maximum, as implemented, doesn't work on arrays.

I have a PR ready that re-enables this functionality by having StatCounter use the numpy element-wise functions "maximum" and "minimum", which work on both numpy arrays and scalars (and I've added new tests for this capability).

However, I realize this adds a dependency on NumPy outside of MLLib. If that's not ok, maybe it'd be worth adding this functionality as a util within PySpark MLLib?