Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.5.2
-
Ubuntu 14.04, Python 2.7.6
Description
http://stackoverflow.com/q/34114585/4698425
In PySpark Streaming, function countByValue and countByValueAndWindow return one single number which is the count of distinct elements, instead of a list of (k,v) pairs.
It's inconsistent with the documentation:
countByValue: When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
countByValueAndWindow: When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.