[SPARK-12353] wrong output for countByValue and countByValueAndWindow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.5.2
Fix Version/s: 2.0.0
Component/s: Documentation, DStreams, Input/Output, PySpark
Labels:
- releasenotes
Environment:

Ubuntu 14.04, Python 2.7.6

Target Version/s:

2.0.0

Description

http://stackoverflow.com/q/34114585/4698425

In PySpark Streaming, function countByValue and countByValueAndWindow return one single number which is the count of distinct elements, instead of a list of (k,v) pairs.

It's inconsistent with the documentation:

countByValue: When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.

countByValueAndWindow: When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.

Attachments

Issue Links

links to

[Github] Pull Request #10350 (jerryshao)

Activity

People

Assignee:: Saisai Shao

Reporter:: Bo Jin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Dec/15 04:42

Updated:: 28/Dec/15 10:43

Resolved:: 28/Dec/15 10:42

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified