Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12353

wrong output for countByValue and countByValueAndWindow

    XMLWordPrintableJSON

Details

    Description

      http://stackoverflow.com/q/34114585/4698425

      In PySpark Streaming, function countByValue and countByValueAndWindow return one single number which is the count of distinct elements, instead of a list of (k,v) pairs.

      It's inconsistent with the documentation:

      countByValue: When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.

      countByValueAndWindow: When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.

      Attachments

        Activity

          People

            jerryshao Saisai Shao
            krist.jin.apply@gmail.com Bo Jin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 2h
                2h
                Remaining:
                Remaining Estimate - 2h
                2h
                Logged:
                Time Spent - Not Specified
                Not Specified