Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21057

Do not use a PascalDistribution in countApprox

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1.1
    • Fix Version/s: 2.3.0
    • Component/s: Spark Core
    • Labels:
      None

      Description

      I was reading the source of Spark, and found this:
      https://github.com/apache/spark/blob/v2.1.1/core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala#L50-L72

      This is the function that estimates the probability distribution of the total count of elements in an RDD given the count of only some partitions.

      This function does a strange thing: when the number of elements counted so far is less than 10 000, it models the total count with a negative binomial (Pascal) law, else, it models it with a Poisson law.

      Modeling our number of uncounted elements with a negative binomial law is like saying that we ran over elements, counting only some, and stopping after having counted a given number of elements.
      But this does not model what really happened. Our counting was limited in time, not in number of counted elements, and we can't count only some of the elements in a partition.

      I propose to use the Poisson distribution in every case, as it can be justified under the hypothesis that the number of elements in each partition is independent and follows a Poisson law.

        Attachments

          Activity

            People

            • Assignee:
              srowen Sean Owen
              Reporter:
              lovasoa Lovasoa
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: