Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16484

Incremental Cardinality estimation operations with Hyperloglog

    XMLWordPrintableJSON

Details

    Description

      Efficient cardinality estimation is very important, and SparkSQL has had approxCountDistinct based on Hyperloglog for quite some time. However, there isn't a way to do incremental estimation. For example, if we want to get updated distinct counts of the last 90 days, we need to do the aggregation for the entire window over and over again. The more efficient way involves serializing the counter for smaller time windows (such as hourly) so the counts can be efficiently updated in an incremental fashion for any time window.
      With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus implementation in the current Spark version, it's easy enough to extend the functionality to include incremental counting, and even other general set operations such as intersection and set difference. Spark API is already as elegant as it can be, but it still takes quite some effort to do a custom implementation of the aforementioned operations which are supposed to be in high demand. I have been searching but failed to find an usable existing solution nor any ongoing effort for this. The closest I got is the following but it does not work with Spark 1.6 due to API changes.
      https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala

      I wonder if it worth to integrate such operations into SparkSQL. The only problem I see is it depends on serialization of a specific HLL implementation and introduce compatibility issues. But as long as the user is aware of such issue, it should be fine.

      Attachments

        Activity

          People

            Ryan Berti Ryan Berti
            yongjiaw Yongjia Wang
            Votes:
            2 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: