Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6006

Optimize count distinct in case of high cardinality columns

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.1.1, 1.2.1
    • 1.6.0
    • SQL
    • None

    Description

      In case there are a lot of distinct values, count distinct becomes too slow since it tries to hash partial results to one map. It can be improved by creating buckets/partial maps in an intermediate stage where same key from multiple partial maps of first stage hash to the same bucket. Later we can sum the size of these buckets to get total distinct count.

      Attachments

        Issue Links

          Activity

            People

              davies Davies Liu
              saucam Yash Datta
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: