Uploaded image for project: 'Kylin'
  1. Kylin
  2. KYLIN-1186

Support precise Count Distinct using bitmap (under limited conditions)

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: v1.1
    • Fix Version/s: v1.3.0, v1.5.0
    • Component/s: Job Engine
    • Labels:
      None

      Description

      For now, kylin only support non-precise count distinct by hyperloglog.
      In our production scenario, there're strongly requirements for precise count distinct, mainly for the column of type int or bigint, such as user-id, product-id, etc.
      Implementing of precise count distinct for all types is difficult and not efficiency. However, only supporting int or bigint make this much easier. The values can be projected into a bitmap, which is easy to be compressed and stored, and easy to count.
      I've created a POC based on RoaringBitmap, proving that worked. There's some more work to be done:

      • RoaringBitmap only support int, there need a solution to support bigint;
      • Add a new measure and codec, like HyperLogLogPlusCounter, make it easy to use;
      • Add new measure on web ui, and check that whether the column type is int or bigint;

        Attachments

        1. KYLIN-1186-1.x-staging.patch
          22 kB
          Yerui Sun
        2. KYLIN-1186-1.x-staging.2.patch
          44 kB
          Yerui Sun
        3. KYLIN-1186-2.x-staging.2.patch
          47 kB
          Yerui Sun
        4. KYLIN-1186-2.x-staging.3.patch
          48 kB
          Yerui Sun

          Issue Links

            Activity

              People

              • Assignee:
                sunyerui Yerui Sun
                Reporter:
                sunyerui Yerui Sun
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 168h
                  168h
                  Remaining:
                  Remaining Estimate - 168h
                  168h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified