Uploaded image for project: 'Kylin'
  1. Kylin
  2. KYLIN-1186

Support precise Count Distinct using bitmap (under limited conditions)

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • v1.1
    • v1.3.0, v1.5.0
    • Job Engine
    • None

    Description

      For now, kylin only support non-precise count distinct by hyperloglog.
      In our production scenario, there're strongly requirements for precise count distinct, mainly for the column of type int or bigint, such as user-id, product-id, etc.
      Implementing of precise count distinct for all types is difficult and not efficiency. However, only supporting int or bigint make this much easier. The values can be projected into a bitmap, which is easy to be compressed and stored, and easy to count.
      I've created a POC based on RoaringBitmap, proving that worked. There's some more work to be done:

      • RoaringBitmap only support int, there need a solution to support bigint;
      • Add a new measure and codec, like HyperLogLogPlusCounter, make it easy to use;
      • Add new measure on web ui, and check that whether the column type is int or bigint;

      Attachments

        1. KYLIN-1186-1.x-staging.patch
          22 kB
          Yerui Sun
        2. KYLIN-1186-1.x-staging.2.patch
          44 kB
          Yerui Sun
        3. KYLIN-1186-2.x-staging.2.patch
          47 kB
          Yerui Sun
        4. KYLIN-1186-2.x-staging.3.patch
          48 kB
          Yerui Sun

        Issue Links

          Activity

            People

              sunyerui Yerui Sun
              sunyerui Yerui Sun
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 168h
                  168h
                  Remaining:
                  Remaining Estimate - 168h
                  168h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified