Uploaded image for project: 'Kylin'
  1. Kylin
  2. KYLIN-1186

Support precise Count Distinct using bitmap (under limited conditions)

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • v1.1
    • v1.3.0, v1.5.0
    • Job Engine
    • None

    Description

      For now, kylin only support non-precise count distinct by hyperloglog.
      In our production scenario, there're strongly requirements for precise count distinct, mainly for the column of type int or bigint, such as user-id, product-id, etc.
      Implementing of precise count distinct for all types is difficult and not efficiency. However, only supporting int or bigint make this much easier. The values can be projected into a bitmap, which is easy to be compressed and stored, and easy to count.
      I've created a POC based on RoaringBitmap, proving that worked. There's some more work to be done:

      • RoaringBitmap only support int, there need a solution to support bigint;
      • Add a new measure and codec, like HyperLogLogPlusCounter, make it easy to use;
      • Add new measure on web ui, and check that whether the column type is int or bigint;

      Attachments

        1. KYLIN-1186-2.x-staging.3.patch
          48 kB
          Yerui Sun
        2. KYLIN-1186-2.x-staging.2.patch
          47 kB
          Yerui Sun
        3. KYLIN-1186-1.x-staging.patch
          22 kB
          Yerui Sun
        4. KYLIN-1186-1.x-staging.2.patch
          44 kB
          Yerui Sun

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            sunyerui Yerui Sun
            sunyerui Yerui Sun
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 168h
                168h
                Remaining:
                Remaining Estimate - 168h
                168h
                Logged:
                Time Spent - Not Specified
                Not Specified

                Slack

                  Issue deployment