Uploaded image for project: 'Phoenix'
  1. Phoenix
  2. PHOENIX-4164

APPROX_COUNT_DISTINCT becomes imprecise at 20m unique values.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Invalid
    • None
    • None
    • None
    • None

    Description

      0: jdbc:phoenix:localhost> select count(*) from test;
      +-----------+
      | COUNT(1)  |
      +-----------+
      | 26931816  |
      +-----------+
      1 row selected (14.604 seconds)
      0: jdbc:phoenix:localhost> select approx_count_distinct(v1) from test;
      +----------------------------+
      | APPROX_COUNT_DISTINCT(V1)  |
      +----------------------------+
      | 17221394                   |
      +----------------------------+
      1 row selected (21.619 seconds)
      

      The table is generated from random numbers, and the cardinality of v1 is close to the number of rows.
      (I cannot run a COUNT(DISTINCT(v1)), as it uses up all memory on my machine and eventually kills the regionserver - that's another story and another jira)

      aertoria

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              larsh Lars Hofhansl
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: