Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-7228

Histogram end points show high deviation for a sample data set

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.16.0
    • 1.17.0
    • None

    Description

      There are couple of scenarios where the histogram bucket end points show high deviation for the attached sample data set.

      +Scenario 1: +
      There are total 100 rows in the sample. Here are the first 10 values of the c_float column ordered by the column.

      select c_float from `table_stats/alltypes_with_nulls` order by c_float;
      +--------------+
      |   c_float    |
      +--------------+
      | -4.6873795E9 |
      | 8.1855632E7  |
      | 2.65311632E8 |
      | 4.50677952E8 |
      | 4.6864464E8  |
      | 5.7848493E8  |
      | 6.6793114E8  |
      | 7.1175571E8  |
      | 9.0065581E8  |
      | 9.2245773E8  |
      ...
      ...
      <100 rows>
      

      Here the minimum value is a small negative number. Here's the output of the histogram after running ANALYZE command:

       "buckets" : [ 8.1855488E7, 9.13736816E8, 1.7208630111999998E9, 3.2401755232E9, 4.6546719328E9, 5.130497904E9, 5.9901393504E9, 6.779930992E9, 7.998626672E9, 8.691596143999998E9, 9.983783792E9 ]
      

      Note that the starting end point of bucket 0 is actually the 2nd value in the ordered list and the small negative number is not represented in the histogram at all.

      Scenario 2:
      Histogram for the c_bigint column is as below:

       {
            "column" : "`c_bigint`",
            "majortype" : {
              "type" : "BIGINT",
              "mode" : "OPTIONAL"
            },
            "schema" : 1.0,
            "rowcount" : 100.0,
            "nonnullrowcount" : 87.0,
            "ndv" : 46,
            "avgwidth" : 8.0,
            "histogram" : {
              "category" : "numeric-equi-depth",
              "numRowsPerBucket" : 8,
              "buckets" : [ -8.6390506354062131E18, -7.679478802017577E18, -5.8389791200382024E18, -2.9165328693138038E18, -1.77746633649836621E18, 2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 8.848383132345303E17, 4.6441480083157811E18 ]
            }
          }
      

      This indicates that there are duplicate rows with the value close to 2.83 which is not true when we analyze the source data.
      This is the output of the ntile function:

      SELECT bucket_num,
       min(c_bigint) as min_amount,
      max(c_bigint) as max_amount,
       count(*) as total_count
         FROM (
           SELECT c_bigint,
               NTILE(10) OVER (ORDER BY c_bigint) as bucket_num
               FROM `table_stats/alltypes_with_nulls`
        )
      GROUP BY bucket_num
       ORDER BY bucket_num;
      +------------+----------------------+----------------------+-------------+
      | bucket_num |      min_amount      |      max_amount      | total_count |
      +------------+----------------------+----------------------+-------------+
      | 1          | -8804872880253829120 | -6983033704176156672 | 10          |
      | 2          | -6772904422084182016 | -5326061597989273600 | 10          |
      | 3          | -5111449881868763136 | -2561061038367703040 | 10          |
      | 4          | -2424523650070740992 | -449093763428515840  | 10          |
      | 5          | 0                    | 0                    | 10          |
      | 6          | 0                    | 0                    | 10          |
      | 7          | 0                    | 0                    | 10          |
      | 8          | 0                    | 884838034226544640   | 10          |
      | 9          | 884838034226544640   | 4644147690488201216  | 10          |
      | 10         | null                 | null                 | 10          |
      +------------+----------------------+----------------------+-------------+
      

      Attachments

        Issue Links

          Activity

            People

              amansinha100 Aman Sinha
              amansinha100 Aman Sinha
              Gautam Parai Gautam Parai
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: