Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.16.0
-
None
Description
There are couple of scenarios where the histogram bucket end points show high deviation for the attached sample data set.
+Scenario 1: +
There are total 100 rows in the sample. Here are the first 10 values of the c_float column ordered by the column.
select c_float from `table_stats/alltypes_with_nulls` order by c_float; +--------------+ | c_float | +--------------+ | -4.6873795E9 | | 8.1855632E7 | | 2.65311632E8 | | 4.50677952E8 | | 4.6864464E8 | | 5.7848493E8 | | 6.6793114E8 | | 7.1175571E8 | | 9.0065581E8 | | 9.2245773E8 | ... ... <100 rows>
Here the minimum value is a small negative number. Here's the output of the histogram after running ANALYZE command:
"buckets" : [ 8.1855488E7, 9.13736816E8, 1.7208630111999998E9, 3.2401755232E9, 4.6546719328E9, 5.130497904E9, 5.9901393504E9, 6.779930992E9, 7.998626672E9, 8.691596143999998E9, 9.983783792E9 ]
Note that the starting end point of bucket 0 is actually the 2nd value in the ordered list and the small negative number is not represented in the histogram at all.
Scenario 2:
Histogram for the c_bigint column is as below:
{ "column" : "`c_bigint`", "majortype" : { "type" : "BIGINT", "mode" : "OPTIONAL" }, "schema" : 1.0, "rowcount" : 100.0, "nonnullrowcount" : 87.0, "ndv" : 46, "avgwidth" : 8.0, "histogram" : { "category" : "numeric-equi-depth", "numRowsPerBucket" : 8, "buckets" : [ -8.6390506354062131E18, -7.679478802017577E18, -5.8389791200382024E18, -2.9165328693138038E18, -1.77746633649836621E18, 2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 8.848383132345303E17, 4.6441480083157811E18 ] } }
This indicates that there are duplicate rows with the value close to 2.83 which is not true when we analyze the source data.
This is the output of the ntile function:
SELECT bucket_num, min(c_bigint) as min_amount, max(c_bigint) as max_amount, count(*) as total_count FROM ( SELECT c_bigint, NTILE(10) OVER (ORDER BY c_bigint) as bucket_num FROM `table_stats/alltypes_with_nulls` ) GROUP BY bucket_num ORDER BY bucket_num; +------------+----------------------+----------------------+-------------+ | bucket_num | min_amount | max_amount | total_count | +------------+----------------------+----------------------+-------------+ | 1 | -8804872880253829120 | -6983033704176156672 | 10 | | 2 | -6772904422084182016 | -5326061597989273600 | 10 | | 3 | -5111449881868763136 | -2561061038367703040 | 10 | | 4 | -2424523650070740992 | -449093763428515840 | 10 | | 5 | 0 | 0 | 10 | | 6 | 0 | 0 | 10 | | 7 | 0 | 0 | 10 | | 8 | 0 | 884838034226544640 | 10 | | 9 | 884838034226544640 | 4644147690488201216 | 10 | | 10 | null | null | 10 | +------------+----------------------+----------------------+-------------+
Attachments
Issue Links
- links to