Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
4.7.0
-
None
-
None
-
Phoenix version: 4.7.0
HBase version: 1.1.2
Description
The LAST_VALUE aggregation UDF doesn't give correct result when group by columns does not include all primary key columns.
The function works in a way that it ignores all the group by columns except the column in the order by clause in finding the last value. For example
Following is a dataset (pk1,pk2,pk3,pk4 and pk5 are Primary key columns and m1-m3 are metric columns, I intend to do a SUM on m2 and m3 and a last value on m1
pk1 . | pk2 | pk3 . | pk4 | pk5 . | m1 | m2 | m3
------------------------------------------------------------------------------------------------------
test 201808010000 app1 plat1 lang1 1 10 100 test 201808010000 app1 plat1 lang2 2 10 100 test 201808010000 app1 plat2 lang1 3 10 100 test 201808010000 app2 plat1 lang1 4 10 100 test 201808010030 app1 plat1 lang1 10 10 100 test 201808010030 app1 plat1 lang2 20 10 100 test 201808010030 app1 plat2 lang1 30 10 100 test 201808010030 app2 plat1 lang1 40 10 100 test 201808010100 app1 plat1 lang1 100 10 100 test 201808010100 app1 plat1 lang2 200 10 100 test 201808010100 app1 plat2 lang1 300 10 100 test 201808010100 app2 plat1 lang1 400 10 100 test 201808010130 app1 plat1 lang1 1000 10 100 test 201808010130 app1 plat1 lang2 2000 10 100 test 201808010130 app1 plat2 lang1 3000 10 100 test 201808010130 app2 plat1 lang1 4000 10 100 test 201808010200 app1 plat1 lang1 10000 10 100 test 201808010200 app1 plat1 lang2 20000 10 100 test 201808010200 app1 plat2 lang1 30000 10 100 test 201808010200 app2 plat1 lang1 40000 10 100
If I run the following query (using all primary key columns in group by)
select TO_CHAR(TO_DATE(pk2,'yyyyMMddHHmm'),'yyyyMMddHH') as t,pk3,pk4,pk5, last_value(m1) within group (order by pk2 asc) as oi, sum(m2), sum(m3) from test_table group by pk1,t,pk3,pk4,pk5;
I get the correct result for last value as for each pk2 value which is used in order by there is only 1 row
------------
--------------------------------------------------
T pk3 pk4 pk5 OI SUM(M2) SUM(M3) ------------
--------------------------------------------------
2018080100 app1 plat1 lang1 10 20 200 2018080100 app1 plat1 lang2 20 20 200 2018080100 app1 plat2 lang1 30 20 200 2018080100 app2 plat1 lang1 40 20 200 2018080101 app1 plat1 lang1 1000 20 200 2018080101 app1 plat1 lang2 2000 20 200 2018080101 app1 plat2 lang1 3000 20 200 2018080101 app2 plat1 lang1 4000 20 200 2018080102 app1 plat1 lang1 10000 20 100 2018080102 app1 plat1 lang2 20000 10 100 2018080102 app1 plat2 lang1 30000 10 100 2018080102 app2 plat1 lang1 40000 10 100 ------------
--------------------------------------------------
However if I do I group by on less than all the primary columns the LAST_VALUE function ignores the rest of the group by columns in sorting and returns incorrect last_value
select TO_CHAR(TO_DATE(pk2,'yyyyMMddHHmm'),'yyyyMMddHH') as t, pk3, last_value(m1) within group (order by pk2 asc) as oi, sum(m2), sum(m3) from test_table group by pk1,t,pk3;
------------
------------------------------
T pk3 OI SUM(M2) SUM(M3) ------------
------------------------------
2018080100 app1 10 60 600 2018080100 app2 40 20 200 2018080101 app1 1000 60 600 2018080101 app2 4000 20 200 2018080102 app1 10000 40 300 2018080102 app2 40000 10 100 ------------
------------------------------
So instead of taking the last value of the group formed by 2018080100 and app1 i.e 30 it is picking the first i.e 10.