Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.1.0, 2.3.0
-
None
Description
For query below:
select `date`, 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528 group by `date`, platform
We intended to group by 100 and get distinct deviceid number.
By spark sql,we get:
+-----------+-----------+-----------+--+ | date | platform | new_user | +-----------+-----------+-----------+--+ | 20180528 | 100 | 521 | | 20180528 | 100 | 82 | | 20180528 | 100 | 3 | | 20180528 | 100 | 2 | | 20180528 | 100 | 7 | | 20180528 | 100 | 870 | | 20180528 | 100 | 3 | | 20180528 | 100 | 8 | | 20180528 | 100 | 3 | | 20180528 | 100 | 2204 | | 20180528 | 100 | 1123 | | 20180528 | 100 | 1 | | 20180528 | 100 | 54 | | 20180528 | 100 | 440 | | 20180528 | 100 | 4 | | 20180528 | 100 | 478 | | 20180528 | 100 | 34 | | 20180528 | 100 | 195 | | 20180528 | 100 | 17 | | 20180528 | 100 | 18 | | 20180528 | 100 | 2 | | 20180528 | 100 | 2 | | 20180528 | 100 | 84 | | 20180528 | 100 | 1616 | | 20180528 | 100 | 15 | | 20180528 | 100 | 7 | | 20180528 | 100 | 479 | | 20180528 | 100 | 50 | | 20180528 | 100 | 376 | | 20180528 | 100 | 21 | | 20180528 | 100 | 842 | | 20180528 | 100 | 444 | | 20180528 | 100 | 538 | | 20180528 | 100 | 1 | | 20180528 | 100 | 2 | | 20180528 | 100 | 7 | | 20180528 | 100 | 17 | | 20180528 | 100 | 133 | | 20180528 | 100 | 7 | | 20180528 | 100 | 415 | | 20180528 | 100 | 2 | | 20180528 | 100 | 318 | | 20180528 | 100 | 5 | | 20180528 | 100 | 1 | | 20180528 | 100 | 2060 | | 20180528 | 100 | 1217 | | 20180528 | 100 | 2 | | 20180528 | 100 | 60 | | 20180528 | 100 | 22 | | 20180528 | 100 | 4 | +-----------+-----------+-----------+--+
Actually sum of the deviceid is below:
0: jdbc:hive2://xxx/> select sum(t1.new_user) from (select `date`, 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528 group by `date`, platform)t1;
+----------------+--+
| sum(new_user) |
+----------------+--+
| 14816 |
+----------------+--+
1 row selected (4.934 seconds)
And the real distinct deviceid value is below:
0: jdbc:hive2://xxx/> select 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528;
+-----------+-----------+--+
| platform | new_user |
+-----------+-----------+--+
| 100 | 14773 |
+-----------+-----------+--+
1 row selected (2.846 seconds)
In impala,with the first query we can get result below:
[xxx] > select `date`, 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528 group by `date`, platform;Query: select `date`, 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528 group by `date`, platform +----------+----------+----------+ | date | platform | new_user | +----------+----------+----------+ | 20180528 | 100 | 14773 | +----------+----------+----------+ Fetched 1 row(s) in 1.00s