Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24440

When use constant as column we may get wrong answer versus impala

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.1.0, 2.3.0
    • None
    • SQL

    Description

      For query below:

      select `date`, 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528 group by `date`, platform
      

      We intended to group by 100 and get distinct deviceid number.
      By spark sql,we get:

      +-----------+-----------+-----------+--+
      |   date    | platform  | new_user  |
      +-----------+-----------+-----------+--+
      | 20180528  | 100       | 521       |
      | 20180528  | 100       | 82        |
      | 20180528  | 100       | 3         |
      | 20180528  | 100       | 2         |
      | 20180528  | 100       | 7         |
      | 20180528  | 100       | 870       |
      | 20180528  | 100       | 3         |
      | 20180528  | 100       | 8         |
      | 20180528  | 100       | 3         |
      | 20180528  | 100       | 2204      |
      | 20180528  | 100       | 1123      |
      | 20180528  | 100       | 1         |
      | 20180528  | 100       | 54        |
      | 20180528  | 100       | 440       |
      | 20180528  | 100       | 4         |
      | 20180528  | 100       | 478       |
      | 20180528  | 100       | 34        |
      | 20180528  | 100       | 195       |
      | 20180528  | 100       | 17        |
      | 20180528  | 100       | 18        |
      | 20180528  | 100       | 2         |
      | 20180528  | 100       | 2         |
      | 20180528  | 100       | 84        |
      | 20180528  | 100       | 1616      |
      | 20180528  | 100       | 15        |
      | 20180528  | 100       | 7         |
      | 20180528  | 100       | 479       |
      | 20180528  | 100       | 50        |
      | 20180528  | 100       | 376       |
      | 20180528  | 100       | 21        |
      | 20180528  | 100       | 842       |
      | 20180528  | 100       | 444       |
      | 20180528  | 100       | 538       |
      | 20180528  | 100       | 1         |
      | 20180528  | 100       | 2         |
      | 20180528  | 100       | 7         |
      | 20180528  | 100       | 17        |
      | 20180528  | 100       | 133       |
      | 20180528  | 100       | 7         |
      | 20180528  | 100       | 415       |
      | 20180528  | 100       | 2         |
      | 20180528  | 100       | 318       |
      | 20180528  | 100       | 5         |
      | 20180528  | 100       | 1         |
      | 20180528  | 100       | 2060      |
      | 20180528  | 100       | 1217      |
      | 20180528  | 100       | 2         |
      | 20180528  | 100       | 60        |
      | 20180528  | 100       | 22        |
      | 20180528  | 100       | 4         |
      +-----------+-----------+-----------+--+
      

      Actually sum of the deviceid is below:

      0: jdbc:hive2://xxx/> select sum(t1.new_user) from (select `date`, 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528 group by `date`, platform)t1; 
      +----------------+--+
      | sum(new_user)  |
      +----------------+--+
      | 14816          |
      +----------------+--+
      1 row selected (4.934 seconds)
      

      And the real distinct deviceid value is below:

      0: jdbc:hive2://xxx/> select 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528;
      +-----------+-----------+--+
      | platform  | new_user  |
      +-----------+-----------+--+
      | 100       | 14773     |
      +-----------+-----------+--+
      1 row selected (2.846 seconds)
      

      In impala,with the first query we can get result below:

      [xxx] > select `date`, 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528 group by `date`, platform;Query: select `date`, 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528 group by `date`, platform
      +----------+----------+----------+
      | date     | platform | new_user |
      +----------+----------+----------+
      | 20180528 | 100      | 14773    |
      +----------+----------+----------+
      Fetched 1 row(s) in 1.00s
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            cane zhoukang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: