Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-1592

COUNT(DISTINCT(field)) is not deterministic

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Incomplete
    • Impala 1.2.4
    • None
    • Clients
    • Amazon EMR Cluster:

      AMI version:3.2.1
      Hadoop distribution:Amazon 2.4.0
      Applications:Hive latest, Impala latest

      Master:Running1m3.xlarge
      Core:Running1m3.2xlarge

    Description

      I create this kind of table:

      ------------------------------------------
      CREATE EXTERNAL TABLE IF NOT EXISTS o2 (
      log_date BIGINT,
      service_key STRING,
      user_id STRING,
      user_country STRING
      )
      PARTITIONED BY (year SMALLINT, month TINYINT)
      STORED AS PARQUET;
      ------------------------------------------

      Afterwards I import data from HIVE:

      ------------------------------------------
      SET hive.exec.dynamic.partition=true;
      SET hive.exec.dynamic.partition.mode=nonstrict;
      add jar s3://.../parquet-hive-bundle-1.6.0rc2.jar;

      INSERT OVERWRITE TABLE o2 PARTITION(year, month) SELECT * FROM o2_hive;
      ------------------------------------------

      Then I run queries:

      ------------------------------------------
      [ip-10-169-190-25.ec2.internal:21000] > select COUNT(DISTINCT(t.user_id)) FROM o2 t WHERE (t.service_key = "my_key") AND month=3 AND year=2014;
      Query: select COUNT(DISTINCT(t.user_id)) FROM o2 t WHERE (t.service_key = "my_key") AND month=3 AND year=2014
      -----------------------------

      count(distinct (t.user_id))

      -----------------------------

      3

      -----------------------------
      Returned 1 row(s) in 0.97s
      [ip-10-169-190-25.ec2.internal:21000] > select COUNT(DISTINCT(t.user_id)) FROM o2 t WHERE (t.service_key = "my_key") AND month=3 AND year=2014;
      Query: select COUNT(DISTINCT(t.user_id)) FROM o2 t WHERE (t.service_key = "my_key") AND month=3 AND year=2014
      -----------------------------

      count(distinct (t.user_id))

      -----------------------------

      4

      -----------------------------
      ------------------------------------------

      As you can see I run the very same query but I get different results. Do you have any idea what might be wrong?

      Thank you!

      Attachments

        Activity

          People

            Unassigned Unassigned
            larsxschneider_impala_37e2 larsxschneider
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: