Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2750

Wrong query results for COUNT(*) from an external delimited table

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • Impala 2.5.0
    • Impala 2.5.0
    • Backend
    • nightly cluster

    Description

      The attached csv file has 7300 rows. If I register it as an external table and execute a SELECT COUNT query, the result is 2347 rows when I run it on the nightly cluster.

      To reproduce, login to root@nightly-2.vpc.cloudera.com. Note that the csv file is already in the temp directory.

      Place the attached csv file somewhere on HDFS.

      wc -l /tmp/0.csv
      hadoop fs -mkdir -p /tmp/my_csv
      hadoop fs -put 0.csv /tmp/my_csv
      

      Open up an impala-shell and execute

      CREATE EXTERNAL TABLE my_csv
      (`id` int,
       `bool_col` boolean,
       `tinyint_col` tinyint,
       `smallint_col` smallint,
       `int_col` int,
       `bigint_col` bigint,
       `float_col` float,
       `double_col` double,
       `date_string_col` string,
       `string_col` string,
       `timestamp_col` timestamp,
       `year` int,
       `month` int)
       ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      ESCAPED BY '\\'
      LOCATION '/tmp/my_csv'
      TBLPROPERTIES('serialization.null.format'='#NULL');
      
      SELECT COUNT(*) FROM my_csv;
      

      And you will get 2347 rows.

      Attachments

        1. 0.csv
          611 kB
          Uri Laserson

        Activity

          People

            mgrund_impala_bb91 Martin Grund
            laserson Uri Laserson
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: