Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4153

count(*) is not working properly on all blank('') columns in RCFile stored tables

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Impala 2.5.0
    • Fix Version/s: Impala 2.8.0
    • Component/s: Backend
    • Labels:

      Description

      The aggregate function count is not working properly on those table columns that contain '' (blank) values only. It returns incorrect values seemingly handleing it as NULLs. The issue is reproducible in multiple environments (CDH 5.5.2, 5.7.2) and it seems it is Impala and RCFile specific:

      [morhidi-552-sec-2.gce.cloudera.com:21000] > select * from impala_rcfile_count_issue;
      Query: select * from impala_rcfile_count_issue
      +------+------+------+
      | col1 | col2 | col3 |
      +------+------+------+
      | 1    |      | a    |
      |      |      | b    |
      | 2    |      | c    |
      +------+------+------+
      Fetched 3 row(s) in 0.22s
      [morhidi-552-sec-2.gce.cloudera.com:21000] > select * from impala_rcfile_count_issue where col1='';
      Query: select * from impala_rcfile_count_issue where col1=''
      +------+------+------+
      | col1 | col2 | col3 |
      +------+------+------+
      |      |      | b    |
      +------+------+------+
      Fetched 1 row(s) in 0.12s
      [morhidi-552-sec-2.gce.cloudera.com:21000] > select count(*) from impala_rcfile_count_issue where col1='';
      Query: select count(*) from impala_rcfile_count_issue where col1=''
      +----------+
      | count(*) |
      +----------+
      | 1        |
      +----------+
      Fetched 1 row(s) in 0.67s
      [morhidi-552-sec-2.gce.cloudera.com:21000] > select * from impala_rcfile_count_issue where col2='';
      Query: select * from impala_rcfile_count_issue where col2=''
      +------+------+------+
      | col1 | col2 | col3 |
      +------+------+------+
      | 1    |      | a    |
      |      |      | b    |
      | 2    |      | c    |
      +------+------+------+
      Fetched 3 row(s) in 0.12s
      
      INCORRECT:
      
      [morhidi-552-sec-2.gce.cloudera.com:21000] > select count(*) from impala_rcfile_count_issue where col2='';
      Query: select count(*) from impala_rcfile_count_issue where col2=''
      +----------+
      | count(*) |
      +----------+
      | 0        |
      +----------+
      Fetched 1 row(s) in 0.76s
      
      
      [morhidi-552-sec-2.gce.cloudera.com:21000] > select * from impala_rcfile_count_issue where col1 is NULL;
      Query: select * from impala_rcfile_count_issue where col1 is NULL
      
      Fetched 0 row(s) in 0.22s
      [morhidi-552-sec-2.gce.cloudera.com:21000] > select count(*) from impala_rcfile_count_issue where col1 is NULL;
      Query: select count(*) from impala_rcfile_count_issue where col1 is NULL
      +----------+
      | count(*) |
      +----------+
      | 0        |
      +----------+
      Fetched 1 row(s) in 0.68s
      [morhidi-552-sec-2.gce.cloudera.com:21000] > select * from impala_rcfile_count_issue where col2 is NULL;
      Query: select * from impala_rcfile_count_issue where col2 is NULL
      
      Fetched 0 row(s) in 0.12s
      
      INCORRECT:
      
      [morhidi-552-sec-2.gce.cloudera.com:21000] > select count(*) from impala_rcfile_count_issue where col2 is NULL;
      Query: select count(*) from impala_rcfile_count_issue where col2 is NULL
      +----------+
      | count(*) |
      +----------+
      | 3        |
      +----------+
      Fetched 1 row(s) in 0.64s
      

      Here is the command to create the table from Hive:

      create table impala_rcfile_count_issue stored as rcfile as
      select * from
      (
      select '1' as col1,'' as col2, 'a' as col3
      union all
      select '' as col1,'' as col2, 'b' as col3
      union all
      select '2' as col1,'' as col2, 'c' as col3
      ) t;
      

        Activity

        Hide
        alex.behm Alexander Behm added a comment -

        Michael, can you take a first look? Feel free to re-assign.

        Show
        alex.behm Alexander Behm added a comment - Michael, can you take a first look? Feel free to re-assign.
        Hide
        kwho Michael Ho added a comment -

        FWIW, the underlying problem of this bug should have been fixed on Impala 2.6.0 and beyond.

        The problem is not with the logic in the scanner but more related to TextConverter and MemPool. Specifically, a zero-length allocation from MemPool() always returns NULL. In this case, since all rows of col2 contains an empty string (i.e. not NULL as the null indicator is "\N"), the overall 'row_group_length_' sums up to zero. Therefore, 'row_group_buffer_' will be NULL as MemPool::Allocate() will return NULL for zero-length allocation in the old code and that tricks TextConverter() into believing that the slot is NULL.

        As part of commit 9ed3b685a109559fad8dd94f6c03af93e164305a, Allocate() will return a non-NULL dummy pointer for zero-length allocations. It was done to differentiate between allocation failure and zero-length allocations. It may be arguable whether TextConvertert::WriterSlot() is handling the case in which the parameter data being NULL well.

        Show
        kwho Michael Ho added a comment - FWIW, the underlying problem of this bug should have been fixed on Impala 2.6.0 and beyond. The problem is not with the logic in the scanner but more related to TextConverter and MemPool. Specifically, a zero-length allocation from MemPool() always returns NULL. In this case, since all rows of col2 contains an empty string (i.e. not NULL as the null indicator is "\N"), the overall 'row_group_length_' sums up to zero. Therefore, 'row_group_buffer_' will be NULL as MemPool::Allocate() will return NULL for zero-length allocation in the old code and that tricks TextConverter() into believing that the slot is NULL. As part of commit 9ed3b685a109559fad8dd94f6c03af93e164305a, Allocate() will return a non-NULL dummy pointer for zero-length allocations. It was done to differentiate between allocation failure and zero-length allocations. It may be arguable whether TextConvertert::WriterSlot() is handling the case in which the parameter data being NULL well.
        Hide
        kwho Michael Ho added a comment -

        Han Xu, please backport commit 9ed3b685a109559fad8dd94f6c03af93e164305a to 2.5.x. Thanks.

        Show
        kwho Michael Ho added a comment - Han Xu , please backport commit 9ed3b685a109559fad8dd94f6c03af93e164305a to 2.5.x. Thanks.
        Hide
        kwho Michael Ho added a comment -

        FWIW, the problem will also occur with query: select col2 from impala_rcfile_count_issue where col2 is NULL; The reason why the query select * from impala_rcfile_count_issue where col2 is NULL; shows the right answer is due to the fact that other columns have non-zero lengths so 'row_group_length_' > 0.

        Show
        kwho Michael Ho added a comment - FWIW, the problem will also occur with query: select col2 from impala_rcfile_count_issue where col2 is NULL; The reason why the query select * from impala_rcfile_count_issue where col2 is NULL; shows the right answer is due to the fact that other columns have non-zero lengths so 'row_group_length_' > 0.
        Hide
        HuaisiXu Huaisi Xu added a comment -

        resolve this since it is fixed by commit 9ed3b685a109559fad8dd94f6c03af93e164305a. adding target version to track releases.

        Show
        HuaisiXu Huaisi Xu added a comment - resolve this since it is fixed by commit 9ed3b685a109559fad8dd94f6c03af93e164305a. adding target version to track releases.
        Hide
        dhecht Dan Hecht added a comment -

        Since this was fixed "accidentally" we don't have a good regression test for it, so reopening to remember to add the test case.

        Show
        dhecht Dan Hecht added a comment - Since this was fixed "accidentally" we don't have a good regression test for it, so reopening to remember to add the test case.
        Hide
        kwho Michael Ho added a comment -

        Reassigning to Laszlo Gaal as he already wrote some tests for this JIRA when backporting the fix.

        Show
        kwho Michael Ho added a comment - Reassigning to Laszlo Gaal as he already wrote some tests for this JIRA when backporting the fix.
        Hide
        laszlog Laszlo Gaal added a comment -

        Review passed, GVO running.

        Show
        laszlog Laszlo Gaal added a comment - Review passed, GVO running.
        Hide
        laszlog Laszlo Gaal added a comment -

        First GVO pass failed due to a data-load problem caused by the Dyn DNS outage last week.
        The change was rebased to catch up with the current state of the source code and a new GVO run was kicked off, which passed.
        Currently waiting for the previous +2 to be carried forward.

        Show
        laszlog Laszlo Gaal added a comment - First GVO pass failed due to a data-load problem caused by the Dyn DNS outage last week. The change was rebased to catch up with the current state of the source code and a new GVO run was kicked off, which passed. Currently waiting for the previous +2 to be carried forward.
        Hide
        laszlog Laszlo Gaal added a comment -

        commit 9af59bfe2bb2747e71de45cb1fb96b7da3811ad1

        This change adds test coverage for the fixes committed for
        IMPALA-2399 in commit 9ed3b685a109559fad8dd94f6c03af93e164305a.
        It uses the table nulltable in the workload functional-query
        to verify the materialization and counting of NULL and empty-
        valued columns. The test can be run on any supported storage
        and compression combination.

        Change-Id: I23923f95f43d67977ee1520a1fc09ce297548b3f
        Reviewed-on: http://gerrit.cloudera.org:8080/4755
        Tested-by: Internal Jenkins
        Reviewed-by: Jim Apple <jbapple@cloudera.com>

        Show
        laszlog Laszlo Gaal added a comment - commit 9af59bfe2bb2747e71de45cb1fb96b7da3811ad1 This change adds test coverage for the fixes committed for IMPALA-2399 in commit 9ed3b685a109559fad8dd94f6c03af93e164305a. It uses the table nulltable in the workload functional-query to verify the materialization and counting of NULL and empty- valued columns. The test can be run on any supported storage and compression combination. Change-Id: I23923f95f43d67977ee1520a1fc09ce297548b3f Reviewed-on: http://gerrit.cloudera.org:8080/4755 Tested-by: Internal Jenkins Reviewed-by: Jim Apple <jbapple@cloudera.com>

          People

          • Assignee:
            laszlog Laszlo Gaal
            Reporter:
            morhidi Matyas Orhidi
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development