Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-12359

Hive ORC table reports different counts between select * and count(*)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.2.1
    • None
    • CBO, HiveServer2, ORC, Statistics
    • None
    • HDP 2.3 + Kerberos

    Description

      I have an ORC table which is giving different figures between select count( * ) and select *:

      > select count(*) from myTable;
      +--------+--+
      |  _c0   |
      +--------+--+
      | 56471  |
      +--------+--+
      
      > select * from myTable;
      ...
      109,295 rows selected (62.993 seconds)
      

      At first I thought this was obvious just "analyze table ... compute statistics" and it'll correct itself, however I've tried that as well as adding "for columns" but the results remain the same. The select count( * ) is very fast so it must be using the pre-computed stats.

      When I transform the table to text or to another orc table the count star on that new tables returns the correct number.

      I've even tried disabling stats, CBO, the works, restart, same result, with very fast return each time for select count( * ), indicating it's using either pre-computed stats stored in Metastore or ORC stats in file format, but I'm not sure how ORC could store the wrong count, especially as doing a CTAS to another ORC table returns the correct count when I select count( * ) that new ORC table.

      Attachments

        Activity

          People

            Unassigned Unassigned
            harisekhon Hari Sekhon
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: