Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-4160 Vectorized Query Execution in Hive
  3. HIVE-4676

Optimize COUNT(*) aggregate over vectorized ORC execution path

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: vectorization-branch
    • Fix Version/s: None
    • Component/s: Query Processor
    • Labels:
      None

      Description

      The COUNT(*) aggregate with the vectorized execution path over ORC should be optimized because it is a very common case. The same problem described below also happens with COUNT(1) and that should be optimized as well.

      Given a table factsqlengineam_vec_orc with about 25 columns and 218 million rows, this query

      select count(*) from factsqlengineam_vec_orc;

      runs in 2 minutes 15 seconds, with HDFS Read 2,000,078,555

      and this query

      select count(mrowflag) from factsqlengineam_vec_orc;

      runs in 42 seconds, with HDFS Read 1,207,855

      Because the column mrowflag is non-null, both queries return the same result.

      We should optimize count(*) so that it, say, chooses the most-compressed column from the ORC file (or even a single random column) and counts those values, but logically counts null values too so the meaning is the same as count(*). The vectorized iterator should not have to load all columns, just one column minimum, and any columns being filtered in the WHERE clause.

      For scalar count(*) aggregates (i.e. without group-by) we can simply tally up the total number of remaining rows in each batch, without even looking at the data. Maybe we're already doing that but we are reading more data than necessary now.

      The query

      select count(*) from factsqlengineam_vec_orc where mrowflag > 0;

      Takes only 41 seconds and also reads 1,207,855 bytes, so it appears that when there is no WHERE clause, more data is read.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                ehans Eric N. Hanson
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated: