Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-4160 Vectorized Query Execution in Hive
  3. HIVE-4701

Optimize filter Column IN ( list-of-constants ) for vectorized execution

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Not for the current sprint. Schedule for after June 2013.

      Description

      OR filters have been optimized to run with vectorized query execution. IN filters of the form "Column IN (list-of-constants)" are a special case of OR. However, IN does not vectorized currently.

      E.g.

      select ddate, count(*) from factsqlengineam_vec_orc where ddate = "2012-05-19 00:00:00" OR ddate = "2012-05-20 00:00:00" or ddate = "2012-05-21 00:00:00" group by ddate;

      takes about 23 seconds of CPU and

      select ddate, count(*) from factsqlengineam_vec_orc where ddate IN ("2012-05-19 00:00:00", "2012-05-20 00:00:00", "2012-05-21 00:00:00") group by ddate;

      takes about 153 seconds of CPU.

      A simple fix may be that for short IN lists (say <= 64 elements) we turn them into OR by manipulating the query tree before planning whether vectorization can be used.

      A more complex fix that covers more cases would be to turn longer IN lists into a join so when we eventually support vectorized joins it will be fast.

      An intermediate approach might be to implement a special IN filter operator that stores the constant values in a sorted array or high-performance hash table (like Cuckoo hashing).

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              ehans Eric N. Hanson
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: