Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-9983

Vectorizer doesn't vectorize (1) partitions with different schema anywhere (2) any MapWork with >1 table scans in MR

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • Vectorization
    • None

    Description

      For some test, tables are created as such:

      CREATE TABLE orc_llap_part(
          csmallint SMALLINT,
          cint INT,
          cbigint BIGINT,
          cfloat FLOAT,
          cdouble DOUBLE,
          cstring1 STRING,
          cstring2 STRING,
          ctimestamp1 TIMESTAMP,
          ctimestamp2 TIMESTAMP,
          cboolean1 BOOLEAN,
          cboolean2 BOOLEAN
      ) PARTITIONED BY (ctinyint TINYINT) STORED AS ORC;
      
      CREATE TABLE orc_llap_dim_part(
          cbigint BIGINT
      ) PARTITIONED BY (ctinyint TINYINT) STORED AS ORC;
      
      
      INSERT OVERWRITE TABLE orc_llap_part PARTITION (ctinyint)
      SELECT csmallint, cint, cbigint, cfloat, cdouble, cstring1, cstring2, ctimestamp1, ctimestamp2, cboolean1, cboolean2, ctinyint FROM alltypesorc;
      
      INSERT OVERWRITE TABLE orc_llap_dim_part PARTITION (ctinyint)
      SELECT sum(cbigint) as cbigint, ctinyint FROM alltypesorc WHERE ctinyint > 10 AND ctinyint < 21 GROUP BY ctinyint;
      

      The query is:

      explain
        SELECT oft.ctinyint, oft.cint FROM orc_llap_part oft
        INNER JOIN orc_llap_dim_part od ON oft.ctinyint = od.ctinyint;
      

      This results in a failure to vectorize in MR:

      Could not vectorize partition pfile:/Users/sergey/git/hive3/itests/qtest/target/warehouse/orc_llap_dim_part/ctinyint=11.  Its column names cbigint do not match the other column names csmallint,cint,cbigint,cfloat,cdouble,cstring1,cstring2,ctimestamp1,ctimestamp2,cboolean1,cboolean2
      

      This is comparing schemas from different tables because MapWork has 2 TableScan-s; in Tez this error will never happen as MapWork will not have 2 scans.
      In Tez (and MR as well), the other case can happen, namely partitions of the same table having different schemas.

      Tez case can be solved by making a super-schema to include all variations and handling missing columns where necessary.
      MR case may be harder to solve.
      Of note is that despite schema being different (and not a prefix of a schema by coincidence or some such), query passes if validation is commented out. Perhaps in some cases it can work?

      Attachments

        Activity

          People

            mmccline Matt McCline
            sershe Sergey Shelukhin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: