Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-23597

VectorizedOrcAcidRowBatchReader::ColumnizedDeleteEventRegistry reads delete delta directories multiple times

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java#L1562

      try {
              final Path[] deleteDeltaDirs = getDeleteDeltaDirsFromSplit(orcSplit);
              if (deleteDeltaDirs.length > 0) {
                int totalDeleteEventCount = 0;
                for (Path deleteDeltaDir : deleteDeltaDirs) {
      

       

      Consider a directory layout like the following. This was created by having simple set of "insert --> update --> select" queries.

       

      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/base_0000001
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/base_0000002
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000003_0000003_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000004_0000004_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000005_0000005_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000006_0000006_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000007_0000007_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000008_0000008_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000009_0000009_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000010_0000010_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000011_0000011_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000012_0000012_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000013_0000013_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000003_0000003_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000004_0000004_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000005_0000005_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000006_0000006_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000007_0000007_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000008_0000008_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000009_0000009_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000010_0000010_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000011_0000011_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000012_0000012_0000
      /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000013_0000013_0000 

       

      Orcsplit contains all the delete delta folder information. For the directory layout like this, it would create ~12 splits. For every split, it constructs "ColumnizedDeleteEventRegistry" in VRBAcidReader and ends up reading all these delete delta folders multiple times.
      In this case, it would read it approximately 121 times!.

      This causes huge delay in running simple queries like "select * from tab_x" in cloud storage. 

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rajesh.balamohan Rajesh Balamohan
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m