Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-23763

Query based minor compaction produces wrong files when rows with different buckets Ids are processed by the same FileSinkOperator

    XMLWordPrintableJSON

    Details

      Description

      How to reproduce:

      • Create an unbucketed ACID table
      • Insert a bigger amount of data into this table so there would be multiple bucket files in the table
        The files in the table should look like this:
        /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00000_0
        /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00001_0
        /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00002_0
        /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00003_0
        /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00004_0
        /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00005_0
      • Do some delete on rows with different bucket Ids
        The files in a delete delta should look like this:
        /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000002_0000002_0000/bucket_00000
        /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000006_0000006_0000/bucket_00003
        /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000006_0000006_0000/bucket_00001
      • Run the query-based minor compaction
      • After the compaction the newly created delete delta containes only 1 bucket file. This file contains rows from all buckets and the table becomes unusable
        /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000001_0000007_v0000066/bucket_00000

      The issue happens only if rows with different bucket Ids are processed by the same FileSinkOperator.
      In the FileSinkOperator.process method, the files for the compaction table are created like this:

          if (!bDynParts && !filesCreated) {
            if (lbDirName != null) {
              if (valToPaths.get(lbDirName) == null) {
                createNewPaths(null, lbDirName);
              }
            } else {
              if (conf.isCompactionTable()) {
                int bucketProperty = getBucketProperty(row);
                bucketId = BucketCodec.determineVersion(bucketProperty).decodeWriterId(bucketProperty);
              }
              createBucketFiles(fsp);
            }
          }
      

      When the first row is processed, the file is created and then the filesCreated variable is set to true. Then when the other rows are processed, the first if statement will be false, so no new file gets created, but the row will be written into the file created for the first row.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                kuczoram Marta Kuczora
                Reporter:
                kuczoram Marta Kuczora
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m