Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-20901

running compactor when there is nothing to do produces duplicate data

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 4.0.0
    • 4.0.0-alpha-2
    • Transactions
    • None

    Description

      suppose we run minor compaction 2 times, via alter table

      The 2nd request to compaction should have nothing to do but I don't think there is a check for that.  It's visible in the context of HIVE-20823, where each compactor run produces a delta with new visibility suffix so we end up with something like

      target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/
      
      ├── delete_delta_0000001_0000002_v0000019
      │   ├── _orc_acid_version
      │   └── bucket_00000
      ├── delete_delta_0000001_0000002_v0000021
      │   ├── _orc_acid_version
      │   └── bucket_00000
      ├── delta_0000001_0000001_0000
      │   ├── _orc_acid_version
      │   └── bucket_00000
      ├── delta_0000001_0000002_v0000019
      │   ├── _orc_acid_version
      │   └── bucket_00000
      ├── delta_0000001_0000002_v0000021
      │   ├── _orc_acid_version
      │   └── bucket_00000
      └── delta_0000002_0000002_0000
          ├── _orc_acid_version
          └── bucket_00000

      i.e. 2 deltas with the same write ID range

      this is bad.  Probably happens today as well but new run produces a delta with the same name and clobbers the previous one, which may interfere with writers

       

      need to investigate

       

      The issue (I think) is that AcidUtils.getAcidState() then returns both deltas as if they were distinct and it effectively duplicates data.  There is no data duplication - getAcidState() will not use 2 deltas with the same writeid range

       

       

      Attachments

        1. HIVE-20901.1.patch
          2 kB
          Abhishek Somani
        2. HIVE-20901.2.patch
          4 kB
          Abhishek Somani

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            asomani Abhishek Somani Assign to me
            ekoifman Eugene Koifman
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment