HBase
  1. HBase
  2. HBASE-3481

max seq id in flushed file can be larger than its correct value causing data loss during recovery

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.90.1, 0.92.0
    • Component/s: None
    • Labels:
      None

      Description

      [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]

      When memstores are flushed, the max sequence id recorded in the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.

      From HRegion.java:

          sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
      

      where, startCacheFlush() is:

      public long startCacheFlush() {
          this.cacheFlushLock.lock();
          return obtainSeqNum();
       }
      

      where, obtainSeqNum() is simply:

      private long obtainSeqNum() {
          return this.logSeqNum.incrementAndGet();
        }
      

      So let's say a memstore contains edits with sequence number 1..10.

      Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)

      1. HLog.append();
             1.1  obtainSeqNum()
             1.2 writeToWAL()
      
      2 updateMemStore()
      

      So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.

      At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).

      Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.

      Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15.

      Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it skips replaying edits 11..15. Or in other words, data loss.


      1. HBASE-3481.txt
        3 kB
        ryan rawson

        Activity

        Hide
        ryan rawson added a comment -

        Your analysis sounds correct. The correct thing to do here is to provide the MAX_SEQ_ID from the memstore KVs, not from the "current" HLog seqid.

        good find!

        Show
        ryan rawson added a comment - Your analysis sounds correct. The correct thing to do here is to provide the MAX_SEQ_ID from the memstore KVs, not from the "current" HLog seqid. good find!
        Hide
        Kannan Muthukkaruppan added a comment -

        Maybe the quickest fix is to avoid the "skip" optimization during replaying of recovered.edits.

        I think this should restore correctness.

        And with regards to HLog reclamation (i.e. an HLog should only be reclaimed if it contains no data for an active memstore), I don't think it relies on this MAX_SEQ_ID inside store files-- but rather on separate mechanism of what the min edit contained in each memstore is. So, probably that case is ok.

        Show
        Kannan Muthukkaruppan added a comment - Maybe the quickest fix is to avoid the "skip" optimization during replaying of recovered.edits. I think this should restore correctness. And with regards to HLog reclamation (i.e. an HLog should only be reclaimed if it contains no data for an active memstore), I don't think it relies on this MAX_SEQ_ID inside store files-- but rather on separate mechanism of what the min edit contained in each memstore is. So, probably that case is ok.
        Hide
        ryan rawson added a comment -

        if we avoid the skip behaviour, which isnt optimization, we will
        introduce duplicate KVs into the HFiles, which for people who are
        depending on the version count to keep a reasonable, important, number
        of versions will be trouble for them.

        why not just nab the largest seqid during flush and provide it at the
        end? The footer is written out at close() time, and we have time
        between the last KV being appened and close() being called to add a
        correct SEQ_ID.

        Show
        ryan rawson added a comment - if we avoid the skip behaviour, which isnt optimization, we will introduce duplicate KVs into the HFiles, which for people who are depending on the version count to keep a reasonable, important, number of versions will be trouble for them. why not just nab the largest seqid during flush and provide it at the end? The footer is written out at close() time, and we have time between the last KV being appened and close() being called to add a correct SEQ_ID.
        Hide
        Kannan Muthukkaruppan added a comment -

        Is the last seq id readily available from the memstore KVs or is already stashed away somewhere? I agree that that would be the cleanest/best fix.

        (Happy to accept a patch if you want to post one up. Else, I'll look further on this tomorrow morning).

        Show
        Kannan Muthukkaruppan added a comment - Is the last seq id readily available from the memstore KVs or is already stashed away somewhere? I agree that that would be the cleanest/best fix. (Happy to accept a patch if you want to post one up. Else, I'll look further on this tomorrow morning).
        Hide
        ryan rawson added a comment -

        In HRegion.internalFlushCache we have this logic:

        this.updatesLock.writeLock().lock();
        final long currentMemStoreSize = this.memstoreSize.get();
        List<StoreFlusher> storeFlushers = new
        ArrayList<StoreFlusher>(stores.size());
        try {
        sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
        completeSequenceId = this.getCompleteCacheFlushSequenceId(sequenceId);

        for (Store s : stores.values())

        { storeFlushers.add(s.getStoreFlusher(completeSequenceId)); }

        // prepare flush (take a snapshot)
        for (StoreFlusher flusher : storeFlushers)

        { flusher.prepare(); }

        } finally

        { this.updatesLock.writeLock().unlock(); }

        We take a write lock, no more puts/deletes/whatever can be done to this hregion.

        we then grab a seqid (wal.startCacheFlush). We now snapshot everything.

        we then release the update lock and mutations can happen to the region again.

        The flush sequence id should lie exactly between the snapshot and the memstore.

        Given this code, I'm not sure how to explain what you are seeing...
        But this logic seems spot on and correct.

        On Wed, Jan 26, 2011 at 1:14 AM, Kannan Muthukkaruppan (JIRA)

        Show
        ryan rawson added a comment - In HRegion.internalFlushCache we have this logic: this.updatesLock.writeLock().lock(); final long currentMemStoreSize = this.memstoreSize.get(); List<StoreFlusher> storeFlushers = new ArrayList<StoreFlusher>(stores.size()); try { sequenceId = (wal == null)? myseqid: wal.startCacheFlush(); completeSequenceId = this.getCompleteCacheFlushSequenceId(sequenceId); for (Store s : stores.values()) { storeFlushers.add(s.getStoreFlusher(completeSequenceId)); } // prepare flush (take a snapshot) for (StoreFlusher flusher : storeFlushers) { flusher.prepare(); } } finally { this.updatesLock.writeLock().unlock(); } We take a write lock, no more puts/deletes/whatever can be done to this hregion. we then grab a seqid (wal.startCacheFlush). We now snapshot everything. we then release the update lock and mutations can happen to the region again. The flush sequence id should lie exactly between the snapshot and the memstore. Given this code, I'm not sure how to explain what you are seeing... But this logic seems spot on and correct. On Wed, Jan 26, 2011 at 1:14 AM, Kannan Muthukkaruppan (JIRA)
        Hide
        Kannan Muthukkaruppan added a comment -

        Ryan: Your observation seems quite correct! So the outstanding puts case I was worried about shouldn't happen because of the updatesLock.

        So the mystery thickens – I'll need to dig further as to why these edits got skipped during replay. The edits in question were definitely present in the recovered.edits file and had older sequence number than the HFile's MAX_SEQ_ID.

        Show
        Kannan Muthukkaruppan added a comment - Ryan: Your observation seems quite correct! So the outstanding puts case I was worried about shouldn't happen because of the updatesLock. So the mystery thickens – I'll need to dig further as to why these edits got skipped during replay. The edits in question were definitely present in the recovered.edits file and had older sequence number than the HFile's MAX_SEQ_ID.
        Hide
        Kannan Muthukkaruppan added a comment -

        Dug some more... the bad MAX_SEQ_ID that the HFile got stamped with corresponds to an edit that looks special.

        <METAROW/METAFAMILY:/1296023308559/Put/vlen=17; >
        

        Another interesting thing is that in the .logs, this seq id appears out of order...

        ...
        #1081, pos=6403894 kannan1/kannan1,,1296023211936.dfe45b39b1d8c360b8e7f554b576ab17./71170147=[#edits: 18 = <0992bdb67af8d2b2172618496243433c:287686/actions:0\/1296023308555/Put/vlen=597; 0992bdb67af8d2b2172618496243433c:287686/actions:1/1296023308555/Put/vlen=416; 0992bdb67af8d2b2172618496243433c:287686/actions:2\/1296023308555/Put/vlen=436; 0992bdb67af8d2b2172618496243433c:287686/actions:3/1296023308555/Put/vlen=329; 0992bdb67af8d2b2172618496243433c:287686/actions:4\/1296023308555/Put/vlen=256; 0992bdb67af8d2b2172618496243433c:287686/actions:5/1296023308555/Put/vlen=792; 0992bdb67af8d2b2172618496243433c:287686/actions:6\/1296023308555/Put/vlen=231; 0992bdb67af8d2b2172618496243433c:287686/actions:7/1296023308555/Put/vlen=790; 0992bdb67af8d2b2172618496243433c:287686/actions:8\/1296023308555/Put/vlen=800; 0992bdb67af8d2b2172618496243433c:287686/actions:9/1296023308555/Put/vlen=664; 0992bdb67af8d2b2172618496243433c:287686/actions:10\/1296023308555/Put/vlen=618; 0992bdb67af8d2b2172618496243433c:287686/actions:11/1296023308555/Put/vlen=533; 0992bdb67af8d2b2172618496243433c:287686/actions:12\/1296023308555/Put/vlen=636; 0992bdb67af8d2b2172618496243433c:287686/actions:13/1296023308555/Put/vlen=554; 0992bdb67af8d2b2172618496243433c:287686/actions:14\/1296023308555/Put/vlen=353; 0992bdb67af8d2b2172618496243433c:287686/actions:15/1296023308555/Put/vlen=307; 0992bdb67af8d2b2172618496243433c:287686/actions:16\/1296023308555/Put/vlen=410; 0992bdb67af8d2b2172618496243433c:287686/actions:17/1296023308555/Put/vlen=579; >]
        
        #1082, pos=6404082 kannan1/kannan1,bbbbbbbb,1296023211937.0a3f2362b0264b0192c311f10c006d84./71169066{=[#edits: 1 = <METAROW/METAFAMILY:/1296023308559/Put/vlen=17; >]
        
        #1083, pos=6406845 kannan1/kannan1,99999999,1296023211936.8ff83822490b1f8175dc16aaf4c170a6./71170148=[#edits: 4 = <a2adc743be4a350b7009af5143acfbf3:287685/actions:0/1296023308555/Put/vlen=552; a2adc743be4a35\
        0b7009af5143acfbf3:287685/actions:1/1296023308555/Put/vlen=715; a2adc743be4a350b7009af5143acfbf3:287685/actions:2/1296023308555/Put/vlen=806; a2adc743be4a350b7009af5143acfbf3:287685/actions:3/1296023308555/P\
        ut/vlen=296; >]
        

        The flushed file's MAX_SEQ_ID is recorded as: 71169066 (corresponding to entry #1082 above). However that edit is not a regular put key, but this special row <METAROW/METAFAMILY:/1296023308559/Put/vlen=17; >]

        Also, notice out the sequence ids for edits surrounding it(entries #1081 & #1083) are both higher (namely, 71170147 & 71170148).

        Looks like this sequence id is being obtained outside of the updatesLock. But haven't confirmed that in the code.

        Show
        Kannan Muthukkaruppan added a comment - Dug some more... the bad MAX_SEQ_ID that the HFile got stamped with corresponds to an edit that looks special. <METAROW/METAFAMILY:/1296023308559/Put/vlen=17; > Another interesting thing is that in the .logs, this seq id appears out of order... ... #1081, pos=6403894 kannan1/kannan1,,1296023211936.dfe45b39b1d8c360b8e7f554b576ab17./71170147=[#edits: 18 = <0992bdb67af8d2b2172618496243433c:287686/actions:0\/1296023308555/Put/vlen=597; 0992bdb67af8d2b2172618496243433c:287686/actions:1/1296023308555/Put/vlen=416; 0992bdb67af8d2b2172618496243433c:287686/actions:2\/1296023308555/Put/vlen=436; 0992bdb67af8d2b2172618496243433c:287686/actions:3/1296023308555/Put/vlen=329; 0992bdb67af8d2b2172618496243433c:287686/actions:4\/1296023308555/Put/vlen=256; 0992bdb67af8d2b2172618496243433c:287686/actions:5/1296023308555/Put/vlen=792; 0992bdb67af8d2b2172618496243433c:287686/actions:6\/1296023308555/Put/vlen=231; 0992bdb67af8d2b2172618496243433c:287686/actions:7/1296023308555/Put/vlen=790; 0992bdb67af8d2b2172618496243433c:287686/actions:8\/1296023308555/Put/vlen=800; 0992bdb67af8d2b2172618496243433c:287686/actions:9/1296023308555/Put/vlen=664; 0992bdb67af8d2b2172618496243433c:287686/actions:10\/1296023308555/Put/vlen=618; 0992bdb67af8d2b2172618496243433c:287686/actions:11/1296023308555/Put/vlen=533; 0992bdb67af8d2b2172618496243433c:287686/actions:12\/1296023308555/Put/vlen=636; 0992bdb67af8d2b2172618496243433c:287686/actions:13/1296023308555/Put/vlen=554; 0992bdb67af8d2b2172618496243433c:287686/actions:14\/1296023308555/Put/vlen=353; 0992bdb67af8d2b2172618496243433c:287686/actions:15/1296023308555/Put/vlen=307; 0992bdb67af8d2b2172618496243433c:287686/actions:16\/1296023308555/Put/vlen=410; 0992bdb67af8d2b2172618496243433c:287686/actions:17/1296023308555/Put/vlen=579; >] #1082, pos=6404082 kannan1/kannan1,bbbbbbbb,1296023211937.0a3f2362b0264b0192c311f10c006d84./71169066{=[#edits: 1 = <METAROW/METAFAMILY:/1296023308559/Put/vlen=17; >] #1083, pos=6406845 kannan1/kannan1,99999999,1296023211936.8ff83822490b1f8175dc16aaf4c170a6./71170148=[#edits: 4 = <a2adc743be4a350b7009af5143acfbf3:287685/actions:0/1296023308555/Put/vlen=552; a2adc743be4a35\ 0b7009af5143acfbf3:287685/actions:1/1296023308555/Put/vlen=715; a2adc743be4a350b7009af5143acfbf3:287685/actions:2/1296023308555/Put/vlen=806; a2adc743be4a350b7009af5143acfbf3:287685/actions:3/1296023308555/P\ ut/vlen=296; >] The flushed file's MAX_SEQ_ID is recorded as: 71169066 (corresponding to entry #1082 above). However that edit is not a regular put key, but this special row <METAROW/METAFAMILY:/1296023308559/Put/vlen=17; >] Also, notice out the sequence ids for edits surrounding it(entries #1081 & #1083) are both higher (namely, 71170147 & 71170148). Looks like this sequence id is being obtained outside of the updatesLock. But haven't confirmed that in the code.
        Hide
        Kannan Muthukkaruppan added a comment -

        Ryan wrote: <<< We take a write lock, no more puts/deletes/whatever can be done to this hregion>>>.

        Interestingly, I don't see incrementColumnValue() taking the region's updatesLock, which seems like a problem too (but not the one I am running into – since in my test I didn't use incrementColumnValue()).

        We can file the incrementColumnValue() issue as a separate JIRA (if I am not missing something obvious there).

        regards,
        Kannan

        Show
        Kannan Muthukkaruppan added a comment - Ryan wrote: <<< We take a write lock, no more puts/deletes/whatever can be done to this hregion>>>. Interestingly, I don't see incrementColumnValue() taking the region's updatesLock, which seems like a problem too (but not the one I am running into – since in my test I didn't use incrementColumnValue()). We can file the incrementColumnValue() issue as a separate JIRA (if I am not missing something obvious there). regards, Kannan
        Hide
        Kannan Muthukkaruppan added a comment -

        mutliPut() code path also doesn't seem to take updatesLock. I believe this is the bug I am hitting...

        Show
        Kannan Muthukkaruppan added a comment - mutliPut() code path also doesn't seem to take updatesLock. I believe this is the bug I am hitting...
        Hide
        Kannan Muthukkaruppan added a comment -

        Since single Put on client gets promoted to multiPut() , this bug affects both regular puts and increment column value.

        Show
        Kannan Muthukkaruppan added a comment - Since single Put on client gets promoted to multiPut() , this bug affects both regular puts and increment column value.
        Hide
        ryan rawson added a comment -

        here is a patch adding in those locks

        Show
        ryan rawson added a comment - here is a patch adding in those locks
        Hide
        Kannan Muthukkaruppan added a comment -

        You beat me to the patch Ryan Thanks. Patch looks good. I imported patch and am running unit tests and will also run some kill tests shortly.

        Show
        Kannan Muthukkaruppan added a comment - You beat me to the patch Ryan Thanks. Patch looks good. I imported patch and am running unit tests and will also run some kill tests shortly.
        Hide
        stack added a comment -

        Marking a blocker because data loss (and since it seems like Kannan and Ryan figured it – smile). Bringing into 0.90.1.

        Show
        stack added a comment - Marking a blocker because data loss (and since it seems like Kannan and Ryan figured it – smile). Bringing into 0.90.1.
        Hide
        Kannan Muthukkaruppan added a comment -

        Unit tests ran well (I ran against 0.89) & so did the kill testing. I think we are good wrt to this patch.

        Show
        Kannan Muthukkaruppan added a comment - Unit tests ran well (I ran against 0.89) & so did the kill testing. I think we are good wrt to this patch.
        Hide
        stack added a comment -

        Marking patch available and assigning to Ryan. Want to commit RR?

        Show
        stack added a comment - Marking patch available and assigning to Ryan. Want to commit RR?
        Hide
        ryan rawson added a comment -

        ill commit in am

        Show
        ryan rawson added a comment - ill commit in am
        Hide
        Todd Lipcon added a comment -

        Any way we can build a test for this? (even if it doesn't fail reliably every time, would be good to catch bugs like this in the future)

        Show
        Todd Lipcon added a comment - Any way we can build a test for this? (even if it doesn't fail reliably every time, would be good to catch bugs like this in the future)
        Hide
        ryan rawson added a comment -

        that is a good point I opened HBASE-3486 to do that. We already have
        durability unit tests, they should be fixed to fail w/o this patch.

        Show
        ryan rawson added a comment - that is a good point I opened HBASE-3486 to do that. We already have durability unit tests, they should be fixed to fail w/o this patch.
        Hide
        Todd Lipcon added a comment -

        BTW, think I ran into another ramification of this bug. I had a region split,
        but even after the parent was closed, it remained in the lastSeqWritten
        map in HLog. So, my log shows:

        2011-01-27 11:34:21,784 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=46, maxlogs=32; forcing flush of 5 regions(s): 1064fef659ccf634bfb6dbd24a3d3e32, 2e84570f830560a49276e236ebfb53ed, 61b89db95b24fbc6cabf6661cfbc9534, 6f219526af6c8a9e180324bd5d03a08b, d7e912ab428
        2011-01-27 11:34:21,784 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of d7e912ab428b25dbb388edda934591a0r=null, requester=null
        2011-01-27 11:34:45,904 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=47, maxlogs=32; forcing flush of 1 regions(s): d7e912ab428b25dbb388edda934591a0
        2011-01-27 11:34:45,934 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of d7e912ab428b25dbb388edda934591a0r=null, requester=null
        ...
        and I ended up with hundreds of HLogs. This was with a mixed put/ICV workload

        Show
        Todd Lipcon added a comment - BTW, think I ran into another ramification of this bug. I had a region split, but even after the parent was closed, it remained in the lastSeqWritten map in HLog. So, my log shows: 2011-01-27 11:34:21,784 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=46, maxlogs=32; forcing flush of 5 regions(s): 1064fef659ccf634bfb6dbd24a3d3e32, 2e84570f830560a49276e236ebfb53ed, 61b89db95b24fbc6cabf6661cfbc9534, 6f219526af6c8a9e180324bd5d03a08b, d7e912ab428 2011-01-27 11:34:21,784 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of d7e912ab428b25dbb388edda934591a0r=null, requester=null 2011-01-27 11:34:45,904 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=47, maxlogs=32; forcing flush of 1 regions(s): d7e912ab428b25dbb388edda934591a0 2011-01-27 11:34:45,934 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of d7e912ab428b25dbb388edda934591a0r=null, requester=null ... and I ended up with hundreds of HLogs. This was with a mixed put/ICV workload

          People

          • Assignee:
            ryan rawson
            Reporter:
            Kannan Muthukkaruppan
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development