Bookkeeper
  1. Bookkeeper
  2. BOOKKEEPER-93

bookkeeper doesn't work correctly on OpenLedgerNoRecovery

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0.0
    • Fix Version/s: 4.0.0
    • Component/s: None
    • Labels:
      None

      Description

      1) bookkeeper hang when openLedgerNoRecovery, since LedgerOpenOp didn't trigger callback when opening ledger no recovery.

      2) race condition in ReadLastConfirmOp

      ReadLastConfirmOp callback on readEntryComplete.
      a) first decrement numResponsePending
      b) then increment validResponses
      c) check validResponses to callback with OK
      b) check numResponsePending to callback with LedgerRecoveryException

      support two callbacks returns on readEntryComplete: A, B. (quorum/ensemble size : 2)

      a) A first decrement numResponsePending from 2 to 1.
      b) A increment validResponses from 0 to 1.
      c) B then decrement numResponsePending from 1 to 0.
      d) A check numResponsePending before B check validResponse, A found the numResponsePending is 0 now. A will callback with exception. But the right action is B check validResponse and callback with OK.

      3) if an LegerHandle is opened by openLedgerNoRecovery, the lastAddConfirmed will be set to -1. so all read requests will be failed since readEntry id > lastAddConfirmed.

      so I suggested that if an LegerHandle is opened by openLegerNoRecovery, the ledgerHandle is under unsafeRead mode. close/write operations will be failed, read operations should not check condition entry_id > lastAddConfirmed.

      1. bookkeeper-93_v2.patch
        8 kB
        Sijie Guo
      2. bookkeeper-93_v3.patch
        8 kB
        Sijie Guo
      3. bookkeeper-93.patch
        12 kB
        Sijie Guo

        Issue Links

          Activity

          Hide
          Ivan Kelly added a comment -

          1) Yikes, that's a big oversight. There is actually a test for it, BookieReadWriteTest#testReadFromOpenLedger, but the @Test annotation is missing from it so it never gets run. Also, the actual checking code seems to be wrong, as it tries to read from lh, not lhOpen (line 861). Could you break the fix for this problem into a single patch along with the fix for the test and ill commit that as BOOKKEEPER-91.

          2) This is unrelated to 1) so should be in a separate JIRA. Also, im unsure the race you describe can occur. ReadLastConfirmedOp#readEntryComplete is already synchronized.

          3) Actually this could go into BOOKKEEPER-91. However, I think a better solution may be to do a ReadLastConfirmedOp in the else part of LedgerOpenOp#processResult.

                  if(!unsafe) {
                      lh.recover(new GenericCallback<Void>() {
                      @Override
                      public void operationComplete(int rc, Void result) {
                          if (rc != BKException.Code.OK) {
                              cb.openComplete(BKException.Code.LedgerRecoveryException, null, LedgerOpenOp.this.ctx);
                          } else {
                              cb.openComplete(BKException.Code.OK, lh, LedgerOpenOp.this.ctx);
                          }
                      }
                 } else {
                     lh.asyncReadLastConfirmed(new ReadLastConfirmedCallback() {
                         void readLastConfirmedComplete(int rc, long lastConfirmed, Object ctx) {
                             lh.lastAddConfirmed = lh.lastAddPushed = lastConfirmed;
                             cb.complete(rc, LedgerOpenOp.this.ctx);
                         }
                     });
                 }
          

          This way, a non recovery ledger will be able to read entries up to the point it was opened and no further. I think this should be correct behaviour, as otherwise it could be possible for the ledger to read an entry which hasn't been confirmed to the writer. If it hasn't been confirmed to the writer and the writer closes at that point. Which means the reader can read more than the writer, which I don't think affects correctness, but is a little ugly.

          Show
          Ivan Kelly added a comment - 1) Yikes, that's a big oversight. There is actually a test for it, BookieReadWriteTest#testReadFromOpenLedger, but the @Test annotation is missing from it so it never gets run. Also, the actual checking code seems to be wrong, as it tries to read from lh, not lhOpen (line 861). Could you break the fix for this problem into a single patch along with the fix for the test and ill commit that as BOOKKEEPER-91 . 2) This is unrelated to 1) so should be in a separate JIRA. Also, im unsure the race you describe can occur. ReadLastConfirmedOp#readEntryComplete is already synchronized. 3) Actually this could go into BOOKKEEPER-91 . However, I think a better solution may be to do a ReadLastConfirmedOp in the else part of LedgerOpenOp#processResult. if (!unsafe) { lh.recover( new GenericCallback< Void >() { @Override public void operationComplete( int rc, Void result) { if (rc != BKException.Code.OK) { cb.openComplete(BKException.Code.LedgerRecoveryException, null , LedgerOpenOp. this .ctx); } else { cb.openComplete(BKException.Code.OK, lh, LedgerOpenOp. this .ctx); } } } else { lh.asyncReadLastConfirmed( new ReadLastConfirmedCallback() { void readLastConfirmedComplete( int rc, long lastConfirmed, Object ctx) { lh.lastAddConfirmed = lh.lastAddPushed = lastConfirmed; cb.complete(rc, LedgerOpenOp. this .ctx); } }); } This way, a non recovery ledger will be able to read entries up to the point it was opened and no further. I think this should be correct behaviour, as otherwise it could be possible for the ledger to read an entry which hasn't been confirmed to the writer. If it hasn't been confirmed to the writer and the writer closes at that point. Which means the reader can read more than the writer, which I don't think affects correctness, but is a little ugly.
          Hide
          Sijie Guo added a comment -

          Ivan,

          > 2) This is unrelated to 1) so should be in a separate JIRA. Also, im unsure the race you describe can occur. ReadLastConfirmedOp#readEntryComplete is already synchronized.

          You are right. readEntryComplete is synchronized, no race condition on it.

          the issue is that readLastConfirmedComplete will be triggered twice.

          ReadLastConfirmedOp.java
                  // other return codes dont count as valid responses
                  if ((validResponses >= lh.metadata.quorumSize) &&
                          notComplete) {
                      notComplete = false;
                      if (LOG.isDebugEnabled()) {
                          LOG.debug("Read Complete with enough validResponses");
                      }
                      cb.readLastConfirmedComplete(BKException.Code.OK, maxAddConfirmed, this.ctx);
                      return;
                  }
          
                  if (numResponsesPending == 0) {
                      // Have got all responses back but was still not enough, just fail the operation
                      LOG.error("While readLastConfirmed ledger: " + ledgerId + " did not hear success responses from all quorums");
                      cb.readLastConfirmedComplete(BKException.Code.LedgerRecoveryException, maxAddConfirmed, this.ctx);
                  }
          

          The last one will trigger readLastConfirmedComplete no matter there is enough valid responses or not.

          2011-10-26 09:34:48,874 - DEBUG - [pool-174-thread-1:ReadLastConfirmedOp@90] - Read Complete with enough validResponses
          2011-10-26 09:34:48,874 - ERROR - [pool-174-thread-1:ReadLastConfirmedOp@97] - While readLastConfirmed ledger: 1 did not hear success responses from

          Show
          Sijie Guo added a comment - Ivan, > 2) This is unrelated to 1) so should be in a separate JIRA. Also, im unsure the race you describe can occur. ReadLastConfirmedOp#readEntryComplete is already synchronized. You are right. readEntryComplete is synchronized, no race condition on it. the issue is that readLastConfirmedComplete will be triggered twice. ReadLastConfirmedOp.java // other return codes dont count as valid responses if ((validResponses >= lh.metadata.quorumSize) && notComplete) { notComplete = false ; if (LOG.isDebugEnabled()) { LOG.debug( "Read Complete with enough validResponses" ); } cb.readLastConfirmedComplete(BKException.Code.OK, maxAddConfirmed, this .ctx); return ; } if (numResponsesPending == 0) { // Have got all responses back but was still not enough, just fail the operation LOG.error( "While readLastConfirmed ledger: " + ledgerId + " did not hear success responses from all quorums" ); cb.readLastConfirmedComplete(BKException.Code.LedgerRecoveryException, maxAddConfirmed, this .ctx); } The last one will trigger readLastConfirmedComplete no matter there is enough valid responses or not. 2011-10-26 09:34:48,874 - DEBUG - [pool-174-thread-1:ReadLastConfirmedOp@90] - Read Complete with enough validResponses 2011-10-26 09:34:48,874 - ERROR - [pool-174-thread-1:ReadLastConfirmedOp@97] - While readLastConfirmed ledger: 1 did not hear success responses from
          Hide
          Sijie Guo added a comment -

          Thanks for Ivan's suggestions.

          fixes:

          1) avoid two callbacks when readLastConfirmedOp

          2) readLastConfirmedOp to set lastAddConfirmed when opening ledger no recovery. so the entries be read will all confirmed by writter.

          3) add unsafeRead in LedgerHandle to avoid close/write on it.

          Show
          Sijie Guo added a comment - Thanks for Ivan's suggestions. fixes: 1) avoid two callbacks when readLastConfirmedOp 2) readLastConfirmedOp to set lastAddConfirmed when opening ledger no recovery. so the entries be read will all confirmed by writter. 3) add unsafeRead in LedgerHandle to avoid close/write on it.
          Hide
          Ivan Kelly added a comment -

          I see you created BOOKKEEPER-94 for the test change. That change should actually be part of this JIRA. It's part 1) (The two callback changes) which should be in the other JIRA, as it's unrelated, whereas 2) & 3) and the fix to testing is all the same thing.

          Regarding 2 & 3, these changes look good. However, I'd change the unsafeRead flag to be called readOnly. Also, add a logging line before the addComplete in asyncAddEntry saying that the client tried to write on a read only ledger handle.

          Show
          Ivan Kelly added a comment - I see you created BOOKKEEPER-94 for the test change. That change should actually be part of this JIRA. It's part 1) (The two callback changes) which should be in the other JIRA, as it's unrelated, whereas 2) & 3) and the fix to testing is all the same thing. Regarding 2 & 3, these changes look good. However, I'd change the unsafeRead flag to be called readOnly. Also, add a logging line before the addComplete in asyncAddEntry saying that the client tried to write on a read only ledger handle.
          Hide
          Ivan Kelly added a comment -

          My previous comment was incomplete. The changes should be tested also. The whole reason the bug exists is a lack of testing in the first place. The easiest thing is to simply extend the BookieReadWriteTest for his case to ensure that add fails on lhOpen, and that the ledger metadata isn't closed after lhOpen is called.

          Im still confused by the callback issue on readLastConfirmedOp. The only scenario where the callback can be called twice is where it recieves more responses than it has requests made. This discussion should continue on BOOKKEEPER-94.

          Show
          Ivan Kelly added a comment - My previous comment was incomplete. The changes should be tested also. The whole reason the bug exists is a lack of testing in the first place. The easiest thing is to simply extend the BookieReadWriteTest for his case to ensure that add fails on lhOpen, and that the ledger metadata isn't closed after lhOpen is called. Im still confused by the callback issue on readLastConfirmedOp. The only scenario where the callback can be called twice is where it recieves more responses than it has requests made. This discussion should continue on BOOKKEEPER-94 .
          Hide
          Sijie Guo added a comment -

          attach new patch.

          add testing close/write on read only LedgerHandle on BookieReadWriteTest#testReadFromOpenLedger

          Show
          Sijie Guo added a comment - attach new patch. add testing close/write on read only LedgerHandle on BookieReadWriteTest#testReadFromOpenLedger
          Hide
          Ivan Kelly added a comment -

          +1

          Committed as r1189867, Thanks Sijie.

          Show
          Ivan Kelly added a comment - +1 Committed as r1189867, Thanks Sijie.

            People

            • Assignee:
              Ivan Kelly
              Reporter:
              Sijie Guo
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development