Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.3.0
    • Component/s: contrib-bookkeeper
    • Labels:
      None

      Description

      there is a race condition between the zookeeper completion thread and the bookeeper processing queue during create. if the zookeeper completion thread falls behind due to scheduling, the action counter of the create operation may go backwards.

      1. ZOOKEEPER-503.patch
        154 kB
        Flavio Junqueira
      2. ZOOKEEPER-503.patch
        151 kB
        Benjamin Reed

        Issue Links

          Activity

          Hide
          Flavio Junqueira added a comment -

          A patch for ZOOKEEPER-461 should fix this problem.

          Show
          Flavio Junqueira added a comment - A patch for ZOOKEEPER-461 should fix this problem.
          Hide
          Benjamin Reed added a comment -

          this patch fixes a range of projects. it is a big simplification. it has a net removal of 700 lines of code. the meta data for a ledger was collapsed into a single znode. here is a description of the changes:

          Index calculation in QuorumEngine must be synchronized on the LedgerHandle to avoid changes to the ensemble while trying to submit an operation. Such changes happen upon crashes of bookies.

          I initialized thought it was not necessary, but now I think this synchronization block is necessary.

          If a writer adds just a few entries to a ledger, it may end up with hints that say "empty ledger" when trying to recover a ledger. In this case, if we receive an empty ledger flag as a hint, we have to switch the hint to zero, which means that the client will start recovery from entry zero. If no entry has been written, it still works as the client won't be able to read anything.

          I have changed LedgerRecoveryTest to test for: many entries written, one entry written, no entry written.

          I have been able to identify the problem that was causing BookieFailureTest to hang on Utkarsh's computer. Basically, when the queue of a BookieHandle is full and the corresponding bookie has crashed, we are not able to add a read operation to the queue incoming queue of the bookie handle because the BookieHandle is not processing new requests anymore and it is waiting to fail the handle. In this case, the BookieHandle throws an exception after timing out the call to add the read operation to the queue. We were propagating this exception to the application.

          The main problem is that we have to add the operation to the queue of ClientCBWorker so that we guarantee that it knows about the operation once we receive responses from bookies. If we throw an exception without removing the operation from the ClientCBWorker queue, the worker will wait forever, which I believe is the case Utkarsh was observing.

          If I reasoned about the code correctly, then my modifications fix this problem by retrying a few times and erroring out after a number of retries. Erroring out in this case means notifying the CBWorker so that we can release the operation.

          Fixing log level in LedgerConfig. -F

          I have mainly worked on the ledger recovery machinery. I made it asynchronous by transforming LedgerRecovery into a thread and moving some calls. We have to revisit this way of making it asynchronous as it might not be acceptable for this patch.

          I'm still to check why BookieFailureTest is failing for Utkarsh. It passes fine every time for me, so we have to find a way to reproduce it reliably in my machine so that I can debug it.

          Took a pass over asynchronous ledger operations: create, open, close. Some parts are still blocking, work on those next.

          Show
          Benjamin Reed added a comment - this patch fixes a range of projects. it is a big simplification. it has a net removal of 700 lines of code. the meta data for a ledger was collapsed into a single znode. here is a description of the changes: Index calculation in QuorumEngine must be synchronized on the LedgerHandle to avoid changes to the ensemble while trying to submit an operation. Such changes happen upon crashes of bookies. I initialized thought it was not necessary, but now I think this synchronization block is necessary. If a writer adds just a few entries to a ledger, it may end up with hints that say "empty ledger" when trying to recover a ledger. In this case, if we receive an empty ledger flag as a hint, we have to switch the hint to zero, which means that the client will start recovery from entry zero. If no entry has been written, it still works as the client won't be able to read anything. I have changed LedgerRecoveryTest to test for: many entries written, one entry written, no entry written. I have been able to identify the problem that was causing BookieFailureTest to hang on Utkarsh's computer. Basically, when the queue of a BookieHandle is full and the corresponding bookie has crashed, we are not able to add a read operation to the queue incoming queue of the bookie handle because the BookieHandle is not processing new requests anymore and it is waiting to fail the handle. In this case, the BookieHandle throws an exception after timing out the call to add the read operation to the queue. We were propagating this exception to the application. The main problem is that we have to add the operation to the queue of ClientCBWorker so that we guarantee that it knows about the operation once we receive responses from bookies. If we throw an exception without removing the operation from the ClientCBWorker queue, the worker will wait forever, which I believe is the case Utkarsh was observing. If I reasoned about the code correctly, then my modifications fix this problem by retrying a few times and erroring out after a number of retries. Erroring out in this case means notifying the CBWorker so that we can release the operation. Fixing log level in LedgerConfig. -F I have mainly worked on the ledger recovery machinery. I made it asynchronous by transforming LedgerRecovery into a thread and moving some calls. We have to revisit this way of making it asynchronous as it might not be acceptable for this patch. I'm still to check why BookieFailureTest is failing for Utkarsh. It passes fine every time for me, so we have to find a way to reproduce it reliably in my machine so that I can debug it. Took a pass over asynchronous ledger operations: create, open, close. Some parts are still blocking, work on those next.
          Hide
          Benjamin Reed added a comment -

          i should have also mentioned that this patch was done by flavio and utkarsh. i will be reviewing it.

          Show
          Benjamin Reed added a comment - i should have also mentioned that this patch was done by flavio and utkarsh. i will be reviewing it.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12416647/ZOOKEEPER-503.patch
          against trunk revision 803300.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 12 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          -1 release audit. The applied patch generated 183 release audit warnings (more than the trunk's current 177 warnings).

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/187/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/187/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/187/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/187/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12416647/ZOOKEEPER-503.patch against trunk revision 803300. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 183 release audit warnings (more than the trunk's current 177 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/187/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/187/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/187/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/187/console This message is automatically generated.
          Hide
          Flavio Junqueira added a comment -

          Some new files do not contain license header.

          Show
          Flavio Junqueira added a comment - Some new files do not contain license header.
          Hide
          Flavio Junqueira added a comment -

          Added license headers.

          Show
          Flavio Junqueira added a comment - Added license headers.
          Hide
          Flavio Junqueira added a comment -

          There are some TODO comments we left throughout this patch as reminders for what to work on next. Is this acceptable?

          Show
          Flavio Junqueira added a comment - There are some TODO comments we left throughout this patch as reminders for what to work on next. Is this acceptable?
          Hide
          Mahadev konar added a comment -

          Committed in ZOOKEEPER-507.

          Show
          Mahadev konar added a comment - Committed in ZOOKEEPER-507 .

            People

            • Assignee:
              Benjamin Reed
              Reporter:
              Benjamin Reed
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development