this patch fixes a range of projects. it is a big simplification. it has a net removal of 700 lines of code. the meta data for a ledger was collapsed into a single znode. here is a description of the changes:
Index calculation in QuorumEngine must be synchronized on the LedgerHandle to avoid changes to the ensemble while trying to submit an operation. Such changes happen upon crashes of bookies.
I initialized thought it was not necessary, but now I think this synchronization block is necessary.
If a writer adds just a few entries to a ledger, it may end up with hints that say "empty ledger" when trying to recover a ledger. In this case, if we receive an empty ledger flag as a hint, we have to switch the hint to zero, which means that the client will start recovery from entry zero. If no entry has been written, it still works as the client won't be able to read anything.
I have changed LedgerRecoveryTest to test for: many entries written, one entry written, no entry written.
I have been able to identify the problem that was causing BookieFailureTest to hang on Utkarsh's computer. Basically, when the queue of a BookieHandle is full and the corresponding bookie has crashed, we are not able to add a read operation to the queue incoming queue of the bookie handle because the BookieHandle is not processing new requests anymore and it is waiting to fail the handle. In this case, the BookieHandle throws an exception after timing out the call to add the read operation to the queue. We were propagating this exception to the application.
The main problem is that we have to add the operation to the queue of ClientCBWorker so that we guarantee that it knows about the operation once we receive responses from bookies. If we throw an exception without removing the operation from the ClientCBWorker queue, the worker will wait forever, which I believe is the case Utkarsh was observing.
If I reasoned about the code correctly, then my modifications fix this problem by retrying a few times and erroring out after a number of retries. Erroring out in this case means notifying the CBWorker so that we can release the operation.
Fixing log level in LedgerConfig. -F
I have mainly worked on the ledger recovery machinery. I made it asynchronous by transforming LedgerRecovery into a thread and moving some calls. We have to revisit this way of making it asynchronous as it might not be acceptable for this patch.
I'm still to check why BookieFailureTest is failing for Utkarsh. It passes fine every time for me, so we have to find a way to reproduce it reliably in my machine so that I can debug it.
Took a pass over asynchronous ledger operations: create, open, close. Some parts are still blocking, work on those next.