Uploaded image for project: 'Bookkeeper'
  1. Bookkeeper
  2. BOOKKEEPER-215

Deadlock occurs under high load

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 4.1.0
    • 4.1.0
    • hedwig-server
    • None

    Description

      LedgerHandle uses a Semaphore(opCounterSem) with a default value of 5000 permits to implement throttling for outstanding requests. This is causing a deadlock under high load. What I've observed is the following - There are a fixed number of threads created by OrderedSafeExecutor(mainWorkerPool in BookKeeper) and this is used to execute operations by PerChannelBookieClient. Under high load, the bookies are not able to satisfy requests at the rate at which they are being generated. This exhausts all permits in the Semaphore and any further operations block on lh.opCounterSem.acquire(). In this scenario, if the connection to the bookies is shut down, channelDisconnected in PerChannelBookieClient tries to error out all outstanding entries. The errorOutReadKey and errorOutAddKey functions enqueue these operations in the same mainWorkerPool, all threads in which are blocked on acquire. So, handleBookieFailure is never executed and the server stops responding.

      Blocking operations in a fixed size thread pool doesn't sound quite right. Temporarily, I fixed this by having another ExecutorService for every PerChannelBookieClient and queuing the operations from the errorOut* functions in it, but this is just a quick fix. I feel that the server shouldn't rely on LedgerHandle to throttle connections, but do this itself. Any other ideas on how to fix this? I'd be happy to contribute a patch.

      Attachments

        1. hedwig_ts.log
          52 kB
          Aniruddha
        2. BK-215.patch
          21 kB
          Sijie Guo
        3. BK-215.patch_v2
          22 kB
          Sijie Guo
        4. DeadlockCheckOrderedSafeExecutor.java
          3 kB
          Sijie Guo
        5. BK-215-check-deadlock.patch
          52 kB
          Sijie Guo
        6. BK-215.patch_v3
          29 kB
          Sijie Guo
        7. BK-215.patch_v4
          21 kB
          Sijie Guo

        Issue Links

          Activity

            People

              hustlmsp Sijie Guo
              i0exception Aniruddha
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: