Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-8357

Exhausting the high priority message thread pool can result in deadlock

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.0.0-incubating, 1.2.0, 1.3.0, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.9.0, 1.10.0, 1.11.0, 1.12.0
    • None
    • messaging

    Description

      The system property "DistributionManager.MAX_THREADS" default to 100:

      int MAX_THREADS = Integer.getInteger("DistributionManager.MAX_THREADS", 100);
      

      The system property used to be defined in geode-core ClusterDistributionManager and has moved to geode-core OperationExecutors.

      The value is used to limit ClusterOperationExecutors threadPool and highPriorityPool:

      threadPool =
          CoreLoggingExecutors.newThreadPoolWithFeedStatistics("Pooled Message Processor ",
              thread -> stats.incProcessingThreadStarts(), this::doProcessingThread,
              MAX_THREADS, stats.getNormalPoolHelper(), threadMonitor,
              INCOMING_QUEUE_LIMIT, stats.getOverflowQueueHelper());
      
      highPriorityPool = CoreLoggingExecutors.newThreadPoolWithFeedStatistics(
          "Pooled High Priority Message Processor ",
          thread -> stats.incHighPriorityThreadStarts(), this::doHighPriorityThread,
          MAX_THREADS, stats.getHighPriorityPoolHelper(), threadMonitor,
          INCOMING_QUEUE_LIMIT, stats.getHighPriorityQueueHelper());
      

      I have seen server startup hang when recovering lots of expired entries from disk while using PDX. The hang looks like a dlock request for the PDX lock is not receiving a response. Checking the value for the distributionStats#highPriorityQueueSize statistic (in VSD) shows the value maxed out and never dropping.

      The dlock response granting the PDX lock is stuck in the highPriorityQueue because there are no more highPriorityQueue threads available to process the response. All of the highPriorityQueue thread stack dumps show tasks such as recovering bucket from disk are blocked waiting for the PDX lock.

      Several changes could improve this situation, either in conjunction or individually:

      1. improve observability to enable support to identify that this situation has occurred
      2. increase MAX_THREADS default to 1000
      3. automatically identify this situation and warn the user with a log statement
      4. automatically prevent this situation
      5. identify the messages that are prone to causing deadlocks and move them to a dedicated thread pool with a higher limit
      6. move dlock messages to a new dedicated thread pool

      Attachments

        Activity

          People

            klund Kirk Lund
            klund Kirk Lund
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: