[GEODE-8357] Exhausting the high priority message thread pool can result in deadlock - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.0.0-incubating, 1.2.0, 1.3.0, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.9.0, 1.10.0, 1.11.0, 1.12.0
Fix Version/s: None
Component/s: messaging
Labels:
- GeodeOperationAPI

Description

The system property "DistributionManager.MAX_THREADS" default to 100:

int MAX_THREADS = Integer.getInteger("DistributionManager.MAX_THREADS", 100);

The system property used to be defined in geode-core ClusterDistributionManager and has moved to geode-core OperationExecutors.

The value is used to limit ClusterOperationExecutors threadPool and highPriorityPool:

threadPool =
    CoreLoggingExecutors.newThreadPoolWithFeedStatistics("Pooled Message Processor ",
        thread -> stats.incProcessingThreadStarts(), this::doProcessingThread,
        MAX_THREADS, stats.getNormalPoolHelper(), threadMonitor,
        INCOMING_QUEUE_LIMIT, stats.getOverflowQueueHelper());

highPriorityPool = CoreLoggingExecutors.newThreadPoolWithFeedStatistics(
    "Pooled High Priority Message Processor ",
    thread -> stats.incHighPriorityThreadStarts(), this::doHighPriorityThread,
    MAX_THREADS, stats.getHighPriorityPoolHelper(), threadMonitor,
    INCOMING_QUEUE_LIMIT, stats.getHighPriorityQueueHelper());

I have seen server startup hang when recovering lots of expired entries from disk while using PDX. The hang looks like a dlock request for the PDX lock is not receiving a response. Checking the value for the distributionStats#highPriorityQueueSize statistic (in VSD) shows the value maxed out and never dropping.

The dlock response granting the PDX lock is stuck in the highPriorityQueue because there are no more highPriorityQueue threads available to process the response. All of the highPriorityQueue thread stack dumps show tasks such as recovering bucket from disk are blocked waiting for the PDX lock.

Several changes could improve this situation, either in conjunction or individually:

improve observability to enable support to identify that this situation has occurred
increase MAX_THREADS default to 1000
automatically identify this situation and warn the user with a log statement
automatically prevent this situation
identify the messages that are prone to causing deadlocks and move them to a dedicated thread pool with a higher limit
move dlock messages to a new dedicated thread pool

Attachments

Activity

People

Assignee:: Kirk Lund

Reporter:: Kirk Lund

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Jul/20 17:03

Updated:: 15/Jul/21 20:28