Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Invalid
-
None
-
None
-
None
-
None
Description
While beasting some facet related cloud tests on master, I noticed a pattern of occasional failures that seemed to crop up...
- test ultimately fails due to a time out (usually the client threads time out waiting for a server response)
- if i notice my CPU isn't spinning very hard before the test fails, I can capture a jstack and inspect some threads
- there will be multiple jetty/solr request threads (ex: "qtp82184175-145" whose stack traces show various stages of DocSet collection that show they are "... in Object.wait()" but also RUNNABLE
...this isn't a thread summary+state combination that i'm use to seeing when looking at thread dumps, and some research into when/why this might happen lead me to:
...while the comments/status of JDK-8037567 suggests "nothing wrong here" the overall symptoms/description of the problem in the SO answer and linked blog and summation that this is essentially a "deadlock" situation in the class loader, do seem to correlate to some of the specifics I can see in the stack traces when this happens while running solr tests...
- at least one "RUNNABLE / Object.wait" thread trying to do class init; class: DocSet...
"qtp1535326437-68" #68 prio=5 os_prio=0 cpu=72.48ms elapsed=241.69s tid=0x00007fc08c0a4000 nid=0x864 in Object.wait() [0x00007fc0adedd000] java.lang.Thread.State: RUNNABLE at org.apache.solr.search.DocSet.<clinit>(DocSet.java:118) at org.apache.solr.search.DocSetCollector.getDocSet(DocSetCollector.java:90) // "new BitDocSet(..)" at org.apache.solr.search.DocSetUtil.getDocSet(DocSetUtil.java:93) at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1730)
- other "RUNNABLE / Object.wait" threads are on lines that involve instantiating a subclass of DocSet:
"qtp1535326437-67" #67 prio=5 os_prio=0 cpu=801.44ms elapsed=241.69s tid=0x00007fc08c0a1800 nid=0x863 in Object.wait() [0x00007fc0adfdf000] java.lang.Thread.State: RUNNABLE at org.apache.solr.search.DocSetCollector.getDocSet(DocSetCollector.java:90) // "new BitDocSet(..)" at org.apache.solr.search.DocSetUtil.getDocSet(DocSetUtil.java:93) at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1730)
-
"qtp82184175-65" #65 prio=5 os_prio=0 cpu=137.76ms elapsed=241.69s tid=0x00007fc088092000 nid=0x860 in Object.wait() [0x00007fc0ae2e2000] java.lang.Thread.State: RUNNABLE at org.apache.solr.search.DocSetCollector.getDocSet(DocSetCollector.java:84) // "new SortedIntDocSet(..)" at org.apache.solr.search.DocSetUtil.getDocSet(DocSetUtil.java:93) at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1730) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1433)
-
- etc...
- DocSet has a static reference to a concrete subclass...
- {{public static final DocSet EMPTY = new SortedIntDocSet(new int[0], 0);
I should point out:
- While this particular "class loading deadlock" issue seems more likely to happen in a "test" situation where the JVMs/classloaders are short lived, there's no reason to assume this type of failure couldn't happen in a production solr instance when handling a burst of queries right after startup.
- This type of failure (either specifically due to "DocSet vs SortedIntDocSet", or due to similar patterns in other classes) may also be the root cause of various other hard to reproduce "timed out" test failures we've seen over the years.
Attachments
Issue Links
- relates to
-
SOLR-14256 Remove HashDocSet
- Closed