Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Cannot Reproduce
-
1.6.6
-
None
-
None
Description
So, I've been trying to test 1.6.6 release candidate, and I keep getting frequent failures of the ConditionalWriterIT.testTrace.
I've seen two kinds of failures, both of which result in timeouts (because the IT continues to retry until the timeout).
The first is a very simple case: it can't find any traces. This is probably due to a startup race condition where the client runs before the tracer is up. The spammy output looks like:
2016-09-14 17:04:26,799 [test.ConditionalWriterIT] INFO : Ignoring trace output as traceCount not greater than zero: 0 2016-09-14 17:04:27,802 [test.ConditionalWriterIT] INFO : Trace output:Did not find any traces! 2016-09-14 17:04:27,802 [test.ConditionalWriterIT] INFO : Ignoring trace output as traceCount not greater than zero: 0 2016-09-14 17:04:28,805 [test.ConditionalWriterIT] INFO : Trace output:Did not find any traces! 2016-09-14 17:04:28,805 [test.ConditionalWriterIT] INFO : Ignoring trace output as traceCount not greater than zero: 0 2016-09-14 17:04:29,808 [test.ConditionalWriterIT] INFO : Trace output:Did not find any traces!
The TraceServer's log file contain no activity after startup (no activity indicated for successful run, either... the tracer probably just doesn't log much, or I don't have the log threshold low enough).
The second, and more annoying failure is that it appears sometimes, some spans are unrooted.
The output of a failed run looks like:
2016-09-13 22:15:06,464 [test.ConditionalWriterIT] INFO : Ignoring trace output as traceCount not greater than zero: -1 2016-09-13 22:15:07,475 [test.ConditionalWriterIT] INFO : Trace output:Trace started at 2016/09/13 22:15:01.046 Time Start Service@Location Name 322+0 testTrace@localhost traceTest 0+35 testTrace@localhost client:startScan 27+40 tserver@0.0.0.0 startScan 0+67 testTrace@localhost client:closeScan 0+68 tserver@0.0.0.0 closeScan 0+68 testTrace@localhost client:startScan 3+69 tserver@0.0.0.0 startScan 1+70 tserver@0.0.0.0 metadata tablets read ahead 7 0+72 testTrace@localhost client:closeScan 0+73 tserver@0.0.0.0 closeScan Warning: the following spans are not rooted! 98074c50fb57bce3 44a1f6292a6ec858 client:startScan 803c22e524cfaab4 c0944de077dcfbaa metadata tablets read ahead 2 c0944de077dcfbaa 44a1f6292a6ec858 startScan e87ec35eb1a659ee 44a1f6292a6ec858 client:closeScan 75e4a9576afc000b 98bfb7e65f27e99 client:startConditionalUpdate bd6250a4d3050aba 44a1f6292a6ec858 closeScan 98bfb7e65f27e99 44a1f6292a6ec858 ConditionalWriterImpl 2 d83a024f5e4dbae5 98bfb7e65f27e99 startConditionalUpdate 45472c5b3fc34b0b 98bfb7e65f27e99 client:conditionalUpdate 42d81ffa77149a94 98bfb7e65f27e99 conditionalUpdate 2d6c20f144b406cd 42d81ffa77149a94 Check conditions 715be33f3216fe99 42d81ffa77149a94 apply conditional mutations ff11b4df0e8ddd94 715be33f3216fe99 prep 9acf9f34eeb4c7c1 715be33f3216fe99 wal 191305dbf1fb40ff 9acf9f34eeb4c7c1 update 3dd49cdc4b5a2d7d 9acf9f34eeb4c7c1 client:update e31c17c89e1b3cdc 191305dbf1fb40ff prep 84dbf7ebd59be75 191305dbf1fb40ff commit 198bc516183d8e7f 191305dbf1fb40ff wal 61b91d25c997c748 715be33f3216fe99 commit
For comparison, the output of a successful run looks like:
2016-09-14 17:07:57,125 [test.ConditionalWriterIT] INFO : Ignoring trace output as traceCount not greater than zero: -1 2016-09-14 17:07:58,134 [test.ConditionalWriterIT] INFO : Trace output:Trace started at 2016/09/14 17:07:54.663 Time Start Service@Location Name 428+0 testTrace@localhost traceTest 3+360 testTrace@localhost client:startScan 4+365 tserver@0.0.0.0 startScan 0+377 testTrace@localhost client:closeScan 0+378 tserver@0.0.0.0 closeScan 0+380 testTrace@localhost client:startScan 5+381 tserver@0.0.0.0 startScan 0+385 tserver@0.0.0.0 metadata tablets read ahead 4 0+388 tserver@0.0.0.0 closeScan 0+388 testTrace@localhost client:closeScan 39+389 testTrace@localhost ConditionalWriterImpl 1 1+397 testTrace@localhost client:startConditionalUpdate 2+402 tserver@0.0.0.0 startConditionalUpdate 2+408 testTrace@localhost client:conditionalUpdate 8+417 tserver@0.0.0.0 conditionalUpdate 2+419 tserver@0.0.0.0 Check conditions 4+421 tserver@0.0.0.0 apply conditional mutations 1+421 tserver@0.0.0.0 prep 3+422 tserver@0.0.0.0 wal 0+423 tserver@0.0.0.0 client:update 2+423 tserver@0.0.0.0 update 0+424 tserver@0.0.0.0 wal 0+424 tserver@0.0.0.0 prep 0+425 tserver@0.0.0.0 commit 0+425 tserver@0.0.0.0 commit The following is all the nodes! c63771e049ce64df 0 traceTest 346315d5c08dacd c63771e049ce64df client:startScan 83a6b294967ec2c2 c63771e049ce64df startScan c31f792dd3cf0479 c63771e049ce64df client:closeScan 6847784061285c9f c63771e049ce64df closeScan 6675da322b33924f c63771e049ce64df client:startScan 874c41927959277a c63771e049ce64df startScan 7f26ece0e74cebd3 874c41927959277a metadata tablets read ahead 4 7e4d4904313a0096 c63771e049ce64df closeScan 9324832a57fbe3c1 c63771e049ce64df client:closeScan e6cccd862b014fc2 c63771e049ce64df ConditionalWriterImpl 1 e8c2e500c3b87beb e6cccd862b014fc2 client:startConditionalUpdate 1bbe4c4bc22565d9 e6cccd862b014fc2 startConditionalUpdate f13d4a0dfbb322ed e6cccd862b014fc2 client:conditionalUpdate 2ec39d24415847b1 e6cccd862b014fc2 conditionalUpdate 2b076238cd26cb8f 2ec39d24415847b1 Check conditions c0addb5c7a6b3fca 2ec39d24415847b1 apply conditional mutations 2dd63e5ea321119d c0addb5c7a6b3fca prep d6958323b0cb7f5e c0addb5c7a6b3fca wal 616296d843722d0a d6958323b0cb7f5e client:update db05c61214e16ce5 d6958323b0cb7f5e update 9d3417d7b56a21e0 db05c61214e16ce5 prep 2d66569d22a19871 db05c61214e16ce5 wal 21cd8265be1e6c94 c0addb5c7a6b3fca commit b381168631ad019c db05c61214e16ce5 commit
I added code to dump the span IDs and parent IDs for the rooted spans, to more easily compare.
The biggest thing that stands out for me is that the one with the description ConditionalWriterImpl 2 is not rooted at "testTrace" like the ConditionalWriterImpl 1 is in the successful case.
As far as I can tell, the failure case is always case 2, and the successful case is always case 1.
It looks like the problem is caused by the NamingThreadFactory creating wrapping Runnable s with a new TraceRunnable based on the current thread's currentSpan. It's not known which thread is creating new threads within the thread pool's executor. Instead of having NamingThreadFactory wrap when it creates a new executor thread, we need to ensure we wrap the spans before we place them into the thread pool (when we call execute()), based on the originating thread's currentSpan.
The NamingThreadFactory is used in multiple places, so it may not be an easy fix to address this in all places.