Seeing this a bunch on jenkins - previous leader ephemeral node is still around for some reason.
Overseer threads are only being interrupted and not closed.
Occasional OverseerTest#testOverseerFailure fail due to missing election node.
When registering as the leader, if an existing ephemeral registration exists, wait a short time to see if it goes away.
Commit 1572370 from Mark Miller in branch 'dev/trunk'
[ https://svn.apache.org/r1572370 ]
SOLR-5596: Improve this test.
Commit 1572371 from Mark Miller in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1572371 ]
That last attempt did not work - I just saw this again locally.
SOLR-5799 may solve this. My best guess is that the previous leader is just taking a little longer than we would expect to have it's ephemeral leader registration node removed.
SOLR-5799 was just committed - we now wait a short time if an ephemeral leader registration node exists - if we are simply catching it briefly before it goes away, we wait and when it is gone, create our own ephemeral registration node.
So we still hit this - pretty surprising. I've gone over the test a couple times and have not spotted the problem yet, but I think it must be an issue with the test.
Move issue to Solr 4.9.
I was looking into the logs of this fail today:
[junit4] 2> 472241 T2893 oazsp.FileTxnLog.commit WARN fsync-ing the write ahead log in SyncThread:0 took 11588ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
This error can be due to a slow machine but it also happens on fast machines if you try to do a lot of writes very fast on ZooKeeper which is what the testShardLeaderChange does. Perhaps we should add a small wait between operations?
Would it make sense to set forcefscync to no for ZooKeeper in our tests? At the very least, it would reduce the spurious failures and let us concentrate on fixing real bugs.
Would it make sense to set forcefscync to no for ZooKeeper in our tests?
I think I tried it many months ago and still saw the problem. I can't remember exactly what settings I tried though, so feel free to see if you can get it to work. We don't need to worry about this type of thing with zookeeper for 99.9% of our tests.
I'll take a crack at it.
Commit 1608555 from firstname.lastname@example.org in branch 'dev/trunk'
[ https://svn.apache.org/r1608555 ]
SOLR-5596: Set system property zookeeper.forceSync=no for Solr test cases
Commit 1608559 from email@example.com in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1608559 ]
Commit 1608562 from firstname.lastname@example.org in branch 'dev/trunk'
[ https://svn.apache.org/r1608562 ]
SOLR-5596: Remove initCore call from afterClass
Commit 1608565 from email@example.com in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1608565 ]
Yeah, I think this is the same result as when I tried to remove the forceSync - still happens: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Windows/4201/
I think this may actually be due to SOLR-6426 SolrZkClient clean can fail due to a race with children nodes.
Commit 1620247 from Mark Miller in branch 'dev/trunk'
[ https://svn.apache.org/r1620247 ]
SOLR-5596: Raise zk client timeout for mock objects.
Commit 1620248 from Mark Miller in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1620248 ]
No, it can still happen.
Commit 1620319 from Mark Miller in branch 'dev/trunk'
[ https://svn.apache.org/r1620319 ]
SOLR-6428: Occasional OverseerTest#testOverseerFailure fail due to missing election node.
SOLR-5596: OverseerTest.testOverseerFailure - leader node already exists.
Commit 1620320 from Mark Miller in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1620320 ]
Okay, now I think this will stop. We will see.
Bulk close after 5.0 release.