[SOLR-10032] Create report to assess Solr test quality at a commit point. - ASF JIRA

Details

Type: Task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 7.0, 8.0
Component/s: Tests
Labels:
None

Description

We have many Jenkins instances blasting tests, some official, some policeman, I and others have or had their own, and the email trail proves the power of the Jenkins cluster to find test fails.

However, I still have a very hard time with some basic questions:

what tests are flakey right now? which test fails actually affect devs most? did I break it? was that test already flakey? is that test still flakey? what are our worst tests right now? is that test getting better or worse?

We really need a way to see exactly what tests are the problem, not because of OS or environmental issues, but more basic test quality issues. Which tests are flakey and how flakey are they at any point in time.

Reports:
https://drive.google.com/drive/folders/0ByYyjsrbz7-qa2dOaU1UZDdRVzg?usp=sharing

01/24/2017 - https://docs.google.com/spreadsheets/d/1JySta2j2s7A_p16wA1UO-l6c4GsUHBIb4FONS2EzW9k/edit?usp=sharing
02/01/2017 - https://docs.google.com/spreadsheets/d/1FndoyHmihaOVL2o_Zns5alpNdAJlNsEwQVoJ4XDWj3c/edit?usp=sharing
02/08/2017 - https://docs.google.com/spreadsheets/d/1N6RxH4Edd7ldRIaVfin0si-uSLGyowQi8-7mcux27S0/edit?usp=sharing
02/14/2017 - https://docs.google.com/spreadsheets/d/1eZ9_ds_0XyqsKKp8xkmESrcMZRP85jTxSKkNwgtcUn0/edit?usp=sharing
02/17/2017 - https://docs.google.com/spreadsheets/d/1LEPvXbsoHtKfIcZCJZ3_P6OHp7S5g2HP2OJgU6B2sAg/edit?usp=sharing

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Lucene-Solr Master Test Beast Results 01-24-2017-9899cbd031dc3fc37a384b1f9e2b379e90a9a3a6 Level Medium- Running 30 iterations, 12 at a time .pdf
28/Jan/17 00:31
929 kB
Mark Miller
Lucene-Solr Master Test Beasults 02%2F17%2F2017-19c8ec2bf1882bed1bb34d0b55198d03f2018838 Level Hard Running 100 iterations, 12 at a time, 8 cores.pdf
19/Feb/17 18:11
872 kB
Mark Miller
Lucene-Solr Master Test Beasults 02-01-2017-bbc455de195c83d9f807980b510fa46018f33b1b Level Medium- Running 30 iterations, 10 at a time.pdf
03/Feb/17 05:16
947 kB
Mark Miller
Lucene-Solr Master Test Beasults 02-08-2017-6696eafaae18948c2891ce758c7a2ec09873dab8 Level Medium+- Running 30 iterations, 10 at a time, 8 cores.pdf
09/Feb/17 14:01
805 kB
Mark Miller
Lucene-Solr Master Test Beasults 02-14-2017- Level Medium+-a1f114f70f3800292c25be08213edf39b3e37f6a Running 30 iterations, 10 at a time, 8 cores.pdf
15/Feb/17 19:37
876 kB
Mark Miller

Issue Links

incorporates

SOLR-10079 TestInPlaceUpdates(Distrib|Standalone) failures

Resolved

SOLR-10063 CoreContainer shutdown has race condition that can cause a hang on shutdown.

Closed

SOLR-10101 TestLazyCores hangs.

Closed

SOLR-10069 The Nightly test CdcrReplicationDistributedZkTest appears to be too fragile.

Open

SOLR-10071 The Nightly test LeaderInitiatedRecoveryOnShardRestartTest appears to be too fragile.

Open

SOLR-10139 ShardSplitTest needs to be hardened.

Open

SOLR-10161 HdfsChaosMonkeySafeLeaderTest needs to be hardened.

Open

SOLR-10162 HttpPartitionTest needs to be hardened.

Open

SOLR-10165 MissingSegmentRecoveryTest needs to be hardened.

Open

SOLR-10195 Harden AbstractSolrMorphlineZkTestBase based tests.

Open

SOLR-10253 Make tests that are as expensive as our expensive @Nightlys @Nightly themselves.

Open

SOLR-10125 CollectionsAPIDistributedZkTest is too fragile.

Reopened

SOLR-10070 The test HdfsDirectoryFactoryTest appears to be unreliable.

Resolved

SOLR-10107 CdcrReplicationDistributedZkTest fails far too often and is an extremely expensive test, even when compared to other nightlies.

Resolved

SOLR-10119 TestReplicationHandler has always been too good at finding too many annoying bugs.

Resolved

SOLR-10126 PeerSyncReplicationTest is a flakey test.

Resolved

SOLR-10191 HdfsChaosMonkeyNothingIsSafeTest needs to be hardened.

Resolved

SOLR-10066 The Nightly test HdfsWriteToMultipleCollectionsTest appears to be too fragile.

Closed

SOLR-10064 The Nightly test HdfsCollectionsAPIDistributedZkTest appears to be too fragile.

Closed

SOLR-10065 The Nightly test ConcurrentDeleteAndCreateCollectionTest appears to be too fragile.

Closed

SOLR-10067 The Nightly test HdfsBasicDistributedZkTest appears to be too fragile.

Closed

SOLR-10068 The Nightly test SharedFSAutoReplicaFailoverTest appears to be too fragile.

Closed

SOLR-10072 The test TestSelectiveWeightCreation appears to be unreliable.

Closed

SOLR-10098 HdfsThreadLeakTest and HdfsRecoverLeaseTest can leak threads

Closed

SOLR-10109 SoftAutoCommitTest is too fragile.

Closed

SOLR-10127 OverseerRolesTest needs to be hardened.

Closed

SOLR-10160 HdfsTlogReplayBufferedWhileIndexingTest needs to be hardened.

Closed

SOLR-10164 DistributedVersionInfoTest needs to be hardened.

Closed

SOLR-10166 TestLBHttpSolrClient needs to be hardened.

Closed

is depended upon by

SOLR-12052 Schedule a BeastIt unit test report to be published every week.

Open

is duplicated by

SOLR-7354 Solr Jenkins test fails are out of control.

Closed

is related to

SOLR-8742 HdfsDirectoryTest fails reliably after changes in LUCENE-6932

Closed

SOLR-9903 Stop interrupting the update executor on shutdown, it can cause graceful shutdowns to put replicas into Leader Initiated Recovery among other undesirable things.

Closed

SOLR-10169 PeerSync will hit an NPE on no response errors when looking for fingerprint.

Closed

SOLR-10077 TestManagedFeatureStore extends LuceneTestCase, but has no tests and just hosts a static method.

Resolved

SOLR-10074 TestConfig appears to be incompatible with custom ant test location properties that should be supported.

Closed

SOLR-10075 Move assumes in TestNonWritablePersistFile to BeforeClass method.

Closed

SOLR-10193 Improve MiniSolrCloudCluster#shutdown.

Closed

relates to

SOLR-9401 TestPKIAuthenticationPlugin NPE

Closed

SOLR-10196 ElectionContext#runLeaderProcess can hit NPE on core close.

Closed

SOLR-10248 Merge SolrTestCaseJ4's SolrIndexSearcher tracking into the ObjectReleaseTracker.

Open

SOLR-10053 TestSolrCloudWithDelegationTokens failures

Resolved

SOLR-10136 TestReqParamsAPI regularly fails on Policeman Jenkins

Resolved

SOLR-10175 replace @Ignore for TestAnalyticsQParserPlugin

Resolved

SOLR-10024 If you use ExternalPaths#determineSourceHome with a custom tests.workDir, tests may not find the path.

Closed

SOLR-10073 TestCoreDiscovery appears to be incompatible with custom ant test location properties that should be supported.

Closed

SOLR-10111 MBeansHandlerTest.testDiff regularly fails

Closed

SOLR-10142 replace @Ignore/@AwaitsFix for TestClassNameShortening

Closed

SOLR-10176 add @Ignore to forbiddenApis/solr.txt

Open

(24 incorporates, 1 is depended upon by, 1 is duplicated by, 7 is related to, 11 relates to)

Activity

Ascending order - Click to sort in descending order

Mark Miller added a comment - 25/Jan/17 15:19

I'll create sub issues off this JIRA to address the worst tests I find.

Mark Miller added a comment - 25/Jan/17 15:19 I'll create sub issues off this JIRA to address the worst tests I find.

Cassandra Targett added a comment - 25/Jan/17 15:41

+1. Flakey test clean up is an arduous and unforgiving task, but it's hard to overstate it's importance.

Do you have ideas on how we might track this beyond trying to fix the biggest offenders? If there's stuff we can do to raise the visibility of such tests, that will be a start to getting all of us more aware of the problem and moving toward a long-term solution.

Cassandra Targett added a comment - 25/Jan/17 15:41 +1. Flakey test clean up is an arduous and unforgiving task, but it's hard to overstate it's importance. Do you have ideas on how we might track this beyond trying to fix the biggest offenders? If there's stuff we can do to raise the visibility of such tests, that will be a start to getting all of us more aware of the problem and moving toward a long-term solution.

David Smiley added a comment - 25/Jan/17 16:45

+1 I'm so glad you're looking into this Mark; thank you!

David Smiley added a comment - 25/Jan/17 16:45 +1 I'm so glad you're looking into this Mark; thank you!

Mark Miller added a comment - 25/Jan/17 16:58

I'm still building the first report, but here is a partial sample attached. At the moment, the output of my script is a tsv file and I just paste that into Google spreadsheets.

Mark Miller added a comment - 25/Jan/17 16:58 I'm still building the first report, but here is a partial sample attached. At the moment, the output of my script is a tsv file and I just paste that into Google spreadsheets.

Mark Miller added a comment - 25/Jan/17 17:06

If there's stuff we can do to raise the visibility of such tests, that will be a start to getting all of us more aware of the problem and moving toward a long-term solution.

I think it really depends on how much effort we can maintain over time on this, but ideally we would do this:

Take the first report and file JIRA issues (or find current ones) for all of the tests that are beyond a bit flakey. Push on the author and contributors to those tests to get them solid. Generating reports for just the worst tests is actually pretty fast for a feedback loop. For tests that have fails, I have all the logs to provide to the JIRA issue.

If we can get out a report with all tests within a certain flakey cutoff range and if we can regularly generate this report over time, it will be relatively easy to spot new bad tests or tests that re enter a bad state, and we can file high priority JIRA's (or reopen JIRAs) and/or ignore them with AwaitsFix annotations.

Once we get a report under a basic level, we can be harder on tests that creep into a danger zone. It really depends on if we get enough momentum, but I'm willing to give it a try.

One thing I've tried to do is create a rating for different failure rates, with the idea being, we first work on the tests worse than the 'flakey' rating, Once we achieve that, we can be very hard on tests that go above that rating while working on hardening the least flakey tests over time as well.

Mark Miller added a comment - 25/Jan/17 17:06 If there's stuff we can do to raise the visibility of such tests, that will be a start to getting all of us more aware of the problem and moving toward a long-term solution. I think it really depends on how much effort we can maintain over time on this, but ideally we would do this: Take the first report and file JIRA issues (or find current ones) for all of the tests that are beyond a bit flakey. Push on the author and contributors to those tests to get them solid. Generating reports for just the worst tests is actually pretty fast for a feedback loop. For tests that have fails, I have all the logs to provide to the JIRA issue. If we can get out a report with all tests within a certain flakey cutoff range and if we can regularly generate this report over time, it will be relatively easy to spot new bad tests or tests that re enter a bad state, and we can file high priority JIRA's (or reopen JIRAs) and/or ignore them with AwaitsFix annotations. Once we get a report under a basic level, we can be harder on tests that creep into a danger zone. It really depends on if we get enough momentum, but I'm willing to give it a try. One thing I've tried to do is create a rating for different failure rates, with the idea being, we first work on the tests worse than the 'flakey' rating, Once we achieve that, we can be very hard on tests that go above that rating while working on hardening the least flakey tests over time as well.

Mark Miller added a comment - 25/Jan/17 17:09

How the report is generated:

I am running each of tests on a beefy ubuntu machine, 30 runs per tests, 12 in parallel at a time.

Mark Miller added a comment - 25/Jan/17 17:09 How the report is generated: I am running each of tests on a beefy ubuntu machine, 30 runs per tests, 12 in parallel at a time.

Mark Miller added a comment - 25/Jan/17 17:17

If there's stuff we can do to raise the visibility of such tests,

I'm hoping having this report with all the tests compared will help give us backing to do just this. Right now, it's hard to be objective about which tests are the problem and how hard you should push back on a flakey test. Fails for you, not for me, fails on jenkins but so do many others, etc.

With this report and the logs of all 30 runs for each test, I think it will be much easier to create high priority JIRA issues for bad tests and pressure contributors to improve them. We can also send an occasional email to the dev list and highlight problem tests.

Open to other ideas as well of course.

Mark Miller added a comment - 25/Jan/17 17:17 If there's stuff we can do to raise the visibility of such tests, I'm hoping having this report with all the tests compared will help give us backing to do just this. Right now, it's hard to be objective about which tests are the problem and how hard you should push back on a flakey test. Fails for you, not for me, fails on jenkins but so do many others, etc. With this report and the logs of all 30 runs for each test, I think it will be much easier to create high priority JIRA issues for bad tests and pressure contributors to improve them. We can also send an occasional email to the dev list and highlight problem tests. Open to other ideas as well of course.

Yonik Seeley added a comment - 25/Jan/17 17:22

We can also send an occasional email to the dev list and highlight problem tests.

+1 !
Would be great to eventually automate (e.g. a daily report wouldn't be noise compared to what we have now), but whatever can be done in the meantime will be a huge improvement.

Yonik Seeley added a comment - 25/Jan/17 17:22 We can also send an occasional email to the dev list and highlight problem tests. +1 ! Would be great to eventually automate (e.g. a daily report wouldn't be noise compared to what we have now), but whatever can be done in the meantime will be a huge improvement.

Mark Miller added a comment - 25/Jan/17 18:58

This will also help address getting our nightly runs to a useful state. I am not currently running tests that run non nightly with 'nightly cranked up' variants, but we do get a report on tests that only run nightly, and tests like that tend to get little to no visibility currently (and my guess is we may find many fairly failure prone).

Mark Miller added a comment - 25/Jan/17 18:58 This will also help address getting our nightly runs to a useful state. I am not currently running tests that run non nightly with 'nightly cranked up' variants, but we do get a report on tests that only run nightly, and tests like that tend to get little to no visibility currently (and my guess is we may find many fairly failure prone).

Ishan Chattopadhyaya added a comment - 25/Jan/17 20:00

I think a hammer approach (and probably effective) for now would be disable all flaky tests. While someone should anyway need to work on them, their getting resolved would not get in the way of a regular developer now trying to figure out the basic questions Mark mentioned.

Ishan Chattopadhyaya added a comment - 25/Jan/17 20:00 I think a hammer approach (and probably effective) for now would be disable all flaky tests. While someone should anyway need to work on them, their getting resolved would not get in the way of a regular developer now trying to figure out the basic questions Mark mentioned.

Mark Miller added a comment - 25/Jan/17 20:11 - edited

I think there is likely too much of a test coverage problem if we take that approach.

I'd like to instead push gradually, though perhaps 'Apache time' quickly.

First I will create critical issues for the worst offenders, if they cannot be fixed pretty much right away, I will badapple or awaitsfix them.

I'll also create critical issues for other fails above a certain threshold and ping appropriate JIRA issues to try and bring attention to them. Over time we can ignore these as well if they are not addressed and someone doesn't find them important enough to keep coverage.

We can then tighten this net down to a certain level.

I think if we commit to following through on some progress, we can take an iterative approach that gives people ample time to fix important tests and us time to evaluate loss of important test coverage (even flakey test coverage is very valuable info to us right now, and some flakey tests pass 90%+ of the time - we want to harden them, but they provide critical coverage in many cases).

I'll also ping the dev list with a summary occasionally to bring attention to this and the current state.

Mark Miller added a comment - 25/Jan/17 20:11 - edited I think there is likely too much of a test coverage problem if we take that approach. I'd like to instead push gradually, though perhaps 'Apache time' quickly. First I will create critical issues for the worst offenders, if they cannot be fixed pretty much right away, I will badapple or awaitsfix them. I'll also create critical issues for other fails above a certain threshold and ping appropriate JIRA issues to try and bring attention to them. Over time we can ignore these as well if they are not addressed and someone doesn't find them important enough to keep coverage. We can then tighten this net down to a certain level. I think if we commit to following through on some progress, we can take an iterative approach that gives people ample time to fix important tests and us time to evaluate loss of important test coverage (even flakey test coverage is very valuable info to us right now, and some flakey tests pass 90%+ of the time - we want to harden them, but they provide critical coverage in many cases). I'll also ping the dev list with a summary occasionally to bring attention to this and the current state.

Mark Miller added a comment - 28/Jan/17 00:31

Here is the first report. There may still be some kinks to work out. I'll summarize the report and add additional commentary later. I'll also send that to the dev list. We can make or surface JIRA issues for any test not solid and prompt fixes or badapple/awaitsfix annotations.

You can see the attached report or here: https://docs.google.com/spreadsheets/d/1JySta2j2s7A_p16wA1UO-l6c4GsUHBIb4FONS2EzW9k/edit?usp=sharing

Mark Miller added a comment - 28/Jan/17 00:31 Here is the first report. There may still be some kinks to work out. I'll summarize the report and add additional commentary later. I'll also send that to the dev list. We can make or surface JIRA issues for any test not solid and prompt fixes or badapple/awaitsfix annotations. You can see the attached report or here: https://docs.google.com/spreadsheets/d/1JySta2j2s7A_p16wA1UO-l6c4GsUHBIb4FONS2EzW9k/edit?usp=sharing

Mark Miller added a comment - 28/Jan/17 00:34

This report covers 876 tests. If any tests share the same name, the results for one would currently be misses. Will have to start using full package names to avoid that.

Mark Miller added a comment - 28/Jan/17 00:34 This report covers 876 tests. If any tests share the same name, the results for one would currently be misses. Will have to start using full package names to avoid that.

Mark Miller added a comment - 31/Jan/17 15:01

One thing that has surprised me is how stable the tests actually are. I ran this on a 16 core machine, but it is also generally 12 of the same test at a time, 12 again, and then 6 at a time. It's a fairly aggressive environment to survive for 30 runs. Given 876 tests, the number that have passed this test is very, very high when you consider the many ignored and nightly tests listed as failed.

Our jenkins cluster will probably keep finding rarer fails over time, but getting the average developer in good shape looks very tractable.

Mark Miller added a comment - 31/Jan/17 15:01 One thing that has surprised me is how stable the tests actually are. I ran this on a 16 core machine, but it is also generally 12 of the same test at a time, 12 again, and then 6 at a time. It's a fairly aggressive environment to survive for 30 runs. Given 876 tests, the number that have passed this test is very, very high when you consider the many ignored and nightly tests listed as failed. Our jenkins cluster will probably keep finding rarer fails over time, but getting the average developer in good shape looks very tractable.

ASF subversion and git services added a comment - 01/Feb/17 14:02

Commit 730df22e40cdfb51dd466d44332631fa8fa87f42 in lucene-solr's branch refs/heads/master from markrmiller
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=730df22 ]

~~SOLR-10032~~: Ignore tests that run no test methods.

ASF subversion and git services added a comment - 01/Feb/17 14:02 Commit 730df22e40cdfb51dd466d44332631fa8fa87f42 in lucene-solr's branch refs/heads/master from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=730df22 ] SOLR-10032 : Ignore tests that run no test methods.

Mark Miller added a comment - 01/Feb/17 14:57

In some cases, because the machine I have been using has no swap space, RAM usage could be an issue in the first report. I'll be adjusting and reducing load in the next run.

Mark Miller added a comment - 01/Feb/17 14:57 In some cases, because the machine I have been using has no swap space, RAM usage could be an issue in the first report. I'll be adjusting and reducing load in the next run.

Mark Miller added a comment - 03/Feb/17 05:16 - edited

Here is a second test report for a commit from 2/1.

A couple fails in the first run had to do with RAM issues, so for the second report I used a lot more RAM and did 10 at a time instead of 12.

I've been making other small iterative improvements as well.

https://docs.google.com/spreadsheets/d/1FndoyHmihaOVL2o_Zns5alpNdAJlNsEwQVoJ4XDWj3c/edit?usp=sharing

Mark Miller added a comment - 03/Feb/17 05:16 - edited Here is a second test report for a commit from 2/1. A couple fails in the first run had to do with RAM issues, so for the second report I used a lot more RAM and did 10 at a time instead of 12. I've been making other small iterative improvements as well. https://docs.google.com/spreadsheets/d/1FndoyHmihaOVL2o_Zns5alpNdAJlNsEwQVoJ4XDWj3c/edit?usp=sharing

Ishan Chattopadhyaya added a comment - 03/Feb/17 11:52

The report looks great, and much needed. However, was just wondering whether it accounts for all flaky tests. ShardSplitTest.testSplitAfterFailedSplit() fails quite often on my local test runs, and also fails often at Apache/Policeman jenkins, but you seemed to have had no problems with that test ("rock-solid"). Could it be that the test (and potentially other tests like that) fails only on slower computers?

Ishan Chattopadhyaya added a comment - 03/Feb/17 11:52 The report looks great, and much needed. However, was just wondering whether it accounts for all flaky tests. ShardSplitTest.testSplitAfterFailedSplit() fails quite often on my local test runs, and also fails often at Apache/Policeman jenkins, but you seemed to have had no problems with that test ("rock-solid"). Could it be that the test (and potentially other tests like that) fails only on slower computers?

Mark Miller added a comment - 03/Feb/17 15:31 - edited

This report is not made to reproduce all fails.

Some tests will fail for a variety of reasons. Resources are too low, Java/OS version, which tests they happen to run against, etc, etc.

So this will not exhaustively produce all flakey tests, nor is it trying to. In fact, I've tried to make sure there are plenty of resources to the run the tests reasonably. My goal is to find flakey tests that pop out easily, and not due to very specific conditions. This should target and find obvious problems and then help clamp down on minor flakey tests in time. Jenkins and individual devs will still play an important role in outliers and other, hopefully much less common, fails.

That said, most things still end up popping out if you beast long enough in my experience. Beasting for 100 runs would probably surface even more flakey tests. Producing this test with 30 is already quite time expensive though I'll eventually do some longer reports as we whittle down the obvious issues. It's really a judgment call of time vs coverage, and in these early reports 30 seemed like a reasonable condition to pass.

The other tests are not all cleared, but here is a very reasonable list of tests we should focus on - that even in a good clean env appear to fail too much.

I will also focus 100 run or more beasting on the tests that this report surfaces as flakey, and likely some tests will enter and drop off the report from one to the next. Those tests will end up needing more extensive individual beasting to pass as 100% clean.

rock-solid is not really a definitive judgment, just the rating for no fails. If you did a single run and it passed it would be rock-solid. I can change that to something a little less confusing.

If you do have a specific test that seems to fail for you, I'm happy to beast it more extensively and let you know if fails pop out. I'll try ShardSplitTest. It may be that it has more severe resource problems when it ends up running against some other intensive tests in ant test.

It does give us some more info when looking at ShardSplitTest - we know it fails fairly often for you and Jenkins, but that in a clean, resource friendly env, it can pass for 30 runs, 10 run at a time. That gives some clues hardening that test.

Mark Miller added a comment - 03/Feb/17 15:31 - edited This report is not made to reproduce all fails. Some tests will fail for a variety of reasons. Resources are too low, Java/OS version, which tests they happen to run against, etc, etc. So this will not exhaustively produce all flakey tests, nor is it trying to. In fact, I've tried to make sure there are plenty of resources to the run the tests reasonably. My goal is to find flakey tests that pop out easily, and not due to very specific conditions. This should target and find obvious problems and then help clamp down on minor flakey tests in time. Jenkins and individual devs will still play an important role in outliers and other, hopefully much less common, fails. That said, most things still end up popping out if you beast long enough in my experience. Beasting for 100 runs would probably surface even more flakey tests. Producing this test with 30 is already quite time expensive though I'll eventually do some longer reports as we whittle down the obvious issues. It's really a judgment call of time vs coverage, and in these early reports 30 seemed like a reasonable condition to pass. The other tests are not all cleared, but here is a very reasonable list of tests we should focus on - that even in a good clean env appear to fail too much. I will also focus 100 run or more beasting on the tests that this report surfaces as flakey, and likely some tests will enter and drop off the report from one to the next. Those tests will end up needing more extensive individual beasting to pass as 100% clean. rock-solid is not really a definitive judgment, just the rating for no fails. If you did a single run and it passed it would be rock-solid. I can change that to something a little less confusing. If you do have a specific test that seems to fail for you, I'm happy to beast it more extensively and let you know if fails pop out. I'll try ShardSplitTest. It may be that it has more severe resource problems when it ends up running against some other intensive tests in ant test. It does give us some more info when looking at ShardSplitTest - we know it fails fairly often for you and Jenkins, but that in a clean, resource friendly env, it can pass for 30 runs, 10 run at a time. That gives some clues hardening that test.

Mark Miller added a comment - 03/Feb/17 17:58

In other words: I am looking for tests that fail in a medium challenging but resource plentiful scenario first. Once we have a handle on those tests, there are steps we can take to improve hunting for deeper issues.

Mark Miller added a comment - 03/Feb/17 17:58 In other words: I am looking for tests that fail in a medium challenging but resource plentiful scenario first. Once we have a handle on those tests, there are steps we can take to improve hunting for deeper issues.

Mark Miller added a comment - 03/Feb/17 18:45

but you seemed to have had no problems with that test

This test did actually fail .07% of the time in the previous report. In future reports I will include the previous fail rate as well.

Mark Miller added a comment - 03/Feb/17 18:45 but you seemed to have had no problems with that test This test did actually fail .07% of the time in the previous report. In future reports I will include the previous fail rate as well.

Mark Miller added a comment - 09/Feb/17 02:22 - edited

For this next report I have switched to an 8 core machine from a 16 core machine. It looks like that may have made some of the more resource/env sensitive tests pop out a little more. The first report was created on a single machine, so I went with 16 cores just to try and generate it as fast as possible. 16-cores was not strictly needed, I run 10 tests at a time on my 6-core machine with similar results. It may even be a little too much CPU for our use case, even when running 10 instances of the test in parallel.

I have moved on from just using one machine though. It actually basically took 2-3 days to generate the first report as I was still working out some speed issues. The First run had like 2 minutes and 40 seconds of 'build' overtime per test run for most of the report and just barely enough RAM to handle 10 tests at a time - for a few test fails on heavy tests (eg hdfs), not enough RAM as there is also no swap space on those machines. Anyway, beasting ~900 tests is time consuming even in the best case.

Two tests also hung and that slowed things up a bit. Now I am more on the lookout for that - I've @BadAppled a test method involved in producing one of the hangs, and for this report I locally @BadAppled the other. They both look like legit bugs to me. I should have done @Ignore for the second hang, the test report runs @BadApple and @AwaitFix. Losing one machine for a long time when you are using 10 costs you a lot in report creation time. Now I at least know to pay attention to my email while running reports though. Luckily, these instance I'm using will auto pause after 30 minutes of no real activity and I get an email, so I now I can be a bit more vigilant while creating the report. Also helps that I've gotten down to about 4 hours to create the report.

I used 5 16-core machines for the second report. I can't recall about how long that took, but it was still in the realm of an all night job.

For this third report I am using 10 8-core machines.

I think we should be using those annotations like this:

@AwaitsFix - we basically know something key is broken and it's fairly clear what the issue is - we are waiting for someone to fix it - you don't expect this to be run regularly, but you can just pass a system property to run them.
@BadApple - test is too flakey, fails too much for unknown or varied reasons - you do expect that some test runs would still or could still include these tests and give some useful coverage information - flakiness in many more integration type tests can be the result of unrelated issues and clear up over time. Or get worse.
@Ignore - test is never run, it can hang, OOM, or does something negative to other tests.

I'll put up another report soon. I probably won't do another one until I have tackled the above flakey rating issues, hoping that's just a couple to a few weeks at most, but may be wishful.

Mark Miller added a comment - 09/Feb/17 02:22 - edited For this next report I have switched to an 8 core machine from a 16 core machine. It looks like that may have made some of the more resource/env sensitive tests pop out a little more. The first report was created on a single machine, so I went with 16 cores just to try and generate it as fast as possible. 16-cores was not strictly needed, I run 10 tests at a time on my 6-core machine with similar results. It may even be a little too much CPU for our use case, even when running 10 instances of the test in parallel. I have moved on from just using one machine though. It actually basically took 2-3 days to generate the first report as I was still working out some speed issues. The First run had like 2 minutes and 40 seconds of 'build' overtime per test run for most of the report and just barely enough RAM to handle 10 tests at a time - for a few test fails on heavy tests (eg hdfs), not enough RAM as there is also no swap space on those machines. Anyway, beasting ~900 tests is time consuming even in the best case. Two tests also hung and that slowed things up a bit. Now I am more on the lookout for that - I've @BadAppled a test method involved in producing one of the hangs, and for this report I locally @BadAppled the other. They both look like legit bugs to me. I should have done @Ignore for the second hang, the test report runs @BadApple and @AwaitFix. Losing one machine for a long time when you are using 10 costs you a lot in report creation time. Now I at least know to pay attention to my email while running reports though. Luckily, these instance I'm using will auto pause after 30 minutes of no real activity and I get an email, so I now I can be a bit more vigilant while creating the report. Also helps that I've gotten down to about 4 hours to create the report. I used 5 16-core machines for the second report. I can't recall about how long that took, but it was still in the realm of an all night job. For this third report I am using 10 8-core machines. I think we should be using those annotations like this: @AwaitsFix - we basically know something key is broken and it's fairly clear what the issue is - we are waiting for someone to fix it - you don't expect this to be run regularly, but you can just pass a system property to run them. @BadApple - test is too flakey, fails too much for unknown or varied reasons - you do expect that some test runs would still or could still include these tests and give some useful coverage information - flakiness in many more integration type tests can be the result of unrelated issues and clear up over time. Or get worse. @Ignore - test is never run, it can hang, OOM, or does something negative to other tests. I'll put up another report soon. I probably won't do another one until I have tackled the above flakey rating issues, hoping that's just a couple to a few weeks at most, but may be wishful.

Mark Miller added a comment - 09/Feb/17 02:39

I'd also been wondering why the test framework didn't bail on those hangs like I've seen it do many times with ant test. Finally dug it up and found the default when you run a single test is no timeout. I'll add an appropriate timeout.

Mark Miller added a comment - 09/Feb/17 02:39 I'd also been wondering why the test framework didn't bail on those hangs like I've seen it do many times with ant test. Finally dug it up and found the default when you run a single test is no timeout. I'll add an appropriate timeout.

Mark Miller added a comment - 09/Feb/17 14:01

Here is the latest report, ran against a commit from yesterday.

I now include up to 3 of the last fail percentage results from previous reports.

https://docs.google.com/spreadsheets/d/1N6RxH4Edd7ldRIaVfin0si-uSLGyowQi8-7mcux27S0/edit?usp=sharing

Mark Miller added a comment - 09/Feb/17 14:01 Here is the latest report, ran against a commit from yesterday. I now include up to 3 of the last fail percentage results from previous reports. https://docs.google.com/spreadsheets/d/1N6RxH4Edd7ldRIaVfin0si-uSLGyowQi8-7mcux27S0/edit?usp=sharing

Alexandre Rafalovitch added a comment - 09/Feb/17 21:22

Just thinking out loud, I wonder if this beasting could be tied into something that instruments the code to automatically dump the stack and all related variables on failure.

Maybe with something like http://byteman.jboss.org/ or - for commercial options - https://www.overops.com/ or http://chrononsystems.com/

Alexandre Rafalovitch added a comment - 09/Feb/17 21:22 Just thinking out loud, I wonder if this beasting could be tied into something that instruments the code to automatically dump the stack and all related variables on failure. Maybe with something like http://byteman.jboss.org/ or - for commercial options - https://www.overops.com/ or http://chrononsystems.com/

Mark Miller added a comment - 09/Feb/17 23:13

I suppose it depends on the cost of integrating and maintaining it vs how much the information it provides is consumed or useful. I think of things like clover that a lot of effort and time was spent on in the past, but who is looking at clover coverage? I'd still prefer to have test coverage reports, but again, usually I'll just generate that on demand. Often I won't even look at the logs for beast fails initially. I will move to the latest code, beast out the latest fails with the latest code, etc. Usually beast fails are very replicable (beast longer!), and running 8-12 in parallel allows you blast out a few hundred runs on a test in no time (one test, no time, 900 tests, not so much). For example, a couple tests hung in my first report and couple had RAM issues. I didn't dig in the data much first though, I just reproduced with YourKit attached to see what was up and addressed it.

Mark Miller added a comment - 09/Feb/17 23:13 I suppose it depends on the cost of integrating and maintaining it vs how much the information it provides is consumed or useful. I think of things like clover that a lot of effort and time was spent on in the past, but who is looking at clover coverage? I'd still prefer to have test coverage reports, but again, usually I'll just generate that on demand. Often I won't even look at the logs for beast fails initially. I will move to the latest code, beast out the latest fails with the latest code, etc. Usually beast fails are very replicable (beast longer!), and running 8-12 in parallel allows you blast out a few hundred runs on a test in no time (one test, no time, 900 tests, not so much). For example, a couple tests hung in my first report and couple had RAM issues. I didn't dig in the data much first though, I just reproduced with YourKit attached to see what was up and addressed it.

Tomas Eduardo Fernandez Lobbe added a comment - 10/Feb/17 18:04

Interesting that PeerSyncReplicationTest is rock-solid for you, It fails quite often to me (and seems like also in Jenkins)

Tomas Eduardo Fernandez Lobbe added a comment - 10/Feb/17 18:04 Interesting that PeerSyncReplicationTest is rock-solid for you, It fails quite often to me (and seems like also in Jenkins)

Mark Miller added a comment - 10/Feb/17 18:19

It's what I mention above. Surviving 30 runs is an arbitrary initial bar. If that test has low resource env issues, fails will will pop up over time and longer beast runs will draw out its true stability.

This will not match Jenkins or everyone local environments, but that is not really the point either. I promise if a test is truly flakey, I will draw it out over time. These others tests fail often even in good conditions and short beasting and will need attention first.

A lot of these tests that pass 30 runs even often will have 1 out of 60 or 1 out of 100 fails or 1 out of 200 fails that can be much more common under certain conditions.

Mark Miller added a comment - 10/Feb/17 18:19 It's what I mention above. Surviving 30 runs is an arbitrary initial bar. If that test has low resource env issues, fails will will pop up over time and longer beast runs will draw out its true stability. This will not match Jenkins or everyone local environments, but that is not really the point either. I promise if a test is truly flakey, I will draw it out over time. These others tests fail often even in good conditions and short beasting and will need attention first. A lot of these tests that pass 30 runs even often will have 1 out of 60 or 1 out of 100 fails or 1 out of 200 fails that can be much more common under certain conditions.

Mark Miller added a comment - 10/Feb/17 18:52

Back to my point that devs and Jenkins will play an very important role in testing regardless of the progress made on this front.

Please keep reporting tests that you know fail and are not yet popping out in these reports. That is two I will make sure have JIRA issues and I can quickly and easily beast them a few hundred times and report the results.

As tests are flagged, special attention can produce much more likely fails. Simply adding many more iterations for a test is the great hammer but you can also look at running even more in parallel or creating other fake load. Some tests are not very aggressive competition with themselves even at 10 in parallel.

I have plenty of experience pulling out failures from single tests, plenty of knobs to twist, this is just about getting a list of those tests to focus on and naming names.

So if your favorite flakey test has not made the report yet, please report it here and I'll give it some special attention.

Mark Miller added a comment - 10/Feb/17 18:52 Back to my point that devs and Jenkins will play an very important role in testing regardless of the progress made on this front. Please keep reporting tests that you know fail and are not yet popping out in these reports. That is two I will make sure have JIRA issues and I can quickly and easily beast them a few hundred times and report the results. As tests are flagged, special attention can produce much more likely fails. Simply adding many more iterations for a test is the great hammer but you can also look at running even more in parallel or creating other fake load. Some tests are not very aggressive competition with themselves even at 10 in parallel. I have plenty of experience pulling out failures from single tests, plenty of knobs to twist, this is just about getting a list of those tests to focus on and naming names. So if your favorite flakey test has not made the report yet, please report it here and I'll give it some special attention.

Mark Miller added a comment - 15/Feb/17 19:37

Here is the latest report. I now include the average fail % of all tracked report runs and started tagging tests with JIRA ids.

2/14/2017 https://docs.google.com/spreadsheets/d/1eZ9_ds_0XyqsKKp8xkmESrcMZRP85jTxSKkNwgtcUn0/edit?usp=sharing

(PDF attached to issue)

Mark Miller added a comment - 15/Feb/17 19:37 Here is the latest report. I now include the average fail % of all tracked report runs and started tagging tests with JIRA ids. 2/14/2017 https://docs.google.com/spreadsheets/d/1eZ9_ds_0XyqsKKp8xkmESrcMZRP85jTxSKkNwgtcUn0/edit?usp=sharing (PDF attached to issue)

Erick Erickson added a comment - 16/Feb/17 16:47

Mark:

First of all, really appreciate the work you're doing here. I hope to address some of the tests I'm familiar with, honest. Real Soon Now.

Would it be convenient for you to publicly share the folder these reports go into so I could bookmark that?

Erick

Erick Erickson added a comment - 16/Feb/17 16:47 Mark: First of all, really appreciate the work you're doing here. I hope to address some of the tests I'm familiar with, honest. Real Soon Now. Would it be convenient for you to publicly share the folder these reports go into so I could bookmark that? Erick

Mark Miller added a comment - 17/Feb/17 16:10

Would it be convenient for you to publicly share the folder these reports go into so I could bookmark that?

https://drive.google.com/drive/folders/0ByYyjsrbz7-qa2dOaU1UZDdRVzg?usp=sharing

Mark Miller added a comment - 17/Feb/17 16:10 Would it be convenient for you to publicly share the folder these reports go into so I could bookmark that? https://drive.google.com/drive/folders/0ByYyjsrbz7-qa2dOaU1UZDdRVzg?usp=sharing

Mark Miller added a comment - 19/Feb/17 18:11

Here is a special 3 day weekend edition, 100 iterations, 12 at a time instead of 10, trying to draw out more tests: https://docs.google.com/spreadsheets/d/1LEPvXbsoHtKfIcZCJZ3_P6OHp7S5g2HP2OJgU6B2sAg/edit?usp=sharing

Mark Miller added a comment - 19/Feb/17 18:11 Here is a special 3 day weekend edition, 100 iterations, 12 at a time instead of 10, trying to draw out more tests: https://docs.google.com/spreadsheets/d/1LEPvXbsoHtKfIcZCJZ3_P6OHp7S5g2HP2OJgU6B2sAg/edit?usp=sharing

Mark Miller added a comment - 20/Feb/17 04:38

Here are some notes and color for the last test report: https://medium.com/@heismark/solr-test-report-results-notes-02-17-2017-3fd66db1adeb#.gity816cj

Mark Miller added a comment - 20/Feb/17 04:38 Here are some notes and color for the last test report: https://medium.com/@heismark/solr-test-report-results-notes-02-17-2017-3fd66db1adeb#.gity816cj

Mark Miller added a comment - 08/Mar/17 04:53

I've got a few issue to look into, but when I'm done I'll run a new report and work on removing any tests failing over 5% of the time.

Mark Miller added a comment - 08/Mar/17 04:53 I've got a few issue to look into, but when I'm done I'll run a new report and work on removing any tests failing over 5% of the time.

Shawn Heisey added a comment - 10/Mar/17 15:05

If graphs can be generated showing the time it takes for test runs (excluding those that take two hours and hit the timeout), it could serve as a minimal benchmark. Perhaps ~~SOLR-10130~~ might not have slipped through the cracks. A full-fledged benchmark (like the one McCandless does for Lucene) would be best, but graphing times for test runs could serve as a stopgap.

Shawn Heisey added a comment - 10/Mar/17 15:05 If graphs can be generated showing the time it takes for test runs (excluding those that take two hours and hit the timeout), it could serve as a minimal benchmark. Perhaps SOLR-10130 might not have slipped through the cracks. A full-fledged benchmark (like the one McCandless does for Lucene) would be best, but graphing times for test runs could serve as a stopgap.

Ishan Chattopadhyaya added a comment - 13/Mar/17 10:10

work on removing any tests failing over 5% of the time.

+1. This would result in a much more stable build, and might force us all to start tackling the important disabled tests more seriously.

Ishan Chattopadhyaya added a comment - 13/Mar/17 10:10 work on removing any tests failing over 5% of the time. +1. This would result in a much more stable build, and might force us all to start tackling the important disabled tests more seriously.

Mark Miller added a comment - 14/Mar/17 05:38

I've recorded a talk on this topic that also demo's creating a test report a bit: https://youtu.be/A2VXU-JVoGY

Mark Miller added a comment - 14/Mar/17 05:38 I've recorded a talk on this topic that also demo's creating a test report a bit: https://youtu.be/A2VXU-JVoGY

Mark Miller added a comment - 17/Mar/17 17:16

I'm a little behind on my latest run because I want to look into a test or two that hangs recently first (a hang that the test suite timeout can't kill can make a test run take much much longer unless I notice it fast and I'm not usually monitoring over hours). I'll generate a new report by Monday though.

Mark Miller added a comment - 17/Mar/17 17:16 I'm a little behind on my latest run because I want to look into a test or two that hangs recently first (a hang that the test suite timeout can't kill can make a test run take much much longer unless I notice it fast and I'm not usually monitoring over hours). I'll generate a new report by Monday though.

Mark Miller added a comment - 30/Jun/17 14:29

I actually bailed on spending more time with report generation for a bit. The process was still a bit too cumbersome - it really needs to be one command to kick off and then everything else is automatic so that it can be run regularly with ease. I finally came up with a plan to automate from test running to table generation and publishing, so I have spent the time instead finishing up full automation. I still have a bit to do, but I'll be done soon and then will run a report on a regular schedule.

I've also spent a little time planning out using Docker for parallel test runs instead of the built in project support. This will enable the possibility of supporting other projects and there are some nice benefits in truly isolating parallel test runs.

Mark Miller added a comment - 30/Jun/17 14:29 I actually bailed on spending more time with report generation for a bit. The process was still a bit too cumbersome - it really needs to be one command to kick off and then everything else is automatic so that it can be run regularly with ease. I finally came up with a plan to automate from test running to table generation and publishing, so I have spent the time instead finishing up full automation. I still have a bit to do, but I'll be done soon and then will run a report on a regular schedule. I've also spent a little time planning out using Docker for parallel test runs instead of the built in project support. This will enable the possibility of supporting other projects and there are some nice benefits in truly isolating parallel test runs.

Mark Miller added a comment - 30/Jun/17 17:57

As part of fully automating everything, I've been experimented with sending results to bitballoon rather than cut and pasting into Google sheets. It's dead simple to automate and a nicer experience I think. Here is a demo of results http://solr-tests.bitballoon.com

Mark Miller added a comment - 30/Jun/17 17:57 As part of fully automating everything, I've been experimented with sending results to bitballoon rather than cut and pasting into Google sheets. It's dead simple to automate and a nicer experience I think. Here is a demo of results http://solr-tests.bitballoon.com

Erick Erickson added a comment - 30/Jun/17 18:00

Sweet!

Erick Erickson added a comment - 30/Jun/17 18:00 Sweet!

Anshum Gupta added a comment - 30/Jun/17 18:27

This is looking really good!

Anshum Gupta added a comment - 30/Jun/17 18:27 This is looking really good!

Mark Miller added a comment - 09/Aug/17 04:14

Okay, I'm pretty much ready here. I have a generic project called BeastIt that can generate a report for pretty much anything that runs on linux, has test files and allows launching a single test via command line. I'm going to resolve this issue once I post a new report and then I'll start regularly generating test reports now that it's just a command line call.

Mark Miller added a comment - 09/Aug/17 04:14 Okay, I'm pretty much ready here. I have a generic project called BeastIt that can generate a report for pretty much anything that runs on linux, has test files and allows launching a single test via command line. I'm going to resolve this issue once I post a new report and then I'll start regularly generating test reports now that it's just a command line call.

Anshum Gupta added a comment - 09/Aug/17 06:46

markrmiller@gmail.com do you intend to open source BeastIt ?

Anshum Gupta added a comment - 09/Aug/17 06:46 markrmiller@gmail.com do you intend to open source BeastIt ?

Mark Miller added a comment - 09/Aug/17 08:43

Yeah, I'll move it to a public repo at some point.

First report is done, they will show up here: http://apache-solr.bitballoon.com/

One thing to note is that SharedFSAutoReplicaFailoverTest looks broken. That may be the same on the 7.0 release branch. One of the things I'm looking forward to with this is some more visibility on nightlies - too easy to break them and not care or notice as it is.

Mark Miller added a comment - 09/Aug/17 08:43 Yeah, I'll move it to a public repo at some point. First report is done, they will show up here: http://apache-solr.bitballoon.com/ One thing to note is that SharedFSAutoReplicaFailoverTest looks broken. That may be the same on the 7.0 release branch. One of the things I'm looking forward to with this is some more visibility on nightlies - too easy to break them and not care or notice as it is.

Cassandra Targett added a comment - 09/Aug/17 13:30

markrmiller@gmail.com - This is really great, thank you.

I have one question, out of curiosity: Why do a few tests show up as failing more than 100% of the time?

Cassandra Targett added a comment - 09/Aug/17 13:30 markrmiller@gmail.com - This is really great, thank you. I have one question, out of curiosity: Why do a few tests show up as failing more than 100% of the time?

Mark Miller added a comment - 09/Aug/17 14:13

Those are special codes to order and identify tests with annotations.

So if a test is ignored, it's not run at all and gets a 122 or whatever. If it's @BadApple and fails 100%, it gets a 112, if it's @AwaitFix and fails 100% it gets a 113. So those 100% fails are basically expected. If it is 100%, it's a test that won't pass and doesn't have one of these annotations, so really bad.

Mark Miller added a comment - 09/Aug/17 14:13 Those are special codes to order and identify tests with annotations. So if a test is ignored, it's not run at all and gets a 122 or whatever. If it's @BadApple and fails 100%, it gets a 112, if it's @AwaitFix and fails 100% it gets a 113. So those 100% fails are basically expected. If it is 100%, it's a test that won't pass and doesn't have one of these annotations, so really bad.

Erick Erickson added a comment - 09/Aug/17 15:05

This is excellent, thanks for all your hard work here!

Erick Erickson added a comment - 09/Aug/17 15:05 This is excellent, thanks for all your hard work here!

Mark Miller added a comment - 09/Aug/17 15:24

Thanks! It actually ended up being a ton of work. It wasn't so bad just to stitch something together for Solr with me to fill in gaps, but to make it generic for any project using docker, to allow it to have tests (docker within docker!), to allow you to point it at 10 freshly provisioned machines with no setup on your part, to make it easy to debug and add new project support easily, etc, was actually many, many, many hours of effort. Still some polish and minor things to do, but very happy it's ready to start pushing out reports regularly now.

Mark Miller added a comment - 09/Aug/17 15:24 Thanks! It actually ended up being a ton of work. It wasn't so bad just to stitch something together for Solr with me to fill in gaps, but to make it generic for any project using docker, to allow it to have tests (docker within docker!), to allow you to point it at 10 freshly provisioned machines with no setup on your part, to make it easy to debug and add new project support easily, etc, was actually many, many, many hours of effort. Still some polish and minor things to do, but very happy it's ready to start pushing out reports regularly now.

Shalin Shekhar Mangar added a comment - 06/Aug/18 15:59

Hi markrmiller@gmail.com, the last report was in April. Is this still something you intend to publish and/or publish the code?

Shalin Shekhar Mangar added a comment - 06/Aug/18 15:59 Hi markrmiller@gmail.com , the last report was in April. Is this still something you intend to publish and/or publish the code?

Mark Miller added a comment - 06/Aug/18 17:45

I don’t own the code and had not had a chance to open source it before changing jobs, so remains to be seen what will happen.

Mark Miller added a comment - 06/Aug/18 17:45 I don’t own the code and had not had a chance to open source it before changing jobs, so remains to be seen what will happen.

People

Assignee:: Mark Miller

Reporter:: Mark Miller

Votes:: 2 Vote for this issue

Watchers:: 19 Start watching this issue

Dates

Created:: 25/Jan/17 15:18

Updated:: 08/Jun/19 15:33

Resolved:: 09/Aug/17 08:44