Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10032

Create report to assess Solr test quality at a commit point.

Details

    • Task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 7.0, 8.0
    • Tests
    • None

    Description

      We have many Jenkins instances blasting tests, some official, some policeman, I and others have or had their own, and the email trail proves the power of the Jenkins cluster to find test fails.

      However, I still have a very hard time with some basic questions:

      what tests are flakey right now? which test fails actually affect devs most? did I break it? was that test already flakey? is that test still flakey? what are our worst tests right now? is that test getting better or worse?

      We really need a way to see exactly what tests are the problem, not because of OS or environmental issues, but more basic test quality issues. Which tests are flakey and how flakey are they at any point in time.

      Reports:
      https://drive.google.com/drive/folders/0ByYyjsrbz7-qa2dOaU1UZDdRVzg?usp=sharing

      01/24/2017 - https://docs.google.com/spreadsheets/d/1JySta2j2s7A_p16wA1UO-l6c4GsUHBIb4FONS2EzW9k/edit?usp=sharing
      02/01/2017 - https://docs.google.com/spreadsheets/d/1FndoyHmihaOVL2o_Zns5alpNdAJlNsEwQVoJ4XDWj3c/edit?usp=sharing
      02/08/2017 - https://docs.google.com/spreadsheets/d/1N6RxH4Edd7ldRIaVfin0si-uSLGyowQi8-7mcux27S0/edit?usp=sharing
      02/14/2017 - https://docs.google.com/spreadsheets/d/1eZ9_ds_0XyqsKKp8xkmESrcMZRP85jTxSKkNwgtcUn0/edit?usp=sharing
      02/17/2017 - https://docs.google.com/spreadsheets/d/1LEPvXbsoHtKfIcZCJZ3_P6OHp7S5g2HP2OJgU6B2sAg/edit?usp=sharing

      Attachments

        Issue Links

          Activity

            markrmiller@gmail.com Mark Miller added a comment -

            I'll create sub issues off this JIRA to address the worst tests I find.

            markrmiller@gmail.com Mark Miller added a comment - I'll create sub issues off this JIRA to address the worst tests I find.

            +1. Flakey test clean up is an arduous and unforgiving task, but it's hard to overstate it's importance.

            Do you have ideas on how we might track this beyond trying to fix the biggest offenders? If there's stuff we can do to raise the visibility of such tests, that will be a start to getting all of us more aware of the problem and moving toward a long-term solution.

            ctargett Cassandra Targett added a comment - +1. Flakey test clean up is an arduous and unforgiving task, but it's hard to overstate it's importance. Do you have ideas on how we might track this beyond trying to fix the biggest offenders? If there's stuff we can do to raise the visibility of such tests, that will be a start to getting all of us more aware of the problem and moving toward a long-term solution.
            dsmiley David Smiley added a comment -

            +1 I'm so glad you're looking into this Mark; thank you!

            dsmiley David Smiley added a comment - +1 I'm so glad you're looking into this Mark; thank you!
            markrmiller@gmail.com Mark Miller added a comment -

            I'm still building the first report, but here is a partial sample attached. At the moment, the output of my script is a tsv file and I just paste that into Google spreadsheets.

            markrmiller@gmail.com Mark Miller added a comment - I'm still building the first report, but here is a partial sample attached. At the moment, the output of my script is a tsv file and I just paste that into Google spreadsheets.
            markrmiller@gmail.com Mark Miller added a comment -

            If there's stuff we can do to raise the visibility of such tests, that will be a start to getting all of us more aware of the problem and moving toward a long-term solution.

            I think it really depends on how much effort we can maintain over time on this, but ideally we would do this:

            Take the first report and file JIRA issues (or find current ones) for all of the tests that are beyond a bit flakey. Push on the author and contributors to those tests to get them solid. Generating reports for just the worst tests is actually pretty fast for a feedback loop. For tests that have fails, I have all the logs to provide to the JIRA issue.

            If we can get out a report with all tests within a certain flakey cutoff range and if we can regularly generate this report over time, it will be relatively easy to spot new bad tests or tests that re enter a bad state, and we can file high priority JIRA's (or reopen JIRAs) and/or ignore them with AwaitsFix annotations.

            Once we get a report under a basic level, we can be harder on tests that creep into a danger zone. It really depends on if we get enough momentum, but I'm willing to give it a try.

            One thing I've tried to do is create a rating for different failure rates, with the idea being, we first work on the tests worse than the 'flakey' rating, Once we achieve that, we can be very hard on tests that go above that rating while working on hardening the least flakey tests over time as well.

            markrmiller@gmail.com Mark Miller added a comment - If there's stuff we can do to raise the visibility of such tests, that will be a start to getting all of us more aware of the problem and moving toward a long-term solution. I think it really depends on how much effort we can maintain over time on this, but ideally we would do this: Take the first report and file JIRA issues (or find current ones) for all of the tests that are beyond a bit flakey. Push on the author and contributors to those tests to get them solid. Generating reports for just the worst tests is actually pretty fast for a feedback loop. For tests that have fails, I have all the logs to provide to the JIRA issue. If we can get out a report with all tests within a certain flakey cutoff range and if we can regularly generate this report over time, it will be relatively easy to spot new bad tests or tests that re enter a bad state, and we can file high priority JIRA's (or reopen JIRAs) and/or ignore them with AwaitsFix annotations. Once we get a report under a basic level, we can be harder on tests that creep into a danger zone. It really depends on if we get enough momentum, but I'm willing to give it a try. One thing I've tried to do is create a rating for different failure rates, with the idea being, we first work on the tests worse than the 'flakey' rating, Once we achieve that, we can be very hard on tests that go above that rating while working on hardening the least flakey tests over time as well.
            markrmiller@gmail.com Mark Miller added a comment -

            How the report is generated:

            I am running each of tests on a beefy ubuntu machine, 30 runs per tests, 12 in parallel at a time.

            markrmiller@gmail.com Mark Miller added a comment - How the report is generated: I am running each of tests on a beefy ubuntu machine, 30 runs per tests, 12 in parallel at a time.
            markrmiller@gmail.com Mark Miller added a comment -

            If there's stuff we can do to raise the visibility of such tests,

            I'm hoping having this report with all the tests compared will help give us backing to do just this. Right now, it's hard to be objective about which tests are the problem and how hard you should push back on a flakey test. Fails for you, not for me, fails on jenkins but so do many others, etc.

            With this report and the logs of all 30 runs for each test, I think it will be much easier to create high priority JIRA issues for bad tests and pressure contributors to improve them. We can also send an occasional email to the dev list and highlight problem tests.

            Open to other ideas as well of course.

            markrmiller@gmail.com Mark Miller added a comment - If there's stuff we can do to raise the visibility of such tests, I'm hoping having this report with all the tests compared will help give us backing to do just this. Right now, it's hard to be objective about which tests are the problem and how hard you should push back on a flakey test. Fails for you, not for me, fails on jenkins but so do many others, etc. With this report and the logs of all 30 runs for each test, I think it will be much easier to create high priority JIRA issues for bad tests and pressure contributors to improve them. We can also send an occasional email to the dev list and highlight problem tests. Open to other ideas as well of course.
            yseeley@gmail.com Yonik Seeley added a comment -

            We can also send an occasional email to the dev list and highlight problem tests.

            +1 !
            Would be great to eventually automate (e.g. a daily report wouldn't be noise compared to what we have now), but whatever can be done in the meantime will be a huge improvement.

            yseeley@gmail.com Yonik Seeley added a comment - We can also send an occasional email to the dev list and highlight problem tests. +1 ! Would be great to eventually automate (e.g. a daily report wouldn't be noise compared to what we have now), but whatever can be done in the meantime will be a huge improvement.
            markrmiller@gmail.com Mark Miller added a comment -

            This will also help address getting our nightly runs to a useful state. I am not currently running tests that run non nightly with 'nightly cranked up' variants, but we do get a report on tests that only run nightly, and tests like that tend to get little to no visibility currently (and my guess is we may find many fairly failure prone).

            markrmiller@gmail.com Mark Miller added a comment - This will also help address getting our nightly runs to a useful state. I am not currently running tests that run non nightly with 'nightly cranked up' variants, but we do get a report on tests that only run nightly, and tests like that tend to get little to no visibility currently (and my guess is we may find many fairly failure prone).

            I think a hammer approach (and probably effective) for now would be disable all flaky tests. While someone should anyway need to work on them, their getting resolved would not get in the way of a regular developer now trying to figure out the basic questions Mark mentioned.

            ichattopadhyaya Ishan Chattopadhyaya added a comment - I think a hammer approach (and probably effective) for now would be disable all flaky tests. While someone should anyway need to work on them, their getting resolved would not get in the way of a regular developer now trying to figure out the basic questions Mark mentioned.
            markrmiller@gmail.com Mark Miller added a comment - - edited

            I think there is likely too much of a test coverage problem if we take that approach.

            I'd like to instead push gradually, though perhaps 'Apache time' quickly.

            First I will create critical issues for the worst offenders, if they cannot be fixed pretty much right away, I will badapple or awaitsfix them.

            I'll also create critical issues for other fails above a certain threshold and ping appropriate JIRA issues to try and bring attention to them. Over time we can ignore these as well if they are not addressed and someone doesn't find them important enough to keep coverage.

            We can then tighten this net down to a certain level.

            I think if we commit to following through on some progress, we can take an iterative approach that gives people ample time to fix important tests and us time to evaluate loss of important test coverage (even flakey test coverage is very valuable info to us right now, and some flakey tests pass 90%+ of the time - we want to harden them, but they provide critical coverage in many cases).

            I'll also ping the dev list with a summary occasionally to bring attention to this and the current state.

            markrmiller@gmail.com Mark Miller added a comment - - edited I think there is likely too much of a test coverage problem if we take that approach. I'd like to instead push gradually, though perhaps 'Apache time' quickly. First I will create critical issues for the worst offenders, if they cannot be fixed pretty much right away, I will badapple or awaitsfix them. I'll also create critical issues for other fails above a certain threshold and ping appropriate JIRA issues to try and bring attention to them. Over time we can ignore these as well if they are not addressed and someone doesn't find them important enough to keep coverage. We can then tighten this net down to a certain level. I think if we commit to following through on some progress, we can take an iterative approach that gives people ample time to fix important tests and us time to evaluate loss of important test coverage (even flakey test coverage is very valuable info to us right now, and some flakey tests pass 90%+ of the time - we want to harden them, but they provide critical coverage in many cases). I'll also ping the dev list with a summary occasionally to bring attention to this and the current state.
            markrmiller@gmail.com Mark Miller added a comment -

            Here is the first report. There may still be some kinks to work out. I'll summarize the report and add additional commentary later. I'll also send that to the dev list. We can make or surface JIRA issues for any test not solid and prompt fixes or badapple/awaitsfix annotations.

            You can see the attached report or here: https://docs.google.com/spreadsheets/d/1JySta2j2s7A_p16wA1UO-l6c4GsUHBIb4FONS2EzW9k/edit?usp=sharing

            markrmiller@gmail.com Mark Miller added a comment - Here is the first report. There may still be some kinks to work out. I'll summarize the report and add additional commentary later. I'll also send that to the dev list. We can make or surface JIRA issues for any test not solid and prompt fixes or badapple/awaitsfix annotations. You can see the attached report or here: https://docs.google.com/spreadsheets/d/1JySta2j2s7A_p16wA1UO-l6c4GsUHBIb4FONS2EzW9k/edit?usp=sharing
            markrmiller@gmail.com Mark Miller added a comment -

            This report covers 876 tests. If any tests share the same name, the results for one would currently be misses. Will have to start using full package names to avoid that.

            markrmiller@gmail.com Mark Miller added a comment - This report covers 876 tests. If any tests share the same name, the results for one would currently be misses. Will have to start using full package names to avoid that.
            markrmiller@gmail.com Mark Miller added a comment -

            One thing that has surprised me is how stable the tests actually are. I ran this on a 16 core machine, but it is also generally 12 of the same test at a time, 12 again, and then 6 at a time. It's a fairly aggressive environment to survive for 30 runs. Given 876 tests, the number that have passed this test is very, very high when you consider the many ignored and nightly tests listed as failed.

            Our jenkins cluster will probably keep finding rarer fails over time, but getting the average developer in good shape looks very tractable.

            markrmiller@gmail.com Mark Miller added a comment - One thing that has surprised me is how stable the tests actually are. I ran this on a 16 core machine, but it is also generally 12 of the same test at a time, 12 again, and then 6 at a time. It's a fairly aggressive environment to survive for 30 runs. Given 876 tests, the number that have passed this test is very, very high when you consider the many ignored and nightly tests listed as failed. Our jenkins cluster will probably keep finding rarer fails over time, but getting the average developer in good shape looks very tractable.

            Commit 730df22e40cdfb51dd466d44332631fa8fa87f42 in lucene-solr's branch refs/heads/master from markrmiller
            [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=730df22 ]

            SOLR-10032: Ignore tests that run no test methods.

            jira-bot ASF subversion and git services added a comment - Commit 730df22e40cdfb51dd466d44332631fa8fa87f42 in lucene-solr's branch refs/heads/master from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=730df22 ] SOLR-10032 : Ignore tests that run no test methods.
            markrmiller@gmail.com Mark Miller added a comment -

            In some cases, because the machine I have been using has no swap space, RAM usage could be an issue in the first report. I'll be adjusting and reducing load in the next run.

            markrmiller@gmail.com Mark Miller added a comment - In some cases, because the machine I have been using has no swap space, RAM usage could be an issue in the first report. I'll be adjusting and reducing load in the next run.
            markrmiller@gmail.com Mark Miller added a comment - - edited

            Here is a second test report for a commit from 2/1.

            A couple fails in the first run had to do with RAM issues, so for the second report I used a lot more RAM and did 10 at a time instead of 12.

            I've been making other small iterative improvements as well.

            https://docs.google.com/spreadsheets/d/1FndoyHmihaOVL2o_Zns5alpNdAJlNsEwQVoJ4XDWj3c/edit?usp=sharing

            markrmiller@gmail.com Mark Miller added a comment - - edited Here is a second test report for a commit from 2/1. A couple fails in the first run had to do with RAM issues, so for the second report I used a lot more RAM and did 10 at a time instead of 12. I've been making other small iterative improvements as well. https://docs.google.com/spreadsheets/d/1FndoyHmihaOVL2o_Zns5alpNdAJlNsEwQVoJ4XDWj3c/edit?usp=sharing

            The report looks great, and much needed. However, was just wondering whether it accounts for all flaky tests. ShardSplitTest.testSplitAfterFailedSplit() fails quite often on my local test runs, and also fails often at Apache/Policeman jenkins, but you seemed to have had no problems with that test ("rock-solid"). Could it be that the test (and potentially other tests like that) fails only on slower computers?

            ichattopadhyaya Ishan Chattopadhyaya added a comment - The report looks great, and much needed. However, was just wondering whether it accounts for all flaky tests. ShardSplitTest.testSplitAfterFailedSplit() fails quite often on my local test runs, and also fails often at Apache/Policeman jenkins, but you seemed to have had no problems with that test ("rock-solid"). Could it be that the test (and potentially other tests like that) fails only on slower computers?
            markrmiller@gmail.com Mark Miller added a comment - - edited

            This report is not made to reproduce all fails.

            Some tests will fail for a variety of reasons. Resources are too low, Java/OS version, which tests they happen to run against, etc, etc.

            So this will not exhaustively produce all flakey tests, nor is it trying to. In fact, I've tried to make sure there are plenty of resources to the run the tests reasonably. My goal is to find flakey tests that pop out easily, and not due to very specific conditions. This should target and find obvious problems and then help clamp down on minor flakey tests in time. Jenkins and individual devs will still play an important role in outliers and other, hopefully much less common, fails.

            That said, most things still end up popping out if you beast long enough in my experience. Beasting for 100 runs would probably surface even more flakey tests. Producing this test with 30 is already quite time expensive though I'll eventually do some longer reports as we whittle down the obvious issues. It's really a judgment call of time vs coverage, and in these early reports 30 seemed like a reasonable condition to pass.

            The other tests are not all cleared, but here is a very reasonable list of tests we should focus on - that even in a good clean env appear to fail too much.

            I will also focus 100 run or more beasting on the tests that this report surfaces as flakey, and likely some tests will enter and drop off the report from one to the next. Those tests will end up needing more extensive individual beasting to pass as 100% clean.

            rock-solid is not really a definitive judgment, just the rating for no fails. If you did a single run and it passed it would be rock-solid. I can change that to something a little less confusing.

            If you do have a specific test that seems to fail for you, I'm happy to beast it more extensively and let you know if fails pop out. I'll try ShardSplitTest. It may be that it has more severe resource problems when it ends up running against some other intensive tests in ant test.

            It does give us some more info when looking at ShardSplitTest - we know it fails fairly often for you and Jenkins, but that in a clean, resource friendly env, it can pass for 30 runs, 10 run at a time. That gives some clues hardening that test.

            markrmiller@gmail.com Mark Miller added a comment - - edited This report is not made to reproduce all fails. Some tests will fail for a variety of reasons. Resources are too low, Java/OS version, which tests they happen to run against, etc, etc. So this will not exhaustively produce all flakey tests, nor is it trying to. In fact, I've tried to make sure there are plenty of resources to the run the tests reasonably. My goal is to find flakey tests that pop out easily, and not due to very specific conditions. This should target and find obvious problems and then help clamp down on minor flakey tests in time. Jenkins and individual devs will still play an important role in outliers and other, hopefully much less common, fails. That said, most things still end up popping out if you beast long enough in my experience. Beasting for 100 runs would probably surface even more flakey tests. Producing this test with 30 is already quite time expensive though I'll eventually do some longer reports as we whittle down the obvious issues. It's really a judgment call of time vs coverage, and in these early reports 30 seemed like a reasonable condition to pass. The other tests are not all cleared, but here is a very reasonable list of tests we should focus on - that even in a good clean env appear to fail too much. I will also focus 100 run or more beasting on the tests that this report surfaces as flakey, and likely some tests will enter and drop off the report from one to the next. Those tests will end up needing more extensive individual beasting to pass as 100% clean. rock-solid is not really a definitive judgment, just the rating for no fails. If you did a single run and it passed it would be rock-solid. I can change that to something a little less confusing. If you do have a specific test that seems to fail for you, I'm happy to beast it more extensively and let you know if fails pop out. I'll try ShardSplitTest. It may be that it has more severe resource problems when it ends up running against some other intensive tests in ant test. It does give us some more info when looking at ShardSplitTest - we know it fails fairly often for you and Jenkins, but that in a clean, resource friendly env, it can pass for 30 runs, 10 run at a time. That gives some clues hardening that test.
            markrmiller@gmail.com Mark Miller added a comment -

            In other words: I am looking for tests that fail in a medium challenging but resource plentiful scenario first. Once we have a handle on those tests, there are steps we can take to improve hunting for deeper issues.

            markrmiller@gmail.com Mark Miller added a comment - In other words: I am looking for tests that fail in a medium challenging but resource plentiful scenario first. Once we have a handle on those tests, there are steps we can take to improve hunting for deeper issues.
            markrmiller@gmail.com Mark Miller added a comment -

            but you seemed to have had no problems with that test

            This test did actually fail .07% of the time in the previous report. In future reports I will include the previous fail rate as well.

            markrmiller@gmail.com Mark Miller added a comment - but you seemed to have had no problems with that test This test did actually fail .07% of the time in the previous report. In future reports I will include the previous fail rate as well.
            markrmiller@gmail.com Mark Miller added a comment - - edited

            For this next report I have switched to an 8 core machine from a 16 core machine. It looks like that may have made some of the more resource/env sensitive tests pop out a little more. The first report was created on a single machine, so I went with 16 cores just to try and generate it as fast as possible. 16-cores was not strictly needed, I run 10 tests at a time on my 6-core machine with similar results. It may even be a little too much CPU for our use case, even when running 10 instances of the test in parallel.

            I have moved on from just using one machine though. It actually basically took 2-3 days to generate the first report as I was still working out some speed issues. The First run had like 2 minutes and 40 seconds of 'build' overtime per test run for most of the report and just barely enough RAM to handle 10 tests at a time - for a few test fails on heavy tests (eg hdfs), not enough RAM as there is also no swap space on those machines. Anyway, beasting ~900 tests is time consuming even in the best case.

            Two tests also hung and that slowed things up a bit. Now I am more on the lookout for that - I've @BadAppled a test method involved in producing one of the hangs, and for this report I locally @BadAppled the other. They both look like legit bugs to me. I should have done @Ignore for the second hang, the test report runs @BadApple and @AwaitFix. Losing one machine for a long time when you are using 10 costs you a lot in report creation time. Now I at least know to pay attention to my email while running reports though. Luckily, these instance I'm using will auto pause after 30 minutes of no real activity and I get an email, so I now I can be a bit more vigilant while creating the report. Also helps that I've gotten down to about 4 hours to create the report.

            I used 5 16-core machines for the second report. I can't recall about how long that took, but it was still in the realm of an all night job.

            For this third report I am using 10 8-core machines.

            I think we should be using those annotations like this:

            • @AwaitsFix - we basically know something key is broken and it's fairly clear what the issue is - we are waiting for someone to fix it - you don't expect this to be run regularly, but you can just pass a system property to run them.
            • @BadApple - test is too flakey, fails too much for unknown or varied reasons - you do expect that some test runs would still or could still include these tests and give some useful coverage information - flakiness in many more integration type tests can be the result of unrelated issues and clear up over time. Or get worse.
            • @Ignore - test is never run, it can hang, OOM, or does something negative to other tests.

            I'll put up another report soon. I probably won't do another one until I have tackled the above flakey rating issues, hoping that's just a couple to a few weeks at most, but may be wishful.

            markrmiller@gmail.com Mark Miller added a comment - - edited For this next report I have switched to an 8 core machine from a 16 core machine. It looks like that may have made some of the more resource/env sensitive tests pop out a little more. The first report was created on a single machine, so I went with 16 cores just to try and generate it as fast as possible. 16-cores was not strictly needed, I run 10 tests at a time on my 6-core machine with similar results. It may even be a little too much CPU for our use case, even when running 10 instances of the test in parallel. I have moved on from just using one machine though. It actually basically took 2-3 days to generate the first report as I was still working out some speed issues. The First run had like 2 minutes and 40 seconds of 'build' overtime per test run for most of the report and just barely enough RAM to handle 10 tests at a time - for a few test fails on heavy tests (eg hdfs), not enough RAM as there is also no swap space on those machines. Anyway, beasting ~900 tests is time consuming even in the best case. Two tests also hung and that slowed things up a bit. Now I am more on the lookout for that - I've @BadAppled a test method involved in producing one of the hangs, and for this report I locally @BadAppled the other. They both look like legit bugs to me. I should have done @Ignore for the second hang, the test report runs @BadApple and @AwaitFix. Losing one machine for a long time when you are using 10 costs you a lot in report creation time. Now I at least know to pay attention to my email while running reports though. Luckily, these instance I'm using will auto pause after 30 minutes of no real activity and I get an email, so I now I can be a bit more vigilant while creating the report. Also helps that I've gotten down to about 4 hours to create the report. I used 5 16-core machines for the second report. I can't recall about how long that took, but it was still in the realm of an all night job. For this third report I am using 10 8-core machines. I think we should be using those annotations like this: @AwaitsFix - we basically know something key is broken and it's fairly clear what the issue is - we are waiting for someone to fix it - you don't expect this to be run regularly, but you can just pass a system property to run them. @BadApple - test is too flakey, fails too much for unknown or varied reasons - you do expect that some test runs would still or could still include these tests and give some useful coverage information - flakiness in many more integration type tests can be the result of unrelated issues and clear up over time. Or get worse. @Ignore - test is never run, it can hang, OOM, or does something negative to other tests. I'll put up another report soon. I probably won't do another one until I have tackled the above flakey rating issues, hoping that's just a couple to a few weeks at most, but may be wishful.
            markrmiller@gmail.com Mark Miller added a comment -

            I'd also been wondering why the test framework didn't bail on those hangs like I've seen it do many times with ant test. Finally dug it up and found the default when you run a single test is no timeout. I'll add an appropriate timeout.

            markrmiller@gmail.com Mark Miller added a comment - I'd also been wondering why the test framework didn't bail on those hangs like I've seen it do many times with ant test. Finally dug it up and found the default when you run a single test is no timeout. I'll add an appropriate timeout.
            markrmiller@gmail.com Mark Miller added a comment -

            Here is the latest report, ran against a commit from yesterday.

            I now include up to 3 of the last fail percentage results from previous reports.

            https://docs.google.com/spreadsheets/d/1N6RxH4Edd7ldRIaVfin0si-uSLGyowQi8-7mcux27S0/edit?usp=sharing

            markrmiller@gmail.com Mark Miller added a comment - Here is the latest report, ran against a commit from yesterday. I now include up to 3 of the last fail percentage results from previous reports. https://docs.google.com/spreadsheets/d/1N6RxH4Edd7ldRIaVfin0si-uSLGyowQi8-7mcux27S0/edit?usp=sharing

            Just thinking out loud, I wonder if this beasting could be tied into something that instruments the code to automatically dump the stack and all related variables on failure.

            Maybe with something like http://byteman.jboss.org/ or - for commercial options - https://www.overops.com/ or http://chrononsystems.com/

            arafalov Alexandre Rafalovitch added a comment - Just thinking out loud, I wonder if this beasting could be tied into something that instruments the code to automatically dump the stack and all related variables on failure. Maybe with something like http://byteman.jboss.org/ or - for commercial options - https://www.overops.com/ or http://chrononsystems.com/
            markrmiller@gmail.com Mark Miller added a comment -

            I suppose it depends on the cost of integrating and maintaining it vs how much the information it provides is consumed or useful. I think of things like clover that a lot of effort and time was spent on in the past, but who is looking at clover coverage? I'd still prefer to have test coverage reports, but again, usually I'll just generate that on demand. Often I won't even look at the logs for beast fails initially. I will move to the latest code, beast out the latest fails with the latest code, etc. Usually beast fails are very replicable (beast longer!), and running 8-12 in parallel allows you blast out a few hundred runs on a test in no time (one test, no time, 900 tests, not so much). For example, a couple tests hung in my first report and couple had RAM issues. I didn't dig in the data much first though, I just reproduced with YourKit attached to see what was up and addressed it.

            markrmiller@gmail.com Mark Miller added a comment - I suppose it depends on the cost of integrating and maintaining it vs how much the information it provides is consumed or useful. I think of things like clover that a lot of effort and time was spent on in the past, but who is looking at clover coverage? I'd still prefer to have test coverage reports, but again, usually I'll just generate that on demand. Often I won't even look at the logs for beast fails initially. I will move to the latest code, beast out the latest fails with the latest code, etc. Usually beast fails are very replicable (beast longer!), and running 8-12 in parallel allows you blast out a few hundred runs on a test in no time (one test, no time, 900 tests, not so much). For example, a couple tests hung in my first report and couple had RAM issues. I didn't dig in the data much first though, I just reproduced with YourKit attached to see what was up and addressed it.

            Interesting that PeerSyncReplicationTest is rock-solid for you, It fails quite often to me (and seems like also in Jenkins)

            tflobbe Tomas Eduardo Fernandez Lobbe added a comment - Interesting that PeerSyncReplicationTest is rock-solid for you, It fails quite often to me (and seems like also in Jenkins)
            markrmiller@gmail.com Mark Miller added a comment -

            It's what I mention above. Surviving 30 runs is an arbitrary initial bar. If that test has low resource env issues, fails will will pop up over time and longer beast runs will draw out its true stability.

            This will not match Jenkins or everyone local environments, but that is not really the point either. I promise if a test is truly flakey, I will draw it out over time. These others tests fail often even in good conditions and short beasting and will need attention first.

            A lot of these tests that pass 30 runs even often will have 1 out of 60 or 1 out of 100 fails or 1 out of 200 fails that can be much more common under certain conditions.

            markrmiller@gmail.com Mark Miller added a comment - It's what I mention above. Surviving 30 runs is an arbitrary initial bar. If that test has low resource env issues, fails will will pop up over time and longer beast runs will draw out its true stability. This will not match Jenkins or everyone local environments, but that is not really the point either. I promise if a test is truly flakey, I will draw it out over time. These others tests fail often even in good conditions and short beasting and will need attention first. A lot of these tests that pass 30 runs even often will have 1 out of 60 or 1 out of 100 fails or 1 out of 200 fails that can be much more common under certain conditions.
            markrmiller@gmail.com Mark Miller added a comment -

            Back to my point that devs and Jenkins will play an very important role in testing regardless of the progress made on this front.

            Please keep reporting tests that you know fail and are not yet popping out in these reports. That is two I will make sure have JIRA issues and I can quickly and easily beast them a few hundred times and report the results.

            As tests are flagged, special attention can produce much more likely fails. Simply adding many more iterations for a test is the great hammer but you can also look at running even more in parallel or creating other fake load. Some tests are not very aggressive competition with themselves even at 10 in parallel.

            I have plenty of experience pulling out failures from single tests, plenty of knobs to twist, this is just about getting a list of those tests to focus on and naming names.

            So if your favorite flakey test has not made the report yet, please report it here and I'll give it some special attention.

            markrmiller@gmail.com Mark Miller added a comment - Back to my point that devs and Jenkins will play an very important role in testing regardless of the progress made on this front. Please keep reporting tests that you know fail and are not yet popping out in these reports. That is two I will make sure have JIRA issues and I can quickly and easily beast them a few hundred times and report the results. As tests are flagged, special attention can produce much more likely fails. Simply adding many more iterations for a test is the great hammer but you can also look at running even more in parallel or creating other fake load. Some tests are not very aggressive competition with themselves even at 10 in parallel. I have plenty of experience pulling out failures from single tests, plenty of knobs to twist, this is just about getting a list of those tests to focus on and naming names. So if your favorite flakey test has not made the report yet, please report it here and I'll give it some special attention.
            markrmiller@gmail.com Mark Miller added a comment -

            Here is the latest report. I now include the average fail % of all tracked report runs and started tagging tests with JIRA ids.

            2/14/2017 https://docs.google.com/spreadsheets/d/1eZ9_ds_0XyqsKKp8xkmESrcMZRP85jTxSKkNwgtcUn0/edit?usp=sharing

            (PDF attached to issue)

            markrmiller@gmail.com Mark Miller added a comment - Here is the latest report. I now include the average fail % of all tracked report runs and started tagging tests with JIRA ids. 2/14/2017 https://docs.google.com/spreadsheets/d/1eZ9_ds_0XyqsKKp8xkmESrcMZRP85jTxSKkNwgtcUn0/edit?usp=sharing (PDF attached to issue)
            erickerickson Erick Erickson added a comment -

            Mark:

            First of all, really appreciate the work you're doing here. I hope to address some of the tests I'm familiar with, honest. Real Soon Now.

            Would it be convenient for you to publicly share the folder these reports go into so I could bookmark that?

            Erick

            erickerickson Erick Erickson added a comment - Mark: First of all, really appreciate the work you're doing here. I hope to address some of the tests I'm familiar with, honest. Real Soon Now. Would it be convenient for you to publicly share the folder these reports go into so I could bookmark that? Erick
            markrmiller@gmail.com Mark Miller added a comment -

            Would it be convenient for you to publicly share the folder these reports go into so I could bookmark that?

            https://drive.google.com/drive/folders/0ByYyjsrbz7-qa2dOaU1UZDdRVzg?usp=sharing

            markrmiller@gmail.com Mark Miller added a comment - Would it be convenient for you to publicly share the folder these reports go into so I could bookmark that? https://drive.google.com/drive/folders/0ByYyjsrbz7-qa2dOaU1UZDdRVzg?usp=sharing
            markrmiller@gmail.com Mark Miller added a comment -

            Here is a special 3 day weekend edition, 100 iterations, 12 at a time instead of 10, trying to draw out more tests: https://docs.google.com/spreadsheets/d/1LEPvXbsoHtKfIcZCJZ3_P6OHp7S5g2HP2OJgU6B2sAg/edit?usp=sharing

            markrmiller@gmail.com Mark Miller added a comment - Here is a special 3 day weekend edition, 100 iterations, 12 at a time instead of 10, trying to draw out more tests: https://docs.google.com/spreadsheets/d/1LEPvXbsoHtKfIcZCJZ3_P6OHp7S5g2HP2OJgU6B2sAg/edit?usp=sharing
            markrmiller@gmail.com Mark Miller added a comment - Here are some notes and color for the last test report: https://medium.com/@heismark/solr-test-report-results-notes-02-17-2017-3fd66db1adeb#.gity816cj
            markrmiller@gmail.com Mark Miller added a comment -

            I've got a few issue to look into, but when I'm done I'll run a new report and work on removing any tests failing over 5% of the time.

            markrmiller@gmail.com Mark Miller added a comment - I've got a few issue to look into, but when I'm done I'll run a new report and work on removing any tests failing over 5% of the time.
            elyograg Shawn Heisey added a comment -

            If graphs can be generated showing the time it takes for test runs (excluding those that take two hours and hit the timeout), it could serve as a minimal benchmark. Perhaps SOLR-10130 might not have slipped through the cracks. A full-fledged benchmark (like the one McCandless does for Lucene) would be best, but graphing times for test runs could serve as a stopgap.

            elyograg Shawn Heisey added a comment - If graphs can be generated showing the time it takes for test runs (excluding those that take two hours and hit the timeout), it could serve as a minimal benchmark. Perhaps SOLR-10130 might not have slipped through the cracks. A full-fledged benchmark (like the one McCandless does for Lucene) would be best, but graphing times for test runs could serve as a stopgap.

            work on removing any tests failing over 5% of the time.

            +1. This would result in a much more stable build, and might force us all to start tackling the important disabled tests more seriously.

            ichattopadhyaya Ishan Chattopadhyaya added a comment - work on removing any tests failing over 5% of the time. +1. This would result in a much more stable build, and might force us all to start tackling the important disabled tests more seriously.
            markrmiller@gmail.com Mark Miller added a comment -

            I've recorded a talk on this topic that also demo's creating a test report a bit: https://youtu.be/A2VXU-JVoGY

            markrmiller@gmail.com Mark Miller added a comment - I've recorded a talk on this topic that also demo's creating a test report a bit: https://youtu.be/A2VXU-JVoGY
            markrmiller@gmail.com Mark Miller added a comment -

            I'm a little behind on my latest run because I want to look into a test or two that hangs recently first (a hang that the test suite timeout can't kill can make a test run take much much longer unless I notice it fast and I'm not usually monitoring over hours). I'll generate a new report by Monday though.

            markrmiller@gmail.com Mark Miller added a comment - I'm a little behind on my latest run because I want to look into a test or two that hangs recently first (a hang that the test suite timeout can't kill can make a test run take much much longer unless I notice it fast and I'm not usually monitoring over hours). I'll generate a new report by Monday though.
            markrmiller@gmail.com Mark Miller added a comment -

            I actually bailed on spending more time with report generation for a bit. The process was still a bit too cumbersome - it really needs to be one command to kick off and then everything else is automatic so that it can be run regularly with ease. I finally came up with a plan to automate from test running to table generation and publishing, so I have spent the time instead finishing up full automation. I still have a bit to do, but I'll be done soon and then will run a report on a regular schedule.

            I've also spent a little time planning out using Docker for parallel test runs instead of the built in project support. This will enable the possibility of supporting other projects and there are some nice benefits in truly isolating parallel test runs.

            markrmiller@gmail.com Mark Miller added a comment - I actually bailed on spending more time with report generation for a bit. The process was still a bit too cumbersome - it really needs to be one command to kick off and then everything else is automatic so that it can be run regularly with ease. I finally came up with a plan to automate from test running to table generation and publishing, so I have spent the time instead finishing up full automation. I still have a bit to do, but I'll be done soon and then will run a report on a regular schedule. I've also spent a little time planning out using Docker for parallel test runs instead of the built in project support. This will enable the possibility of supporting other projects and there are some nice benefits in truly isolating parallel test runs.
            markrmiller@gmail.com Mark Miller added a comment -

            As part of fully automating everything, I've been experimented with sending results to bitballoon rather than cut and pasting into Google sheets. It's dead simple to automate and a nicer experience I think. Here is a demo of results http://solr-tests.bitballoon.com

            markrmiller@gmail.com Mark Miller added a comment - As part of fully automating everything, I've been experimented with sending results to bitballoon rather than cut and pasting into Google sheets. It's dead simple to automate and a nicer experience I think. Here is a demo of results http://solr-tests.bitballoon.com
            erickerickson Erick Erickson added a comment -

            Sweet!

            erickerickson Erick Erickson added a comment - Sweet!
            anshum Anshum Gupta added a comment -

            This is looking really good!

            anshum Anshum Gupta added a comment - This is looking really good!
            markrmiller@gmail.com Mark Miller added a comment -

            Okay, I'm pretty much ready here. I have a generic project called BeastIt that can generate a report for pretty much anything that runs on linux, has test files and allows launching a single test via command line. I'm going to resolve this issue once I post a new report and then I'll start regularly generating test reports now that it's just a command line call.

            markrmiller@gmail.com Mark Miller added a comment - Okay, I'm pretty much ready here. I have a generic project called BeastIt that can generate a report for pretty much anything that runs on linux, has test files and allows launching a single test via command line. I'm going to resolve this issue once I post a new report and then I'll start regularly generating test reports now that it's just a command line call.
            anshum Anshum Gupta added a comment -

            markrmiller@gmail.com do you intend to open source BeastIt ?

            anshum Anshum Gupta added a comment - markrmiller@gmail.com do you intend to open source BeastIt ?
            markrmiller@gmail.com Mark Miller added a comment -

            Yeah, I'll move it to a public repo at some point.

            First report is done, they will show up here: http://apache-solr.bitballoon.com/

            One thing to note is that SharedFSAutoReplicaFailoverTest looks broken. That may be the same on the 7.0 release branch. One of the things I'm looking forward to with this is some more visibility on nightlies - too easy to break them and not care or notice as it is.

            markrmiller@gmail.com Mark Miller added a comment - Yeah, I'll move it to a public repo at some point. First report is done, they will show up here: http://apache-solr.bitballoon.com/ One thing to note is that SharedFSAutoReplicaFailoverTest looks broken. That may be the same on the 7.0 release branch. One of the things I'm looking forward to with this is some more visibility on nightlies - too easy to break them and not care or notice as it is.

            markrmiller@gmail.com - This is really great, thank you.

            I have one question, out of curiosity: Why do a few tests show up as failing more than 100% of the time?

            ctargett Cassandra Targett added a comment - markrmiller@gmail.com - This is really great, thank you. I have one question, out of curiosity: Why do a few tests show up as failing more than 100% of the time?
            markrmiller@gmail.com Mark Miller added a comment -

            Those are special codes to order and identify tests with annotations.

            So if a test is ignored, it's not run at all and gets a 122 or whatever. If it's @BadApple and fails 100%, it gets a 112, if it's @AwaitFix and fails 100% it gets a 113. So those 100% fails are basically expected. If it is 100%, it's a test that won't pass and doesn't have one of these annotations, so really bad.

            markrmiller@gmail.com Mark Miller added a comment - Those are special codes to order and identify tests with annotations. So if a test is ignored, it's not run at all and gets a 122 or whatever. If it's @BadApple and fails 100%, it gets a 112, if it's @AwaitFix and fails 100% it gets a 113. So those 100% fails are basically expected. If it is 100%, it's a test that won't pass and doesn't have one of these annotations, so really bad.
            erickerickson Erick Erickson added a comment -

            This is excellent, thanks for all your hard work here!

            erickerickson Erick Erickson added a comment - This is excellent, thanks for all your hard work here!
            markrmiller@gmail.com Mark Miller added a comment -

            Thanks! It actually ended up being a ton of work. It wasn't so bad just to stitch something together for Solr with me to fill in gaps, but to make it generic for any project using docker, to allow it to have tests (docker within docker!), to allow you to point it at 10 freshly provisioned machines with no setup on your part, to make it easy to debug and add new project support easily, etc, was actually many, many, many hours of effort. Still some polish and minor things to do, but very happy it's ready to start pushing out reports regularly now.

            markrmiller@gmail.com Mark Miller added a comment - Thanks! It actually ended up being a ton of work. It wasn't so bad just to stitch something together for Solr with me to fill in gaps, but to make it generic for any project using docker, to allow it to have tests (docker within docker!), to allow you to point it at 10 freshly provisioned machines with no setup on your part, to make it easy to debug and add new project support easily, etc, was actually many, many, many hours of effort. Still some polish and minor things to do, but very happy it's ready to start pushing out reports regularly now.

            Hi markrmiller@gmail.com, the last report was in April. Is this still something you intend to publish and/or publish the code?

            shalin Shalin Shekhar Mangar added a comment - Hi markrmiller@gmail.com , the last report was in April. Is this still something you intend to publish and/or publish the code?
            markrmiller@gmail.com Mark Miller added a comment -

            I don’t own the code and had not had a chance to open source it before changing jobs, so remains to be seen what will happen.

            markrmiller@gmail.com Mark Miller added a comment - I don’t own the code and had not had a chance to open source it before changing jobs, so remains to be seen what will happen.

            People

              markrmiller@gmail.com Mark Miller
              markrmiller@gmail.com Mark Miller
              Votes:
              2 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: