Solr
  1. Solr
  2. SOLR-6290

CollectionsAPIAsyncDistributedZkTest failure

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, 6.0
    • Component/s: SolrCloud, Tests
    • Labels:
      None

      Description

      Found this failure on my local jenkins instance:

      Error Message
      
      CreateCollection task did not complete! expected:<[completed]> but was:<[running]>
      Stacktrace
      
      org.junit.ComparisonFailure: CreateCollection task did not complete! expected:<[completed]> but was:<[running]>
      	at __randomizedtesting.SeedInfo.seed([D4EC4F9D698903B5:550AC1851ED66389]:0)
      	at org.junit.Assert.assertEquals(Assert.java:125)
      	at org.apache.solr.cloud.CollectionsAPIAsyncDistributedZkTest.testSolrJAPICalls(CollectionsAPIAsyncDistributedZkTest.java:75)
      	at org.apache.solr.cloud.CollectionsAPIAsyncDistributedZkTest.doTest(CollectionsAPIAsyncDistributedZkTest.java:61)
      	at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:865)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      
      1. SOLR-6290.patch
        3 kB
        Shalin Shekhar Mangar
      2. SOLR-6290.patch
        2 kB
        Shalin Shekhar Mangar
      3. SOLR-6290.patch
        2 kB
        Shalin Shekhar Mangar

        Activity

        Hide
        Shalin Shekhar Mangar added a comment -

        There were a couple of calls which had timeout of 10 seconds and one with just 3 seconds. These were causing the failure. I bumped up the timeout to 5 minutes – it won't ever take that long but we should wait for as long as possible.

        Show
        Shalin Shekhar Mangar added a comment - There were a couple of calls which had timeout of 10 seconds and one with just 3 seconds. These were causing the failure. I bumped up the timeout to 5 minutes – it won't ever take that long but we should wait for as long as possible.
        Hide
        Mark Miller added a comment -

        had timeout of 10 seconds and one with just 3 seconds.

        I bumped up the timeout to 5 minutes – it won't ever take that long but we should wait for as long as possible.

        I don't agree that we want to wait as long as possible. We still want tests to fail as fast as possible as well. It's a balancing act - we want to wait as long as is necessary not to get false fails and no more. I didn't dig into this, but if tests were usually passing with 3 and 10 seconds, it seems to make more sense to raise them to 2-5x that or something. Just making it huge is annoying when people have to debug these fails later when doing development.

        Show
        Mark Miller added a comment - had timeout of 10 seconds and one with just 3 seconds. I bumped up the timeout to 5 minutes – it won't ever take that long but we should wait for as long as possible. I don't agree that we want to wait as long as possible. We still want tests to fail as fast as possible as well. It's a balancing act - we want to wait as long as is necessary not to get false fails and no more. I didn't dig into this, but if tests were usually passing with 3 and 10 seconds, it seems to make more sense to raise them to 2-5x that or something. Just making it huge is annoying when people have to debug these fails later when doing development.
        Hide
        Shalin Shekhar Mangar added a comment -

        I don't agree that we want to wait as long as possible. We still want tests to fail as fast as possible as well.

        I don't buy that. Our tests should fail when there's actually a problem. Out of all the tests that I've looked – all except the SyncSliceTest were false negatives. That sort of flakiness doesn't buy us anything. I'd rather have our tests take 30 minutes more and make us more confident than forcing us to ignore (sometimes real) failures just because they fail regularly on slow jenkins and we can't tell the difference easily.

        That being said, 5 minutes may be too much and we can make it something like a minute or so.

        Show
        Shalin Shekhar Mangar added a comment - I don't agree that we want to wait as long as possible. We still want tests to fail as fast as possible as well. I don't buy that. Our tests should fail when there's actually a problem. Out of all the tests that I've looked – all except the SyncSliceTest were false negatives. That sort of flakiness doesn't buy us anything. I'd rather have our tests take 30 minutes more and make us more confident than forcing us to ignore (sometimes real) failures just because they fail regularly on slow jenkins and we can't tell the difference easily. That being said, 5 minutes may be too much and we can make it something like a minute or so.
        Hide
        Mark Miller added a comment -

        we want to wait as long as is necessary not to get false fails and no more.

        It's a true statement for the reason I said. If the test won't fail at 30 seconds (and it seems it won't if it usually passed at 3 or 10 seconds), then making it 5 minutes is just painful for no gain. We could also make it an hour, and it would be really painful. I've spent a lot of time debugging SolrCloud tests, and making unnecessary huge timeouts is just mean to future devs.

        Show
        Mark Miller added a comment - we want to wait as long as is necessary not to get false fails and no more. It's a true statement for the reason I said. If the test won't fail at 30 seconds (and it seems it won't if it usually passed at 3 or 10 seconds), then making it 5 minutes is just painful for no gain. We could also make it an hour, and it would be really painful. I've spent a lot of time debugging SolrCloud tests, and making unnecessary huge timeouts is just mean to future devs.
        Hide
        Mark Miller added a comment -

        I'd rather have our tests take 30 minutes more

        -1! That would really hurt development!

        Show
        Mark Miller added a comment - I'd rather have our tests take 30 minutes more -1! That would really hurt development!
        Hide
        Shalin Shekhar Mangar added a comment -

        You're getting very generous with vetoes these days.

        No, it won't hurt development because these tests take a long time only on slow boxes. You and I, and most people running on decent 2-3 year old machines will not face a problem.

        Show
        Shalin Shekhar Mangar added a comment - You're getting very generous with vetoes these days. No, it won't hurt development because these tests take a long time only on slow boxes. You and I, and most people running on decent 2-3 year old machines will not face a problem.
        Hide
        Mark Miller added a comment -

        A -1 is not normally a veto. It's simply an indication of direction. -1, -0, 0, +0, +1 all have somewhat defined meanings in general Apache discussion.

        A verto is a very specific -1 to a specific code change. It needs to be announced as an official veto and it needs to be supported by a valid technical reason.

        Someone just saying -1 is not a veto.

        Show
        Mark Miller added a comment - A -1 is not normally a veto. It's simply an indication of direction. -1, -0, 0, +0, +1 all have somewhat defined meanings in general Apache discussion. A verto is a very specific -1 to a specific code change. It needs to be announced as an official veto and it needs to be supported by a valid technical reason. Someone just saying -1 is not a veto.
        Hide
        Mark Miller added a comment -

        You and I, and most people running on decent 2-3 year old machines will not face a problem.

        We will face a problem. Longer timeouts do not make the tests longer. I don't think you are understanding me.

        When developers make future changes, it will often cause tests to fail. If those tests take 5 minutes to fail when they could take 30 seconds or even a minute, it's very painful. I've experienced it. Even with my fast computer - in fact computer speed makes little to no difference in this case.

        Show
        Mark Miller added a comment - You and I, and most people running on decent 2-3 year old machines will not face a problem. We will face a problem. Longer timeouts do not make the tests longer. I don't think you are understanding me. When developers make future changes, it will often cause tests to fail. If those tests take 5 minutes to fail when they could take 30 seconds or even a minute, it's very painful. I've experienced it. Even with my fast computer - in fact computer speed makes little to no difference in this case.
        Hide
        Dawid Weiss added a comment -

        > I'd rather have our tests take 30 minutes more

        Oh, please no.

        I really appreciate the work you're doing, Shalin, but I agree with Mark – many people have stopped running Solr tests at all because they take forever. Adding delays is just making it worse. I understand it's complex system interaction, lots of code, etc., but wall-time delays are really the worst way of waiting for a condition to happen... If there's any other way (a http ping request for status or something like this) then spin-looping this for 30 seconds without any progress is I think an indication of a real problem.

        Show
        Dawid Weiss added a comment - > I'd rather have our tests take 30 minutes more Oh, please no. I really appreciate the work you're doing, Shalin, but I agree with Mark – many people have stopped running Solr tests at all because they take forever. Adding delays is just making it worse. I understand it's complex system interaction, lots of code, etc., but wall-time delays are really the worst way of waiting for a condition to happen... If there's any other way (a http ping request for status or something like this) then spin-looping this for 30 seconds without any progress is I think an indication of a real problem.
        Hide
        Mark Miller added a comment -

        To clarify on my -1 comment, I'll translate:

        Shalin: I'd rather have our tests take 30 minutes more

        Mark: I don't agree with that at all! That would really hurt development!

        Show
        Mark Miller added a comment - To clarify on my -1 comment, I'll translate: Shalin: I'd rather have our tests take 30 minutes more Mark: I don't agree with that at all! That would really hurt development!
        Hide
        Shalin Shekhar Mangar added a comment -

        When developers make future changes, it will often cause tests to fail. If those tests take 5 minutes to fail when they could take 30 seconds or even a minute, it's very painful. I've experienced it. Even with my fast computer - in fact computer speed makes little to no difference in this case.

        I totally get that but I'd rather have a test which fails late sometimes (when someone made a mistake and introduced a bug) than a test which gives up quickly and causes spurious failures often on our jenkins instance, the results of which are broadcasted to our whole dev community. We can agree to disagree.

        If there's any other way (a http ping request for status or something like this) then spin-looping this for 30 seconds without any progress is I think an indication of a real problem.

        This. That's exactly the problem here. We're not sleeping for 5 minutes. We are willing to wait for upto 5 minutes (we can tune it lower if it makes people happier) for an async task to succeed, invoking a ping status call every 1 second. Unfortunately, instead of going over the technical merits of the patch, we've decided to have a philosophical discussion here.

        Show
        Shalin Shekhar Mangar added a comment - When developers make future changes, it will often cause tests to fail. If those tests take 5 minutes to fail when they could take 30 seconds or even a minute, it's very painful. I've experienced it. Even with my fast computer - in fact computer speed makes little to no difference in this case. I totally get that but I'd rather have a test which fails late sometimes (when someone made a mistake and introduced a bug) than a test which gives up quickly and causes spurious failures often on our jenkins instance, the results of which are broadcasted to our whole dev community. We can agree to disagree. If there's any other way (a http ping request for status or something like this) then spin-looping this for 30 seconds without any progress is I think an indication of a real problem. This. That's exactly the problem here. We're not sleeping for 5 minutes. We are willing to wait for upto 5 minutes (we can tune it lower if it makes people happier) for an async task to succeed, invoking a ping status call every 1 second. Unfortunately, instead of going over the technical merits of the patch, we've decided to have a philosophical discussion here.
        Hide
        Dawid Weiss added a comment -

        > We are willing to wait for upto 5 minutes (we can tune it lower if it makes people happier) for an async task to succeed, invoking a ping status call every 1 second.

        I think these are insane timeouts though, Shalin. I mean – the tests never write tons of data; even on heavily loaded systems there should be some detectable progress (or a sensible, short-lived timeout after which you consider this a failure).

        FreeBSD jenkins and its failures are a whole different subject. I don't think timeouts will help it, to be honest – I had many concerns with respect to the openjdk port we're (forced) to use...

        Show
        Dawid Weiss added a comment - > We are willing to wait for upto 5 minutes (we can tune it lower if it makes people happier) for an async task to succeed, invoking a ping status call every 1 second. I think these are insane timeouts though, Shalin. I mean – the tests never write tons of data; even on heavily loaded systems there should be some detectable progress (or a sensible, short-lived timeout after which you consider this a failure). FreeBSD jenkins and its failures are a whole different subject. I don't think timeouts will help it, to be honest – I had many concerns with respect to the openjdk port we're (forced) to use...
        Hide
        Mark Miller added a comment -

        we've decided to have a philosophical discussion here.

        This is not philosophical at all It's purely technical.

        You raised timeouts of 3 and 10 seconds to 5 minutes and declared essentially that we can't make these timeouts high enough. I'm saying we sure as heck can!

        I totally get that

        If you totally got that, you wouldn't just assume 5 minutes is a good number and say that the timouts should just be super high.

        The chances that 2 minutes or 1 minutes or 30 seconds is better seems pretty likely. Why not make it an hour with your philosophy?

        If you are debugging, there is a big difference between 1 minutes and 2 minutes and 5 minutes.

        So you see, I'm focusing on the very specific and technical timeout raise that you made. Your philosophy of "the timeouts can't be high enough" also concerns me, but what has my focus is that this issue indicates tests usually pass with timeouts of 3 and 10 seconds and you are raising them to 5 minutes and that seems overkill and comes with negative consequences that are pretty likely not necessary.

        It's pretty easy to spot these kinds of failures. It's way better to do a more sensible timeout, and if there is another fail, double it again or something.

        Show
        Mark Miller added a comment - we've decided to have a philosophical discussion here. This is not philosophical at all It's purely technical. You raised timeouts of 3 and 10 seconds to 5 minutes and declared essentially that we can't make these timeouts high enough. I'm saying we sure as heck can! I totally get that If you totally got that, you wouldn't just assume 5 minutes is a good number and say that the timouts should just be super high. The chances that 2 minutes or 1 minutes or 30 seconds is better seems pretty likely. Why not make it an hour with your philosophy? If you are debugging, there is a big difference between 1 minutes and 2 minutes and 5 minutes. So you see, I'm focusing on the very specific and technical timeout raise that you made. Your philosophy of "the timeouts can't be high enough" also concerns me, but what has my focus is that this issue indicates tests usually pass with timeouts of 3 and 10 seconds and you are raising them to 5 minutes and that seems overkill and comes with negative consequences that are pretty likely not necessary. It's pretty easy to spot these kinds of failures. It's way better to do a more sensible timeout, and if there is another fail, double it again or something.
        Hide
        Shalin Shekhar Mangar added a comment -

        I think these are insane timeouts though, Shalin. I mean – the tests never write tons of data; even on heavily loaded systems there should be some detectable progress (or a sensible, short-lived timeout after which you consider this a failure).

        Yeah, perhaps I went overboard with the timeout. It can be set to a minute and that should avoid this particular failure. Some tasks such as split or migrate require more time e.g. I saw a simple shard split on an empty index not complete within 90 seconds and there was nothing in the logs which showed any problems.

        The chances that 2 minutes or 1 minutes or 30 seconds is better seems pretty likely.

        And I agreed with you and said that (twice): "That being said, 5 minutes may be too much and we can make it something like a minute or so." and then again "we can tune it lower if it makes people happier"

        Why not make it an hour with your philosophy?

        Because I am not insane.

        See, all the while, you've been talking as if:

        1. I am actually increasing the time taken by our suite by 30 minutes
        2. That increasing a timeout to 5 minutes here will actually slow the test for everyone by 5 minutes

        Both of the above are false. The problem I am trying to solve is that our test suite fails needlessly and often enough that many of us ignore failures on jenkins just because we can't make a distinction quickly enough. I think there are real benefits to solving that problem right now than an imagined problem which might manifest when you are debugging this particular test (when you can always change the timeout if you really think it is a problem).

        My quote " it won't ever take that long but we should wait for as long as possible." and "I'd rather have our tests take 30 minutes more" have been twisted to mean something that I did not intend.

        Show
        Shalin Shekhar Mangar added a comment - I think these are insane timeouts though, Shalin. I mean – the tests never write tons of data; even on heavily loaded systems there should be some detectable progress (or a sensible, short-lived timeout after which you consider this a failure). Yeah, perhaps I went overboard with the timeout. It can be set to a minute and that should avoid this particular failure. Some tasks such as split or migrate require more time e.g. I saw a simple shard split on an empty index not complete within 90 seconds and there was nothing in the logs which showed any problems. The chances that 2 minutes or 1 minutes or 30 seconds is better seems pretty likely. And I agreed with you and said that (twice): "That being said, 5 minutes may be too much and we can make it something like a minute or so." and then again "we can tune it lower if it makes people happier" Why not make it an hour with your philosophy? Because I am not insane. See, all the while, you've been talking as if: I am actually increasing the time taken by our suite by 30 minutes That increasing a timeout to 5 minutes here will actually slow the test for everyone by 5 minutes Both of the above are false. The problem I am trying to solve is that our test suite fails needlessly and often enough that many of us ignore failures on jenkins just because we can't make a distinction quickly enough. I think there are real benefits to solving that problem right now than an imagined problem which might manifest when you are debugging this particular test (when you can always change the timeout if you really think it is a problem). My quote " it won't ever take that long but we should wait for as long as possible." and "I'd rather have our tests take 30 minutes more" have been twisted to mean something that I did not intend.
        Hide
        Shalin Shekhar Mangar added a comment -

        I have toned down the maximum wait to 1 minute. I'll commit this shortly.

        Show
        Shalin Shekhar Mangar added a comment - I have toned down the maximum wait to 1 minute. I'll commit this shortly.
        Hide
        Shalin Shekhar Mangar added a comment -

        Here's an updated patch which makes the test faster by:

        1. The default collection shard count was fixed at 2 slices and 4 shards but the test doesn't really use the default collection at all. I set slice and shard count to 1
        2. The 'testasynccollectioncreation' collection was created with 2 shards but 1 shard is enough to test this functionality.

        I increased the timeout for the splitshard call to 2 minutes because I have seen shard splits take longer than 90 seconds on jenkins.

        Show
        Shalin Shekhar Mangar added a comment - Here's an updated patch which makes the test faster by: The default collection shard count was fixed at 2 slices and 4 shards but the test doesn't really use the default collection at all. I set slice and shard count to 1 The 'testasynccollectioncreation' collection was created with 2 shards but 1 shard is enough to test this functionality. I increased the timeout for the splitshard call to 2 minutes because I have seen shard splits take longer than 90 seconds on jenkins.
        Hide
        ASF subversion and git services added a comment -

        Commit 1614194 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1614194 ]

        SOLR-6290: Harden and speed up CollectionsAPIAsyncDistributedZkTest

        Show
        ASF subversion and git services added a comment - Commit 1614194 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1614194 ] SOLR-6290 : Harden and speed up CollectionsAPIAsyncDistributedZkTest
        Hide
        ASF subversion and git services added a comment -

        Commit 1614195 from shalin@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1614195 ]

        SOLR-6290: Harden and speed up CollectionsAPIAsyncDistributedZkTest

        Show
        ASF subversion and git services added a comment - Commit 1614195 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1614195 ] SOLR-6290 : Harden and speed up CollectionsAPIAsyncDistributedZkTest

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            Shalin Shekhar Mangar
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development