Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      TestReplicationHandler seems to fail often.

      1. TestReplicationHandler.FAILED.210743
        971 kB
        Yonik Seeley
      2. fail1.txt
        1.03 MB
        Yonik Seeley
      3. SOLR-1469.patch
        6 kB
        Yonik Seeley
      4. SOLR-1469.patch
        16 kB
        Yonik Seeley

        Activity

        Hide
        Yonik Seeley added a comment -

        Attaching one particular worrying failure.

        Testcase: testReplicateAfterWrite2Slave took 2.656 sec
        FAILED
        expected:<1> but was:<0>
        junit.framework.AssertionFailedError: expected:<1> but was:<0>
        at org.apache.solr.handler.TestReplicationHandler.testReplicateAfterWrite2Slave(TestReplicationHandler.java:424)

        So replication was disabled, a doc was added to the slave, but then the search for it failed. How can that be?

        Show
        Yonik Seeley added a comment - Attaching one particular worrying failure. Testcase: testReplicateAfterWrite2Slave took 2.656 sec FAILED expected:<1> but was:<0> junit.framework.AssertionFailedError: expected:<1> but was:<0> at org.apache.solr.handler.TestReplicationHandler.testReplicateAfterWrite2Slave(TestReplicationHandler.java:424) So replication was disabled, a doc was added to the slave, but then the search for it failed. How can that be?
        Hide
        Mark Miller added a comment -

        Are you using Windows? Havn't gotten it to fail on Linux yet.

        Perhaps we should add a check of the status returned from index in there?

        Show
        Mark Miller added a comment - Are you using Windows? Havn't gotten it to fail on Linux yet. Perhaps we should add a check of the status returned from index in there?
        Hide
        Yonik Seeley added a comment -

        OK, I think this particular one might be a bug in the test.
        Replication is disabled only after the master commit - and in the test log provided it looks like the start of a replication sneeks in there, and finishes after the addDoc() on the slave.

        I'll leave the tests running in a loop tonight and see if moving the disableReplication before the master commit fixes things.

        Show
        Yonik Seeley added a comment - OK, I think this particular one might be a bug in the test. Replication is disabled only after the master commit - and in the test log provided it looks like the start of a replication sneeks in there, and finishes after the addDoc() on the slave. I'll leave the tests running in a loop tonight and see if moving the disableReplication before the master commit fixes things.
        Hide
        Yonik Seeley added a comment -

        I just committed the fix for the assertion failure above.
        The failures left now are due to connection refused exceptions.

        Show
        Yonik Seeley added a comment - I just committed the fix for the assertion failure above. The failures left now are due to connection refused exceptions.
        Hide
        Yonik Seeley added a comment -

        Jetty still fails to come up occasionally, even after waiting 2 minutes. I also tested with the latest Jetty 6.1.21 - same results.
        At this point, it could still be a Solr bug or a Jetty bug.

        Show
        Yonik Seeley added a comment - Jetty still fails to come up occasionally, even after waiting 2 minutes. I also tested with the latest Jetty 6.1.21 - same results. At this point, it could still be a Solr bug or a Jetty bug.
        Hide
        Yonik Seeley added a comment -

        The Jetty bug has been fixed by SOLR-2019.

        The failures that Mark & I recently saw http://search.lucidimagination.com/search/document/6d3f4d23cde4f1bd/solr_replication_test_case_failure
        looks due to sharing servers across test cases.

        setUp() cleans up inbetween, but that's part of the problem. The commit from setup() can start replication before it can be disabled... and then it can complete right after the last add on the slave, wiping out the adds and causing the test to fail.

        Show
        Yonik Seeley added a comment - The Jetty bug has been fixed by SOLR-2019 . The failures that Mark & I recently saw http://search.lucidimagination.com/search/document/6d3f4d23cde4f1bd/solr_replication_test_case_failure looks due to sharing servers across test cases. setUp() cleans up inbetween, but that's part of the problem. The commit from setup() can start replication before it can be disabled... and then it can complete right after the last add on the slave, wiping out the adds and causing the test to fail.
        Hide
        Yonik Seeley added a comment -

        Here's the log of the failure (fail1.txt)
        The fix would seem to be simple.... but I'm seeing some other strange stuff that looks off.

        Line 12818: you can see replication start right before it's disabled.
        It's going after master index version 1280686155645.
        But if we look back to Line 16046, we see that that is not the newest master index!

        Show
        Yonik Seeley added a comment - Here's the log of the failure (fail1.txt) The fix would seem to be simple.... but I'm seeing some other strange stuff that looks off. Line 12818: you can see replication start right before it's disabled. It's going after master index version 1280686155645. But if we look back to Line 16046, we see that that is not the newest master index!
        Hide
        Yonik Seeley added a comment -

        OK, it looks like the previous test method set up the master to only replicate on startup, and hence that would seem to be the reason why the wrong index version was being replicated. The test only checked that when replication was enabled again that the doc added to the slave directly was gone - it did not check that it got the correct version of the index.

        Show
        Yonik Seeley added a comment - OK, it looks like the previous test method set up the master to only replicate on startup, and hence that would seem to be the reason why the wrong index version was being replicated. The test only checked that when replication was enabled again that the doc added to the slave directly was gone - it did not check that it got the correct version of the index.
        Hide
        Yonik Seeley added a comment -

        OK, here's a patch that fixes some issues.

        • To avoid the commit() in setUp() causing an unwanted replication event, it now first queries the master and if there are any docs it deletes the index on the master and waits for that to replicate to the slave.
        • I moved testReplicateAfterWrite2Slave up in the file so it will run before anything that changes the default server configs (this gets around the problem where it was only replicating on startup)
        • I added some test code at the end of testReplicateAfterWrite2Slave to ensure that the correct index was replicated
        Show
        Yonik Seeley added a comment - OK, here's a patch that fixes some issues. To avoid the commit() in setUp() causing an unwanted replication event, it now first queries the master and if there are any docs it deletes the index on the master and waits for that to replicate to the slave. I moved testReplicateAfterWrite2Slave up in the file so it will run before anything that changes the default server configs (this gets around the problem where it was only replicating on startup) I added some test code at the end of testReplicateAfterWrite2Slave to ensure that the correct index was replicated
        Hide
        Yonik Seeley added a comment -

        committed to trunk and 3x.

        Show
        Yonik Seeley added a comment - committed to trunk and 3x.
        Hide
        Michael McCandless added a comment -

        Hmm TestReplicationHandler now takes much longer for me (~149 seconds). Previously it was ~30 seconds. Is this expected?

        Show
        Michael McCandless added a comment - Hmm TestReplicationHandler now takes much longer for me (~149 seconds). Previously it was ~30 seconds. Is this expected?
        Hide
        Yonik Seeley added a comment -

        grrr... no it's not expected.
        I moved one testcase around (that shouldn't matter).
        Only thing I can figure is it might be when I'm waiting for the empty index to replicate to the slave. I'll look into it.

        Show
        Yonik Seeley added a comment - grrr... no it's not expected. I moved one testcase around (that shouldn't matter). Only thing I can figure is it might be when I'm waiting for the empty index to replicate to the slave. I'll look into it.
        Hide
        Yonik Seeley added a comment -

        That was the problem. I'm making slow progress... the test methods have so many side effects it's hard to order them correctly and return things to enough of a correct state though.

        It's also the case that all of the methods indexed and tested for 500 docs, so it was also possible to get false passes. I'm changing this to use a different number for each test method.

        Show
        Yonik Seeley added a comment - That was the problem. I'm making slow progress... the test methods have so many side effects it's hard to order them correctly and return things to enough of a correct state though. It's also the case that all of the methods indexed and tested for 500 docs, so it was also possible to get false passes. I'm changing this to use a different number for each test method.
        Hide
        Yonik Seeley added a comment -

        OK, here's a patch that removes setUp()... we can't really automatically clean the indexes with the servers in unknown states as they are at the end of many test methods. The calls are now somewhat explicit.

        There were some tests that rearranging didn't fix and I had to add more server bounces to restore the correct state, so the test will be a few seconds longer than it was before.

        Show
        Yonik Seeley added a comment - OK, here's a patch that removes setUp()... we can't really automatically clean the indexes with the servers in unknown states as they are at the end of many test methods. The calls are now somewhat explicit. There were some tests that rearranging didn't fix and I had to add more server bounces to restore the correct state, so the test will be a few seconds longer than it was before.
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1.0 release

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1.0 release

          People

          • Assignee:
            Unassigned
            Reporter:
            Yonik Seeley
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development