Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-13811

possible autoAddReplicas bug and/or (Hdfs)AutoAddReplicasIntegrationTest refactoring / fixes

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • None
    • None

    Description

      I've noticed a pattern of failure behavior in jenkins runs of AutoAddReplicasIntegrationTest (which mostly manifests in the subclass HdfsAutoAddReplicasIntegrationTest, probably due to timing) which indicates either:

      1. the test is too contrived, and expects autoAddReplicas to kick in in a situation where the current impl of NodeLostTrigger isn't smart enough to handle
      2. NodeLostTrigger should be smart enough to handle this, but isn't.

      The test failure is currently somewhat finicky to reproduce, and depends on a node being stoped, restarted, and stopped again – while an affected collection is changed from autoAddReplicas=false to autoAddReplicas=true before the second "stop"

      Regardless of which of the 2 above is true: the test itself is somewhat convoluted. It creates a sequence of events (some randomized, some static) and asserting specific outcomes after each – but the timing of scheduled triggers like NodeLostTrigger , and the interplay of things like "pick a random node to shutdown" with a subsequent "explicitly shut down node2" (even if it was the node randomly shut down earlier) is confusing.

      I'm creating this issue to track two tightly dependent objectives:

      1. refactoring this test to:
        • better isolate the specific things it's trying to test in individual test methods.
        • have a singular test method that triggers the specific sequence of events that is currently problematic (ideally in such a way that it reliably fails).
      2. AwaitsFix this new test method until someone with a better understand of the autoAddReplicas / NodeLostTrigger code can assess if the test is faulty or the code being tested is faulty.

      Attachments

        1. apache_Lucene-Solr-NightlyTests-8.x_221.log.txt
          4.51 MB
          Chris M. Hostetter
        2. hoss_local_failure_after_refactoring.log.txt
          3.11 MB
          Chris M. Hostetter

        Activity

          People

            Unassigned Unassigned
            hossman Chris M. Hostetter
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: