Solr
  1. Solr
  2. SOLR-7134

Replication can still cause index corruption.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1, 6.0
    • Component/s: replication (java)
    • Labels:
      None

      Description

      While we have plugged most of these holes, there appears to be another that is fairly rare.

      I've seen it play out a couple ways in tests, but it looks like part of the problem is that even if we decide we need a file and download it, we don't care if we then cannot move it into place if it already exists.

      I'm working with a fix that does two things:

      • Fail a replication attempt if we cannot move a file into place because it already exists.
      • If a replication attempt during recovery fails, on the next attempt force a full replication to a new directory.
      1. SOLR-7134.patch
        27 kB
        Mark Miller
      2. SOLR-7134.patch
        27 kB
        Mark Miller
      3. SOLR-7134.patch
        23 kB
        Mark Miller
      4. SOLR-7134.patch
        22 kB
        Mark Miller
      5. SOLR-7134.patch
        17 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          This seems to be working well.

          While looking into chaosmonkey test fails, I've also found another issue I'll file a JIRA for - we wait for 2 seconds in RecoveryStrategy to make sure any updates that saw the old cloudstate are done - that doesn't appear like it's enough in some cases. I'm trying with 5 seconds.

          Show
          Mark Miller added a comment - This seems to be working well. While looking into chaosmonkey test fails, I've also found another issue I'll file a JIRA for - we wait for 2 seconds in RecoveryStrategy to make sure any updates that saw the old cloudstate are done - that doesn't appear like it's enough in some cases. I'm trying with 5 seconds.
          Hide
          Mark Miller added a comment -

          I've also found another issue

          I think raising the time is a good start, but this is a hard problem to solve as nicely as I'd like - what if you get a long stop the world gc pause at the wrong time?

          Show
          Mark Miller added a comment - I've also found another issue I think raising the time is a good start, but this is a hard problem to solve as nicely as I'd like - what if you get a long stop the world gc pause at the wrong time?
          Hide
          Mark Miller added a comment -

          I've raised SOLR-7141 for this other issue.

          Show
          Mark Miller added a comment - I've raised SOLR-7141 for this other issue.
          Hide
          Mark Miller added a comment -

          First patch. Includes SOLR-7141.

          I've been mostly testing this on our Cloudera search branch so far (4.4 with lots of backports, especially around SolrCloud). It has resolved all of the most worrisome trouble spots I could find with HdfsChaosMonkeySafeLeader with thousands and thousands of runs so far.

          Show
          Mark Miller added a comment - First patch. Includes SOLR-7141 . I've been mostly testing this on our Cloudera search branch so far (4.4 with lots of backports, especially around SolrCloud). It has resolved all of the most worrisome trouble spots I could find with HdfsChaosMonkeySafeLeader with thousands and thousands of runs so far.
          Hide
          Mark Miller added a comment -

          New patch. I think this is about ready.

          Show
          Mark Miller added a comment - New patch. I think this is about ready.
          Hide
          Mike Drob added a comment -
          RealTimeGetComponent.java
          +        } catch (Exception e) {
          +          throw new SolrException(ErrorCode.SERVER_ERROR, null, e);
          +        }
          

          Is it ok to kill the whole operation from inside of a debug block? Maybe just debug log that we couldn't get correct debug logging for some reason (log exception too).

          HdfsTestUtil.java
          +      try {
          +        dfsCluster.shutdown();
          +      } catch (Error e) {
          +        log.warn("Exception shutting down dfsCluster", e);
          +      }
          

          Is this related, or just incidental test cleanup?

          StopableIndexingThread.java
          -  public StopableIndexingThread(SolrClient controlClient, SolrClient cloudClient, String id, boolean doDeletes, int numCycles) {
          

          No reason to remove this constructor, I think.

          ChaosMonkeySafeLeaderTest.java
          +    if (!pauseBetweenUpdates) {
          +      maxUpdates = 10000 + random().nextInt(1000);
          +    } else {
          +      maxUpdates = 15000;
          +    }
          

          Why is there a difference?

          SnapPuller.java
          +          LOG.info("Reloading SolrCore");
          

          Possibly worth logging which core?

          Show
          Mike Drob added a comment - RealTimeGetComponent.java + } catch (Exception e) { + throw new SolrException(ErrorCode.SERVER_ERROR, null , e); + } Is it ok to kill the whole operation from inside of a debug block? Maybe just debug log that we couldn't get correct debug logging for some reason (log exception too). HdfsTestUtil.java + try { + dfsCluster.shutdown(); + } catch (Error e) { + log.warn( "Exception shutting down dfsCluster" , e); + } Is this related, or just incidental test cleanup? StopableIndexingThread.java - public StopableIndexingThread(SolrClient controlClient, SolrClient cloudClient, String id, boolean doDeletes, int numCycles) { No reason to remove this constructor, I think. ChaosMonkeySafeLeaderTest.java + if (!pauseBetweenUpdates) { + maxUpdates = 10000 + random().nextInt(1000); + } else { + maxUpdates = 15000; + } Why is there a difference? SnapPuller.java + LOG.info( "Reloading SolrCore" ); Possibly worth logging which core?
          Hide
          Mark Miller added a comment - - edited

          Is it ok to kill the whole operation from inside of a debug block?

          From my perspective I'd rather that as I'm more likely to notice there is an issue. Probably nicer for the end user to chug along though.

          Is this related, or just incidental test cleanup?

          Rarely, shutting down hdfs can throw an exception - in some cases because it cannot find a class it wants, in other cases a null pointer exception - starting and stopping hdfs test issues are outside of what we are testing though - it shouldn't randomly fail our chaosmonkey tests.

          No reason to remove this constructor, I think.

          I'd rather force devs to pay attention to the other params they use than offer more constructors here.

          Why is there a difference?

          Tests can last much too long if when we have no pauses between updates and we allow too many updates. When there are pauses, its not so bad, but the pauses can be so short (it's random), we still want to have some upper limit. This is probably a result of log replay not being able to keep up with updates coming.

          Possibly worth logging which core?

          It's always worth doing this everywhere, but since we do it so little already (except when you use our special test logger, which tries to figure it out), I've been waiting for a more holistic fix rather doing ad hoc fixes. No real pain adding it here though.

          Show
          Mark Miller added a comment - - edited Is it ok to kill the whole operation from inside of a debug block? From my perspective I'd rather that as I'm more likely to notice there is an issue. Probably nicer for the end user to chug along though. Is this related, or just incidental test cleanup? Rarely, shutting down hdfs can throw an exception - in some cases because it cannot find a class it wants, in other cases a null pointer exception - starting and stopping hdfs test issues are outside of what we are testing though - it shouldn't randomly fail our chaosmonkey tests. No reason to remove this constructor, I think. I'd rather force devs to pay attention to the other params they use than offer more constructors here. Why is there a difference? Tests can last much too long if when we have no pauses between updates and we allow too many updates. When there are pauses, its not so bad, but the pauses can be so short (it's random), we still want to have some upper limit. This is probably a result of log replay not being able to keep up with updates coming. Possibly worth logging which core? It's always worth doing this everywhere, but since we do it so little already (except when you use our special test logger, which tries to figure it out), I've been waiting for a more holistic fix rather doing ad hoc fixes. No real pain adding it here though.
          Hide
          Mark Miller added a comment -

          New patch that changes all of the solrcloud_debug sections to only debug log caught exceptions, adds the core name to the reload log line, and adds a note about ignoring rare hdfs shutdown issues.

          Show
          Mark Miller added a comment - New patch that changes all of the solrcloud_debug sections to only debug log caught exceptions, adds the core name to the reload log line, and adds a note about ignoring rare hdfs shutdown issues.
          Hide
          Mike Drob added a comment -

          Tests can last much too long if when we have no pauses between updates and we allow too many updates. When there are pauses, its not so bad, but the pauses can be so short (it's random), we still want to have some upper limit. This is probably a result of log replay not being able to keep up with updates coming.

          It just doesn't look like there is an appreciable difference between [10000,11000) and [15000]. Is the supposition here that the pauses slow things down enough that we want to raise how many we do?

          +1 on the rest.

          Show
          Mike Drob added a comment - Tests can last much too long if when we have no pauses between updates and we allow too many updates. When there are pauses, its not so bad, but the pauses can be so short (it's random), we still want to have some upper limit. This is probably a result of log replay not being able to keep up with updates coming. It just doesn't look like there is an appreciable difference between [10000,11000) and [15000]. Is the supposition here that the pauses slow things down enough that we want to raise how many we do? +1 on the rest.
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Mark. Some comments on your latest patch:

          1. The SolrCoreState.setLastReplicateIndexSuccess is not used anywhere?
          2. The first time a recovery is requested, it will always force replication because SolrCoreState.getLastReplicateIndexSuccess will return false
          3. A replication failure because of connect exception or timeout etc shouldn't necessarily force a full replication but it looks like it will in this patch
          4. The SnapPuller.cleanup method releases tmpIndexDir even if deleteTmpIdxDir=false.
          Show
          Shalin Shekhar Mangar added a comment - Thanks Mark. Some comments on your latest patch: The SolrCoreState.setLastReplicateIndexSuccess is not used anywhere? The first time a recovery is requested, it will always force replication because SolrCoreState.getLastReplicateIndexSuccess will return false A replication failure because of connect exception or timeout etc shouldn't necessarily force a full replication but it looks like it will in this patch The SnapPuller.cleanup method releases tmpIndexDir even if deleteTmpIdxDir=false.
          Hide
          Mark Miller added a comment -

          It just doesn't look like there is an appreciable difference between [10000,11000) and [15000]

          Whoops - with no pause it should be 1000+ not 10000+.

          Show
          Mark Miller added a comment - It just doesn't look like there is an appreciable difference between [10000,11000) and [15000] Whoops - with no pause it should be 1000+ not 10000+.
          Hide
          Mark Miller added a comment -

          Thanks shalin.

          1,2

          Whoops - lost these in the refactoring for the new patch.

          3

          Yup. There are already so many bugs when trying to replicate partially into an index, I don't think it's currently worth trying to distinguish here yet. We need to back off to something safer and more conservative until we can have greater confidence in our tests catching more more easily. Right now they are not catching tons of horrible stuff and rather than introduce more bugs trying to be careful about when we try a full rep, I'd rather start very conservative - one chance at partial rep, then do the safe thing.

          4

          Ill look closer, but that's the same code that currently there.

          Show
          Mark Miller added a comment - Thanks shalin. 1,2 Whoops - lost these in the refactoring for the new patch. 3 Yup. There are already so many bugs when trying to replicate partially into an index, I don't think it's currently worth trying to distinguish here yet. We need to back off to something safer and more conservative until we can have greater confidence in our tests catching more more easily. Right now they are not catching tons of horrible stuff and rather than introduce more bugs trying to be careful about when we try a full rep, I'd rather start very conservative - one chance at partial rep, then do the safe thing. 4 Ill look closer, but that's the same code that currently there.
          Hide
          Mark Miller added a comment -

          New patch that address' Mike's test comment and Shalin's comment 1&2.

          4 is not an issue - the release is for reference counting, hasn't been changed, and should remain as is.

          Show
          Mark Miller added a comment - New patch that address' Mike's test comment and Shalin's comment 1&2. 4 is not an issue - the release is for reference counting, hasn't been changed, and should remain as is.
          Hide
          Mark Miller added a comment -

          One more - moves the set of last replication success into the cleanup method so it always set.

          Show
          Mark Miller added a comment - One more - moves the set of last replication success into the cleanup method so it always set.
          Hide
          Mark Miller added a comment -

          A fair amount of flux since I've 'test beasted' this work - I'll spend some time doing that with the latest patch this morning.

          HdfsChaosMonkeySafeLeader test has been better at catching a lot of these issues than ChaosMonkeySafeLeader test for some reason.

          Show
          Mark Miller added a comment - A fair amount of flux since I've 'test beasted' this work - I'll spend some time doing that with the latest patch this morning. HdfsChaosMonkeySafeLeader test has been better at catching a lot of these issues than ChaosMonkeySafeLeader test for some reason.
          Hide
          ASF subversion and git services added a comment -

          Commit 1668779 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1668779 ]

          SOLR-7134: Replication can still cause index corruption.

          Show
          ASF subversion and git services added a comment - Commit 1668779 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1668779 ] SOLR-7134 : Replication can still cause index corruption.
          Hide
          Ramkumar Aiyengar added a comment -

          Mark, do you plan to merge this to branch_5x soon? I wanted to merge my commit for SOLR-7291, which I still can, but will end up causing a few conflicts for you when you merge – hence wanted to check..

          Show
          Ramkumar Aiyengar added a comment - Mark, do you plan to merge this to branch_5x soon? I wanted to merge my commit for SOLR-7291 , which I still can, but will end up causing a few conflicts for you when you merge – hence wanted to check..
          Hide
          Mark Miller added a comment -

          Yeah, thanks, give me a couple hours, I'll do it shortly.

          Show
          Mark Miller added a comment - Yeah, thanks, give me a couple hours, I'll do it shortly.
          Hide
          ASF subversion and git services added a comment -

          Commit 1669600 from Mark Miller in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1669600 ]

          SOLR-7134: Replication can still cause index corruption.

          Show
          ASF subversion and git services added a comment - Commit 1669600 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1669600 ] SOLR-7134 : Replication can still cause index corruption.
          Hide
          Timothy Potter added a comment -

          Bulk close after 5.1 release

          Show
          Timothy Potter added a comment - Bulk close after 5.1 release

            People

            • Assignee:
              Mark Miller
              Reporter:
              Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development