Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-4531

Add tests to ensure that recovery does not fail on corrupted tlogs

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 6.3, 7.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      One of the solr nodes in our SolrCloud was killed. It caused tlog was corrupted. Now the node can't finish recoverying. There is an excepion:
      Caused by: java.lang.IndexOutOfBoundsException: Index: 14, Size: 13
      at java.util.ArrayList.RangeCheck(ArrayList.java:547)
      at java.util.ArrayList.get(ArrayList.java:322)
      at org.apache.solr.update.TransactionLog$LogCodec.readExternString(TransactionLog.java:128)
      at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:188)
      at org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.java:120)
      at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:184)
      at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:451)
      at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:182)
      at org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.java:121)
      at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:184)
      at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:451)
      at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:182)
      at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:451)
      at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:182)
      at org.apache.solr.update.TransactionLog$ReverseReader.next(TransactionLog.java:708)
      at org.apache.solr.update.UpdateLog$RecentUpdates.update(UpdateLog.java:906)
      at org.apache.solr.update.UpdateLog$RecentUpdates.access$000(UpdateLog.java:846)
      at org.apache.solr.update.UpdateLog.getRecentUpdates(UpdateLog.java:996)
      at org.apache.solr.update.UpdateLog.init(UpdateLog.java:241)
      at org.apache.solr.update.UpdateHandler.initLog(UpdateHandler.java:94)
      at org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:123)
      at org.apache.solr.update.DirectUpdateHandler2.<init>(DirectUpdateHandler2.java:97)
      ... 31 more

      I check the code in UpdateLog.java. I find that only IOException is catched when the above expception happens.

      solr\core\src\java\org\apache\solr\update\UpdateLog.java
          private void update() {
            int numUpdates = 0;
            updateList = new ArrayList<List<Update>>(logList.size());
            deleteByQueryList = new ArrayList<Update>();
            deleteList = new ArrayList<DeleteUpdate>();
            updates = new HashMap<Long,Update>(numRecordsToKeep);
      
            for (TransactionLog oldLog : logList) {
              List<Update> updatesForLog = new ArrayList<Update>();
      
              TransactionLog.ReverseReader reader = null;
              try {
                reader = oldLog.getReverseReader();
      
                while (numUpdates < numRecordsToKeep) {
                  Object o = reader.next();
                  if (o==null) break;
                  try {
      
                    // should currently be a List<Oper,Ver,Doc/Id>
                    List entry = (List)o;
      
                    // TODO: refactor this out so we get common error handling
                    int opAndFlags = (Integer)entry.get(0);
                    if (latestOperation == 0) {
                      latestOperation = opAndFlags;
                    }
                    int oper = opAndFlags & UpdateLog.OPERATION_MASK;
                    long version = (Long) entry.get(1);
      
                    switch (oper) {
                      case UpdateLog.ADD:
                      case UpdateLog.DELETE:
                      case UpdateLog.DELETE_BY_QUERY:
                        Update update = new Update();
                        update.log = oldLog;
                        update.pointer = reader.position();
                        update.version = version;
      
                        updatesForLog.add(update);
                        updates.put(version, update);
                        
                        if (oper == UpdateLog.DELETE_BY_QUERY) {
                          deleteByQueryList.add(update);
                        } else if (oper == UpdateLog.DELETE) {
                          deleteList.add(new DeleteUpdate(version, (byte[])entry.get(2)));
                        }
                        
                        break;
      
                      case UpdateLog.COMMIT:
                        break;
                      default:
                        throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,  "Unknown Operation! " + oper);
                    }
                  } catch (ClassCastException cl) {
                    log.warn("Unexpected log entry or corrupt log.  Entry=" + o, cl);
                    // would be caused by a corrupt transaction log
                  } catch (Exception ex) {
                    log.warn("Exception reverse reading log", ex);
                    break;
                  }
                }
      
              } catch (IOException e) {
                // failure to read a log record isn't fatal
                log.error("Exception reading versions from log",e);
              } finally {
                if (reader != null) reader.close();
              }
              updateList.add(updatesForLog);
            }
      
          }
      
      1. SOLR-4531.patch
        10 kB
        Shalin Shekhar Mangar
      2. SOLR-4531.patch
        11 kB
        Cao Manh Dat

        Activity

        Hide
        caomanhdat Cao Manh Dat added a comment - - edited

        Mark Miller Is this issue duplicated with SOLR-4359?

        Show
        caomanhdat Cao Manh Dat added a comment - - edited Mark Miller Is this issue duplicated with SOLR-4359 ?
        Hide
        caomanhdat Cao Manh Dat added a comment -

        Patch contain a test for this issue.

        In the test, It create corrupted tlog files by remove some last byte of the file and make sure the cluster still healthy.

        This patch also modify TestLeaderRecoverFromLogOnStartup to make a cleaner test for SOLR-5401 based on Varun Thacker hint.

        Shalin Shekhar Mangar Please kindly review this patch.

        Show
        caomanhdat Cao Manh Dat added a comment - Patch contain a test for this issue. In the test, It create corrupted tlog files by remove some last byte of the file and make sure the cluster still healthy. This patch also modify TestLeaderRecoverFromLogOnStartup to make a cleaner test for SOLR-5401 based on Varun Thacker hint. Shalin Shekhar Mangar Please kindly review this patch.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        I don't see how this issue was/is fixed. Seems like it can still happen? It's good that you randomize the number of bytes to truncate. Can you beast this test to see if we can get it to fail?

        Looking at the patch – the cluster.startJettySolrRunner() calls are redundant?

        +
        +    ChaosMonkey.start(cluster.getJettySolrRunners());
        +    cluster.startJettySolrRunner();
        +    cluster.startJettySolrRunner();
        +    cluster.startJettySolrRunner();
        +    cluster.startJettySolrRunner();
        
        Show
        shalinmangar Shalin Shekhar Mangar added a comment - I don't see how this issue was/is fixed. Seems like it can still happen? It's good that you randomize the number of bytes to truncate. Can you beast this test to see if we can get it to fail? Looking at the patch – the cluster.startJettySolrRunner() calls are redundant? + + ChaosMonkey.start(cluster.getJettySolrRunners()); + cluster.startJettySolrRunner(); + cluster.startJettySolrRunner(); + cluster.startJettySolrRunner(); + cluster.startJettySolrRunner();
        Hide
        caomanhdat Cao Manh Dat added a comment -

        The patch just contain the test, because this issue already fixed by SOLR-4359.

        Show
        caomanhdat Cao Manh Dat added a comment - The patch just contain the test, because this issue already fixed by SOLR-4359 .
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        I looked at the SOLR-4359 commit but I don't see where an IndexOutOfBoundsException is caught. The only error handling that SOLR-4359 added was to skip the tlog if the next record is null.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - I looked at the SOLR-4359 commit but I don't see where an IndexOutOfBoundsException is caught. The only error handling that SOLR-4359 added was to skip the tlog if the next record is null.
        Hide
        caomanhdat Cao Manh Dat added a comment -

        The reader.next() move inside the try catch block in SOLR-4359 solve this issue, because the general Exception is caught for this block so IndexOutOfBoundsException will also be caught by this block. The outer block just catch IOException so the exception will be throw outside of the method.

        Show
        caomanhdat Cao Manh Dat added a comment - The reader.next() move inside the try catch block in SOLR-4359 solve this issue, because the general Exception is caught for this block so IndexOutOfBoundsException will also be caught by this block. The outer block just catch IOException so the exception will be throw outside of the method.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Ah, I missed that inner try/catch block. Thanks for clarifying. I'll commit this shortly.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Ah, I missed that inner try/catch block. Thanks for clarifying. I'll commit this shortly.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Patch which removes the redundant cluster.startJettySolrRunner() calls. I'll commit this after running precommit and tests.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Patch which removes the redundant cluster.startJettySolrRunner() calls. I'll commit this after running precommit and tests.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit b7aa582dffd7a0bae3246e43c66a20a9c2e5341d in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b7aa582 ]

        SOLR-4531: Add tests to ensure that recovery does not fail on corrupted tlogs

        Show
        jira-bot ASF subversion and git services added a comment - Commit b7aa582dffd7a0bae3246e43c66a20a9c2e5341d in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b7aa582 ] SOLR-4531 : Add tests to ensure that recovery does not fail on corrupted tlogs
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 529654c02be519accf0e83a5919570fa2d9bc69d in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=529654c ]

        SOLR-4531: Add tests to ensure that recovery does not fail on corrupted tlogs

        (cherry picked from commit b7aa582)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 529654c02be519accf0e83a5919570fa2d9bc69d in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=529654c ] SOLR-4531 : Add tests to ensure that recovery does not fail on corrupted tlogs (cherry picked from commit b7aa582)
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Thanks Simon and Dat!

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Thanks Simon and Dat!
        Hide
        romseygeek Alan Woodward added a comment -

        I've seen this test fail a couple of times, I think due to collections not being fully active when the @Before method is run: see https://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Linux/2078/consoleFull for example. If the collection isn't fully active, then the DBQ and commit get buffered, and the query to check that the collection is empty is served by an old searcher and consequently fails.

        To fix, I'd suggest either adding a waitForState() call in the @Before method, or just creating a new collection per-method.

        Show
        romseygeek Alan Woodward added a comment - I've seen this test fail a couple of times, I think due to collections not being fully active when the @Before method is run: see https://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Linux/2078/consoleFull for example. If the collection isn't fully active, then the DBQ and commit get buffered, and the query to check that the collection is empty is served by an old searcher and consequently fails. To fix, I'd suggest either adding a waitForState() call in the @Before method, or just creating a new collection per-method.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Closing after 6.3.0 release.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.3.0 release.

          People

          • Assignee:
            shalinmangar Shalin Shekhar Mangar
            Reporter:
            icanfly0421 Simon Scofield
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development