Solr
  1. Solr
  2. SOLR-6969

When opening an HDFSTransactionLog for append we must first attempt to recover it's lease to prevent data loss.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: hdfs
    • Labels:
      None

      Description

      This can happen after a hard crash and restart. The current workaround is to stop and wait it out and start again. We should retry and wait a given amount of time as we do when we detect safe mode though.

      1. SOLR-6969.patch
        16 kB
        Mark Miller
      2. SOLR-6969.patch
        11 kB
        Mark Miller
      3. SOLR-6969-4.10.4-backport.patch
        9 kB
        Anshum Gupta

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -
          ERROR - 2015-01-12 17:49:43.992; org.apache.solr.common.SolrException; Failure to open existing log file (non fatal) hdfs://localhost:8020/solr_test/collection1/core_node1/data/tlog/tlog.0000000000000000000:org.apache.solr.common.SolrException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.RecoveryInProgressException): Failed to close file /solr_test/collection1/core_node1/data/tlog/tlog.0000000000000000000. Lease recovery is in progress. Try again later.
          	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2626)
          	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2462)
          	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2700)
          	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2663)
          	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:559)
          	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:388)
          	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
          	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
          	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
          	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
          	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at javax.security.auth.Subject.doAs(Subject.java:422)
          	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
          	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
          
          	at org.apache.solr.update.HdfsTransactionLog.<init>(HdfsTransactionLog.java:121)
          	at org.apache.solr.update.HdfsUpdateLog.init(HdfsUpdateLog.java:190)
          	at org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:134)
          	at org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:94)
          
          Show
          Mark Miller added a comment - ERROR - 2015-01-12 17:49:43.992; org.apache.solr.common.SolrException; Failure to open existing log file (non fatal) hdfs://localhost:8020/solr_test/collection1/core_node1/data/tlog/tlog.0000000000000000000:org.apache.solr.common.SolrException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.RecoveryInProgressException): Failed to close file /solr_test/collection1/core_node1/data/tlog/tlog.0000000000000000000. Lease recovery is in progress. Try again later. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2626) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2462) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2700) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2663) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:559) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:388) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apache.solr.update.HdfsTransactionLog.<init>(HdfsTransactionLog.java:121) at org.apache.solr.update.HdfsUpdateLog.init(HdfsUpdateLog.java:190) at org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:134) at org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:94)
          Hide
          Mark Miller added a comment -

          Praneeth also mentions seeing AlreadyBeingCreatedException in SOLR-6367 - we should deal with that as well.

          Show
          Mark Miller added a comment - Praneeth also mentions seeing AlreadyBeingCreatedException in SOLR-6367 - we should deal with that as well.
          Hide
          Mike Drob added a comment -

          Ignoring the error itself for a moment, the error message is also really confusing. It starts with "Failure to open existing log file" and then later in the same line is "Failed to close file."

          For SafeMode it looks like we retry indefinitely. Do we want to exhibit the same behaviour here?

          Show
          Mike Drob added a comment - Ignoring the error itself for a moment, the error message is also really confusing. It starts with "Failure to open existing log file" and then later in the same line is "Failed to close file." For SafeMode it looks like we retry indefinitely. Do we want to exhibit the same behaviour here?
          Hide
          Mark Miller added a comment -

          Safe mode can be turned on and off by the user, so we retry without limit. It seems this should be a bound wait though.

          Show
          Mark Miller added a comment - Safe mode can be turned on and off by the user, so we retry without limit. It seems this should be a bound wait though.
          Hide
          Anshum Gupta added a comment -

          Mark Miller do you intend to get this in for 5.0? I'm asking as this has been marked as critical for this release.

          Show
          Anshum Gupta added a comment - Mark Miller do you intend to get this in for 5.0? I'm asking as this has been marked as critical for this release.
          Hide
          Mike Drob added a comment -

          Is retrying always going to be safe? That works fine after we've lost a server and started a new one (albeit too quickly) but what about the case where two servers both think they are responsible for that tlog? This can happen if the original server partially dies, but still has some threads that are doing work and haven't been cleaned up.

          Looking at how other projects handle similar issues - HBase moves the entire directory[1] to break any existing leases and ensure any other processes gets kicked out. Maybe a retry is a good stop-gap, but is it going to be a full solution?

          [1]: https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java#L310

          Show
          Mike Drob added a comment - Is retrying always going to be safe? That works fine after we've lost a server and started a new one (albeit too quickly) but what about the case where two servers both think they are responsible for that tlog? This can happen if the original server partially dies, but still has some threads that are doing work and haven't been cleaned up. Looking at how other projects handle similar issues - HBase moves the entire directory [1] to break any existing leases and ensure any other processes gets kicked out. Maybe a retry is a good stop-gap, but is it going to be a full solution? [1] : https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java#L310
          Hide
          Mark Miller added a comment -

          You can't have two Solr severs trying to access the same files when configured properly. The index lock will prevent this before you get to tlog files.

          I do think a long term / full solution requires some discussion with some hdfs guys. I talked to Colin a little about it today. Sounds like it's kind of an ugly situation.

          Show
          Mark Miller added a comment - You can't have two Solr severs trying to access the same files when configured properly. The index lock will prevent this before you get to tlog files. I do think a long term / full solution requires some discussion with some hdfs guys. I talked to Colin a little about it today. Sounds like it's kind of an ugly situation.
          Hide
          Mark Miller added a comment -

          Part of the problem is that we don't want to diverge from how things work with local file system as much as we can.

          Show
          Mark Miller added a comment - Part of the problem is that we don't want to diverge from how things work with local file system as much as we can.
          Hide
          Anshum Gupta added a comment -

          So it's certainly not for 5.0, right?

          Show
          Anshum Gupta added a comment - So it's certainly not for 5.0, right?
          Show
          Mark Miller added a comment - Colin pointed us to https://github.com/apache/hbase/blob/833feefbf977a8208f8f71f5f3bd9b027d54961f/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java
          Hide
          Mark Miller added a comment -

          Here is a first patch.

          Show
          Mark Miller added a comment - Here is a first patch.
          Hide
          Mark Miller added a comment -

          Ignoring the error itself for a moment, the error message is also really confusing. It starts with "Failure to open existing log file" and then later in the same line is "Failed to close file."

          It's two different things. The high level log item says that there was a "Failure to open existing log file". It shows the cause of this as RecoveryInProgressException and that has a message that let's you know that the problem was "Failed to close file.". Which is what happened. If the file had been properly closed previously (no -9, bug, OOM, etc), you would not have an outstanding lease that causes this issue.

          Show
          Mark Miller added a comment - Ignoring the error itself for a moment, the error message is also really confusing. It starts with "Failure to open existing log file" and then later in the same line is "Failed to close file." It's two different things. The high level log item says that there was a "Failure to open existing log file". It shows the cause of this as RecoveryInProgressException and that has a message that let's you know that the problem was "Failed to close file.". Which is what happened. If the file had been properly closed previously (no -9, bug, OOM, etc), you would not have an outstanding lease that causes this issue.
          Hide
          Mark Miller added a comment -

          Patch adds a basic test to hit the lease recovery path.

          Show
          Mark Miller added a comment - Patch adds a basic test to hit the lease recovery path.
          Hide
          ASF subversion and git services added a comment -

          Commit 1655754 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1655754 ]

          SOLR-6969: When opening an HDFSTransactionLog for append we must first attempt to recover it's lease to prevent data loss.

          Show
          ASF subversion and git services added a comment - Commit 1655754 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1655754 ] SOLR-6969 : When opening an HDFSTransactionLog for append we must first attempt to recover it's lease to prevent data loss.
          Hide
          ASF subversion and git services added a comment -

          Commit 1655756 from Mark Miller in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1655756 ]

          SOLR-6969: When opening an HDFSTransactionLog for append we must first attempt to recover it's lease to prevent data loss.

          Show
          ASF subversion and git services added a comment - Commit 1655756 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1655756 ] SOLR-6969 : When opening an HDFSTransactionLog for append we must first attempt to recover it's lease to prevent data loss.
          Hide
          Mark Miller added a comment -

          The testing could be improved, but I have added a basic test to hit the code path and did a bunch of manual testing so that we can get this into 5.0.

          Show
          Mark Miller added a comment - The testing could be improved, but I have added a basic test to hit the code path and did a bunch of manual testing so that we can get this into 5.0.
          Hide
          ASF subversion and git services added a comment -

          Commit 1655761 from Mark Miller in branch 'dev/branches/lucene_solr_5_0'
          [ https://svn.apache.org/r1655761 ]

          SOLR-6969: When opening an HDFSTransactionLog for append we must first attempt to recover it's lease to prevent data loss.

          Show
          ASF subversion and git services added a comment - Commit 1655761 from Mark Miller in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1655761 ] SOLR-6969 : When opening an HDFSTransactionLog for append we must first attempt to recover it's lease to prevent data loss.
          Hide
          Mark Miller added a comment -

          The lease renewer here can end up running through other tests because we don't shut it down.

          Show
          Mark Miller added a comment - The lease renewer here can end up running through other tests because we don't shut it down.
          Hide
          Mark Miller added a comment -

          I filed SOLR-7092 for that issue.

          Show
          Mark Miller added a comment - I filed SOLR-7092 for that issue.
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.
          Hide
          Anshum Gupta added a comment -

          Reopening for backporting to 4.10.4.

          Show
          Anshum Gupta added a comment - Reopening for backporting to 4.10.4.
          Hide
          Anshum Gupta added a comment -

          4.10.4 backport patch.

          Show
          Anshum Gupta added a comment - 4.10.4 backport patch.
          Hide
          Anshum Gupta added a comment -

          This adds a dependency for solr/core so I'm not sure if we should be backporting this to a bug fix release:

          <dependency org="commons-collections" name="commons-collections" rev="${/commons-collections/commons-collections}" conf="compile.hadoop"/>
          

          What's the general practice?

          Show
          Anshum Gupta added a comment - This adds a dependency for solr/core so I'm not sure if we should be backporting this to a bug fix release: <dependency org= "commons-collections" name= "commons-collections" rev= "${/commons-collections/commons-collections}" conf= "compile.hadoop" /> What's the general practice?
          Hide
          Anshum Gupta added a comment -

          I'm opting to not backport this to 4.10.x branch due to the added dependency.
          In case someone wants to use it, feel free to use the patch though.

          Show
          Anshum Gupta added a comment - I'm opting to not backport this to 4.10.x branch due to the added dependency. In case someone wants to use it, feel free to use the patch though.

            People

            • Assignee:
              Mark Miller
              Reporter:
              Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development