Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-6969

When opening an HDFSTransactionLog for append we must first attempt to recover it's lease to prevent data loss.

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: hdfs
    • Labels:
      None

      Description

      This can happen after a hard crash and restart. The current workaround is to stop and wait it out and start again. We should retry and wait a given amount of time as we do when we detect safe mode though.

      1. SOLR-6969.patch
        16 kB
        Mark Miller
      2. SOLR-6969.patch
        11 kB
        Mark Miller
      3. SOLR-6969-4.10.4-backport.patch
        9 kB
        Anshum Gupta

        Issue Links

          Activity

          Hide
          anshumg Anshum Gupta added a comment -

          I'm opting to not backport this to 4.10.x branch due to the added dependency.
          In case someone wants to use it, feel free to use the patch though.

          Show
          anshumg Anshum Gupta added a comment - I'm opting to not backport this to 4.10.x branch due to the added dependency. In case someone wants to use it, feel free to use the patch though.
          Hide
          anshumg Anshum Gupta added a comment -

          This adds a dependency for solr/core so I'm not sure if we should be backporting this to a bug fix release:

          <dependency org="commons-collections" name="commons-collections" rev="${/commons-collections/commons-collections}" conf="compile.hadoop"/>
          

          What's the general practice?

          Show
          anshumg Anshum Gupta added a comment - This adds a dependency for solr/core so I'm not sure if we should be backporting this to a bug fix release: <dependency org= "commons-collections" name= "commons-collections" rev= "${/commons-collections/commons-collections}" conf= "compile.hadoop" /> What's the general practice?
          Hide
          anshumg Anshum Gupta added a comment -

          4.10.4 backport patch.

          Show
          anshumg Anshum Gupta added a comment - 4.10.4 backport patch.
          Hide
          anshumg Anshum Gupta added a comment -

          Reopening for backporting to 4.10.4.

          Show
          anshumg Anshum Gupta added a comment - Reopening for backporting to 4.10.4.
          Hide
          anshumg Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          anshumg Anshum Gupta added a comment - Bulk close after 5.0 release.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I filed SOLR-7092 for that issue.

          Show
          markrmiller@gmail.com Mark Miller added a comment - I filed SOLR-7092 for that issue.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          The lease renewer here can end up running through other tests because we don't shut it down.

          Show
          markrmiller@gmail.com Mark Miller added a comment - The lease renewer here can end up running through other tests because we don't shut it down.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1655761 from Mark Miller in branch 'dev/branches/lucene_solr_5_0'
          [ https://svn.apache.org/r1655761 ]

          SOLR-6969: When opening an HDFSTransactionLog for append we must first attempt to recover it's lease to prevent data loss.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1655761 from Mark Miller in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1655761 ] SOLR-6969 : When opening an HDFSTransactionLog for append we must first attempt to recover it's lease to prevent data loss.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          The testing could be improved, but I have added a basic test to hit the code path and did a bunch of manual testing so that we can get this into 5.0.

          Show
          markrmiller@gmail.com Mark Miller added a comment - The testing could be improved, but I have added a basic test to hit the code path and did a bunch of manual testing so that we can get this into 5.0.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1655756 from Mark Miller in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1655756 ]

          SOLR-6969: When opening an HDFSTransactionLog for append we must first attempt to recover it's lease to prevent data loss.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1655756 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1655756 ] SOLR-6969 : When opening an HDFSTransactionLog for append we must first attempt to recover it's lease to prevent data loss.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1655754 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1655754 ]

          SOLR-6969: When opening an HDFSTransactionLog for append we must first attempt to recover it's lease to prevent data loss.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1655754 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1655754 ] SOLR-6969 : When opening an HDFSTransactionLog for append we must first attempt to recover it's lease to prevent data loss.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Patch adds a basic test to hit the lease recovery path.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Patch adds a basic test to hit the lease recovery path.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Ignoring the error itself for a moment, the error message is also really confusing. It starts with "Failure to open existing log file" and then later in the same line is "Failed to close file."

          It's two different things. The high level log item says that there was a "Failure to open existing log file". It shows the cause of this as RecoveryInProgressException and that has a message that let's you know that the problem was "Failed to close file.". Which is what happened. If the file had been properly closed previously (no -9, bug, OOM, etc), you would not have an outstanding lease that causes this issue.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Ignoring the error itself for a moment, the error message is also really confusing. It starts with "Failure to open existing log file" and then later in the same line is "Failed to close file." It's two different things. The high level log item says that there was a "Failure to open existing log file". It shows the cause of this as RecoveryInProgressException and that has a message that let's you know that the problem was "Failed to close file.". Which is what happened. If the file had been properly closed previously (no -9, bug, OOM, etc), you would not have an outstanding lease that causes this issue.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Here is a first patch.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Here is a first patch.
          Show
          markrmiller@gmail.com Mark Miller added a comment - Colin pointed us to https://github.com/apache/hbase/blob/833feefbf977a8208f8f71f5f3bd9b027d54961f/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java
          Hide
          anshumg Anshum Gupta added a comment -

          So it's certainly not for 5.0, right?

          Show
          anshumg Anshum Gupta added a comment - So it's certainly not for 5.0, right?
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Part of the problem is that we don't want to diverge from how things work with local file system as much as we can.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Part of the problem is that we don't want to diverge from how things work with local file system as much as we can.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          You can't have two Solr severs trying to access the same files when configured properly. The index lock will prevent this before you get to tlog files.

          I do think a long term / full solution requires some discussion with some hdfs guys. I talked to Colin a little about it today. Sounds like it's kind of an ugly situation.

          Show
          markrmiller@gmail.com Mark Miller added a comment - You can't have two Solr severs trying to access the same files when configured properly. The index lock will prevent this before you get to tlog files. I do think a long term / full solution requires some discussion with some hdfs guys. I talked to Colin a little about it today. Sounds like it's kind of an ugly situation.
          Hide
          mdrob Mike Drob added a comment -

          Is retrying always going to be safe? That works fine after we've lost a server and started a new one (albeit too quickly) but what about the case where two servers both think they are responsible for that tlog? This can happen if the original server partially dies, but still has some threads that are doing work and haven't been cleaned up.

          Looking at how other projects handle similar issues - HBase moves the entire directory[1] to break any existing leases and ensure any other processes gets kicked out. Maybe a retry is a good stop-gap, but is it going to be a full solution?

          [1]: https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java#L310

          Show
          mdrob Mike Drob added a comment - Is retrying always going to be safe? That works fine after we've lost a server and started a new one (albeit too quickly) but what about the case where two servers both think they are responsible for that tlog? This can happen if the original server partially dies, but still has some threads that are doing work and haven't been cleaned up. Looking at how other projects handle similar issues - HBase moves the entire directory [1] to break any existing leases and ensure any other processes gets kicked out. Maybe a retry is a good stop-gap, but is it going to be a full solution? [1] : https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java#L310
          Hide
          anshumg Anshum Gupta added a comment -

          Mark Miller do you intend to get this in for 5.0? I'm asking as this has been marked as critical for this release.

          Show
          anshumg Anshum Gupta added a comment - Mark Miller do you intend to get this in for 5.0? I'm asking as this has been marked as critical for this release.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Safe mode can be turned on and off by the user, so we retry without limit. It seems this should be a bound wait though.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Safe mode can be turned on and off by the user, so we retry without limit. It seems this should be a bound wait though.
          Hide
          mdrob Mike Drob added a comment -

          Ignoring the error itself for a moment, the error message is also really confusing. It starts with "Failure to open existing log file" and then later in the same line is "Failed to close file."

          For SafeMode it looks like we retry indefinitely. Do we want to exhibit the same behaviour here?

          Show
          mdrob Mike Drob added a comment - Ignoring the error itself for a moment, the error message is also really confusing. It starts with "Failure to open existing log file" and then later in the same line is "Failed to close file." For SafeMode it looks like we retry indefinitely. Do we want to exhibit the same behaviour here?
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Praneeth also mentions seeing AlreadyBeingCreatedException in SOLR-6367 - we should deal with that as well.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Praneeth also mentions seeing AlreadyBeingCreatedException in SOLR-6367 - we should deal with that as well.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -
          ERROR - 2015-01-12 17:49:43.992; org.apache.solr.common.SolrException; Failure to open existing log file (non fatal) hdfs://localhost:8020/solr_test/collection1/core_node1/data/tlog/tlog.0000000000000000000:org.apache.solr.common.SolrException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.RecoveryInProgressException): Failed to close file /solr_test/collection1/core_node1/data/tlog/tlog.0000000000000000000. Lease recovery is in progress. Try again later.
          	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2626)
          	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2462)
          	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2700)
          	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2663)
          	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:559)
          	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:388)
          	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
          	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
          	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
          	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
          	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at javax.security.auth.Subject.doAs(Subject.java:422)
          	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
          	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
          
          	at org.apache.solr.update.HdfsTransactionLog.<init>(HdfsTransactionLog.java:121)
          	at org.apache.solr.update.HdfsUpdateLog.init(HdfsUpdateLog.java:190)
          	at org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:134)
          	at org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:94)
          
          Show
          markrmiller@gmail.com Mark Miller added a comment - ERROR - 2015-01-12 17:49:43.992; org.apache.solr.common.SolrException; Failure to open existing log file (non fatal) hdfs://localhost:8020/solr_test/collection1/core_node1/data/tlog/tlog.0000000000000000000:org.apache.solr.common.SolrException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.RecoveryInProgressException): Failed to close file /solr_test/collection1/core_node1/data/tlog/tlog.0000000000000000000. Lease recovery is in progress. Try again later. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2626) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2462) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2700) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2663) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:559) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:388) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apache.solr.update.HdfsTransactionLog.<init>(HdfsTransactionLog.java:121) at org.apache.solr.update.HdfsUpdateLog.init(HdfsUpdateLog.java:190) at org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:134) at org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:94)

            People

            • Assignee:
              markrmiller@gmail.com Mark Miller
              Reporter:
              markrmiller@gmail.com Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development