HBase
  1. HBase
  2. HBASE-3515

[replication] ReplicationSource can miss a log after RS comes out of GC

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.90.0
    • Fix Version/s: 0.92.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This is from Hudson build 1738, if a log is about to be rolled and the ZK connection is already closed then the replication code will fail at adding the new log in ZK but the log will still be rolled and it's possible that some edits will make it in.

      From the log:

      2011-02-08 10:21:20,618 FATAL [RegionServer:0;vesta.apache.org,46117,1297160399378.logRoller] regionserver.HRegionServer(1383):
      ABORTING region server serverName=vesta.apache.org,46117,1297160399378, load=(requests=1525, regions=12,
      usedHeap=273, maxHeap=1244): Failed add log to list
      org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for
      /1/replication/rs/vesta.apache.org,46117,1297160399378/2/vesta.apache.org%3A46117.1297160480509
      ...

      2011-02-08 10:21:22,444 DEBUG [MASTER_META_SERVER_OPERATIONS-vesta.apache.org:56008-0] wal.HLogSplitter(258):
      Splitting hlog 8 of 8: hdfs://localhost:55474/user/hudson/.logs/vesta.apache.org,46117,1297160399378/vesta.apache.org%3A46117.1297160480509, length=0

      2011-02-08 10:21:22,862 DEBUG [MASTER_META_SERVER_OPERATIONS-vesta.apache.org:56008-0] wal.HLogSplitter(436):
      Pushed=31 entries from hdfs://localhost:55474/user/hudson/.logs/vesta.apache.org,46117,1297160399378/vesta.apache.org%3A46117.1297160480509

      The easiest thing to do would be let the exception out and cancel the log roll.

      1. HBASE-3515-v2.patch
        3 kB
        Jean-Daniel Cryans
      2. HBASE-3515-v2-0.92.patch
        4 kB
        Jean-Daniel Cryans
      3. HBASE-3515.patch
        0.6 kB
        Jean-Daniel Cryans

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2398 (See https://builds.apache.org/job/HBase-TRUNK/2398/)
          HBASE-3515 [replication] ReplicationSource can miss a log after RS comes out of GC
          (fixing a missing handling of IOE in Replication)
          HBASE-3515 [replication] ReplicationSource can miss a log after RS comes out of GC

          jdcryans :
          Files :

          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/replication/regionserver/Replication.java

          jdcryans :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2398 (See https://builds.apache.org/job/HBase-TRUNK/2398/ ) HBASE-3515 [replication] ReplicationSource can miss a log after RS comes out of GC (fixing a missing handling of IOE in Replication) HBASE-3515 [replication] ReplicationSource can miss a log after RS comes out of GC jdcryans : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/replication/regionserver/Replication.java jdcryans : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92 #96 (See https://builds.apache.org/job/HBase-0.92/96/)
          HBASE-3515 [replication] ReplicationSource can miss a log after RS comes out of GC

          jdcryans :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALActionsListener.java
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/Replication.java
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java
          Show
          Hudson added a comment - Integrated in HBase-0.92 #96 (See https://builds.apache.org/job/HBase-0.92/96/ ) HBASE-3515 [replication] ReplicationSource can miss a log after RS comes out of GC jdcryans : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALActionsListener.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/Replication.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java
          Hide
          Jean-Daniel Cryans added a comment -

          Committed to 0.92 and trunk, thanks for looking at my patch Stack!

          Show
          Jean-Daniel Cryans added a comment - Committed to 0.92 and trunk, thanks for looking at my patch Stack!
          Hide
          Jean-Daniel Cryans added a comment -

          And the patch for trunk.

          Show
          Jean-Daniel Cryans added a comment - And the patch for trunk.
          Hide
          stack added a comment -

          +1 on patch.

          Show
          stack added a comment - +1 on patch.
          Hide
          Jean-Daniel Cryans added a comment -

          Attaching the patch for 0.92, for trunk you just need to remove the adding of the IOE since it's already there.

          Show
          Jean-Daniel Cryans added a comment - Attaching the patch for 0.92, for trunk you just need to remove the adding of the IOE since it's already there.
          Hide
          Ted Yu added a comment -

          But for 0.92, we still have:

            public void logRolled(Path newFile);
          
          Show
          Ted Yu added a comment - But for 0.92, we still have: public void logRolled(Path newFile);
          Hide
          Jean-Daniel Cryans added a comment -

          It seems the code changed a lot because now in trunk postLogRoll throws an IOE that kills the server. All I need to do is to bubble it up!

          Show
          Jean-Daniel Cryans added a comment - It seems the code changed a lot because now in trunk postLogRoll throws an IOE that kills the server. All I need to do is to bubble it up!
          Hide
          Ted Yu added a comment - - edited

          +1 on stopping region server if this situation happens.

          Show
          Ted Yu added a comment - - edited +1 on stopping region server if this situation happens.
          Hide
          Jean-Daniel Cryans added a comment -

          To reiterate the problem, it's possible to not be able to add an HLog to replicate if the session is expired when log rolling. HLog currently doesn't get any feedback from the WALActionListeners, even if they fail at doing their job.

          One way of fixing it would be to throw an exception and stop the log rolling, but it means that if there's many listeners that some may already have processed the adding of the log. We could also kill the region server plain and simple if it happens.

          I'm in favor of the latter.

          Show
          Jean-Daniel Cryans added a comment - To reiterate the problem, it's possible to not be able to add an HLog to replicate if the session is expired when log rolling. HLog currently doesn't get any feedback from the WALActionListeners, even if they fail at doing their job. One way of fixing it would be to throw an exception and stop the log rolling, but it means that if there's many listeners that some may already have processed the adding of the log. We could also kill the region server plain and simple if it happens. I'm in favor of the latter.
          Hide
          Jean-Daniel Cryans added a comment -

          I'll try to fix it today and will punt if I don't go anywhere.

          Show
          Jean-Daniel Cryans added a comment - I'll try to fix it today and will punt if I don't go anywhere.
          Hide
          stack added a comment -

          @J-D What you thinking regards this issue now? Should it hold up 0.92?

          Show
          stack added a comment - @J-D What you thinking regards this issue now? Should it hold up 0.92?
          Hide
          Jean-Daniel Cryans added a comment -

          Punting, both a hackish fix and a good fix would be very involved.

          Show
          Jean-Daniel Cryans added a comment - Punting, both a hackish fix and a good fix would be very involved.
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #1795 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1795/)
          Reverting HBASE-3515

          Show
          Hudson added a comment - Integrated in HBase-TRUNK #1795 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1795/ ) Reverting HBASE-3515
          Hide
          Jean-Daniel Cryans added a comment -

          Thanks for reminding me my lazyness Todd, trunk is now clean.

          Show
          Jean-Daniel Cryans added a comment - Thanks for reminding me my lazyness Todd, trunk is now clean.
          Hide
          Todd Lipcon added a comment -

          hmm, this was reverted from branch but not trunk? is that right?

          Show
          Todd Lipcon added a comment - hmm, this was reverted from branch but not trunk? is that right?
          Hide
          Jean-Daniel Cryans added a comment -

          I'm thinking of a more radical way of solving this issue, considering that the problem is that the RS is able to roll a log even tho we already lost our session, I'm thinking that we should call fs.close() from inside HRS.abort() thus preventing any other call from reaching HDFS. The downside is that it's going to make a big BOOOM and every call to close regions will fail in the ugliest fashion.

          Show
          Jean-Daniel Cryans added a comment - I'm thinking of a more radical way of solving this issue, considering that the problem is that the RS is able to roll a log even tho we already lost our session, I'm thinking that we should call fs.close() from inside HRS.abort() thus preventing any other call from reaching HDFS. The downside is that it's going to make a big BOOOM and every call to close regions will fail in the ugliest fashion.
          Hide
          Jean-Daniel Cryans added a comment -

          I mistakenly thought that hlog.closed was set when the region server was on it's way out, but I'm wrong... it's just set when the HLog is closing. Need to find an alternative.

          Show
          Jean-Daniel Cryans added a comment - I mistakenly thought that hlog.closed was set when the region server was on it's way out, but I'm wrong... it's just set when the HLog is closing. Need to find an alternative.
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #1739 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1739/)
          HBASE-3515 [replication] ReplicationSource can miss a log after RS comes
          out of GC

          Show
          Hudson added a comment - Integrated in HBase-TRUNK #1739 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1739/ ) HBASE-3515 [replication] ReplicationSource can miss a log after RS comes out of GC
          Hide
          Jean-Daniel Cryans added a comment -

          Committed to branch and trunk.

          Show
          Jean-Daniel Cryans added a comment - Committed to branch and trunk.
          Hide
          Jean-Daniel Cryans added a comment -

          Yes, I was planning to putting this in branch and trunk.

          Show
          Jean-Daniel Cryans added a comment - Yes, I was planning to putting this in branch and trunk.
          Hide
          stack added a comment -

          +1 Is it needed in 0.90 too?

          Show
          stack added a comment - +1 Is it needed in 0.90 too?
          Hide
          Jean-Daniel Cryans added a comment -

          This patch simply adds a check if the region server is going down before closing the file. In the replication case it fixes the issue, since if it fails it will set that flag to true.

          The issue with throwing an exception on i.logRolled(newPath) is that since there's potentially many of them, throwing midway would mean that you have to implement a roll back. There's nothing at the moment that requires that level of complexity.

          Show
          Jean-Daniel Cryans added a comment - This patch simply adds a check if the region server is going down before closing the file. In the replication case it fixes the issue, since if it fails it will set that flag to true. The issue with throwing an exception on i.logRolled(newPath) is that since there's potentially many of them, throwing midway would mean that you have to implement a roll back. There's nothing at the moment that requires that level of complexity.

            People

            • Assignee:
              Jean-Daniel Cryans
              Reporter:
              Jean-Daniel Cryans
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development