Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.6.5, 1.7.1
    • Fix Version/s: 1.6.6, 1.7.2
    • Component/s: gc
    • Labels:
      None

      Description

      Ran into a situation where the master started logging Unable to initiate log sort because the WAL could not be found right after a tserver died. The WAL was for a tablet in the metadata table, with a key extent like !0;t;endrow;prevrow hosted by a tabletserver like tserver1. Doing a log sort happens in the master before a tablet can be reassigned and brought back online. Accumulo is in a really bad state when this happens as that tablet will stay unhosted until manual intervention. Luckily, the WAL was in the HDFS trash and could be copied back to the expected location.

      Piecing together the logs showed something like the following

      This suggests that the GargabeCollectWriteAheadLog is too aggressively removing WALs for a server it thinks is dead but may actually still be doing work. The tserver was under heavy load before it went down.

      Studying the logs with Keith Turner and brainstorming, here are some things that could be fixed/checked

      1. When gc doesn't see a reference to a WAL in metadata table, it asks the tablet server to delete the log. The gc process then logs at DEBUG that the WAL was deleted regardless of whether it was or not. Maybe change log to "asking tserver to delete WAL" or something. We found these messages in the gc log 45 minutes before this event. These messages were misleading because further investigation shows the tserver will log Deleting wal when a WAL is truly deleted. There were not such message in the tserver 45 min earlier, indicating the WAL was not actually deleted.
      2. GC logs "Removing WAL for offline" at DEBUG. These can roll off pretty quickly, so change that to INFO. This will help keep history around longer to aid troubleshooting.
      3. Verify the "adding 1 logs for extent" is using the srv:lock column to enforce the constraint. Looks like it is, but if zooLock is null in the update of MetadataTableUtil maybe badness is happening.
      4. In GC, maybe keep a map of first time we see a tablet server is down and don't actually remove the WAL for offline tablet servers until they have been down an hour or something. Would need to make sure that map is cleared when a tserver comes back online.

        Issue Links

          Activity

          Hide
          busbey Sean Busbey added a comment -

          For folks who hit this issue, check for the WAL in the HDFS trash and restore it if possible. If the missing WAL has already been GCd from the HDFS trash, then you will have data loss. You can restore the remaining data in the impacted tablets (minus whatever data was loss by losing the WALs) by giving the recovery system an empty wal to recover.

          Remember that following these instructions only restores the functionality of impacted tablets; any missing data in the WAL will be permanently lost once the recovery system thinks it has processed the missing WAL files.

          For systems that have ACCUMULO-3182 in place, you can just place a 0 length file in place of each missing WAL.

          Given a repeating set of log messages in the master about recovering a given file:

          2017-02-22 14:59:32,183 [recovery.RecoveryManager] INFO : Starting recovery of hdfs://nameservice1.example.com:8020/accumulo/wal/tserver-4.example.com+10011/26abec5b-63e7-40dd-9fa1-b8ad2436606e (in : 300s), tablet 2<;user515 holds a reference
          

          Placing a 0 length file is done like this, presuming an edge node with appropriate cluster configs and an Accumulo service user named "accumulo":

          # sudo -s -u accumulo
          [accumulo@gateway-1 ~]$ hdfs dfs -touchz /accumulo/wal/tserver-4.example.com+10011/26abec5b-63e7-40dd-9fa1-b8ad2436606e
          [accumulo@gateway-1 ~]$
          

          Presuming this has been failing for a while, the master will have backed off retrying (i.e. the "in : 300s" part of the recovery message). To avoid waiting for the next retry, you can restart the master process. If there is more than one WAL, you'll want to repeat the touchz command for each one prior to restarting the master process.

          If your Accumulo version does not contain ACCUMULO-3182, then you should follow the workaround in that issue for creating a complete WAL file with 0 entries. If you aren't sure, it's safe to do the about 0-length file attempt first. The recovery process will fail with an EOFException instead of recovering the WAL and bringing the tablet online. When that happens, you should delete the 0-length file and follow the workaround from ACCUMULO-3182.

          Show
          busbey Sean Busbey added a comment - For folks who hit this issue, check for the WAL in the HDFS trash and restore it if possible. If the missing WAL has already been GCd from the HDFS trash, then you will have data loss. You can restore the remaining data in the impacted tablets (minus whatever data was loss by losing the WALs) by giving the recovery system an empty wal to recover. Remember that following these instructions only restores the functionality of impacted tablets; any missing data in the WAL will be permanently lost once the recovery system thinks it has processed the missing WAL files. For systems that have ACCUMULO-3182 in place, you can just place a 0 length file in place of each missing WAL. Given a repeating set of log messages in the master about recovering a given file: 2017-02-22 14:59:32,183 [recovery.RecoveryManager] INFO : Starting recovery of hdfs: //nameservice1.example.com:8020/accumulo/wal/tserver-4.example.com+10011/26abec5b-63e7-40dd-9fa1-b8ad2436606e (in : 300s), tablet 2<;user515 holds a reference Placing a 0 length file is done like this, presuming an edge node with appropriate cluster configs and an Accumulo service user named "accumulo": # sudo -s -u accumulo [accumulo@gateway-1 ~]$ hdfs dfs -touchz /accumulo/wal/tserver-4.example.com+10011/26abec5b-63e7-40dd-9fa1-b8ad2436606e [accumulo@gateway-1 ~]$ Presuming this has been failing for a while, the master will have backed off retrying (i.e. the "in : 300s" part of the recovery message). To avoid waiting for the next retry, you can restart the master process. If there is more than one WAL, you'll want to repeat the touchz command for each one prior to restarting the master process. If your Accumulo version does not contain ACCUMULO-3182 , then you should follow the workaround in that issue for creating a complete WAL file with 0 entries. If you aren't sure, it's safe to do the about 0-length file attempt first. The recovery process will fail with an EOFException instead of recovering the WAL and bringing the tablet online. When that happens, you should delete the 0-length file and follow the workaround from ACCUMULO-3182 .
          Hide
          elserj Josh Elser added a comment -

          Only change in core was adding a property

          Ok, thanks! Looks like I got through the core tests fine this time (after killing those processes). Thanks for the responsiveness.

          Show
          elserj Josh Elser added a comment - Only change in core was adding a property Ok, thanks! Looks like I got through the core tests fine this time (after killing those processes). Thanks for the responsiveness.
          Hide
          mjwall Michael Wall added a comment -
          Show
          mjwall Michael Wall added a comment - Only change in core was adding a property https://github.com/apache/accumulo/commit/e0426c51cd6991741e9be321aaa1e4f5361e0e3e#diff-8779c197f6f1473081f0799198733022 Let me know if still get issues.
          Hide
          elserj Josh Elser added a comment -

          Hrm, I did have two rogue java processes sitting around (thanks, jenkins). Let me see if re-running them passes.

          Show
          elserj Josh Elser added a comment - Hrm, I did have two rogue java processes sitting around (thanks, jenkins). Let me see if re-running them passes.
          Hide
          elserj Josh Elser added a comment -

          Michael Wall, not sure if it's my server or your changes, but I just had UT failures across 1.6 and 1.7 after these changes went in:

          https://secure.penguinsinabox.com/jenkins/job/Accumulo-Test-1.6/1053/console
          https://secure.penguinsinabox.com/jenkins/job/Accumulo-Test-1.6-Hadoop-1/1036/console
          https://secure.penguinsinabox.com/jenkins/job/Accumulo-Test-1.7/231/console

          Did you change anything in accumulo-core (or in an earlier module) that might have affected heap usage?

          Show
          elserj Josh Elser added a comment - Michael Wall , not sure if it's my server or your changes, but I just had UT failures across 1.6 and 1.7 after these changes went in: https://secure.penguinsinabox.com/jenkins/job/Accumulo-Test-1.6/1053/console https://secure.penguinsinabox.com/jenkins/job/Accumulo-Test-1.6-Hadoop-1/1036/console https://secure.penguinsinabox.com/jenkins/job/Accumulo-Test-1.7/231/console Did you change anything in accumulo-core (or in an earlier module) that might have affected heap usage?
          Hide
          mjwall Michael Wall added a comment -

          Fix merged to 1.6 and 1.7. Merged without change into 1.8 and master

          Show
          mjwall Michael Wall added a comment - Fix merged to 1.6 and 1.7. Merged without change into 1.8 and master
          Hide
          mjwall Michael Wall added a comment -

          So I have been unable to replicate the bug in 1.8. I still want to do more testing on 1.8 but I'll write another ticket for that. Merging my change to 1.6, then merging 1.6 and 1.7, then merging 1.7 to 1.8 with -sours and finally merging 1.8 to master. Then I'll close this ticket.

          Show
          mjwall Michael Wall added a comment - So I have been unable to replicate the bug in 1.8. I still want to do more testing on 1.8 but I'll write another ticket for that. Merging my change to 1.6, then merging 1.6 and 1.7, then merging 1.7 to 1.8 with -sours and finally merging 1.8 to master. Then I'll close this ticket.
          Hide
          elserj Josh Elser added a comment -

          If we're not doing that, we should

          We do.

          Show
          elserj Josh Elser added a comment - If we're not doing that, we should We do.
          Hide
          busbey Sean Busbey added a comment -

          The timing is a little tricky here and I have been unable to reproduce in a test.

          • tserver 1 becomes unresponsive, master declares it dead and removes the zk lock.

          Before removing the zk lock, the master should be recovering the lease on the WAL to prevent further writes. If we're not doing that, we should. Beyond this issue, there are other ways to lose data if we don't do that (like a tserver handling recovery and splitting a log that then gets more writes). If HDFS is allowing further writes after the lease is lost, that's a way bigger problem.

          Show
          busbey Sean Busbey added a comment - The timing is a little tricky here and I have been unable to reproduce in a test. tserver 1 becomes unresponsive, master declares it dead and removes the zk lock. Before removing the zk lock, the master should be recovering the lease on the WAL to prevent further writes. If we're not doing that, we should. Beyond this issue, there are other ways to lose data if we don't do that (like a tserver handling recovery and splitting a log that then gets more writes). If HDFS is allowing further writes after the lease is lost, that's a way bigger problem.
          Hide
          ecn Eric Newton added a comment -

          It is possible, since so much of the WAL recovery metadata management has changed in 1.8, that the bug no longer exists in 1.8.

          Show
          ecn Eric Newton added a comment - It is possible, since so much of the WAL recovery metadata management has changed in 1.8, that the bug no longer exists in 1.8.
          Hide
          mjwall Michael Wall added a comment -

          It does help, thanks Josh Elser. Still thinking through this

          Show
          mjwall Michael Wall added a comment - It does help, thanks Josh Elser . Still thinking through this
          Hide
          elserj Josh Elser added a comment -

          I merged (but didn't push) my fix for ACCUMULO-4157 to 1.6 and then 1.7

          One thing you could consider is a merge -sours into 1.8 and we could just do a fresh commit for 1.8 with your changes. That would at least let you move on past 1.6 and 1.7.

          But when I got to 1.8. that class has changed quite a bit. Mostly from Eric's ticket

          I'm trying to remember what exactly changed WRT cleaning up WALs. The original change is that each tablet row in metadata no longer contains the points to any WALs. Instead, WALs are tracked, by server, in ZooKeeper. When the master notices that a server dies, it needs to remove the current assignment (the "loc" column) from that tablet (row). During that update, it will also copy up the necessary WAL references from ZK into the relevant Tablet (row) for recovery to read as it always has. This was originally all tracked in a new rowspace in metadata, but, while chatting with Eric Newton, he told me that it was difficult to do correctly because you might need to get WALs to bring the metadata table online but the WAL you need to recover is only stored in the metadata table (oops).

          Anyways, I believe the only big change in GCWAL would be determining what the "current" WALs are. This is now a two-step check: read ZK for the WALs in use by servers as well as read all the tablets to see if any are using a WAL for recovery.

          Does this help?

          Show
          elserj Josh Elser added a comment - I merged (but didn't push) my fix for ACCUMULO-4157 to 1.6 and then 1.7 One thing you could consider is a merge -sours into 1.8 and we could just do a fresh commit for 1.8 with your changes. That would at least let you move on past 1.6 and 1.7. But when I got to 1.8. that class has changed quite a bit. Mostly from Eric's ticket I'm trying to remember what exactly changed WRT cleaning up WALs. The original change is that each tablet row in metadata no longer contains the points to any WALs. Instead, WALs are tracked, by server, in ZooKeeper. When the master notices that a server dies, it needs to remove the current assignment (the "loc" column) from that tablet (row). During that update, it will also copy up the necessary WAL references from ZK into the relevant Tablet (row) for recovery to read as it always has. This was originally all tracked in a new rowspace in metadata, but, while chatting with Eric Newton , he told me that it was difficult to do correctly because you might need to get WALs to bring the metadata table online but the WAL you need to recover is only stored in the metadata table (oops). Anyways, I believe the only big change in GCWAL would be determining what the "current" WALs are. This is now a two-step check: read ZK for the WALs in use by servers as well as read all the tablets to see if any are using a WAL for recovery. Does this help?
          Hide
          mjwall Michael Wall added a comment -

          I merged (but didn't push) my fix for ACCUMULO-4157 to 1.6 and then 1.7. The only difference in GarbageCollectWriteAheadLog.java between 1.6 and 1.7 was the addition of replication. But when I got to 1.8. that class has changed quite a bit. Mostly from Eric's ticket https://issues.apache.org/jira/browse/ACCUMULO-3423.

          I am trying to determine if the issue I was fixing is still an issue in 1.8, where the GCWAL simply deletes a WALOG for a downed tserver. I see this code is still present, which was the problem in 1.6. See https://github.com/apache/accumulo/blob/1.8/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java#L241

                try {
                  if (!useTrash || !fs.moveToTrash(path)) {
                    fs.deleteRecursively(path);
                  }
                  status.currentLog.deleted++;
                } catch (FileNotFoundException ex) {
                  // ignored
                } catch (IOException ex) {
                  log.error("Unable to delete wal " + path + ": " + ex);
                }
          

          But in 1.6 and 1.7, that was only called for a WAL from a "dead" tserver. Still working through what gets to this code. Anyone have any thoughts on this? Josh Elser Keith Turner

          Show
          mjwall Michael Wall added a comment - I merged (but didn't push) my fix for ACCUMULO-4157 to 1.6 and then 1.7. The only difference in GarbageCollectWriteAheadLog.java between 1.6 and 1.7 was the addition of replication. But when I got to 1.8. that class has changed quite a bit. Mostly from Eric's ticket https://issues.apache.org/jira/browse/ACCUMULO-3423 . I am trying to determine if the issue I was fixing is still an issue in 1.8, where the GCWAL simply deletes a WALOG for a downed tserver. I see this code is still present, which was the problem in 1.6. See https://github.com/apache/accumulo/blob/1.8/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java#L241 try { if (!useTrash || !fs.moveToTrash(path)) { fs.deleteRecursively(path); } status.currentLog.deleted++; } catch (FileNotFoundException ex) { // ignored } catch (IOException ex) { log.error( "Unable to delete wal " + path + ": " + ex); } But in 1.6 and 1.7, that was only called for a WAL from a "dead" tserver. Still working through what gets to this code. Anyone have any thoughts on this? Josh Elser Keith Turner
          Hide
          mjwall Michael Wall added a comment -

          I agree, it is important as anyone running Accumulo today is subject to data loss and corruption. The change I have after talking with Keith is pretty minor one. I've been stuck on testing, as the current GarbageCollectWriteAheadLogTest uses a lot of mocking and many of the methods are private. I'll push up a PR tomorrow morning and make another ticket to refactor this class so we can better test it. My vote is to get it out as soon as possible.

          Show
          mjwall Michael Wall added a comment - I agree, it is important as anyone running Accumulo today is subject to data loss and corruption. The change I have after talking with Keith is pretty minor one. I've been stuck on testing, as the current GarbageCollectWriteAheadLogTest uses a lot of mocking and many of the methods are private. I'll push up a PR tomorrow morning and make another ticket to refactor this class so we can better test it. My vote is to get it out as soon as possible.
          Hide
          mdrob Mike Drob added a comment -

          How do you feel about bumping this to 1.7.3? This is an important issue to take care of, but it also seems incredibly delicate and may take a long time to get right.

          Show
          mdrob Mike Drob added a comment - How do you feel about bumping this to 1.7.3? This is an important issue to take care of, but it also seems incredibly delicate and may take a long time to get right.
          Hide
          mjwall Michael Wall added a comment -

          This was found on a live cluster. It has been there for a long time.

          Show
          mjwall Michael Wall added a comment - This was found on a live cluster. It has been there for a long time.
          Hide
          mdrob Mike Drob added a comment -

          Michael Wall - Do you have any suspicions on what caused this issue? Is this a regression from something?

          Show
          mdrob Mike Drob added a comment - Michael Wall - Do you have any suspicions on what caused this issue? Is this a regression from something?
          Hide
          mjwall Michael Wall added a comment -

          The timing is a little tricky here and I have been unable to reproduce in a test.

          • tserver 1 becomes unresponsive, master declares it dead and removes the zk lock.
          • tserver 1 continues work for 8 more min and continues to write data to WAL
          • WAL gc comes along, there is no zk lock for the server so the gc process deletes the file
          • tserver 2 gets assigned the tablet from the master, tries to replay the WAL shown in the metadata but the files is gone

          Another issue is that log messages are misleading which make it harder to trace

          There is definitely a chance for data loss with the bug if you don't recover the deleted WAL

          Show
          mjwall Michael Wall added a comment - The timing is a little tricky here and I have been unable to reproduce in a test. tserver 1 becomes unresponsive, master declares it dead and removes the zk lock. tserver 1 continues work for 8 more min and continues to write data to WAL WAL gc comes along, there is no zk lock for the server so the gc process deletes the file tserver 2 gets assigned the tablet from the master, tries to replay the WAL shown in the metadata but the files is gone Another issue is that log messages are misleading which make it harder to trace There is definitely a chance for data loss with the bug if you don't recover the deleted WAL
          Hide
          busbey Sean Busbey added a comment -

          Shouldn't recovering the file have included doing an HDFS lease recovery, which prevents tserver1 from writing more data?

          Show
          busbey Sean Busbey added a comment - Shouldn't recovering the file have included doing an HDFS lease recovery, which prevents tserver1 from writing more data?
          Hide
          mjwall Michael Wall added a comment -

          Yeah that is a great question Drew Farris. Maybe the first commit is to remove GC from deleting a WAL at all.

          Show
          mjwall Michael Wall added a comment - Yeah that is a great question Drew Farris . Maybe the first commit is to remove GC from deleting a WAL at all.
          Hide
          drew.farris Drew Farris added a comment -

          I'm wondering if it is ever appropriate for GC to remove a write-ahead log for an offline tablet (or a tserver that's down) - doesn't this almost always indicate a problem that will require recovery?

          Show
          drew.farris Drew Farris added a comment - I'm wondering if it is ever appropriate for GC to remove a write-ahead log for an offline tablet (or a tserver that's down) - doesn't this almost always indicate a problem that will require recovery?

            People

            • Assignee:
              mjwall Michael Wall
              Reporter:
              mjwall Michael Wall
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 7h
                7h

                  Development