Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-3772

WAL prematurely deleted for root table

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.8.0
    • Component/s: gc, tserver
    • Labels:
    • Environment:

      SHA1 2436eec237dc23608e91cae56777a323e2f77295 (I believe). Running with Kerberos on. HDFS 2.7.1-SNAPSHOT

      Description

      I have a 5-node test system which I ran all of the ITs against.

      When I came back to the cluster later, I observed the following:

      • Multiple running MinCs against accumulo.metadata that weren't running
      • accumulo.root without an online tablet
      • Multiple queued FATE ops (PopulateMetadata blocked).

      Ultimately, found that recovery was repeatedly failing (without any notification on the monitor, mind you).

      2015-05-04 21:44:16,151 [state.ZooTabletStateStore] DEBUG: root tablet log hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-4.novalocal+9997/fc0e8426-
      f295-40d7-888e-f4b7dde1f966
      2015-05-04 21:44:16,151 [state.ZooTabletStateStore] DEBUG: root tablet log hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal+9997/a1a2c22f-
      f90d-4542-ab24-3ebdfa5d6fe4
      2015-05-04 21:44:16,152 [state.ZooTabletStateStore] DEBUG: root tablet log hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-
      f19e-4f43-bfc6-82f978b8ccb2
      2015-05-04 21:44:16,152 [state.ZooTabletStateStore] DEBUG: Returning root tablet state: +r<<@(null,null,os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal:9997[34d1e090f8c0004])
      2015-05-04 21:44:16,152 [master.Master] DEBUG: Root Table location State: +r<<@(null,null,os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal:9997[34d1e090f8c0004])
      2015-05-04 21:44:16,152 [recovery.RecoveryManager] DEBUG: Recovering hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-4.novalocal+9997/fc0e8426-f295-4
      0d7-888e-f4b7dde1f966 to hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/recovery/fc0e8426-f295-40d7-888e-f4b7dde1f966
      2015-05-04 21:44:16,171 [recovery.RecoveryManager] DEBUG: Recovering hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal+9997/a1a2c22f-f90d-4
      542-ab24-3ebdfa5d6fe4 to hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/recovery/a1a2c22f-f90d-4542-ab24-3ebdfa5d6fe4
      2015-05-04 21:44:16,174 [recovery.RecoveryManager] DEBUG: Recovering hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4
      f43-bfc6-82f978b8ccb2 to hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/recovery/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
      

      Note 088e3c39-f19e-4f43-bfc6-82f978b8ccb2. Eventually we see that the file is gone:

      2015-05-04 21:45:15,647 [recovery.RecoveryManager] DEBUG: Unable to initate log sort for hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2: java.io.FileNotFoundException: File does not exist: /apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
              at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
              at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2828)
              at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:666)
              at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:663)
              at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:415)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
      

      The most likely candidate to direct hatred towards is the GC

      [root@os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2 accumulo]# fgrep 088e3c39-f19e-4f43-bfc6-82f978b8ccb2 gc_os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal.debug.log
      2015-05-04 08:30:15,210 [gc.GarbageCollectWriteAheadLogs] INFO : Checking replication table for hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
      2015-05-04 08:30:15,223 [gc.GarbageCollectWriteAheadLogs] DEBUG: WAL not needed for replication hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
      2015-05-04 08:30:15,268 [gc.GarbageCollectWriteAheadLogs] DEBUG: Removing unused WAL for server os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997[14d1df69c910000] log hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
      

      Sure enough, the GC deleted the file. I'll be digging from here...

        Attachments

        1. ACCUMULO-3772.patch
          18 kB
          Josh Elser

          Issue Links

            Activity

              People

              • Assignee:
                elserj Josh Elser
                Reporter:
                elserj Josh Elser
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m