Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-20330

ProcedureExecutor.start() gets stuck in recover lease on store.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0-beta-2
    • 2.0.0
    • proc-v2
    • None
    • Reviewed

    Description

      We have instance in our internal testing where master log is getting filled with following messages:

      2018-04-02 17:11:17,566 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://ns1/hbase/MasterProcWALs/pv2-00000000000000000018.log
      2018-04-02 17:11:17,567 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://ns1/hbase/MasterProcWALs/pv2-00000000000000000018.log after 1ms
      2018-04-02 17:11:17,574 WARN org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Unable to read tracker for hdfs://ns1/hbase/MasterProcWALs/pv2-00000000000000000018.log - Invalid Trailer version. got 111 expected 1
      2018-04-02 17:11:17,576 ERROR org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Log file with id=19 already exists
      org.apache.hadoop.fs.FileAlreadyExistsException: /hbase/MasterProcWALs/pv2-00000000000000000019.log for client 10.17.202.11 already exists
              at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:381)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2442)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2339)
              at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:764)
              at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:451)
              at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:422)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
      

      Debugging it further with appy, avirmani and xiaochen we found that when WALProcedureStore#rollWriter() fails and returns false for some reason, it keeps looping continuously.

      Attachments

        1. hbase-20330.master.001.patch
          1 kB
          Umesh Agashe
        2. hbase-20330.master.002.patch
          5 kB
          Umesh Agashe
        3. hbase-20330.master.003.patch
          5 kB
          Umesh Agashe
        4. hbase-20330.master.004.patch
          5 kB
          Umesh Agashe
        5. hbase-20330.master.005.patch
          6 kB
          Umesh Agashe

        Issue Links

          Activity

            People

              uagashe Umesh Agashe
              uagashe Umesh Agashe
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: