Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-8749

Potential race condition between FSUtils.renameAndSetModifyTime() and HFile/LogCleaner

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.95.1
    • 0.98.0, 0.95.2, 0.94.9
    • None
    • None

    Description

      We recently saw the following running 0.95.1 on hadoop 2.0:

      2013-06-11 21:51:22,199 ERROR [IPC Server handler 18 on 60000] master.HMaster: Region server ip-10-138-2-28.ec2.internal,60020,1370887927492 reported a fatal error:
      ABORTING region server ip-10-138-2-28.ec2.internal,60020,1370887927492: IOE in log roller
      Cause:
      java.io.FileNotFoundException: File/Directory /hbase/.oldlogs/ip-10-138-2-28.ec2.internal%2C60020%2C1370887927492.1370986265468 does not exist.
      	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setTimesInt(FSNamesystem.java:1488)
      	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setTimes(FSNamesystem.java:1453)
      	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setTimes(NameNodeRpcServer.java:798)
      	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setTimes(ClientNamenodeProtocolServerSideTranslatorPB.java:704)
      	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:43194)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
      	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:910)
      	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1694)
      	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1690)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:415)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1367)
      	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1688)
      
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
      	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
      	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90)
      	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
      	at org.apache.hadoop.hdfs.DFSClient.setTimes(DFSClient.java:2097)
      	at org.apache.hadoop.hdfs.DistributedFileSystem.setTimes(DistributedFileSystem.java:813)
      	at org.apache.hadoop.hbase.util.FSUtils.renameAndSetModifyTime(FSUtils.java:1596)
      	at org.apache.hadoop.hbase.regionserver.wal.FSHLog.archiveLogFile(FSHLog.java:705)
      	at org.apache.hadoop.hbase.regionserver.wal.FSHLog.cleanOldLogs(FSHLog.java:595)
      	at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:536)
      	at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:96)
      	at java.lang.Thread.run(Thread.java:722)
      Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File/Directory /hbase/.oldlogs/ip-10-138-2-28.ec2.internal%2C60020%2C1370887927492.1370986265468 does not exist.
      	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setTimesInt(FSNamesystem.java:1488)
      	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setTimes(FSNamesystem.java:1453)
      	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setTimes(NameNodeRpcServer.java:798)
      	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setTimes(ClientNamenodeProtocolServerSideTranslatorPB.java:704)
      	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:43194)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
      	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:910)
      	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1694)
      	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1690)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:415)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1367)
      	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1688)
      
      	at org.apache.hadoop.ipc.Client.call(Client.java:1164)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
      	at com.sun.proxy.$Proxy11.setTimes(Unknown Source)
      	at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:601)
      	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
      	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
      	at com.sun.proxy.$Proxy11.setTimes(Unknown Source)
      	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setTimes(ClientNamenodeProtocolTranslatorPB.java:685)
      	at sun.reflect.GeneratedMethodAccessor26.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:601)
      	at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:266)
      	at com.sun.proxy.$Proxy12.setTimes(Unknown Source)
      	at org.apache.hadoop.hdfs.DFSClient.setTimes(DFSClient.java:2095)
      	... 7 more
      
      2013-06-11 21:51:32,976 INFO  [main-EventThread] zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [ip-10-138-2-28.ec2.internal,60020,1370887927492]
      

      One scenario that could have led to the above exception was that the LogCleaner thread deleted the log file which had just been moved to .oldlogs dir before the call to fs.setTimes() executed.
      When fs.setTimes() ran, the file was no longer there.

      Attachments

        1. HBASE-8749-v0.patch
          0.9 kB
          Matteo Bertozzi
        2. HBase-8749-0.94.txt
          0.8 kB
          Liang Xie

        Issue Links

          Activity

            People

              mbertozzi Matteo Bertozzi
              yuzhihong@gmail.com Ted Yu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: