Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-3575

Accumulo GC ran out of memory

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Cannot Reproduce
    • 1.6.2
    • None
    • None
    • None

    Description

      During CI run (w/ agitation) on 20 node EC2 cluster the Accumulo GC died with the following errors.

      Following was in gc out file

      #
      # java.lang.OutOfMemoryError: Java heap space
      # -XX:OnOutOfMemoryError="kill -9 %p"
      #   Executing /bin/sh -c "kill -9 20970"...
      

      Following was in last lines of .log file

      2015-02-10 20:19:03,255 [gc.SimpleGarbageCollector] INFO : Collect cycle took 13.07 seconds
      2015-02-10 20:19:03,258 [gc.SimpleGarbageCollector] INFO : Beginning garbage collection of write-ahead logs
      2015-02-10 20:19:03,265 [zookeeper.ZooUtil] DEBUG: Trying to read instance id from hdfs://ip-10-1-2-11:9000/accumulo/instance_id
      

      Restarted GC and same thing happened. Looked in walog dir and saw there were 333k walog. This is the problem, the GC tries to read the list of files into memory.

      $ hadoop fs -ls -R /accumulo/wal | wc
      15/02/10 20:31:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
       333053 2664424 43629314
      

      I suspect the reason there were so many walogs is because there were many many failure like the following (which resulted in 0 length walogs, only 199 of the 333K have non-zero length). The following error is from a tserver, which is probably a result of killing datanodes.

      2015-02-10 03:45:00,447 [log.TabletServerLogger] ERROR: Unexpected error writing to log, retrying attempt 122
      java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /accumulo/wal/ip-10-1-2-21+9997/9906de55-bc93-47f4-887c-4b9540fc3528 could only be replicated to 0 nodes instead of minReplication (=1).  There 
      are 16 datanode(s) running and no node(s) are excluded in this operation.
              at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1549)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3200)
              at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:641)
              at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
              at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:415)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
      
              at org.apache.accumulo.tserver.log.TabletServerLogger.createLoggers(TabletServerLogger.java:190)
              at org.apache.accumulo.tserver.log.TabletServerLogger.access$300(TabletServerLogger.java:53)
              at org.apache.accumulo.tserver.log.TabletServerLogger$1.withWriteLock(TabletServerLogger.java:148)
              at org.apache.accumulo.tserver.log.TabletServerLogger.testLockAndRun(TabletServerLogger.java:115)
              at org.apache.accumulo.tserver.log.TabletServerLogger.initializeLoggers(TabletServerLogger.java:137)
              at org.apache.accumulo.tserver.log.TabletServerLogger.write(TabletServerLogger.java:245)
              at org.apache.accumulo.tserver.log.TabletServerLogger.write(TabletServerLogger.java:230)
              at org.apache.accumulo.tserver.log.TabletServerLogger.log(TabletServerLogger.java:345)
              at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.update(TabletServer.java:1817)
              at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:606)
              at org.apache.accumulo.trace.instrument.thrift.RpcServerInvocationHandler.invoke(RpcServerInvocationHandler.java:46)
              at org.apache.accumulo.server.util.RpcWrapper$1.invoke(RpcWrapper.java:47)
              at com.sun.proxy.$Proxy22.update(Unknown Source)
              at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$update.getResult(TabletClientService.java:2394)
              at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$update.getResult(TabletClientService.java:2378)
              at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
              at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
              at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:168)
              at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:516)
              at org.apache.accumulo.server.util.CustomNonBlockingServer$1.run(CustomNonBlockingServer.java:77)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
              at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
              at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
              at java.lang.Thread.run(Thread.java:744)
      Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /accumulo/wal/ip-10-1-2-21+9997/9906de55-bc93-47f4-887c-4b9540fc3528 could only be replicated to 0 nodes instead of minReplication (=1).  There are 16 datanode(s
      ) running and no node(s) are excluded in this operation.
              at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1549)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3200)
              at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:641)
              at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
              at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:415)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
      
              at org.apache.hadoop.ipc.Client.call(Client.java:1468)
              at org.apache.hadoop.ipc.Client.call(Client.java:1399)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
              at com.sun.proxy.$Proxy20.addBlock(Unknown Source)
              at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
              at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:606)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
              at com.sun.proxy.$Proxy21.addBlock(Unknown Source)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1532)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1349)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:588)
      

      Upped gc max mem from 256k to 2G and it ran ok.

      Attachments

        Activity

          People

            Unassigned Unassigned
            kturner Keith Turner
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: