Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-939

Writing to HDFS gets stuck

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 1.4.3, 1.5.0
    • None
    • tserver
    • None
    • hadoop-1.0.1, hadoop-1.1.2 / accumulo 1.4.3

    Description

      Attempting to test ACCUMULO-575 with the following test framework:

      Test bench-
      1 node running hadoop namenode and 1 datanode
      slave noderunning 1 datanode and accumulo stack, with 8GB in memory map
      Running patched version of accumulo with the following aptch to provide helper debug

      Index: server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java
      ===================================================================
      --- server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java	(revision 1429057)
      +++ server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java	(working copy)
      @@ -81,6 +81,7 @@
         private FileSystem fs;
         protected KeyExtent extent;
         private List<IteratorSetting> iterators;
      +  protected boolean minor= false;
         
         Compactor(Configuration conf, FileSystem fs, Map<String,DataFileValue> files, InMemoryMap imm, String outputFile, boolean propogateDeletes,
             TableConfiguration acuTableConf, KeyExtent extent, CompactionEnv env, List<IteratorSetting> iterators) {
      @@ -158,7 +159,7 @@
               log.error("Verification of successful compaction fails!!! " + extent + " " + outputFile, ex);
               throw ex;
             }
      -      
      +      log.info("Just completed minor? " + minor + " for table " + extent.getTableId());
             log.debug(String.format("Compaction %s %,d read | %,d written | %,6d entries/sec | %6.3f secs", extent, majCStats.getEntriesRead(),
                 majCStats.getEntriesWritten(), (int) (majCStats.getEntriesRead() / ((t2 - t1) / 1000.0)), (t2 - t1) / 1000.0));
             
      Index: server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java
      ===================================================================
      --- server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java	(revision 1429057)
      +++ server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java	(working copy)
      @@ -88,6 +88,7 @@
           
           do {
             try {
      +        this.minor = true;
               CompactionStats ret = super.call();
               
               // log.debug(String.format("MinC %,d recs in | %,d recs out | %,d recs/sec | %6.3f secs | %,d bytes ",map.size(), entriesCompacted,
      

      I stood up a new instance, create a table named test. Ran the following -

      tail -f accumulo-1.5.0-SNAPSHOT/logs/tserver_slave.debug.log | ./ifttt.sh 

      where ifttt.sh is

       #!/bin/sh
      
      dnpid=`jps -m | grep DataNode | awk '{print $1}'`
      
      while [ -z "" ]; do
        if [ -e $1 ] ;then read str; else str=$1;fi
        if [ -n "`echo $str | grep "Just completed minor? true for table 2"`" ]; then
          echo "I'm gonna kill datanode, pid $dnpid"
          kill -9 $dnpid
        fi
      done
      

      Then I ran thefollowing

      accumulo org.apache.accumulo.server.test.TestIngest --table test --rows 65536 --cols 100 --size 8192 -z 172.16.101.220:2181 --batchMemory 100000000 --batchThreads 10 

      Eventually the memory map filled, minor compaction happened, local datanode was killed and things died. Logs filled with-

       org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /accumulo/wal/172.16.101.219+9997/08b9f1b4-26d5-4b07-a260-3334c2013576 could only be replicated to 0 nodes, instead of 1
      	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1556)
      	at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696)
      	at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:616)
      	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
      	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
      	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:416)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
      	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
      

      and

      Unexpected error writing to log, retrying attempt 1
      	java.io.IOException: DFSOutputStream is closed
      		at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3666)
      		at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
      		at org.apache.accumulo.server.tabletserver.log.DfsLogger.defineTablet(DfsLogger.java:295)
      		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger$4.write(TabletServerLogger.java:333)
      		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:273)
      		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229)
      		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330)
      		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254)
      		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229)
      		at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330)
      ... repeats...
      

      .

      Bringing the datanode back up did NOT fix it, either.

      UPDATE: reran and never killed datanode and it still died. So this isn't an issue with my datanode killing, it's something with hadop 1.0.1 and the new rite ahead logs.

      Attachments

        1. tserver_jstack
          446 kB
          John Vines

        Issue Links

          Activity

            People

              Unassigned Unassigned
              vines John Vines
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: