Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-1685

bench testing shows that the NN loses the WAL



    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.0
    • Component/s: tserver
    • Labels:
    • Environment:

      Hadoop 1.0.4, single node dev't system


      Doing bench testing; I build accumulo:

      $ mvn -Pnative package -DskipTests

      I go into the assembly area and configure and run accumulo

      $ cd assemble/target/accumulo-1.6.0-SNAPSHOT-dev/accumulo-1.6.0-SNAPSHOT
      $ cp ~/conf/* conf
      $ hadoop fs -rmr /accumulo
      Moved to trash: hdfs://somehost:9000/accumulo
      $ ( echo test ; echo Y ; echo secret ; echo secret ) | ./bin/accumulo init
      $ 2013-09-04 12:23:51,558 [util.Initialize] INFO : Hadoop Filesystem is hdfs://somehost:9000
      2013-09-04 12:23:51,559 [util.Initialize] INFO : Accumulo data dirs are [hdfs://somehost:9000/accumulo]
      2013-09-04 12:23:51,559 [util.Initialize] INFO : Zookeeper server is localhost:2181
      2013-09-04 12:23:51,559 [util.Initialize] INFO : Checking if Zookeeper is available. If this hangs, then you need to make sure zookeeper is running
      Instance name : test
      Instance name "test" exists. Delete existing entry from zookeeper? [Y/N] : Y
      Enter initial password for root (this may not be applicable for your security setup): ******
      Confirm initial password for root: ******
      $ ./bin/start-all.sh 
      Starting monitor on localhost
      Starting tablet servers .... done
      Starting tablet server on localhost
      2013-09-04 12:26:24,545 [server.Accumulo] INFO : Attempting to talk to zookeeper
      2013-09-04 12:26:24,675 [server.Accumulo] INFO : Zookeeper connected and initialized, attemping to talk to HDFS
      2013-09-04 12:26:24,679 [server.Accumulo] INFO : Connected to HDFS
      Starting master on localhost
      Starting garbage collector on localhost
      Starting tracer on localhost

      Next, create a table

      $ ./bin/accumulo shell -u root -p secret
      2013-09-04 12:27:01,628 [shell.Shell] WARN : Specifying a raw password is deprecated.
      Shell - Apache Accumulo Interactive Shell
      - version: 1.6.0-SNAPSHOT
      - instance name: test
      - instance id: 1967c1ec-cc0f-439b-b4da-4029debd16e3
      - type 'help' for a list of available commands
      root@test> createtable t
      root@test t> 

      Then I checked the tserver log for the write-ahead log created for this update to the root table:

      $ fgrep -a /wal/ logs/tserver_*.debug.log
      2013-09-04 12:26:27,130 [log.DfsLogger] DEBUG: Got new write-ahead log: localhost+9997/hdfs://rd6ul-14706v.tycho.ncsc.mil:9000/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9
      2013-09-04 12:26:58,264 [tabletserver.Tablet] DEBUG: Logs for memory compacted: !!R<< localhost+9997/hdfs://somehost:9000/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9

      Now, let's check for the file:

      $ hadoop fs -ls hdfs://somehost:9000/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9
      ls: Cannot access hdfs://somehost:9000/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9: No such file or directory.


      Check the NN logs:

      $ fgrep 1dd2727f /some/log/dir/hadoop-ecnewt2-local-namenode-somehost.log 
      2013-09-04 12:26:27,075 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9. blk_-6011963215434912690_971163
      2013-09-04 12:26:27,113 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.fsync: file /accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9 for DFSClient_-787226921

      So, the NN seems to be making the file, but it's not there when we go to look!

      Here's my hdfs-site.xml file:

      <?xml version="1.0"?>
      <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      <!-- Put site-specific property overrides in this file. -->

      I have written an integration test that I dumped into RestartIT.java, but that doesn't seem to fail in same way.


          Issue Links



              • Assignee:
                ecn Eric C. Newton
                ecn Eric C. Newton
              • Votes:
                0 Vote for this issue
                6 Start watching this issue


                • Created: