Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-1685

bench testing shows that the NN loses the WAL

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 1.6.0
    • tserver
    • None
    • Hadoop 1.0.4, single node dev't system

    Description

      Doing bench testing; I build accumulo:

      $ mvn -Pnative package -DskipTests
      

      I go into the assembly area and configure and run accumulo

      $ cd assemble/target/accumulo-1.6.0-SNAPSHOT-dev/accumulo-1.6.0-SNAPSHOT
      $ cp ~/conf/* conf
      $ hadoop fs -rmr /accumulo
      Moved to trash: hdfs://somehost:9000/accumulo
      $ ( echo test ; echo Y ; echo secret ; echo secret ) | ./bin/accumulo init
      $ 2013-09-04 12:23:51,558 [util.Initialize] INFO : Hadoop Filesystem is hdfs://somehost:9000
      2013-09-04 12:23:51,559 [util.Initialize] INFO : Accumulo data dirs are [hdfs://somehost:9000/accumulo]
      2013-09-04 12:23:51,559 [util.Initialize] INFO : Zookeeper server is localhost:2181
      2013-09-04 12:23:51,559 [util.Initialize] INFO : Checking if Zookeeper is available. If this hangs, then you need to make sure zookeeper is running
      Instance name : test
      Instance name "test" exists. Delete existing entry from zookeeper? [Y/N] : Y
      Enter initial password for root (this may not be applicable for your security setup): ******
      Confirm initial password for root: ******
      $ ./bin/start-all.sh 
      Starting monitor on localhost
      Starting tablet servers .... done
      Starting tablet server on localhost
      2013-09-04 12:26:24,545 [server.Accumulo] INFO : Attempting to talk to zookeeper
      2013-09-04 12:26:24,675 [server.Accumulo] INFO : Zookeeper connected and initialized, attemping to talk to HDFS
      2013-09-04 12:26:24,679 [server.Accumulo] INFO : Connected to HDFS
      Starting master on localhost
      Starting garbage collector on localhost
      Starting tracer on localhost
      

      Next, create a table

      $ ./bin/accumulo shell -u root -p secret
      2013-09-04 12:27:01,628 [shell.Shell] WARN : Specifying a raw password is deprecated.
      
      Shell - Apache Accumulo Interactive Shell
      - 
      - version: 1.6.0-SNAPSHOT
      - instance name: test
      - instance id: 1967c1ec-cc0f-439b-b4da-4029debd16e3
      - 
      - type 'help' for a list of available commands
      - 
      root@test> createtable t
      root@test t> 
      

      Then I checked the tserver log for the write-ahead log created for this update to the root table:

      $ fgrep -a /wal/ logs/tserver_*.debug.log
      2013-09-04 12:26:27,130 [log.DfsLogger] DEBUG: Got new write-ahead log: localhost+9997/hdfs://rd6ul-14706v.tycho.ncsc.mil:9000/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9
      2013-09-04 12:26:58,264 [tabletserver.Tablet] DEBUG: Logs for memory compacted: !!R<< localhost+9997/hdfs://somehost:9000/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9
      

      Now, let's check for the file:

      $ hadoop fs -ls hdfs://somehost:9000/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9
      ls: Cannot access hdfs://somehost:9000/accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9: No such file or directory.
      

      What?

      Check the NN logs:

      $ fgrep 1dd2727f /some/log/dir/hadoop-ecnewt2-local-namenode-somehost.log 
      2013-09-04 12:26:27,075 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9. blk_-6011963215434912690_971163
      2013-09-04 12:26:27,113 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.fsync: file /accumulo/wal/localhost+9997/1dd2727f-1de9-417b-a5a2-e56f7d8020a9 for DFSClient_-787226921
      

      So, the NN seems to be making the file, but it's not there when we go to look!

      Here's my hdfs-site.xml file:

      <?xml version="1.0"?>
      <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      
      <!-- Put site-specific property overrides in this file. -->
      
      <configuration>
        <property>
            <name>dfs.replication</name>
            <value>1</value>
        </property>
        <property>
            <name>dfs.name.dir</name>
            <value>/local/ecn/data/hadoop/nn</value>
        </property>
        <property>
            <name>dfs.data.dir</name>
            <value>/disk01/data/hadoop/dn,/disk02/data/hadoop/dn,/disk03/data/hadoop/dn</value>
        </property>
        <property>
            <name>dfs.support.append</name>
            <value>true</value>
        </property>
        <property>
            <name>dfs.data.synconclose</name>
            <value>true</value>
        </property>
      </configuration>
      

      I have written an integration test that I dumped into RestartIT.java, but that doesn't seem to fail in same way.

      Attachments

        Issue Links

          Activity

            People

              ecn Eric C. Newton
              ecn Eric C. Newton
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: