Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 1.5.0
    • Fix Version/s: None
    • Component/s: tserver
    • Labels:
      None
    • Environment:

      testing 1.5.1rc1 on a 10 node cluster, hadoop 2.2.0, zk 3.4.5

      Description

      I was running accumulo 1.5.1rc1 on a 10 node cluster. After two days, I saw that several tservers had died with OOME. Several hundred tablets were offline.

      The master was attempting to recover the write lease on the file, and this was failing.

      Attempts to examine the log file failed:

      $ hadoop fs -cat /accumulo/wal/192.168.1.5+9997/bc94602a-9a57-45f6-afdf-ffa2a5b70b14
      Cannot obtain block length for LocatedBlock{BP-901421341-192.168.1.3-1389719663617:blk_1076582460_2869891; getBlockSize()=0; corrupt=false; offset=0; locs=[192.168.1.5:50010]}
      

      Looking at the DN logs, I see this:

      2014-02-06 12:48:35,798 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode at host2/192.168.1.3:9000 calls recoverBlock(BP-901421341-192.168.1.3-1389719663617:blk_1076582290_2869721, targets=[192.168.1.5:50010], newGenerationStamp=2880680)
      2014-02-06 12:48:35,798 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: blk_1076582290_2869721, recoveryId=2880680, replica=ReplicaBeingWritten, blk_1076582290_2869721, RBW
        getNumBytes()     = 634417185
        getBytesOnDisk()  = 634417113
        getVisibleLength()= 634417113
        getVolume()       = /srv/hdfs4/hadoop/dn/current
        getBlockFile()    = /srv/hdfs4/hadoop/dn/current/BP-901421341-192.168.1.3-1389719663617/current/rbw/blk_1076582290
        bytesAcked=634417113
        bytesOnDisk=634417113
      

      I'm guessing that the /srv/hdfs4 partition was filled up, and disagreement about the size of the file and the size the DN thinks the file should be is causing failures.

      Restarting HDFS made no difference.

      I manually copied the block up into HDFS as the WAL to make any progress.

        Activity

        Eric Newton created issue -
        Hide
        Josh Elser added a comment -

        Interesting stuff, Eric Newton. I inadvertently ran into similar situations where I filled up the local partition that the DNs were writing to. The difference though is that after I freed up some space on disk, things happily recovered once they could successfully complete log recovery.

        I assume you were running with the dfs.datanode.synconclose option set to true?

        Show
        Josh Elser added a comment - Interesting stuff, Eric Newton . I inadvertently ran into similar situations where I filled up the local partition that the DNs were writing to. The difference though is that after I freed up some space on disk, things happily recovered once they could successfully complete log recovery. I assume you were running with the dfs.datanode.synconclose option set to true?
        Josh Elser made changes -
        Field Original Value New Value
        Issue Type New Feature [ 2 ] Bug [ 1 ]
        Josh Elser made changes -
        Fix Version/s 1.5.1 [ 12324399 ]
        Hide
        Eric Newton added a comment -

        Yes, synchoclose was set to true. I was only speculating that the FS filled up. If I can reproduce the bug, I'll file an HDFS ticket.

        Show
        Eric Newton added a comment - Yes, synchoclose was set to true. I was only speculating that the FS filled up. If I can reproduce the bug, I'll file an HDFS ticket.
        Eric Newton made changes -
        Fix Version/s 1.5.1 [ 12324399 ]
        Eric Newton made changes -
        Affects Version/s 1.5.0 [ 12318645 ]
        Hide
        Josh Elser added a comment -

        Sounds good. My intention was to hold up on the next 1.5.1 RC until we get a better idea of what happened here. Should I take your removal of the fixVersion as you don't believe that to be necessary?

        Show
        Josh Elser added a comment - Sounds good. My intention was to hold up on the next 1.5.1 RC until we get a better idea of what happened here. Should I take your removal of the fixVersion as you don't believe that to be necessary?

          People

          • Assignee:
            Unassigned
            Reporter:
            Eric Newton
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development