Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1046

Datanode should periodically clean up /tmp from partially received (and not completed) block files

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.9.2, 0.12.0
    • 0.12.0
    • None
    • None
    • Cluster of 10 machines, running Hadoop 0.9.2 + Nutch

    Description

      Cluster is set up with tasktrackers running on the same machines as datanodes. Tasks create heavy load in terms of local CPU/RAM/diskIO. I noticed a lot of the following messages from the datanodes in such situations:

      2007-02-15 05:30:53,298 WARN dfs.DataNode - Failed to transfer blk_-4590782726923911824 to xxx.xxx.xxx/10.10.16.109:50010
      java.net.SocketException: Connection reset
      ....
      java.io.IOException: Block blk_71053993347675204 has already been started (though not completed), and thus cannot be created.

      My reading of the code in DataNode.DataXceiver.writeBlock() and FSDataset.writeToBlock() + FSDataset.java:459 suggests the following scenario: there is no cleanup of temporary files in /tmp that are used to store the incomplete blocks being transferred. If the datanode is CPU-starved and drops the connection while creating this temp file, the source datanode will attempt to transfer it again - but there is already a file under this name in /tmp, because when the connection was dropped the target datanode didn't bother to cleanup.

      I also see that this section is unchanged in trunk/.

      The solution to this would be to check the age of the physical file in the /tmp dir, in FSDataset.java:436 - if it's older than a few hours or so, we should delete it and proceed as if there were no ongoing create op for this block.

      Attachments

        1. fsdataset.patch
          1.0 kB
          Andrzej Bialecki

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ab Andrzej Bialecki
            ab Andrzej Bialecki
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Issue deployment