Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-124

don't permit two datanodes to run from same dfs.data.dir

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.2.0
    • 0.3.0
    • None
    • None
    • ~30 node cluster

    Description

      DFS files are still rotting.

      I suspect that there's a problem with block accounting/detecting identical hosts in the namenode. I have 30 physical nodes, with various numbers of local disks, meaning that my current 'bin/hadoop dfs -report" shows 80 nodes after a full restart. However, when I discovered the problem (which resulted in losing about 500gb worth of temporary data because of missing blocks in some of the larger chunks) -report showed 96 nodes. I suspect somehow there were extra datanodes running against the same paths, and that the namenode was counting those as replicated instances, which then showed up over-replicated, and one of them was told to delete its local block, leading to the block actually getting lost.

      I will debug it more the next time the situation arises. This is at least the 5th time I've had a large amount of file data "rot" in DFS since January.

      Attachments

        1. DatanodeRegister.txt
          4 kB
          Konstantin Shvachko
        2. Hadoop-124.patch
          81 kB
          Konstantin Shvachko
        3. Hadoop-124-v3.patch
          90 kB
          Konstantin Shvachko

        Issue Links

          Activity

            People

              shv Konstantin Shvachko
              bpendleton Bryan Pendleton
              Votes:
              2 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: