Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-124

don't permit two datanodes to run from same dfs.data.dir

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 0.2.0
    • Fix Version/s: 0.3.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      ~30 node cluster

      Description

      DFS files are still rotting.

      I suspect that there's a problem with block accounting/detecting identical hosts in the namenode. I have 30 physical nodes, with various numbers of local disks, meaning that my current 'bin/hadoop dfs -report" shows 80 nodes after a full restart. However, when I discovered the problem (which resulted in losing about 500gb worth of temporary data because of missing blocks in some of the larger chunks) -report showed 96 nodes. I suspect somehow there were extra datanodes running against the same paths, and that the namenode was counting those as replicated instances, which then showed up over-replicated, and one of them was told to delete its local block, leading to the block actually getting lost.

      I will debug it more the next time the situation arises. This is at least the 5th time I've had a large amount of file data "rot" in DFS since January.

        Attachments

        1. Hadoop-124-v3.patch
          90 kB
          Konstantin Shvachko
        2. Hadoop-124.patch
          81 kB
          Konstantin Shvachko
        3. DatanodeRegister.txt
          4 kB
          Konstantin Shvachko

          Issue Links

            Activity

              People

              • Assignee:
                shv Konstantin Shvachko
                Reporter:
                bpendleton Bryan Pendleton
              • Votes:
                2 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: