Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3044

fsck move should be non-destructive by default

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1.0, 2.0.0-alpha
    • Component/s: namenode
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Hide
      The fsck "move" option is no longer destructive. It copies the accessible blocks of corrupt files to lost and found as before, but no longer deletes the corrupt files after copying the blocks. The original, destructive behavior can be enabled by specifying both the "move" and "delete" options.
      Show
      The fsck "move" option is no longer destructive. It copies the accessible blocks of corrupt files to lost and found as before, but no longer deletes the corrupt files after copying the blocks. The original, destructive behavior can be enabled by specifying both the "move" and "delete" options.

      Description

      The fsck move behavior in the code and originally articulated in HADOOP-101 is:

      Current failure modes for DFS involve blocks that are completely missing. The only way to "fix" them would be to recover chains of blocks and put them into lost+found

      A directory is created with the file name, the blocks that are accessible are created as individual files in this directory, then the original file is removed.

      I suspect the rationale for this behavior was that you can't use files that are missing locations, and copying the block as files at least makes part of the files accessible. However this behavior can also result in permanent dataloss. Eg:

      • Some datanodes don't come up (eg due to a HW issues) and checkin on cluster startup, files with blocks where all replicas are on these set of datanodes are marked corrupt
      • Admin does fsck move, which deletes the "corrupt" files, saves whatever blocks were available
      • The HW issues with datanodes are resolved, they are started and join the cluster. The NN tells them to delete their blocks for the corrupt files since the file was deleted.

      I think we should:

      • Make fsck move non-destructive by default (eg just does a move into lost+found)
      • Make the destructive behavior optional (eg "--destructive" so admins think about what they're doing)
      • Provide better sanity checks and warnings, eg if you're running fsck and not all the slaves have checked in (if using dfs.hosts) then fsck should print a warning indicating this that an admin should have to override if they want to do something destructive
      1. HDFS-3044.002.patch
        5 kB
        Colin Patrick McCabe
      2. HDFS-3044.003.patch
        7 kB
        Colin Patrick McCabe
      3. HDFS-3044-b1.002.patch
        8 kB
        Colin Patrick McCabe
      4. HDFS-3044-b1.004.patch
        8 kB
        Colin Patrick McCabe

        Issue Links

          Activity

          Eli Collins created issue -
          Colin Patrick McCabe made changes -
          Field Original Value New Value
          Attachment HDFS-3044.001.patch [ 12517587 ]
          Colin Patrick McCabe made changes -
          Attachment HDFS-3044.001.patch [ 12517587 ]
          Colin Patrick McCabe made changes -
          Attachment HDFS-3044.002.patch [ 12518071 ]
          Colin Patrick McCabe made changes -
          Attachment HDFS-3044.003.patch [ 12519115 ]
          Eli Collins made changes -
          Hadoop Flags Reviewed [ 10343 ]
          Target Version/s 0.23.3 [ 12320052 ]
          Eli Collins made changes -
          Link This issue is related to HDFS-3045 [ HDFS-3045 ]
          Eli Collins made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Eli Collins made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Target Version/s 0.23.3 [ 12320052 ]
          Fix Version/s 0.23.3 [ 12320052 ]
          Resolution Fixed [ 1 ]
          Eli Collins made changes -
          Hadoop Flags Reviewed [ 10343 ] Incompatible change,Reviewed [ 10342,10343 ]
          Release Note The fsck "move" option is no longer destructive. It copies the accessible blocks of corrupt files to lost and found as before, but no longer deletes the corrupt files after copying the blocks. The original, destructive behavior can be enabled by specifying both the "move" and "delete" options.
          Arun C Murthy made changes -
          Fix Version/s 2.0.0 [ 12320353 ]
          Fix Version/s 0.23.3 [ 12320052 ]
          Colin Patrick McCabe made changes -
          Attachment HDFS-3050-b1.001.patch [ 12520513 ]
          Colin Patrick McCabe made changes -
          Attachment HDFS-3050-b1.001.patch [ 12520513 ]
          Colin Patrick McCabe made changes -
          Attachment HDFS-3044-b1.002.patch [ 12520514 ]
          Eli Collins made changes -
          Fix Version/s 1.1.0 [ 12317959 ]
          Colin Patrick McCabe made changes -
          Attachment HDFS-3044-b1.004.patch [ 12520650 ]
          Suresh Srinivas made changes -
          Target Version/s 1.1.0 [ 12317959 ]
          Matt Foley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Colin Patrick McCabe
              Reporter:
              Eli Collins
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development