Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10788

fsck NullPointerException when it encounters corrupt replicas

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 2.6.0
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None
    • Environment:

      CDH5.5.2, CentOS 6.7

      Description

      Somehow (I haven't found root cause yet) we ended up with blocks that have corrupt replicas where the replica count is inconsistent between the blockmap and the corrupt replicas map. If we try to hdfs fsck any parent directory that has a child with one of these blocks, fsck will exit with something like this:

      $ hdfs fsck /path/to/parent/dir/ | egrep -v '^\.+$'
      Connecting to namenode via http://mynamenode:50070
      FSCK started by bot-hadoop (auth:KERBEROS_SSL) from /10.97.132.43 for path /path/to/parent/dir/ at Tue Aug 23 20:34:58 UTC 2016
      .........................................................................FSCK ended at Tue Aug 23 20:34:59 UTC 2016 in 1098 milliseconds
      null
      
      Fsck on path '/path/to/parent/dir/' FAILED
      

      So I start at the top, fscking every subdirectory until I find one or more that fails. Then I do the same thing with those directories (our top level directories all have subdirectories with date directories in them, which then contain the files) and once I find a directory with files in it, I run a checksum of the files in that directory. When I do that, I don't get the name of the file, rather I get:
      checksum: java.lang.NullPointerException

      but since the files are in order, I can figure it out by seeing which file was before the NPE. Once I get to this point, I can see the following in the namenode log when I try to checksum the corrupt file:

      2016-08-23 20:24:59,627 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Inconsistent number of corrupt replicas for blk_1335893388_1100036319546 blockMap has 0 but corrupt replicas map has 1
      2016-08-23 20:24:59,627 WARN org.apache.hadoop.ipc.Server: IPC Server handler 23 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 192.168.1.100:47785 Call#1 Retry#0
      java.lang.NullPointerException

      At which point I can delete the file, but it is a very tedious process.

      Ideally, shouldn't fsck be able to emit the name of the file that is the source of the problem - and (if -delete is specified) get rid of the file, instead of exiting without saying why?

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jfield Jeff Field
              • Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: