Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 0.18.0
    • 0.18.0, 0.19.0
    • None
    • None
    • Reviewed

    Description

      When a datanode has a block that NameNode does not have, it results in an NPE at the NameNode. And one of these cases results in an infinite loop of these errors because DataNode keeps invoking the same RPC that resulted in this NPE.

      One way to reproduce :

      • On a single DN cluster, start writing a large file (something like 'bin/hadoop fs -put 5Gb 5Gb')
      • Now, from a different shell, delete this file (bin/hadoop fs -rm 5Gb)
      • Most likely you will hit this.
      • The cause is that when DataNode invokes blockReceived() to inform about the last block it received, the file is already deleted and results in an NPE at the namenode. The way DataNode works, it basically keep invoking the same RPC with same block and results in the same error.

      When block does not exist in NameNode's blocksMap, it basically does not belong to the cluster. Let me know if you need the trace. Basically the NPE is at FSNamesystem.java:2800 (on trunk).

      Attachments

        1. HADOOP-3776.patch
          2 kB
          Raghu Angadi
        2. HADOOP-3776.patch
          3 kB
          Raghu Angadi
        3. HADOOP-3776.patch
          2 kB
          Raghu Angadi
        4. HADOOP-3776-branch-018.patch
          2 kB
          Raghu Angadi

        Issue Links

          Activity

            rangadi Raghu Angadi added a comment -

            > When block does not exist in NameNode's blocksMap, it basically does not belong to the cluster.
            As Nicholas pointed out in HADOOP-3804, this check was removed in HADOOP-3002. Fix might be just bring the test back.

            rangadi Raghu Angadi added a comment - > When block does not exist in NameNode's blocksMap, it basically does not belong to the cluster. As Nicholas pointed out in HADOOP-3804 , this check was removed in HADOOP-3002 . Fix might be just bring the test back.
            szetszwo Tsz-wo Sze added a comment -

            The patches for 0.18 and trunk in HADOOP-3002 have this problem. The HADOOP-3002 0.17 patch does not have it.

            szetszwo Tsz-wo Sze added a comment - The patches for 0.18 and trunk in HADOOP-3002 have this problem. The HADOOP-3002 0.17 patch does not have it.
            rangadi Raghu Angadi added a comment -

            The patch essentially reverts the hunk "@@ -2780,17 +2751,8 @@" from the patch for HADOOP-3002. It moves the check to the beginning of addStoredBlock().

            The removal of this check was ok in the case of processReport() but not in the case of blockReceived().

            rangadi Raghu Angadi added a comment - The patch essentially reverts the hunk "@@ -2780,17 +2751,8 @@" from the patch for HADOOP-3002 . It moves the check to the beginning of addStoredBlock(). The removal of this check was ok in the case of processReport() but not in the case of blockReceived().
            szetszwo Tsz-wo Sze added a comment -

            I think we better check whether fileINode == null since fileINode could be null even if the stored block is not null.

            szetszwo Tsz-wo Sze added a comment - I think we better check whether fileINode == null since fileINode could be null even if the stored block is not null.
            rangadi Raghu Angadi added a comment -

            You are right. A block can exist in blocksMap with out an INode when a file is deleted. Such blocks seem to be removed when the datanodes send block reports.

            Updated patch includes the check.

            rangadi Raghu Angadi added a comment - You are right. A block can exist in blocksMap with out an INode when a file is deleted. Such blocks seem to be removed when the datanodes send block reports. Updated patch includes the check.
            szetszwo Tsz-wo Sze added a comment -

            +1 patch looks good

            szetszwo Tsz-wo Sze added a comment - +1 patch looks good
            rangadi Raghu Angadi added a comment -

            Thanks Nicholas. The updated patch is smaller and similar to the first version attached.

            rangadi Raghu Angadi added a comment - Thanks Nicholas. The updated patch is smaller and similar to the first version attached.

            +1 I like the smaller patch.
            I guess I missed the blockReceived() case.

            shv Konstantin Shvachko added a comment - +1 I like the smaller patch. I guess I missed the blockReceived() case.
            szetszwo Tsz-wo Sze added a comment -

            It seems that JIRA has a sorting bug:

            Sort the the File Attachments table above by "Attached Date". We get

            HADOOP-3776.patch  	2008-07-22 06:01 PM  	Raghu Angadi  	2 kb
            HADOOP-3776.patch 	2008-07-23 01:59 PM 	Raghu Angadi 	2 kb
            HADOOP-3776.patch 	2008-07-23 12:34 PM 	Raghu Angadi 	3 kb
            

            01:59 PM comes before 12:34 PM. Does it sort the timestamps as strings? Or I miss something?

            szetszwo Tsz-wo Sze added a comment - It seems that JIRA has a sorting bug: Sort the the File Attachments table above by "Attached Date". We get HADOOP-3776.patch 2008-07-22 06:01 PM Raghu Angadi 2 kb HADOOP-3776.patch 2008-07-23 01:59 PM Raghu Angadi 2 kb HADOOP-3776.patch 2008-07-23 12:34 PM Raghu Angadi 3 kb 01:59 PM comes before 12:34 PM. Does it sort the timestamps as strings? Or I miss something?
            rangadi Raghu Angadi added a comment -

            > Does it sort the timestamps as strings?
            yes, thats why 12:30 PM appears below 1:30 PM.

            rangadi Raghu Angadi added a comment - > Does it sort the timestamps as strings? yes, thats why 12:30 PM appears below 1:30 PM.
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12386754/HADOOP-3776.patch
            against trunk revision 679286.

            +1 @author. The patch does not contain any @author tags.

            -1 tests included. The patch doesn't appear to include any new or modified tests.
            Please justify why no tests are needed for this patch.

            +1 javadoc. The javadoc tool did not generate any warning messages.

            +1 javac. The applied patch does not increase the total number of javac compiler warnings.

            +1 findbugs. The patch does not introduce any new Findbugs warnings.

            +1 release audit. The applied patch does not increase the total number of release audit warnings.

            -1 core tests. The patch failed core unit tests.

            +1 contrib tests. The patch passed contrib unit tests.

            Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2934/testReport/
            Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2934/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
            Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2934/artifact/trunk/build/test/checkstyle-errors.html
            Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2934/console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12386754/HADOOP-3776.patch against trunk revision 679286. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2934/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2934/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2934/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2934/console This message is automatically generated.
            rangadi Raghu Angadi added a comment -

            Patch for 0.18 is attached. Only change from patch for trunk is the path for FSNamesystem.java.

            The core test failure during Hudson is expected (HADOOP-3809).

            rangadi Raghu Angadi added a comment - Patch for 0.18 is attached. Only change from patch for trunk is the path for FSNamesystem.java. The core test failure during Hudson is expected ( HADOOP-3809 ).
            rangadi Raghu Angadi added a comment -

            I just committed this.

            rangadi Raghu Angadi added a comment - I just committed this.
            hudson Hudson added a comment -
            hudson Hudson added a comment - Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/ )

            People

              rangadi Raghu Angadi
              rangadi Raghu Angadi
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: