[HADOOP-3776] NPE in NameNode with unknown blocks - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.18.0
Fix Version/s: 0.18.0, 0.19.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

When a datanode has a block that NameNode does not have, it results in an NPE at the NameNode. And one of these cases results in an infinite loop of these errors because DataNode keeps invoking the same RPC that resulted in this NPE.

One way to reproduce :

On a single DN cluster, start writing a large file (something like 'bin/hadoop fs -put 5Gb 5Gb')
Now, from a different shell, delete this file (bin/hadoop fs -rm 5Gb)
Most likely you will hit this.
The cause is that when DataNode invokes blockReceived() to inform about the last block it received, the file is already deleted and results in an NPE at the namenode. The way DataNode works, it basically keep invoking the same RPC with same block and results in the same error.

When block does not exist in NameNode's blocksMap, it basically does not belong to the cluster. Let me know if you need the trace. Basically the NPE is at FSNamesystem.java:2800 (on trunk).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-3776.patch
23/Jul/08 20:59
2 kB
Raghu Angadi
HADOOP-3776.patch
23/Jul/08 19:34
3 kB
Raghu Angadi
HADOOP-3776.patch
23/Jul/08 01:01
2 kB
Raghu Angadi
HADOOP-3776-branch-018.patch
24/Jul/08 16:36
2 kB
Raghu Angadi

Issue Links

is duplicated by

HADOOP-3804 NPE in FSNamesystem.addStoredBlock(...)

Closed

Activity

Ascending order - Click to sort in descending order

Raghu Angadi added a comment - 22/Jul/08 18:24

> When block does not exist in NameNode's blocksMap, it basically does not belong to the cluster.
As Nicholas pointed out in ~~HADOOP-3804~~, this check was removed in ~~HADOOP-3002~~. Fix might be just bring the test back.

Raghu Angadi added a comment - 22/Jul/08 18:24 > When block does not exist in NameNode's blocksMap, it basically does not belong to the cluster. As Nicholas pointed out in HADOOP-3804 , this check was removed in HADOOP-3002 . Fix might be just bring the test back.

Tsz-wo Sze added a comment - 22/Jul/08 18:29

The patches for 0.18 and trunk in ~~HADOOP-3002~~ have this problem. The ~~HADOOP-3002~~ 0.17 patch does not have it.

Tsz-wo Sze added a comment - 22/Jul/08 18:29 The patches for 0.18 and trunk in HADOOP-3002 have this problem. The HADOOP-3002 0.17 patch does not have it.

Raghu Angadi added a comment - 23/Jul/08 01:01

The patch essentially reverts the hunk "@@ -2780,17 +2751,8 @@" from the patch for ~~HADOOP-3002~~. It moves the check to the beginning of addStoredBlock().

The removal of this check was ok in the case of processReport() but not in the case of blockReceived().

Raghu Angadi added a comment - 23/Jul/08 01:01 The patch essentially reverts the hunk "@@ -2780,17 +2751,8 @@" from the patch for HADOOP-3002 . It moves the check to the beginning of addStoredBlock(). The removal of this check was ok in the case of processReport() but not in the case of blockReceived().

Tsz-wo Sze added a comment - 23/Jul/08 18:12

I think we better check whether fileINode == null since fileINode could be null even if the stored block is not null.

Tsz-wo Sze added a comment - 23/Jul/08 18:12 I think we better check whether fileINode == null since fileINode could be null even if the stored block is not null.

Raghu Angadi added a comment - 23/Jul/08 19:34

You are right. A block can exist in blocksMap with out an INode when a file is deleted. Such blocks seem to be removed when the datanodes send block reports.

Updated patch includes the check.

Raghu Angadi added a comment - 23/Jul/08 19:34 You are right. A block can exist in blocksMap with out an INode when a file is deleted. Such blocks seem to be removed when the datanodes send block reports. Updated patch includes the check.

Tsz-wo Sze added a comment - 23/Jul/08 20:26

+1 patch looks good

Tsz-wo Sze added a comment - 23/Jul/08 20:26 +1 patch looks good

Raghu Angadi added a comment - 23/Jul/08 20:57

Thanks Nicholas. The updated patch is smaller and similar to the first version attached.

Raghu Angadi added a comment - 23/Jul/08 20:57 Thanks Nicholas. The updated patch is smaller and similar to the first version attached.

Konstantin Shvachko added a comment - 23/Jul/08 21:19

+1 I like the smaller patch.
I guess I missed the blockReceived() case.

Konstantin Shvachko added a comment - 23/Jul/08 21:19 +1 I like the smaller patch. I guess I missed the blockReceived() case.

Tsz-wo Sze added a comment - 23/Jul/08 23:41

It seems that JIRA has a sorting bug:

Sort the the File Attachments table above by "Attached Date". We get

HADOOP-3776.patch  	2008-07-22 06:01 PM  	Raghu Angadi  	2 kb
HADOOP-3776.patch 	2008-07-23 01:59 PM 	Raghu Angadi 	2 kb
HADOOP-3776.patch 	2008-07-23 12:34 PM 	Raghu Angadi 	3 kb

01:59 PM comes before 12:34 PM. Does it sort the timestamps as strings? Or I miss something?

Tsz-wo Sze added a comment - 23/Jul/08 23:41 It seems that JIRA has a sorting bug: Sort the the File Attachments table above by "Attached Date". We get HADOOP-3776.patch 2008-07-22 06:01 PM Raghu Angadi 2 kb HADOOP-3776.patch 2008-07-23 01:59 PM Raghu Angadi 2 kb HADOOP-3776.patch 2008-07-23 12:34 PM Raghu Angadi 3 kb 01:59 PM comes before 12:34 PM. Does it sort the timestamps as strings? Or I miss something?

Raghu Angadi added a comment - 23/Jul/08 23:46

> Does it sort the timestamps as strings?
yes, thats why 12:30 PM appears below 1:30 PM.

Raghu Angadi added a comment - 23/Jul/08 23:46 > Does it sort the timestamps as strings? yes, thats why 12:30 PM appears below 1:30 PM.

Hadoop QA added a comment - 24/Jul/08 10:05

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12386754/HADOOP-3776.patch
against trunk revision 679286.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2934/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2934/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2934/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2934/console

This message is automatically generated.

Hadoop QA added a comment - 24/Jul/08 10:05 -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12386754/HADOOP-3776.patch against trunk revision 679286. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2934/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2934/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2934/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2934/console This message is automatically generated.

Raghu Angadi added a comment - 24/Jul/08 16:36

Patch for 0.18 is attached. Only change from patch for trunk is the path for FSNamesystem.java.

The core test failure during Hudson is expected (~~HADOOP-3809~~).

Raghu Angadi added a comment - 24/Jul/08 16:36 Patch for 0.18 is attached. Only change from patch for trunk is the path for FSNamesystem.java. The core test failure during Hudson is expected ( HADOOP-3809 ).

Raghu Angadi added a comment - 24/Jul/08 16:52

I just committed this.

Raghu Angadi added a comment - 24/Jul/08 16:52 I just committed this.

Hudson added a comment - 22/Aug/08 12:34

Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/)

Hudson added a comment - 22/Aug/08 12:34 Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/ )

People

Assignee:: Raghu Angadi

Reporter:: Raghu Angadi

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Jul/08 23:02

Updated:: 08/Jul/09 16:43

Resolved:: 24/Jul/08 16:52

Hadoop Common

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates