[HADOOP-1256] Dfs image loading and edits loading creates multiple instances of DatanodeDescriptor for the same datanode - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.12.3
Fix Version/s: 0.13.0
Component/s: None
Labels:
None

Description

This leads to multiple instances of DatanodeDescriptors for the same datanode stored in Host2DatanodeMap.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

nodeMap.patch
13/Apr/07 22:33
1 kB
Hairong Kuang

Issue Links

duplicates

HADOOP-1254 TestCheckpoint fails intermittently

Closed

is related to

HDFS-175 FSNamesystem.startFile throws an IOException when the number of chosen targets is less than the required minimum number

Open

Activity

Ascending order - Click to sort in descending order

Hairong Kuang added a comment - 12/Apr/07 21:19

This patch makes sure that host2DatanodeMap is consistent with datanodeMap and thus eliminates the possibility that host2DatanodeMap has more than one instance of the datanode.

Hairong Kuang added a comment - 12/Apr/07 21:19 This patch makes sure that host2DatanodeMap is consistent with datanodeMap and thus eliminates the possibility that host2DatanodeMap has more than one instance of the datanode.

Konstantin Shvachko added a comment - 13/Apr/07 01:51

In my test the image file contains a data-node D0 = <name0, storageID>.
And the edits file has two record [remove D0], [add D1], where D1 = <name1, storageID>.
storageID is the same meaning that the I'm starting the same data-node on different ip addresses/ports.

I start the name-node, and I get an empty edits file and the image containing D1, which means that the
edits have been applied correctly, everything is as expected.

Then I start data-node D0 and see 2 problems that I believe are related to this issue.
1. The edits file contains 5 add/remove records in it.
There should be just 2: [remove D1], [add D0]
2. The first record in the edits file is [remove D0].
And if I try to restart the name-node it throws UnregisteredDatanodeException exception:

07/04/12 17:39:40 ERROR dfs.NameNode: org.apache.hadoop.dfs.UnregisteredDatanodeException: Data node <name0> is attempting to report storage ID DS1537505994. Node <name0> is expected to serve this storage.
at org.apache.hadoop.dfs.FSNamesystem.getDatanode(FSNamesystem.java:3461)
at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:311)
at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:672)
at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:585)
at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:220)
at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:346)
at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:251)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:173)
at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:211)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:820)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:828)

I tried the patch it did not fix this problem.

Konstantin Shvachko added a comment - 13/Apr/07 01:51 In my test the image file contains a data-node D0 = <name0, storageID>. And the edits file has two record [remove D0] , [add D1] , where D1 = <name1, storageID>. storageID is the same meaning that the I'm starting the same data-node on different ip addresses/ports. I start the name-node, and I get an empty edits file and the image containing D1, which means that the edits have been applied correctly, everything is as expected. Then I start data-node D0 and see 2 problems that I believe are related to this issue. 1. The edits file contains 5 add/remove records in it. There should be just 2: [remove D1] , [add D0] 2. The first record in the edits file is [remove D0] . And if I try to restart the name-node it throws UnregisteredDatanodeException exception: 07/04/12 17:39:40 ERROR dfs.NameNode: org.apache.hadoop.dfs.UnregisteredDatanodeException: Data node <name0> is attempting to report storage ID DS1537505994. Node <name0> is expected to serve this storage. at org.apache.hadoop.dfs.FSNamesystem.getDatanode(FSNamesystem.java:3461) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:311) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:672) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:585) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:220) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:346) at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:251) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:173) at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:211) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:820) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:828) I tried the patch it did not fix this problem.

Hairong Kuang added a comment - 13/Apr/07 02:15

This is the cause of ~~HADOOP-1254~~.

Hairong Kuang added a comment - 13/Apr/07 02:15 This is the cause of HADOOP-1254 .

Konstantin Shvachko added a comment - 13/Apr/07 02:16

+1
the new patch solved the problem.

Konstantin Shvachko added a comment - 13/Apr/07 02:16 +1 the new patch solved the problem.

Hadoop QA added a comment - 13/Apr/07 02:44

http://issues.apache.org/jira/secure/attachment/12355474/nodeMap.patch applied and successfully tested against trunk revision r528230.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/41/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/41/console

Hadoop QA added a comment - 13/Apr/07 02:44 +1 http://issues.apache.org/jira/secure/attachment/12355474/nodeMap.patch applied and successfully tested against trunk revision r528230. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/41/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/41/console

Konstantin Shvachko added a comment - 13/Apr/07 17:50

General comment:

We should treat differently optimization patches.
It would be good to have some measurements that prove the optimization really
works and worth the complexity involved.

These bugs are related to the optimization issue ~~HADOOP-971~~, which introduced
a new data-structure in the name-node in order to accelerate access to data-nodes by name.
It involves more complexity in synchronizing the new map with the other structures.
We still don't know what the benefits are.

We have TestDFSIO and NNBench to measure the performance.
In this particular case the cluster should have a lot of data-nodes, so it probably needs
a custom benchmark, which should run imo both on a small cluster to show the existing
performance does not degrade and on a large one to show the advantages.

I'd propose to make it a requirement for committing optimization patches.

Konstantin Shvachko added a comment - 13/Apr/07 17:50 General comment: We should treat differently optimization patches. It would be good to have some measurements that prove the optimization really works and worth the complexity involved. These bugs are related to the optimization issue HADOOP-971 , which introduced a new data-structure in the name-node in order to accelerate access to data-nodes by name. It involves more complexity in synchronizing the new map with the other structures. We still don't know what the benefits are. We have TestDFSIO and NNBench to measure the performance. In this particular case the cluster should have a lot of data-nodes, so it probably needs a custom benchmark, which should run imo both on a small cluster to show the existing performance does not degrade and on a large one to show the advantages. I'd propose to make it a requirement for committing optimization patches.

Raghu Angadi added a comment - 13/Apr/07 18:06

Just adding to the discussion :

> These bugs are related to the optimization issue ~~HADOOP-971~~, which introduced
> a new data-structure in the name-node in order to accelerate access to data-nodes by name.
> It involves more complexity in synchronizing the new map with the other structures.
> We still don't know what the benefits are.

Would this bug be any easier to find or fix if this was introduced as part of an feature improvement? Or do you think optimization patches tend to get less rigorously tested and reviewed?

In some cases, performance improvements are really obvious even just by looking at the code. I do not want to take credit for any part of ~~HADOOP-971~~, but looks like it was tested to show improvement.

Raghu Angadi added a comment - 13/Apr/07 18:06 Just adding to the discussion : > These bugs are related to the optimization issue HADOOP-971 , which introduced > a new data-structure in the name-node in order to accelerate access to data-nodes by name. > It involves more complexity in synchronizing the new map with the other structures. > We still don't know what the benefits are. Would this bug be any easier to find or fix if this was introduced as part of an feature improvement? Or do you think optimization patches tend to get less rigorously tested and reviewed? In some cases, performance improvements are really obvious even just by looking at the code. I do not want to take credit for any part of HADOOP-971 , but looks like it was tested to show improvement.

Doug Cutting added a comment - 13/Apr/07 18:30

The benefits of new features should be evaluated differently than those of performance optimizations. Both should be weighed against their added complexity, but the improvements offered by optimizations can and should be quantitatively measured before they're committed. Optimizing things by eye is known to be error-prone. Evaluating features is more subjective.

So, +1 for requiring benchmark results for optimizations.

Doug Cutting added a comment - 13/Apr/07 18:30 The benefits of new features should be evaluated differently than those of performance optimizations. Both should be weighed against their added complexity, but the improvements offered by optimizations can and should be quantitatively measured before they're committed. Optimizing things by eye is known to be error-prone. Evaluating features is more subjective. So, +1 for requiring benchmark results for optimizations.

Nigel Daley added a comment - 13/Apr/07 18:38

Some code comments and a unit test would be good.

Nigel Daley added a comment - 13/Apr/07 18:38 Some code comments and a unit test would be good.

Hairong Kuang added a comment - 13/Apr/07 22:37

As I discussed with Nigel and Konstantin, a deterministic junit test is hard to be created. So I submitted the patch with comments but without a unit test.

Hairong Kuang added a comment - 13/Apr/07 22:37 As I discussed with Nigel and Konstantin, a deterministic junit test is hard to be created. So I submitted the patch with comments but without a unit test.

Hadoop QA added a comment - 13/Apr/07 23:09

http://issues.apache.org/jira/secure/attachment/12355529/nodeMap.patch applied and successfully tested against trunk revision r528230.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/49/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/49/console

Hadoop QA added a comment - 13/Apr/07 23:09 +1 http://issues.apache.org/jira/secure/attachment/12355529/nodeMap.patch applied and successfully tested against trunk revision r528230. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/49/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/49/console

Hairong Kuang added a comment - 13/Apr/07 23:56

I want to make it clear that ~~HADOOP-971~~ patch was tested on a 1800-node cluster before it got committed. But this bug occurs only when fsimage & fsedits contain multiple add entries for the same storage id. So unfortunately the bug was not caught.

I understand Konstantin's concern that ~~HADOOP-971~~ adds complexity to NameNode. I'd be happy if anybody comes up with an idea that removes the getDatanodeByHost bottleneck without introducing the host2DatanodeMap.

Hairong Kuang added a comment - 13/Apr/07 23:56 I want to make it clear that HADOOP-971 patch was tested on a 1800-node cluster before it got committed. But this bug occurs only when fsimage & fsedits contain multiple add entries for the same storage id. So unfortunately the bug was not caught. I understand Konstantin's concern that HADOOP-971 adds complexity to NameNode. I'd be happy if anybody comes up with an idea that removes the getDatanodeByHost bottleneck without introducing the host2DatanodeMap.

Nigel Daley added a comment - 15/Apr/07 00:39

Tom, Doug, lets get this committed ASAP so that trunk unit testing (and thus patch process) is no longer broken.

Nigel Daley added a comment - 15/Apr/07 00:39 +1 Tom, Doug, lets get this committed ASAP so that trunk unit testing (and thus patch process) is no longer broken.

Doug Cutting added a comment - 16/Apr/07 17:19

I just committed this. Thanks, Hairong!

Doug Cutting added a comment - 16/Apr/07 17:19 I just committed this. Thanks, Hairong!

Hadoop QA added a comment - 17/Apr/07 11:21

Integrated in Hadoop-Nightly #60 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/60/)

Hadoop QA added a comment - 17/Apr/07 11:21 Integrated in Hadoop-Nightly #60 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/60/ )

People

Assignee:: Hairong Kuang

Reporter:: Hairong Kuang

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 12/Apr/07 20:37

Updated:: 08/Jul/09 16:42

Resolved:: 16/Apr/07 17:19

Hadoop Common

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates