What is the status of this Jira?
I believe that I am also running into this issue. I am using the yahoo_merge branch, but it should be the same in all branches.
When running stress tests, the NameNode daemon receives a ConcurrentModificationException and exits during certain race conditions.
This seems to be a fairly critical bug that could cause the NameNode to exit under stress conditions.
The node configuration I am using is running a single indepent namenode on one machine and hundreds of simulated (by MiniDFSCluster) datanodes on each of 9 other machines, for a total of up to 2000 simulated datanodes.
Than, in this environment, the DataNodeGenerator test is run, which does random reads, creates, writes, and deletes. The goal is to stress the NameNode with hundreds of operations per second.
In some race conditions, when ReplicationMonitor() is calculating invalid blocks, the recentInvalidateSets TreeMap within BlockManager is being modified by one thread while the ReplicationMonitor() is iterating over it.
Here is the exception and stack traceback:
2011-06-08 15:33:41,551 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: ReplicationMonitor thread received Runtime exception.
One thing I did try was to go into the BlockManager and put 'synchronized()' around all places that iterate over, add to, or remove from the recentInvalidateSets TreeMap variable.
I'm not sure what performance (or other unforseen) ramifications this may have.
However, I was able to eliminate the ConcurrentModificationException by using this fix, at least in my test