HBase
  1. HBase
  2. HBASE-6089

SSH and AM.joinCluster causes Concurrent Modification exception.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.92.1, 0.94.0
    • Fix Version/s: 0.94.1, 0.95.0
    • Component/s: None
    • Labels:
      None

      Description

      AM.regions map is parallely accessed in SSH and Master initialization leading to ConcurrentModificationException.

      1. HBASE-6089_92.patch
        3 kB
        rajeshbabu
      2. HBASE-6089_94.patch
        4 kB
        rajeshbabu
      3. HBASE-6089_trunk.patch
        4 kB
        rajeshbabu

        Activity

        Hide
        Jonathan Hsieh added a comment - - edited

        Was not committed to 0.90

        Show
        Jonathan Hsieh added a comment - - edited Was not committed to 0.90
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92-security #109 (See https://builds.apache.org/job/HBase-0.92-security/109/)
        HBASE-6089 SSH and AM.joinCluster causes Concurrent Modification exception. (Rajesh) (Revision 1344803)

        Result = SUCCESS
        ramkrishna :
        Files :

        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.92-security #109 (See https://builds.apache.org/job/HBase-0.92-security/109/ ) HBASE-6089 SSH and AM.joinCluster causes Concurrent Modification exception. (Rajesh) (Revision 1344803) Result = SUCCESS ramkrishna : Files : /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security #33 (See https://builds.apache.org/job/HBase-0.94-security/33/)
        HBASE-6089 SSH and AM.joinCluster causes Concurrent Modification exception. (Rajesh) (Revision 1344805)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security #33 (See https://builds.apache.org/job/HBase-0.94-security/33/ ) HBASE-6089 SSH and AM.joinCluster causes Concurrent Modification exception. (Rajesh) (Revision 1344805) Result = FAILURE ramkrishna : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #35 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/35/)
        HBASE-6089 SSH and AM.joinCluster causes Concurrent Modification exception. (Rajesh) (Revision 1344816)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #35 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/35/ ) HBASE-6089 SSH and AM.joinCluster causes Concurrent Modification exception. (Rajesh) (Revision 1344816) Result = FAILURE ramkrishna : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92 #439 (See https://builds.apache.org/job/HBase-0.92/439/)
        HBASE-6089 SSH and AM.joinCluster causes Concurrent Modification exception. (Rajesh) (Revision 1344803)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.92 #439 (See https://builds.apache.org/job/HBase-0.92/439/ ) HBASE-6089 SSH and AM.joinCluster causes Concurrent Modification exception. (Rajesh) (Revision 1344803) Result = FAILURE ramkrishna : Files : /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2967 (See https://builds.apache.org/job/HBase-TRUNK/2967/)
        HBASE-6089 SSH and AM.joinCluster causes Concurrent Modification exception. (Rajesh) (Revision 1344816)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2967 (See https://builds.apache.org/job/HBase-TRUNK/2967/ ) HBASE-6089 SSH and AM.joinCluster causes Concurrent Modification exception. (Rajesh) (Revision 1344816) Result = FAILURE ramkrishna : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #240 (See https://builds.apache.org/job/HBase-0.94/240/)
        HBASE-6089 SSH and AM.joinCluster causes Concurrent Modification exception. (Rajesh) (Revision 1344805)

        Result = SUCCESS
        ramkrishna :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #240 (See https://builds.apache.org/job/HBase-0.94/240/ ) HBASE-6089 SSH and AM.joinCluster causes Concurrent Modification exception. (Rajesh) (Revision 1344805) Result = SUCCESS ramkrishna : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java
        Hide
        ramkrishna.s.vasudevan added a comment -

        Thanks for the patch Rajesh.
        Thanks for the review Ted, Stack and Anoop.
        committed to 0.92, 0.94 and trunk.

        Show
        ramkrishna.s.vasudevan added a comment - Thanks for the patch Rajesh. Thanks for the review Ted, Stack and Anoop. committed to 0.92, 0.94 and trunk.
        Hide
        stack added a comment -

        What you feel? If it is ok for you , i can commit this?

        Yes. Its no good adding a CSLM as I suggested because the sync block is needed to cover regions and servers updates (as Anoop says).

        Show
        stack added a comment - What you feel? If it is ok for you , i can commit this? Yes. Its no good adding a CSLM as I suggested because the sync block is needed to cover regions and servers updates (as Anoop says).
        Hide
        ramkrishna.s.vasudevan added a comment - - edited

        I was not clear in the previous comment. Thinking again on this, there are some places where we wait for the this.regions to be populated.

            synchronized(regions) {
              while(!regions.containsKey(regionInfo)) {
                // We should receive a notification, but it's
                //  better to have a timeout to recheck the condition here:
                //  it lowers the impact of a race condition if any
                regions.wait(100);
              }
        

        When we need to update the this.servers also along with this.regions then we need to have some sync block. I agree with Anoop here.

        Show
        ramkrishna.s.vasudevan added a comment - - edited I was not clear in the previous comment. Thinking again on this, there are some places where we wait for the this.regions to be populated. synchronized (regions) { while (!regions.containsKey(regionInfo)) { // We should receive a notification, but it's // better to have a timeout to recheck the condition here: // it lowers the impact of a race condition if any regions.wait(100); } When we need to update the this.servers also along with this.regions then we need to have some sync block. I agree with Anoop here.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Yes. As per the javadoc for entrySet in ConcurrentSkipListMap

             * <p>The view's <tt>iterator</tt> is a "weakly consistent" iterator
             * that will never throw {@link ConcurrentModificationException},
             * and guarantees to traverse elements as they existed upon
             * construction of the iterator, and may (but is not guaranteed to)
             * reflect any modifications subsequent to construction.
        

        So as per the current JIRA when we iterate the this.region in joincluster we need to get the actual regions there. If SSH tries to modify we may even try to iterate those regions.
        So making it to ConcurrentSkipListMap will not help us here and still we need to go with the sync block.
        So i think the current patch should be fine.
        @Stack
        What you feel? If it is ok for you , i can commit this?

        Show
        ramkrishna.s.vasudevan added a comment - Yes. As per the javadoc for entrySet in ConcurrentSkipListMap * <p>The view's <tt>iterator</tt> is a "weakly consistent" iterator * that will never throw {@link ConcurrentModificationException}, * and guarantees to traverse elements as they existed upon * construction of the iterator, and may (but is not guaranteed to) * reflect any modifications subsequent to construction. So as per the current JIRA when we iterate the this.region in joincluster we need to get the actual regions there. If SSH tries to modify we may even try to iterate those regions. So making it to ConcurrentSkipListMap will not help us here and still we need to go with the sync block. So i think the current patch should be fine. @Stack What you feel? If it is ok for you , i can commit this?
        Hide
        Anoop Sam John added a comment -

        So you suggest to change both the data structures this.servers and this.regions to concurrentskiplistmap

        As per the current code written don't think we can change Ram.
        We need a block of code to be sync which deals with both regions and servers data structures. Also some other blocks where more than one operation on the regions are being in sync block. Changing the data structure to concurrentskiplistmap will not give this behaviour right.

        Show
        Anoop Sam John added a comment - So you suggest to change both the data structures this.servers and this.regions to concurrentskiplistmap As per the current code written don't think we can change Ram. We need a block of code to be sync which deals with both regions and servers data structures. Also some other blocks where more than one operation on the regions are being in sync block. Changing the data structure to concurrentskiplistmap will not give this behaviour right.
        Hide
        ramkrishna.s.vasudevan added a comment -

        What about its tie to this.servers. That is still respected by this patch?

        This we have checked it and the patch still holds good w.r.t to the tie up with this.servers.

        Rather than synchronize, why not use a concurrentskiplistmap?

        So you suggest to change both the data structures this.servers and this.regions to concurrentskiplistmap thro out the code, by removing the synchronized blocks?

        Show
        ramkrishna.s.vasudevan added a comment - What about its tie to this.servers. That is still respected by this patch? This we have checked it and the patch still holds good w.r.t to the tie up with this.servers. Rather than synchronize, why not use a concurrentskiplistmap? So you suggest to change both the data structures this.servers and this.regions to concurrentskiplistmap thro out the code, by removing the synchronized blocks?
        Hide
        stack added a comment -

        Patch looks good to me.

        Rather than synchronize, why not use a concurrentskiplistmap? Also, for sure we have synchronized all accesses to this.region. What about its tie to this.servers. That is still respected by this patch?

        Show
        stack added a comment - Patch looks good to me. Rather than synchronize, why not use a concurrentskiplistmap? Also, for sure we have synchronized all accesses to this.region. What about its tie to this.servers. That is still respected by this patch?
        Hide
        Ted Yu added a comment - - edited

        Patch for trunk looks good.

        Show
        Ted Yu added a comment - - edited Patch for trunk looks good.
        Hide
        rajeshbabu added a comment -

        In 94 and trunk patches,along with fix,removed params in javadoc of modified methods as part of HBASE-5916 and dead code in AssignmentManager.

        void unassignCatalogRegions(){
            this.servers.entrySet();
        
        Show
        rajeshbabu added a comment - In 94 and trunk patches,along with fix,removed params in javadoc of modified methods as part of HBASE-5916 and dead code in AssignmentManager. void unassignCatalogRegions(){ this .servers.entrySet();
        Hide
        ramkrishna.s.vasudevan added a comment -
        2012-05-24 19:26:02,493 DEBUG org.apache.hadoop.hbase.master.ServerManager: New connection to linux146,60020,1337867810895
        2012-05-24 19:26:02,552 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=linux146,60020,1337867810895, region=2be5ef20db58b775953cc1107eb51d2d
        2012-05-24 19:26:02,592 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=linux146,60020,1337867810895, region=191b0c97f2d2a8262bf790093fdce2ab
        2012-05-24 19:26:02,595 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=linux146,60020,1337867810895, region=99d462b47ea5e301175d025204eff014
        2012-05-24 19:26:03,957 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for linux146,60020,1337867810895
        2012-05-24 19:26:14,640 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=linux146,60020,1337867810895, region=2be5ef20db58b775953cc1107eb51d2d
        2012-05-24 19:26:14,781 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=linux146,60020,1337867810895, region=2be5ef20db58b775953cc1107eb51d2d
        2012-05-24 19:26:14,785 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for et1,,1337864575331.2be5ef20db58b775953cc1107eb51d2d. from linux146,60020,1337867810895; deleting unassigned node
        2012-05-24 19:26:14,786 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x1377ea1a1fe002d Deleting existing unassigned node for 2be5ef20db58b775953cc1107eb51d2d that is in expected state RS_ZK_REGION_OPENED
        2012-05-24 19:26:14,792 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x1377ea1a1fe002d Successfully deleted unassigned node for region 2be5ef20db58b775953cc1107eb51d2d in expected state RS_ZK_REGION_OPENED
        2012-05-24 19:26:14,792 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=linux146,60020,1337867810895, region=5a84a4f4eaf2519e36a8ccc2e9c83b04
        2012-05-24 19:26:14,792 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region et1,,1337864575331.2be5ef20db58b775953cc1107eb51d2d. has been deleted.
        2012-05-24 19:26:23,862 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished processing of shutdown of linux146,60020,1337866620614
        2012-05-24 19:26:51,927 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: []
        2012-05-24 19:26:51,931 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
        java.util.ConcurrentModificationException
        	at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100)
        	at java.util.TreeMap$EntryIterator.next(TreeMap.java:1136)
        	at java.util.TreeMap$EntryIterator.next(TreeMap.java:1131)
        	at org.apache.hadoop.hbase.master.AssignmentManager.processDeadServersAndRegionsInTransition(AssignmentManager.java:409)
        	at org.apache.hadoop.hbase.master.AssignmentManager.joinCluster(AssignmentManager.java:363)
        	at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:607)
        	at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:374)
        	at java.lang.Thread.run(Thread.java:662)
        
        Show
        ramkrishna.s.vasudevan added a comment - 2012-05-24 19:26:02,493 DEBUG org.apache.hadoop.hbase.master.ServerManager: New connection to linux146,60020,1337867810895 2012-05-24 19:26:02,552 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=linux146,60020,1337867810895, region=2be5ef20db58b775953cc1107eb51d2d 2012-05-24 19:26:02,592 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=linux146,60020,1337867810895, region=191b0c97f2d2a8262bf790093fdce2ab 2012-05-24 19:26:02,595 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=linux146,60020,1337867810895, region=99d462b47ea5e301175d025204eff014 2012-05-24 19:26:03,957 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for linux146,60020,1337867810895 2012-05-24 19:26:14,640 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=linux146,60020,1337867810895, region=2be5ef20db58b775953cc1107eb51d2d 2012-05-24 19:26:14,781 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=linux146,60020,1337867810895, region=2be5ef20db58b775953cc1107eb51d2d 2012-05-24 19:26:14,785 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for et1,,1337864575331.2be5ef20db58b775953cc1107eb51d2d. from linux146,60020,1337867810895; deleting unassigned node 2012-05-24 19:26:14,786 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x1377ea1a1fe002d Deleting existing unassigned node for 2be5ef20db58b775953cc1107eb51d2d that is in expected state RS_ZK_REGION_OPENED 2012-05-24 19:26:14,792 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x1377ea1a1fe002d Successfully deleted unassigned node for region 2be5ef20db58b775953cc1107eb51d2d in expected state RS_ZK_REGION_OPENED 2012-05-24 19:26:14,792 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=linux146,60020,1337867810895, region=5a84a4f4eaf2519e36a8ccc2e9c83b04 2012-05-24 19:26:14,792 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region et1,,1337864575331.2be5ef20db58b775953cc1107eb51d2d. has been deleted. 2012-05-24 19:26:23,862 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished processing of shutdown of linux146,60020,1337866620614 2012-05-24 19:26:51,927 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: [] 2012-05-24 19:26:51,931 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. java.util.ConcurrentModificationException at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100) at java.util.TreeMap$EntryIterator.next(TreeMap.java:1136) at java.util.TreeMap$EntryIterator.next(TreeMap.java:1131) at org.apache.hadoop.hbase.master.AssignmentManager.processDeadServersAndRegionsInTransition(AssignmentManager.java:409) at org.apache.hadoop.hbase.master.AssignmentManager.joinCluster(AssignmentManager.java:363) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:607) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:374) at java.lang. Thread .run( Thread .java:662)

          People

          • Assignee:
            rajeshbabu
            Reporter:
            ramkrishna.s.vasudevan
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development