HBase
  1. HBase
  2. HBASE-7247

Assignment performances decreased by 50% because of regionserver.OpenRegionHandler#tickleOpening

    Details

    • Hadoop Flags:
      Reviewed

      Description

      The regionserver.OpenRegionHandler#tickleOpening updates the region znode as "Do this so master doesn't timeout this region-in-transition.".
      However, on the usual test, this makes the assignment time of 1500 regions goes from 70s to 100s, that is, we're 50% slower because of this.
      More generally, ZooKeper commits to disk all the data update, and this takes time. Using it to provide a keep alive seems overkill. At the very list, it could be made asynchronous.

      I'm not sure how necessary these updates are required (I need to go deeper in the internal, feedback welcome), but it seems very important to optimize this... The trival fix would be to make this optional.

      1. 7247.v1.patch
        10 kB
        Nicolas Liochon
      2. 7247.v2.patch
        8 kB
        Nicolas Liochon

        Issue Links

          Activity

          Hide
          stack added a comment -

          Thanks for explanation. +1 on commit to trunk and 0.95.

          Show
          stack added a comment - Thanks for explanation. +1 on commit to trunk and 0.95.
          Hide
          Nicolas Liochon added a comment -

          This sync happens every time? Ain't it really expensive? More expensive that what used to be there (a write?). Why we need it?

          We already had the sync previously. We're just saving the final write.
          The sync can be expensive, I don't know if it's always expensive or sometimes optimized by ZK. My test hides its cost, because it's a single ZK node and the master is on the same node. A real life system with 5 ZK would be more realistic. In any case, saving the write is good.

          Does this clear any existing watch?

          No, it does not set a watcher but should not impact existing ones.

          Does the content of retransitionNodeOpening deserve to be generalized and moved back into transitionNode but with a flag for whether to write the new state or not?

          I don't know. Do we have a lot of case like this one?
          Globally, the performance problem is that we check the state before writing, hence we sync and we read. The next optimization I'm thinking about is to know that we were the previous writer, or that we read the previous version already, so we can send the write without syncing/reading, relying only on ZooKeeper versions. But it's easy to add bugs when doing this.

          Show
          Nicolas Liochon added a comment - This sync happens every time? Ain't it really expensive? More expensive that what used to be there (a write?). Why we need it? We already had the sync previously. We're just saving the final write. The sync can be expensive, I don't know if it's always expensive or sometimes optimized by ZK. My test hides its cost, because it's a single ZK node and the master is on the same node. A real life system with 5 ZK would be more realistic. In any case, saving the write is good. Does this clear any existing watch? No, it does not set a watcher but should not impact existing ones. Does the content of retransitionNodeOpening deserve to be generalized and moved back into transitionNode but with a flag for whether to write the new state or not? I don't know. Do we have a lot of case like this one? Globally, the performance problem is that we check the state before writing, hence we sync and we read. The next optimization I'm thinking about is to know that we were the previous writer, or that we read the previous version already, so we can send the write without syncing/reading, relying only on ZooKeeper versions. But it's easy to add bugs when doing this.
          Hide
          stack added a comment -

          Nicolas Liochon This sync happens every time? Ain't it really expensive? More expensive that what used to be there (a write?). Why we need it?

          + String node = getNodeName(zkw, encoded);
          + zkw.sync(node);

          Does this clear any existing watch?

          + byte [] existingBytes = ZKUtil.getDataNoWatch(zkw, node, stat);

          Does the content of retransitionNodeOpening deserve to be generalized and moved back into transitionNode but with a flag for whether to write the new state or not?

          Otherwise patch looks good.

          Show
          stack added a comment - Nicolas Liochon This sync happens every time? Ain't it really expensive? More expensive that what used to be there (a write?). Why we need it? + String node = getNodeName(zkw, encoded); + zkw.sync(node); Does this clear any existing watch? + byte [] existingBytes = ZKUtil.getDataNoWatch(zkw, node, stat); Does the content of retransitionNodeOpening deserve to be generalized and moved back into transitionNode but with a flag for whether to write the new state or not? Otherwise patch looks good.
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #489 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/489/)
          HBASE-7247 Assignment performances decreased by 50% because of regionserver.OpenRegionHandler#tickleOpening (Revision 1465914)

          Result = FAILURE
          nkeywal :
          Files :

          • /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #489 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/489/ ) HBASE-7247 Assignment performances decreased by 50% because of regionserver.OpenRegionHandler#tickleOpening (Revision 1465914) Result = FAILURE nkeywal : Files : /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
          Hide
          Hudson added a comment -

          Integrated in hbase-0.95-on-hadoop2 #62 (See https://builds.apache.org/job/hbase-0.95-on-hadoop2/62/)
          HBASE-7247 Assignment performances decreased by 50% because of regionserver.OpenRegionHandler#tickleOpening (Revision 1465915)

          Result = FAILURE
          nkeywal :
          Files :

          • /hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java
          • /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
          Show
          Hudson added a comment - Integrated in hbase-0.95-on-hadoop2 #62 (See https://builds.apache.org/job/hbase-0.95-on-hadoop2/62/ ) HBASE-7247 Assignment performances decreased by 50% because of regionserver.OpenRegionHandler#tickleOpening (Revision 1465915) Result = FAILURE nkeywal : Files : /hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #4045 (See https://builds.apache.org/job/HBase-TRUNK/4045/)
          HBASE-7247 Assignment performances decreased by 50% because of regionserver.OpenRegionHandler#tickleOpening (Revision 1465914)

          Result = SUCCESS
          nkeywal :
          Files :

          • /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #4045 (See https://builds.apache.org/job/HBase-TRUNK/4045/ ) HBASE-7247 Assignment performances decreased by 50% because of regionserver.OpenRegionHandler#tickleOpening (Revision 1465914) Result = SUCCESS nkeywal : Files : /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
          Hide
          Jimmy Xiang added a comment -

          v2 Looks good to me.

          Show
          Jimmy Xiang added a comment - v2 Looks good to me.
          Hide
          Nicolas Liochon added a comment -

          Waiting for a review on this one. Works locally, TestHTableMultiplexer is well known as flaky.

          Show
          Nicolas Liochon added a comment - Waiting for a review on this one. Works locally, TestHTableMultiplexer is well known as flaky.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12577192/7247.v2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          +1 site. The mvn site goal succeeds with this patch.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.client.TestHTableMultiplexer

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12577192/7247.v2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.client.TestHTableMultiplexer Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5148//console This message is automatically generated.
          Hide
          Nicolas Liochon added a comment -

          v2. Results are consistent: always 85s vs. 120s. I've done a yourkit profiling against it: all the time is spent in ZK in my test scenario.

          Show
          Nicolas Liochon added a comment - v2. Results are consistent: always 85s vs. 120s. I've done a yourkit profiling against it: all the time is spent in ZK in my test scenario.
          Hide
          Nicolas Liochon added a comment -

          I'm doing some tests on trunk on recovery.

          Test is:

          • 1000 empty regions on one RS
          • on another computer, all services are running: master, ZK, second RS.
          • Stop (cleanly) the first RS.

          Recovery (in this case assignment only) takes 2 minutes.
          All the time seems to be spent in writing/reading to ZK.

          We have ~3000 calls to ZK#transitionNode, including 1000 for tickleOpening.

          Each transition is:
          sync
          readData
          setData

          Commenting the sync makes no difference (it could be an effect of the test env).
          removing tickleOpening brings the result to 80 seconds (vs. 120s)
          Checking but not writing in tickleOpening puts us around 85s. I will do a proper patch with this.

          Show
          Nicolas Liochon added a comment - I'm doing some tests on trunk on recovery. Test is: 1000 empty regions on one RS on another computer, all services are running: master, ZK, second RS. Stop (cleanly) the first RS. Recovery (in this case assignment only) takes 2 minutes. All the time seems to be spent in writing/reading to ZK. We have ~3000 calls to ZK#transitionNode, including 1000 for tickleOpening. Each transition is: sync readData setData Commenting the sync makes no difference (it could be an effect of the test env). removing tickleOpening brings the result to 80 seconds (vs. 120s) Checking but not writing in tickleOpening puts us around 85s. I will do a proper patch with this.
          Hide
          Nicolas Liochon added a comment -

          TimeOutManagement it now optional and deactivated by default. I will redo the measures.

          Show
          Nicolas Liochon added a comment - TimeOutManagement it now optional and deactivated by default. I will redo the measures.
          Hide
          Nicolas Liochon added a comment -

          I'm currently looking as this, now that the blocking issues for this has been fixed. Doable for 0.96 imho.

          Show
          Nicolas Liochon added a comment - I'm currently looking as this, now that the blocking issues for this has been fixed. Doable for 0.96 imho.
          Hide
          stack added a comment -

          Marking major rather than critical. This would be really sweet to have. It might actually be harder getting it in if it doesn't make it into 0.96. It is an improvement though so shouldn't hold up 0.96. Thats why I'm making it 'major'.

          Show
          stack added a comment - Marking major rather than critical. This would be really sweet to have. It might actually be harder getting it in if it doesn't make it into 0.96. It is an improvement though so shouldn't hold up 0.96. Thats why I'm making it 'major'.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12556092/7247.v1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          -1 javadoc. The javadoc tool appears to have generated 104 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 23 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.client.TestMultiParallel

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3508//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3508//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3508//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3508//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3508//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3508//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3508//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3508//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3508//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12556092/7247.v1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. -1 javadoc . The javadoc tool appears to have generated 104 warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 findbugs . The patch appears to introduce 23 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.client.TestMultiParallel Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3508//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3508//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3508//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3508//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3508//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3508//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3508//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3508//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3508//console This message is automatically generated.
          Hide
          Nicolas Liochon added a comment -

          From the comments on the mailing list and here:

          • it makes sense
          • Jimmy Xiang will list the exceptional case to make sure they are covered.

          I will try to have things done in two or three steps:

          • remove the code from the master
          • remove the code from the region server

          I will use the review board for this, in a distinct jira...

          Show
          Nicolas Liochon added a comment - From the comments on the mailing list and here: it makes sense Jimmy Xiang will list the exceptional case to make sure they are covered. I will try to have things done in two or three steps: remove the code from the master remove the code from the region server I will use the review board for this, in a distinct jira...
          Hide
          Andrew Purtell added a comment -

          Stepping back (after looking at code), could we drop the notion that a master can intercede and assign a region elsewhere because it is proceeding too slow on a particular region in the name of simplifying the region open handling interaction? There would be less noise in the logs and less states to deal with.

          It would be a huge simplification imho. It's worth trying, I would say. It actually makes sense to do it now, because once the current trunk code will be production proven, touching it will be scarier.

          +1

          Show
          Andrew Purtell added a comment - Stepping back (after looking at code), could we drop the notion that a master can intercede and assign a region elsewhere because it is proceeding too slow on a particular region in the name of simplifying the region open handling interaction? There would be less noise in the logs and less states to deal with. It would be a huge simplification imho. It's worth trying, I would say. It actually makes sense to do it now, because once the current trunk code will be production proven, touching it will be scarier. +1
          Hide
          stack added a comment -

          Implicitly, it means we still have a race condition here, just that the probability is quite low.

          Yeah. By-product of our keeping state across multiple systems (up in zk and then some state in .meta.). We could change this to a checkAndPut. Read .META. at start of the opening or have master pass over the .META. timestamp or something key to .META. and we'd use it doing checkAndSet into .META. table... would be more strict than this updating zk.

          It would be a huge simplification imho. It's worth trying, I would say. It actually makes sense to do it now, because once the current trunk code will be production proven, touching it will be scarier.

          I'd go along. We should discuss out on dev first. I have short-term memory. I'm currently of the opinion that this expensive facility of master failing an open because it has been taking too long on a particular regionserver has been of no use – worse, it has only caused headache – but I may be just not remembering and others out on dev list will have better recall than I.

          Show
          stack added a comment - Implicitly, it means we still have a race condition here, just that the probability is quite low. Yeah. By-product of our keeping state across multiple systems (up in zk and then some state in .meta.). We could change this to a checkAndPut. Read .META. at start of the opening or have master pass over the .META. timestamp or something key to .META. and we'd use it doing checkAndSet into .META. table... would be more strict than this updating zk. It would be a huge simplification imho. It's worth trying, I would say. It actually makes sense to do it now, because once the current trunk code will be production proven, touching it will be scarier. I'd go along. We should discuss out on dev first. I have short-term memory. I'm currently of the opinion that this expensive facility of master failing an open because it has been taking too long on a particular regionserver has been of no use – worse, it has only caused headache – but I may be just not remembering and others out on dev list will have better recall than I.
          Hide
          Nicolas Liochon added a comment -

          On the 'post_region_open', IIRC, a bunch of these tickleOpenings were added because we saw issues... in this case, an update of .META. that went in though we'd lost ownership of the region.

          Implicitly, it means we still have a race condition here, just that the probability is quite low.

          Stepping back (after looking at code), could we drop the notion that a master can intercede and assign a region elsewhere because it is proceeding too slow on a particular region in the name of simplifying the region open handling interaction? There would be less noise in the logs and less states to deal with.

          It would be a huge simplification imho. It's worth trying, I would say. It actually makes sense to do it now, because once the current trunk code will be production proven, touching it will be scarier.

          Do we fail the open if the following break happens?

          Yes, because updateMeta will return false.

          We have to call the + zkw.sync(node); ? We always did that? We are doing the sync just to read the old znode value? Do we have to? Could we operate w/ stale read?

          Well, we want to be sure that no one else wrote anything. It's often overkill, because we're going to write the znode immediately after, so the sync will occur anyway during the write, as we check the versions during the write. And it's expensive. So there is likely some room for improvement here as well actually.

          HRegion region = this.rsServices.getFromOnlineRegions(encodedName);

          We don't do anything here actually. We do a get, if the region is in the list, 'region' will not be null and that's it. This variable is set later but not read in between. I can raise an error if it's in the list? Hopefully it won't break anything (famous last words)

          Show
          Nicolas Liochon added a comment - On the 'post_region_open', IIRC, a bunch of these tickleOpenings were added because we saw issues... in this case, an update of .META. that went in though we'd lost ownership of the region. Implicitly, it means we still have a race condition here, just that the probability is quite low. Stepping back (after looking at code), could we drop the notion that a master can intercede and assign a region elsewhere because it is proceeding too slow on a particular region in the name of simplifying the region open handling interaction? There would be less noise in the logs and less states to deal with. It would be a huge simplification imho. It's worth trying, I would say. It actually makes sense to do it now, because once the current trunk code will be production proven, touching it will be scarier. Do we fail the open if the following break happens? Yes, because updateMeta will return false. We have to call the + zkw.sync(node); ? We always did that? We are doing the sync just to read the old znode value? Do we have to? Could we operate w/ stale read? Well, we want to be sure that no one else wrote anything. It's often overkill, because we're going to write the znode immediately after, so the sync will occur anyway during the write, as we check the versions during the write. And it's expensive. So there is likely some room for improvement here as well actually. HRegion region = this.rsServices.getFromOnlineRegions(encodedName); We don't do anything here actually. We do a get, if the region is in the list, 'region' will not be null and that's it. This variable is set later but not read in between. I can raise an error if it's in the list? Hopefully it won't break anything (famous last words)
          Hide
          stack added a comment -

          Your approach sounds good nkeywal. No harm writing less but adding the read in case we indeed have lost ownership. It might mean it takes a bit longer to realize we've lost ownership but that should be fine.

          On the 'post_region_open', IIRC, a bunch of these tickleOpenings were added because we saw issues... in this case, an update of .META. that went in though we'd lost ownership of the region.

          Stepping back (after looking at code), could we drop the notion that a master can intercede and assign a region elsewhere because it is proceeding too slow on a particular region in the name of simplifying the region open handling interaction? There would be less noise in the logs and less states to deal with.

          If we did this, I'd think that we'd want the regionserver to do the initial move of the znode from OPENING to OPENING to establish ownership (elsewhere I have petitioned that the regionserver should set the OPENING state, and not the master – master should set the state to PENDING_OPEN in the znode rather than just in master memory – as a means of cleanly denoting the regionservers' assumption of region ownership). Then the next transition would be from OPENING to OPEN or to FAILED_OPEN. And that would be it. Master would just presume that the only reason to intercede is when the regionserver loses its lease in zk. We'd drop tickling OPENING so master knows we are progressing on a region open – it would just presume regionserver is making progress and that it will kill itself if it can't get to HDFS, etc. We'd also be dropping regionserver-side checks that it still owns a region just before it goes to update meta (Could change the meta operation to be a check and put so we didn't have to go to zk just before meta edit – the bit of code you quote above nkeywal?).

          Just putting it out there.

          On patch:

          Do we fail the open if the following break happens?

                   tickleOpening = tickleOpening("post_open_deploy");
          +        if (!tickleOpening) {
          +          break;
          +        }
          

          We have to call the + zkw.sync(node); ? We always did that? We are doing the sync just to read the old znode value? Do we have to? Could we operate w/ stale read?

          Are you removing this:

          • HRegion region = this.rsServices.getFromOnlineRegions(encodedName);

          If so, why or have you moved the check later?

          Show
          stack added a comment - Your approach sounds good nkeywal. No harm writing less but adding the read in case we indeed have lost ownership. It might mean it takes a bit longer to realize we've lost ownership but that should be fine. On the 'post_region_open', IIRC, a bunch of these tickleOpenings were added because we saw issues... in this case, an update of .META. that went in though we'd lost ownership of the region. Stepping back (after looking at code), could we drop the notion that a master can intercede and assign a region elsewhere because it is proceeding too slow on a particular region in the name of simplifying the region open handling interaction? There would be less noise in the logs and less states to deal with. If we did this, I'd think that we'd want the regionserver to do the initial move of the znode from OPENING to OPENING to establish ownership (elsewhere I have petitioned that the regionserver should set the OPENING state, and not the master – master should set the state to PENDING_OPEN in the znode rather than just in master memory – as a means of cleanly denoting the regionservers' assumption of region ownership). Then the next transition would be from OPENING to OPEN or to FAILED_OPEN. And that would be it. Master would just presume that the only reason to intercede is when the regionserver loses its lease in zk. We'd drop tickling OPENING so master knows we are progressing on a region open – it would just presume regionserver is making progress and that it will kill itself if it can't get to HDFS, etc. We'd also be dropping regionserver-side checks that it still owns a region just before it goes to update meta (Could change the meta operation to be a check and put so we didn't have to go to zk just before meta edit – the bit of code you quote above nkeywal?). Just putting it out there. On patch: Do we fail the open if the following break happens? tickleOpening = tickleOpening( "post_open_deploy" ); + if (!tickleOpening) { + break ; + } We have to call the + zkw.sync(node); ? We always did that? We are doing the sync just to read the old znode value? Do we have to? Could we operate w/ stale read? Are you removing this: HRegion region = this.rsServices.getFromOnlineRegions(encodedName); If so, why or have you moved the check later?
          Hide
          Nicolas Liochon added a comment -

          The patch is getting thinner and thinner...
          Now I just check that the znode was not updated recently before writing it again. On the test, it brings half of was I have it I remove totally tickleOpening.

          There is something there as well, in OpenRegionHandler#process()

                boolean failed = true;
                if (tickleOpening("post_region_open")) {
                  if (updateMeta(region)) {
                    failed = false;
                  }
                }
                if (failed || this.server.isStopped() ||
          

          This 'tickleOpening' is called whatever the time spent previously, before updating meta. If we remove it completely, we save a sync.

          Show
          Nicolas Liochon added a comment - The patch is getting thinner and thinner... Now I just check that the znode was not updated recently before writing it again. On the test, it brings half of was I have it I remove totally tickleOpening. There is something there as well, in OpenRegionHandler#process() boolean failed = true ; if (tickleOpening( "post_region_open" )) { if (updateMeta(region)) { failed = false ; } } if (failed || this .server.isStopped() || This 'tickleOpening' is called whatever the time spent previously, before updating meta. If we remove it completely, we save a sync.
          Hide
          Nicolas Liochon added a comment -

          The issue comes from mixing asynchronous writes with synchronous writes & reads. My patch is now working but adds some complexity because of the added resynchronisation. I'm gonna do simpler. The ideal solution would be to be full async for the region opening (and I'm not gonna do that here )

          Show
          Nicolas Liochon added a comment - The issue comes from mixing asynchronous writes with synchronous writes & reads. My patch is now working but adds some complexity because of the added resynchronisation. I'm gonna do simpler. The ideal solution would be to be full async for the region opening (and I'm not gonna do that here )
          Hide
          Nicolas Liochon added a comment -

          I will fix it with the other stuff so. Currently my patch does not work locally, but I'm making progress.

          Show
          Nicolas Liochon added a comment - I will fix it with the other stuff so. Currently my patch does not work locally, but I'm making progress.
          Hide
          stack added a comment -

          In theory, we should break the loop when tickleOpening becomes false.

          Yes. That is a bug.

          Show
          stack added a comment - In theory, we should break the loop when tickleOpening becomes false. Yes. That is a bug.
          Hide
          Nicolas Liochon added a comment -

          I'm a little bit scared with the watchers: I would not like to have race conditions (if the region is opened twice for example). I'm trying something with only the async part...

          Show
          Nicolas Liochon added a comment - I'm a little bit scared with the watchers: I would not like to have race conditions (if the region is opened twice for example). I'm trying something with only the async part...
          Hide
          ramkrishna.s.vasudevan added a comment -

          Yes N.. You are right ...

          Show
          ramkrishna.s.vasudevan added a comment - Yes N.. You are right ...
          Hide
          Nicolas Liochon added a comment - - edited

          Do we not have this currently? If someone else changes the znode, we don't notice? We only notice when we go to update the znode?

          I haven't found where where we do it. The result of tickleOpening is actually often ignored as well, or at least ignored for a long time.
          for example, in OpenRegionHandler#updateMeta(final HRegion r), we have

          tickleOpening = true(
          while (/* condition, but not on tickleOpening */){
            // do something
            tickleOpening = tickleOpening();
          }
          
          return something && tickleOpening;
          

          In theory, we should break the loop when tickleOpening becomes false.

          While the way it's written, it seems that we can have a failure once, then a success.
          Basically, it seems that tickleOpening is not always used as a check.

          Show
          Nicolas Liochon added a comment - - edited Do we not have this currently? If someone else changes the znode, we don't notice? We only notice when we go to update the znode? I haven't found where where we do it. The result of tickleOpening is actually often ignored as well, or at least ignored for a long time. for example, in OpenRegionHandler#updateMeta(final HRegion r), we have tickleOpening = true ( while (/* condition, but not on tickleOpening */){ // do something tickleOpening = tickleOpening(); } return something && tickleOpening; In theory, we should break the loop when tickleOpening becomes false. While the way it's written, it seems that we can have a failure once, then a success. Basically, it seems that tickleOpening is not always used as a check.
          Hide
          stack added a comment -

          use a watcher to see if someone else is updating the znode

          Do we not have this currently? If someone else changes the znode, we don't notice? We only notice when we go to update the znode?

          Should we list how many znode updates happen for a region open?

          Async'ing the OPENINGs so master gets a tickle that the OPEN is progressing sounds fine... We don't actually move any files around on OPEN so its not a 'problem' if not the owner, really... Its only later after compaction, etc., that the damage is done... that is what we should prevent happening for sure.

          Show
          stack added a comment - use a watcher to see if someone else is updating the znode Do we not have this currently? If someone else changes the znode, we don't notice? We only notice when we go to update the znode? Should we list how many znode updates happen for a region open? Async'ing the OPENINGs so master gets a tickle that the OPEN is progressing sounds fine... We don't actually move any files around on OPEN so its not a 'problem' if not the owner, really... Its only later after compaction, etc., that the damage is done... that is what we should prevent happening for sure.
          Hide
          Nicolas Liochon added a comment -

          An possible solution could be:

          • use a watcher to see if someone else is updating the znode
          • use an asynchronous write to update the znode when we want to say to the master we're still there.

          It mostly as of today, except that we suppressed the synchronous write on the regionserver path.
          Pros: should be a small enough change.
          Cons: we're still loading ZK (and the master as a side effect).

          What do you think?

          Show
          Nicolas Liochon added a comment - An possible solution could be: use a watcher to see if someone else is updating the znode use an asynchronous write to update the znode when we want to say to the master we're still there. It mostly as of today, except that we suppressed the synchronous write on the regionserver path. Pros: should be a small enough change. Cons: we're still loading ZK (and the master as a side effect). What do you think?
          Hide
          stack added a comment -

          If one region server is opening a lot of regions, we just need one handler to tickle the opening.

          This would be a significant change for I'm-not-sure-what-benefit. The zk transactions are region level/scoped and their handling is done at this level in open/close exec handlers. Making it so RS does tracking and updating state for master to read regards CLOSING/OPENING would alter a bunch of code.

          I'm all for a reexamination of base operations. Stuff is this way because we would have issues where an open would stall for whatever reason... Master would intercept the open, take over the region and give it to someone else to open. More often than not, we'd just fail again for same reason on the new location but the odd time the new re-attempt would succeed.

          We could give up reattempt and just let everything hinge on whether a region server has a zk lease or not and let ServerShutdownHandler do it all. It'd be a pretty radical difference. Simplify code but also, in an odd case, it might mean we'd fail recover a region (I don't have stats on this).

          We used to have the 'owernership' issue as Stack mentioned. Now, I think we are fine since AM should have a consistent view of region states.

          This is a similar type of leap-in-the-dark (smile). I love the notion that AM is now rock solid. It may be given the work expended (it certainly is a million times better) but I'd like us to run w/ the new AM a while in a few productions before making this ruling.

          Again, if we could undo the OPENING/CLOSING, etc., stuff would be cleaner/simpler (logs would be way less noisy)

          Show
          stack added a comment - If one region server is opening a lot of regions, we just need one handler to tickle the opening. This would be a significant change for I'm-not-sure-what-benefit. The zk transactions are region level/scoped and their handling is done at this level in open/close exec handlers. Making it so RS does tracking and updating state for master to read regards CLOSING/OPENING would alter a bunch of code. I'm all for a reexamination of base operations. Stuff is this way because we would have issues where an open would stall for whatever reason... Master would intercept the open, take over the region and give it to someone else to open. More often than not, we'd just fail again for same reason on the new location but the odd time the new re-attempt would succeed. We could give up reattempt and just let everything hinge on whether a region server has a zk lease or not and let ServerShutdownHandler do it all. It'd be a pretty radical difference. Simplify code but also, in an odd case, it might mean we'd fail recover a region (I don't have stats on this). We used to have the 'owernership' issue as Stack mentioned. Now, I think we are fine since AM should have a consistent view of region states. This is a similar type of leap-in-the-dark (smile). I love the notion that AM is now rock solid. It may be given the work expended (it certainly is a million times better) but I'd like us to run w/ the new AM a while in a few productions before making this ruling. Again, if we could undo the OPENING/CLOSING, etc., stuff would be cleaner/simpler (logs would be way less noisy)
          Hide
          Jimmy Xiang added a comment -

          If one region server is opening a lot of regions, we just need one handler to tickle the opening. Master just needs to know the region server is still running so that it doesn't time out the assignment.

          To me, this is an overkill. We can combine this logic and something else, for example, to detect if a regionserver is dead.

          We used to have the 'owernership' issue as Stack mentioned. Now, I think we are fine since AM should have a consistent view of region states.

          Show
          Jimmy Xiang added a comment - If one region server is opening a lot of regions, we just need one handler to tickle the opening. Master just needs to know the region server is still running so that it doesn't time out the assignment. To me, this is an overkill. We can combine this logic and something else, for example, to detect if a regionserver is dead. We used to have the 'owernership' issue as Stack mentioned. Now, I think we are fine since AM should have a consistent view of region states.
          Hide
          stack added a comment -

          ...the callback would write a variable that would be tested by tickleOpening, and we're done. And as the test cost is cheap, we can test more often.

          You mean you'd keep writing the znode, just async?

          May be the master could ask to the region server if it's ok?

          So master would ping after every region that is opening? Would this be better/less costly than what we have now?

          In past, before we moved assign to all-zk, on the region server heartbeat to the master, the RS would include the list of OPENING regions.

          Show
          stack added a comment - ...the callback would write a variable that would be tested by tickleOpening, and we're done. And as the test cost is cheap, we can test more often. You mean you'd keep writing the znode, just async? May be the master could ask to the region server if it's ok? So master would ping after every region that is opening? Would this be better/less costly than what we have now? In past, before we moved assign to all-zk, on the region server heartbeat to the master, the RS would include the list of OPENING regions.
          Hide
          Nicolas Liochon added a comment -

          For the regionserver side, the callback won't be very difficult to do I think: the callback would write a variable that would be tested by tickleOpening, and we're done. And as the test cost is cheap, we can test more often.

          For master, hum... there are different options. May be the master could ask to the region server if it's ok? This would be done if there's no feedback after a while and if the regionserver is still alive from a zookeeper point if view? This would allow to test multiple regions simultaneously (all the regions of this regionserver).

          Show
          Nicolas Liochon added a comment - For the regionserver side, the callback won't be very difficult to do I think: the callback would write a variable that would be tested by tickleOpening, and we're done. And as the test cost is cheap, we can test more often. For master, hum... there are different options. May be the master could ask to the region server if it's ok? This would be done if there's no feedback after a while and if the regionserver is still alive from a zookeeper point if view? This would allow to test multiple regions simultaneously (all the regions of this regionserver).
          Hide
          stack added a comment -

          Moving znode from OPENING to OPENING or CLOSING to CLOSING ensuring the sequence numbers are as we expect is how we check we still have 'ownership' of the region before we go about making alterations in the filesystem. The callback the master gets when we 'update' the znode is used to flag the master that the region server is still 'alive' and working on the opening so the master updates its opening timer when it gets the callback.

          It is imperfect in that we could lose the lease between the check and the fs operation.

          Alternatives would be to remove this mechanism and instead just rely on our getting a callback if the znode is taken from us? This would happen in another thread. Would have to be watching. This would seem to coarser than what we currently have widening the window during which we could do fs operations on a region though we've lost ownership. Master would not get notification that a region server is opening a region only its taking longer than usual.

          Any other suggestions?

          Show
          stack added a comment - Moving znode from OPENING to OPENING or CLOSING to CLOSING ensuring the sequence numbers are as we expect is how we check we still have 'ownership' of the region before we go about making alterations in the filesystem. The callback the master gets when we 'update' the znode is used to flag the master that the region server is still 'alive' and working on the opening so the master updates its opening timer when it gets the callback. It is imperfect in that we could lose the lease between the check and the fs operation. Alternatives would be to remove this mechanism and instead just rely on our getting a callback if the znode is taken from us? This would happen in another thread. Would have to be watching. This would seem to coarser than what we currently have widening the window during which we could do fs operations on a region though we've lost ownership. Master would not get notification that a region server is opening a region only its taking longer than usual. Any other suggestions?

            People

            • Assignee:
              Nicolas Liochon
              Reporter:
              Nicolas Liochon
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development