HBase
  1. HBase
  2. HBASE-5877

When a query fails because the region has moved, let the regionserver return the new address to the client

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.95.2
    • Fix Version/s: 0.95.0
    • Component/s: Client, master, regionserver
    • Labels:
      None

      Description

      This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load.

      Note that a region is not immediately opened after a close. So:

      • it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed.
      • during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help.

      The implementation in the first patch does:

      • on the region move, there is an added parameter on the regionserver#close to say where we are sending the region
      • the regionserver keeps a list of what was moved. Each entry is kept 100 seconds.
      • the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address.
      • the client analyses the exeptions and update its cache accordingly...
      1. 5877.v1.patch
        35 kB
        Nicolas Liochon
      2. 5877.v12.patch
        61 kB
        Nicolas Liochon
      3. 5877.v15.patch
        65 kB
        Nicolas Liochon
      4. 5877.v18.patch
        65 kB
        Nicolas Liochon
      5. 5877.v18.patch
        65 kB
        Nicolas Liochon
      6. 5877.v18.patch
        65 kB
        Nicolas Liochon
      7. 5877.v6.patch
        56 kB
        Nicolas Liochon
      8. 5877-v16.txt
        62 kB
        Ted Yu
      9. 5877-v17.txt
        62 kB
        stack
      10. 5877-v17.txt
        62 kB
        Ted Yu

        Issue Links

          Activity

          stack made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          stack made changes -
          Fix Version/s 0.98.0 [ 12323143 ]
          stack made changes -
          Fix Version/s 0.98.0 [ 12323143 ]
          stack made changes -
          Fix Version/s 0.95.0 [ 12324094 ]
          Fix Version/s 0.96.0 [ 12320040 ]
          Hide
          Nicolas Liochon added a comment -

          @jd Yep, you're right.

          Show
          Nicolas Liochon added a comment - @jd Yep, you're right.
          Hide
          Jean-Daniel Cryans added a comment -

          Nicolas,

          Do you think this log message could be removed?

          12/11/29 15:17:36 INFO client.HConnectionManager$HConnectionImplementation: Region TestTable,0001966229,1354231005211.1bbba78dda968874d2981c322ed3319f. moved from 572ba57e-1cab-4f9c-a071-782e5a1a7184.cs1cloud.internal:60020, updating client location cache. New server: 20590793-0e19-4eb4-b2f6-05de8244f716.cs1cloud.internal:60020
          

          Right now I'm running some loading tests and I'm getting walls of text every time a split happens and it's basically the same message repeated hundreds of times. We used to have a similar message before but we removed it since it's pretty spammy (or we set it to DEBUG, can't remember).

          Show
          Jean-Daniel Cryans added a comment - Nicolas, Do you think this log message could be removed? 12/11/29 15:17:36 INFO client.HConnectionManager$HConnectionImplementation: Region TestTable,0001966229,1354231005211.1bbba78dda968874d2981c322ed3319f. moved from 572ba57e-1cab-4f9c-a071-782e5a1a7184.cs1cloud.internal:60020, updating client location cache. New server: 20590793-0e19-4eb4-b2f6-05de8244f716.cs1cloud.internal:60020 Right now I'm running some loading tests and I'm getting walls of text every time a split happens and it's basically the same message repeated hundreds of times. We used to have a similar message before but we removed it since it's pretty spammy (or we set it to DEBUG, can't remember).
          Nicolas Liochon made changes -
          Link This issue is related to HBASE-5992 [ HBASE-5992 ]
          Nicolas Liochon made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          ramkrishna.s.vasudevan added a comment -

          The code is committed. If the test failures are not related to this, you can resolve this issue N.

          Show
          ramkrishna.s.vasudevan added a comment - The code is committed. If the test failures are not related to this, you can resolve this issue N.
          Hide
          Nicolas Liochon added a comment -

          Strange, the status is "failure", but actually the code is available in the trunk (which is good, it saves another merge)

          Show
          Nicolas Liochon added a comment - Strange, the status is "failure", but actually the code is available in the trunk (which is good, it saves another merge)
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12526501/5877.v18.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1845//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12526501/5877.v18.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1845//console This message is automatically generated.
          Nicolas Liochon made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Nicolas Liochon made changes -
          Attachment 5877.v18.patch [ 12526501 ]
          Nicolas Liochon made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-security #197 (See https://builds.apache.org/job/HBase-TRUNK-security/197/)
          HBASE-5877 When a query fails because the region has moved, let the regionserver return the new address to the client (N Keywal) (Revision 1336301)

          Result = FAILURE
          tedyu :
          Files :

          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/RegionMovedException.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/protobuf/generated/AdminProtos.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/OnlineRegions.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java
          • /hbase/trunk/src/main/protobuf/Admin.proto
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/TestHCM.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/MockRegionServer.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-security #197 (See https://builds.apache.org/job/HBase-TRUNK-security/197/ ) HBASE-5877 When a query fails because the region has moved, let the regionserver return the new address to the client (N Keywal) (Revision 1336301) Result = FAILURE tedyu : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/RegionMovedException.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/protobuf/generated/AdminProtos.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/OnlineRegions.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java /hbase/trunk/src/main/protobuf/Admin.proto /hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/TestHCM.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/MockRegionServer.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2858 (See https://builds.apache.org/job/HBase-TRUNK/2858/)
          HBASE-5877 When a query fails because the region has moved, let the regionserver return the new address to the client (N Keywal) (Revision 1336301)

          Result = FAILURE
          tedyu :
          Files :

          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/RegionMovedException.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/protobuf/generated/AdminProtos.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/OnlineRegions.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java
          • /hbase/trunk/src/main/protobuf/Admin.proto
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/TestHCM.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/MockRegionServer.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2858 (See https://builds.apache.org/job/HBase-TRUNK/2858/ ) HBASE-5877 When a query fails because the region has moved, let the regionserver return the new address to the client (N Keywal) (Revision 1336301) Result = FAILURE tedyu : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/RegionMovedException.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/protobuf/generated/AdminProtos.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/OnlineRegions.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java /hbase/trunk/src/main/protobuf/Admin.proto /hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/TestHCM.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/MockRegionServer.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java
          Hide
          Nicolas Liochon added a comment -

          Now how does the updateCachelocation help here. If for some reason the opening of the region is not yet done and if the client gets RegionMovedException the client will try to contact the RS thinking the region got moved to it.

          Yes, exactly. That's why I kept the sleep in the client code even for this RegionMoved. We could optimize this by adding a timestamps, with an heuristic like: "we give two seconds for the region to move after it's closed on the origin server". Sharing the region state in ZK would be a simpler option, as we would know if the region has moved or not.

          Show
          Nicolas Liochon added a comment - Now how does the updateCachelocation help here. If for some reason the opening of the region is not yet done and if the client gets RegionMovedException the client will try to contact the RS thinking the region got moved to it. Yes, exactly. That's why I kept the sleep in the client code even for this RegionMoved. We could optimize this by adding a timestamps, with an heuristic like: "we give two seconds for the region to move after it's closed on the origin server". Sharing the region state in ZK would be a simpler option, as we would know if the region has moved or not.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Just saw the intent behind this JIRA. It is very important and a good feature. You have done a lot in this but the JIRA says it is Minor .

          It doesn't know if the region is opened yet. We could have this by adding the info in zk, but it would increase the zk load...

          Now how does the updateCachelocation help here. If for some reason the opening of the region is not yet done and if the client gets RegionMovedException the client will try to contact the RS thinking the region got moved to it.
          Thanks N.

          Show
          ramkrishna.s.vasudevan added a comment - Just saw the intent behind this JIRA. It is very important and a good feature. You have done a lot in this but the JIRA says it is Minor . It doesn't know if the region is opened yet. We could have this by adding the info in zk, but it would increase the zk load... Now how does the updateCachelocation help here. If for some reason the opening of the region is not yet done and if the client gets RegionMovedException the client will try to contact the RS thinking the region got moved to it. Thanks N.
          Hide
          Ted Yu added a comment -

          Integrated to trunk.

          Thanks for the patch, N.

          Thanks for the review, Stack.

          Show
          Ted Yu added a comment - Integrated to trunk. Thanks for the patch, N. Thanks for the review, Stack.
          Hide
          Nicolas Liochon added a comment -

          Seems to be ready for a commit :-]

          Show
          Nicolas Liochon added a comment - Seems to be ready for a commit :-]
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12526123/5877.v18.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 hadoop23. The patch compiles against the hadoop 0.23.x profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1812//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1812//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1812//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12526123/5877.v18.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 hadoop23. The patch compiles against the hadoop 0.23.x profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1812//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1812//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1812//console This message is automatically generated.
          Nicolas Liochon made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Nicolas Liochon made changes -
          Attachment 5877.v18.patch [ 12526123 ]
          Nicolas Liochon made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12526047/5877.v18.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 hadoop23. The patch compiles against the hadoop 0.23.x profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster
          org.apache.hadoop.hbase.replication.TestReplication

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1801//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1801//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1801//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12526047/5877.v18.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 hadoop23. The patch compiles against the hadoop 0.23.x profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster org.apache.hadoop.hbase.replication.TestReplication Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1801//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1801//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1801//console This message is automatically generated.
          Nicolas Liochon made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Nicolas Liochon made changes -
          Attachment 5877.v18.patch [ 12526047 ]
          Hide
          Nicolas Liochon added a comment -

          Hopefully I didn't break anything while doing the merge. Local tests are ok. I included Ted's fix.

          Show
          Nicolas Liochon added a comment - Hopefully I didn't break anything while doing the merge. Local tests are ok. I included Ted's fix.
          Nicolas Liochon made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Nicolas Liochon added a comment -

          Ok... I will do the merge and provide another patch today...

          Show
          Nicolas Liochon added a comment - Ok... I will do the merge and provide another patch today...
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12525963/5877-v17.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1793//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525963/5877-v17.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1793//console This message is automatically generated.
          stack made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          stack made changes -
          Attachment 5877-v17.txt [ 12525963 ]
          Hide
          stack added a comment -

          Retry

          Show
          stack added a comment - Retry
          stack made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Nicolas Liochon added a comment -

          I didn't find the test in error. Could be committed imho.

          Show
          Nicolas Liochon added a comment - I didn't find the test in error. Could be committed imho.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12525684/5877-v17.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 hadoop23. The patch compiles against the hadoop 0.23.x profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1775//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1775//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1775//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525684/5877-v17.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 hadoop23. The patch compiles against the hadoop 0.23.x profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1775//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1775//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1775//console This message is automatically generated.
          Ted Yu made changes -
          Attachment 5877-v17.txt [ 12525684 ]
          Hide
          Ted Yu added a comment -

          I forgot to add the check in one place.

          Trying v17.

          Show
          Ted Yu added a comment - I forgot to add the check in one place. Trying v17.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12525676/5877-v16.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 hadoop23. The patch compiles against the hadoop 0.23.x profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.client.TestMultiParallel

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1773//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1773//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1773//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525676/5877-v16.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 hadoop23. The patch compiles against the hadoop 0.23.x profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.client.TestMultiParallel Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1773//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1773//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1773//console This message is automatically generated.
          Hide
          Nicolas Liochon added a comment -

          thank you Ted. It's strange, it though it worked locally. May be this test
          was hung and was not reported as an error. Thank you anyway.

          Show
          Nicolas Liochon added a comment - thank you Ted. It's strange, it though it worked locally. May be this test was hung and was not reported as an error. Thank you anyway.
          Ted Yu made changes -
          Attachment 5877-v16.txt [ 12525676 ]
          Show
          Ted Yu added a comment - Patch v16 is able to pass TestMultiParallel. See the following link for the reason of infinite recursion: http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/ipc/RemoteException.java#RemoteException.unwrapRemoteException%28%29
          Hide
          Ted Yu added a comment -

          I can reproduce the TestMultiParallel failure:

          testFlushCommitsNoAbort(org.apache.hadoop.hbase.client.TestMultiParallel)  Time elapsed: 1.36 sec  <<< ERROR!
          java.lang.StackOverflowError
            at org.apache.hadoop.hbase.RegionMovedException.find(RegionMovedException.java:91)
            at org.apache.hadoop.hbase.RegionMovedException.find(RegionMovedException.java:108)
            at org.apache.hadoop.hbase.RegionMovedException.find(RegionMovedException.java:108)
            at org.apache.hadoop.hbase.RegionMovedException.find(RegionMovedException.java:108)
            at org.apache.hadoop.hbase.RegionMovedException.find(RegionMovedException.java:108)
            at org.apache.hadoop.hbase.RegionMovedException.find(RegionMovedException.java:108)
            at org.apache.hadoop.hbase.RegionMovedException.find(RegionMovedException.java:108)
          
          Show
          Ted Yu added a comment - I can reproduce the TestMultiParallel failure: testFlushCommitsNoAbort(org.apache.hadoop.hbase.client.TestMultiParallel) Time elapsed: 1.36 sec <<< ERROR! java.lang.StackOverflowError at org.apache.hadoop.hbase.RegionMovedException.find(RegionMovedException.java:91) at org.apache.hadoop.hbase.RegionMovedException.find(RegionMovedException.java:108) at org.apache.hadoop.hbase.RegionMovedException.find(RegionMovedException.java:108) at org.apache.hadoop.hbase.RegionMovedException.find(RegionMovedException.java:108) at org.apache.hadoop.hbase.RegionMovedException.find(RegionMovedException.java:108) at org.apache.hadoop.hbase.RegionMovedException.find(RegionMovedException.java:108) at org.apache.hadoop.hbase.RegionMovedException.find(RegionMovedException.java:108)
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12525660/5877.v15.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 hadoop23. The patch compiles against the hadoop 0.23.x profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.client.TestMultiParallel

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1770//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1770//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1770//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525660/5877.v15.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 hadoop23. The patch compiles against the hadoop 0.23.x profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.client.TestMultiParallel Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1770//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1770//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1770//console This message is automatically generated.
          Nicolas Liochon made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Nicolas Liochon made changes -
          Attachment 5877.v15.patch [ 12525660 ]
          Nicolas Liochon made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Nicolas Liochon added a comment -

          @stack

          You don't want to have RegionMovedException carry a ServerName#toString instead of host and port?

          I think it's safer this way, as I have to parse the string afterward. Otherwise, if someone modifies ServerName#toString he will break the parsing in RegionMovedException, a class he may never have heard of (yes, it will break the unit test )

          Is this a bug fix?

          Unfortunately, it's a feature. The error management is duplicated, and I have to manage both cases, because we don't have the exception when we come back to the result later in the code.

          Put the history of moved regions out into its own class?

          You're right, it would be better. Wil do.

          Don't presize this I'd say: private static final long TIMEOUT_REGION_MOVED = (2L * 60L * 1000L);

          You would prefer a configurable value?

          Stuff is lazily cleared from movedRegions? Should we have a cleaner come visit occasionally?

          Aggreed, it would be better with a cleaner. Will do as well.

          Why you say the above? When we protobuf it, it'll just be an option so it shouldn't be too bad?

          Yeah, if it was too bad I would not have proposed it . It's an imperfection to accept I think. We would not have it if we share the regions state within the cluster with ZK.

          @ted

          Under what condition would newHrl be null above ?

          Oops. Refactoring error. Removed.

          Please remove the space between newHrl and ')' below:

          Done.

          Would the above code result in NPE since I see the following in javadoc:

          It should not happen because we test hrl value before. But I added a check on the arguments to make it safer.

          Since updateCachedLocations() is used to handle exception, the presizing above may not be needed.

          Since updateCachedLocations() is used to handle exception, the presizing above may not be needed.

          Yeah, I sized it thinking: if we're doing a rolling restart we may have 100 regions with a wrong location if we're really unlucky. As it small, any solution would work here, but I prefer to have the size explicitly set, as it says "I though about it, that's a reasonable size". I added a comment however.

          The indentation of CloseRegionHandler() above is off.

          Fixed.

          'will contains' -> 'will contain'. 'keep a too old' -> 'keep too old'.

          Fixed.

          Show
          Nicolas Liochon added a comment - @stack You don't want to have RegionMovedException carry a ServerName#toString instead of host and port? I think it's safer this way, as I have to parse the string afterward. Otherwise, if someone modifies ServerName#toString he will break the parsing in RegionMovedException, a class he may never have heard of (yes, it will break the unit test ) Is this a bug fix? Unfortunately, it's a feature. The error management is duplicated, and I have to manage both cases, because we don't have the exception when we come back to the result later in the code. Put the history of moved regions out into its own class? You're right, it would be better. Wil do. Don't presize this I'd say: private static final long TIMEOUT_REGION_MOVED = (2L * 60L * 1000L); You would prefer a configurable value? Stuff is lazily cleared from movedRegions? Should we have a cleaner come visit occasionally? Aggreed, it would be better with a cleaner. Will do as well. Why you say the above? When we protobuf it, it'll just be an option so it shouldn't be too bad? Yeah, if it was too bad I would not have proposed it . It's an imperfection to accept I think. We would not have it if we share the regions state within the cluster with ZK. @ted Under what condition would newHrl be null above ? Oops. Refactoring error. Removed. Please remove the space between newHrl and ')' below: Done. Would the above code result in NPE since I see the following in javadoc: It should not happen because we test hrl value before. But I added a check on the arguments to make it safer. Since updateCachedLocations() is used to handle exception, the presizing above may not be needed. Since updateCachedLocations() is used to handle exception, the presizing above may not be needed. Yeah, I sized it thinking: if we're doing a rolling restart we may have 100 regions with a wrong location if we're really unlucky. As it small, any solution would work here, but I prefer to have the size explicitly set, as it says "I though about it, that's a reasonable size". I added a comment however. The indentation of CloseRegionHandler() above is off. Fixed. 'will contains' -> 'will contain'. 'keep a too old' -> 'keep too old'. Fixed.
          Hide
          Ted Yu added a comment -
          +        final Set<String> updateHistory = new HashSet<String>(100);
          

          Since updateCachedLocations() is used to handle exception, the presizing above may not be needed.

          +    this(server, rsServices, regionInfo, abort, zk, versionOfClosingNode, eventType, null);
          +  }
          +
          +    protected CloseRegionHandler(final Server server,
          

          The indentation of CloseRegionHandler() above is off.

          +  // This map will contains all the regions that we closed for a move.
          +  //  We add the time it was moved as we don't want to keep a too old information
          

          'will contains' -> 'will contain'. 'keep a too old' -> 'keep too old'.

          This is a very useful feature. Good work, N.

          Show
          Ted Yu added a comment - + final Set< String > updateHistory = new HashSet< String >(100); Since updateCachedLocations() is used to handle exception, the presizing above may not be needed. + this (server, rsServices, regionInfo, abort, zk, versionOfClosingNode, eventType, null ); + } + + protected CloseRegionHandler( final Server server, The indentation of CloseRegionHandler() above is off. + // This map will contains all the regions that we closed for a move. + // We add the time it was moved as we don't want to keep a too old information 'will contains' -> 'will contain'. 'keep a too old' -> 'keep too old'. This is a very useful feature. Good work, N.
          Hide
          Ted Yu added a comment -
          +        HRegionLocation newHrl = new HRegionLocation(hrl.getRegionInfo(), hostname, port);
          +        if (newHrl == null ){
          +          // Should not happened. Just in case.
          

          Under what condition would newHrl be null above ?
          Please remove the space between newHrl and ')' below:

          +        cacheLocation(hrl.getRegionInfo().getTableName(), newHrl  );
          

          For updateCachedLocations():

          +        hrl : getCachedLocation(tableName, row.getRow()));
          

          Would the above code result in NPE since I see the following in javadoc:

          +     * @param row  - can be null if hrl is not null.
          

          More review comments to follow.

          Show
          Ted Yu added a comment - + HRegionLocation newHrl = new HRegionLocation(hrl.getRegionInfo(), hostname, port); + if (newHrl == null ){ + // Should not happened. Just in case . Under what condition would newHrl be null above ? Please remove the space between newHrl and ')' below: + cacheLocation(hrl.getRegionInfo().getTableName(), newHrl ); For updateCachedLocations(): + hrl : getCachedLocation(tableName, row.getRow())); Would the above code result in NPE since I see the following in javadoc: + * @param row - can be null if hrl is not null . More review comments to follow.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12525508/5877.v12.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 hadoop23. The patch compiles against the hadoop 0.23.x profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.client.TestMultiParallel

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1749//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1749//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1749//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525508/5877.v12.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 hadoop23. The patch compiles against the hadoop 0.23.x profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.client.TestMultiParallel Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1749//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1749//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1749//console This message is automatically generated.
          Hide
          stack added a comment -

          You don't want to have RegionMovedException carry a ServerName#toString instead of host and port? Or it doesn't make sense when our cached region exceptions are keyed by hostname+port only?

          Is this a bug fix?

          @@ -1910,6 +1989,7 @@ public class HConnectionManager {
                       }
                     } catch (ExecutionException e) {
                       LOG.debug("Failed all from " + loc, e);
          +            updateCachedLocations(updateHistory, loc, e);
          

          Put the history of moved regions out into its own class?

          Don't presize this I'd say:

          + private static final long TIMEOUT_REGION_MOVED = (2L * 60L * 1000L);

          Stuff is lazily cleared from movedRegions? Should we have a cleaner come visit occasionally?

          Patch looks fine to me. Nice fat test.

          5) The destination is the closeRegion interface is a kind of interface hijacking. Other options would be:

          Why you say the above? When we protobuf it, it'll just be an option so it shouldn't be too bad?

          The HCM stuff is ugly but thats not your fault.

          Show
          stack added a comment - You don't want to have RegionMovedException carry a ServerName#toString instead of host and port? Or it doesn't make sense when our cached region exceptions are keyed by hostname+port only? Is this a bug fix? @@ -1910,6 +1989,7 @@ public class HConnectionManager { } } catch (ExecutionException e) { LOG.debug( "Failed all from " + loc, e); + updateCachedLocations(updateHistory, loc, e); Put the history of moved regions out into its own class? Don't presize this I'd say: + private static final long TIMEOUT_REGION_MOVED = (2L * 60L * 1000L); Stuff is lazily cleared from movedRegions? Should we have a cleaner come visit occasionally? Patch looks fine to me. Nice fat test. 5) The destination is the closeRegion interface is a kind of interface hijacking. Other options would be: Why you say the above? When we protobuf it, it'll just be an option so it shouldn't be too bad? The HCM stuff is ugly but thats not your fault.
          Nicolas Liochon made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Nicolas Liochon made changes -
          Attachment 5877.v12.patch [ 12525508 ]
          Nicolas Liochon made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Nicolas Liochon added a comment -

          v12, should be final.

          1) ServerName is used everywhere in the interface, thanks to protobuf
          2) hadoop.ipc serialization of exception is based on the #getMessage. So we have to parse it internally. It's not visisble to the exception user.
          3) The code to manage the error in the client package is quite complex. We have the exception at the very beginning, and then it's checked again, but we don't have the real exception anymore. I used a new "historyList" to make it works. There is another JIRA for other improvement, in which I could get rid of this (HBASE-5924)
          4) Generated with protobuf 2.4.1
          5) The destination is the closeRegion interface is a kind of interface hijacking. Other options would be:

          • sharing the region state in zookeeper
          • letting the regionserver calls the master to get the new server. On paper this would be more efficient than a client -> master call. In both cases we could consider that the client should not connect to the master except for cluster administration (create table, split regin; ...). That would increase global reliability. That's for another discussion as well I think.
            6) RegionServerServices has been modified to set a destination when removing a region from the online regions.
            7) In another JIRA I will manage the case when the destination is not specified when calling the move function.
          Show
          Nicolas Liochon added a comment - v12, should be final. 1) ServerName is used everywhere in the interface, thanks to protobuf 2) hadoop.ipc serialization of exception is based on the #getMessage. So we have to parse it internally. It's not visisble to the exception user. 3) The code to manage the error in the client package is quite complex. We have the exception at the very beginning, and then it's checked again, but we don't have the real exception anymore. I used a new "historyList" to make it works. There is another JIRA for other improvement, in which I could get rid of this ( HBASE-5924 ) 4) Generated with protobuf 2.4.1 5) The destination is the closeRegion interface is a kind of interface hijacking. Other options would be: sharing the region state in zookeeper letting the regionserver calls the master to get the new server. On paper this would be more efficient than a client -> master call. In both cases we could consider that the client should not connect to the master except for cluster administration (create table, split regin; ...). That would increase global reliability. That's for another discussion as well I think. 6) RegionServerServices has been modified to set a destination when removing a region from the online regions. 7) In another JIRA I will manage the case when the destination is not specified when calling the move function.
          Nicolas Liochon made changes -
          Link This issue relates to HBASE-5924 [ HBASE-5924 ]
          Hide
          Nicolas Liochon added a comment -

          I have the same locally, so it's likely my patch...

          Show
          Nicolas Liochon added a comment - I have the same locally, so it's likely my patch...
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12525304/5877.v6.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 hadoop23. The patch compiles against the hadoop 0.23.x profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.util.TestMergeTable
          org.apache.hadoop.hbase.io.encoding.TestLoadAndSwitchEncodeOnDisk
          org.apache.hadoop.hbase.client.TestMultiParallel
          org.apache.hadoop.hbase.master.TestMasterFailover
          org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster
          org.apache.hadoop.hbase.master.TestZKBasedOpenCloseRegion
          org.apache.hadoop.hbase.coprocessor.TestRowProcessorEndpoint
          org.apache.hadoop.hbase.client.TestFromClientSide3
          org.apache.hadoop.hbase.master.TestRollingRestart
          org.apache.hadoop.hbase.util.TestMiniClusterLoadEncoded
          org.apache.hadoop.hbase.avro.TestAvroServer
          org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1724//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1724//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1724//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525304/5877.v6.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 hadoop23. The patch compiles against the hadoop 0.23.x profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.util.TestMergeTable org.apache.hadoop.hbase.io.encoding.TestLoadAndSwitchEncodeOnDisk org.apache.hadoop.hbase.client.TestMultiParallel org.apache.hadoop.hbase.master.TestMasterFailover org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster org.apache.hadoop.hbase.master.TestZKBasedOpenCloseRegion org.apache.hadoop.hbase.coprocessor.TestRowProcessorEndpoint org.apache.hadoop.hbase.client.TestFromClientSide3 org.apache.hadoop.hbase.master.TestRollingRestart org.apache.hadoop.hbase.util.TestMiniClusterLoadEncoded org.apache.hadoop.hbase.avro.TestAvroServer org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1724//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1724//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1724//console This message is automatically generated.
          Nicolas Liochon made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Nicolas Liochon added a comment -

          Generated with protobuf 2.4.1

          Here are the things I'm not a big fan, but for which I don't have a better solution:

          • the move management in the client code: I think it's possible to change the way we manage error (don't wait for all results before retrying), but that would be for another JIRA
          • the destination is the closeRegion interface is a kind of interface hijacking. Other options would be:
          • sharing the region state in zookeeper
          • letting the regionserver calls the master to get the new server. On paper this would be more efficient than a client -> master call. In both cases we could consider that the client should not connect to the master except for cluster administration (create table, split regin; ...). That would increase global reliability. That's for another discussion as well I think.

          Here is what I plan to do in the final version

          • move the handler functional code into a function in HRegionServer: this would allow to have the function addToMovedRegion as private instead of public.
          • Change all the CloseRegionHandler to take a RegionServer instead of a server? I'm not really keen on adding a class RegionServerServices, but may be I should?
          • Manage the case when the destination is not specified at the beginning of the move (may be in a different Jira if it's not simple)...

          All the previous comment should have been taken into account.

          Show
          Nicolas Liochon added a comment - Generated with protobuf 2.4.1 Here are the things I'm not a big fan, but for which I don't have a better solution: the move management in the client code: I think it's possible to change the way we manage error (don't wait for all results before retrying), but that would be for another JIRA the destination is the closeRegion interface is a kind of interface hijacking. Other options would be: sharing the region state in zookeeper letting the regionserver calls the master to get the new server. On paper this would be more efficient than a client -> master call. In both cases we could consider that the client should not connect to the master except for cluster administration (create table, split regin; ...). That would increase global reliability. That's for another discussion as well I think. Here is what I plan to do in the final version move the handler functional code into a function in HRegionServer: this would allow to have the function addToMovedRegion as private instead of public. Change all the CloseRegionHandler to take a RegionServer instead of a server? I'm not really keen on adding a class RegionServerServices, but may be I should? Manage the case when the destination is not specified at the beginning of the move (may be in a different Jira if it's not simple)... All the previous comment should have been taken into account.
          Nicolas Liochon made changes -
          Attachment 5877.v6.patch [ 12525304 ]
          Hide
          Nicolas Liochon added a comment -

          This patch will benefit any move, not just rolling restart, right?

          Yes, but as there is a wait time between two tries, I think the benefit will be minimal vs. the wait time for a single client. I could add an heuristic like if region was closed more than 2 seconds ago, consider that it's now available on the new server and don't sleep before the next retry. That could lead of having more network messages if the rule is wrong (and the rule will be wrong when the system is overloaded), and it will add some complexity to the client code. Having the real status of the region would solve this.

          Anyway, with the dev already done to cut the link between master & clients, it can help to save a reconnect to master. And during a rolling restart with regions moving everywhere, I think it will make a real difference.

          I don't see changes to make use of this new functionality? I'd expect the balancer in master to make use of it?

          Yes, it's the changes in AssignmentManager: the changes are in the patch, but are quite small at the end: basically:

          -    unassign(plan.getRegionInfo());
          +    unassign(plan.getRegionInfo(), false, plan.getDestination());
          

          I still need to manage the case when the destination is not specified at the beginning.

          Show
          Nicolas Liochon added a comment - This patch will benefit any move, not just rolling restart, right? Yes, but as there is a wait time between two tries, I think the benefit will be minimal vs. the wait time for a single client. I could add an heuristic like if region was closed more than 2 seconds ago, consider that it's now available on the new server and don't sleep before the next retry. That could lead of having more network messages if the rule is wrong (and the rule will be wrong when the system is overloaded), and it will add some complexity to the client code. Having the real status of the region would solve this. Anyway, with the dev already done to cut the link between master & clients, it can help to save a reconnect to master. And during a rolling restart with regions moving everywhere, I think it will make a real difference. I don't see changes to make use of this new functionality? I'd expect the balancer in master to make use of it? Yes, it's the changes in AssignmentManager: the changes are in the patch, but are quite small at the end: basically: - unassign(plan.getRegionInfo()); + unassign(plan.getRegionInfo(), false, plan.getDestination()); I still need to manage the case when the destination is not specified at the beginning.
          Hide
          stack added a comment -

          @N This patch will benefit any move, not just rolling restart, right?

          Here are a quick couple of comments on the patch.

          I like addition of RegionMovedException.

          Convention is to capitalize static defines: '+ private static final String hostField = "hostname=";' so it should be HOSTFIELD.

          Its super ugly that you have to parse exception message to get exception data member fields... but thats not your fault.

          Please keep the style of the surrounding code. This kinda thing is unconventional:

          +    private void updateCachedLocations(
          +      Set<String> updateHistory,
          +      HRegionLocation hrl,
          +      Object t) {
          

          Ugh on how ugly it is updating cache. Again, not your fault.

          Ted suggests updating interface version. Maybe don't. If you do, you can't get this into a 0.94.1, etc.

          I don't see changes to make use of this new functionality? I'd expect the balancer in master to make use of it?

          Show
          stack added a comment - @N This patch will benefit any move, not just rolling restart, right? Here are a quick couple of comments on the patch. I like addition of RegionMovedException. Convention is to capitalize static defines: '+ private static final String hostField = "hostname=";' so it should be HOSTFIELD. Its super ugly that you have to parse exception message to get exception data member fields... but thats not your fault. Please keep the style of the surrounding code. This kinda thing is unconventional: + private void updateCachedLocations( + Set< String > updateHistory, + HRegionLocation hrl, + Object t) { Ugh on how ugly it is updating cache. Again, not your fault. Ted suggests updating interface version. Maybe don't. If you do, you can't get this into a 0.94.1, etc. I don't see changes to make use of this new functionality? I'd expect the balancer in master to make use of it?
          Hide
          Ted Yu added a comment -

          @N:
          If the testing result is favorable, I think Lars may want it in 0.94 as well.
          I think making this feature functional in 0.94 cluster would be a good start.

          We could have this by adding the info in zk

          A separate discussion should be started w.r.t. the above. This would shift load imposed by clients from master to zk quorum.

          Show
          Ted Yu added a comment - @N: If the testing result is favorable, I think Lars may want it in 0.94 as well. I think making this feature functional in 0.94 cluster would be a good start. We could have this by adding the info in zk A separate discussion should be started w.r.t. the above. This would shift load imposed by clients from master to zk quorum.
          Hide
          Nicolas Liochon added a comment -

          Note that I'm currently rewriting the patch, as it conflicts with the protobuf stuff that was committed recently... But the logic hasn't changed.

          @ted What we're saving in the current implementation is a call to the master. It can be interesting in itself if the region moves is used by a lot of clients. We could do better by letting the client know that the region is now fully available somewhere else and that there is no need to wait before retrying. But right now the region server only knows that the region is closed and moved to another server. It doesn't know if the region is opened yet. We could have this by adding the info in zk, but it would increase the zk load...

          Show
          Nicolas Liochon added a comment - Note that I'm currently rewriting the patch, as it conflicts with the protobuf stuff that was committed recently... But the logic hasn't changed. @ted What we're saving in the current implementation is a call to the master. It can be interesting in itself if the region moves is used by a lot of clients. We could do better by letting the client know that the region is now fully available somewhere else and that there is no need to wait before retrying. But right now the region server only knows that the region is closed and moved to another server. It doesn't know if the region is opened yet. We could have this by adding the info in zk, but it would increase the zk load...
          Hide
          Ted Yu added a comment -

          @N:
          I think the following validation in real cluster would illustrate the benefit of this feature:
          For given table, select a region server and note the row key ranges hosted by one region on the region server. Direct client load to this region.
          Issue the following command in shell:

            hbase> move 'ENCODED_REGIONNAME', 'SERVER_NAME'
          

          at time T.

          Difference in client response to region migration around time T with and without the patch would be interesting.

          Show
          Ted Yu added a comment - @N: I think the following validation in real cluster would illustrate the benefit of this feature: For given table, select a region server and note the row key ranges hosted by one region on the region server. Direct client load to this region. Issue the following command in shell: hbase> move 'ENCODED_REGIONNAME', 'SERVER_NAME' at time T. Difference in client response to region migration around time T with and without the patch would be interesting.
          Ted Yu made changes -
          Comment [ @N:
          I think the following validation in real cluster would illustrate the benefit of this feature:
          For given table, select a region server and note the row key ranges hosted by the region server. Direct client load to this server.
          Kill the server at time T.

          Difference in client response to region migration around time T with and without the patch would be interesting. ]
          Hide
          Nicolas Liochon added a comment -

          Can we mark the failure and make this RegionMovedException behave the same as NotServingRegionException ?

          Done.

          For updateCachedLocations(), please put explanation for parameter on the same line as the parameter:

          Done.

          'Failed all' -> 'Failed call'

          It's an existing comment that we can find again later in the code. It really means "failed all": all the queries on this server failed. I don't mind changing it to something better, but I think we should keep the "all".

          'which the server' -> 'which the region'

          Done.

          Please increase the VERSION of HRegionInterface

          Done.

          How is the server removed from cache since I see 'continue' above ?

          That's what makes this code complex and difficult to change: the error is actually managed later, when we don't have the real exception anymore.

          For ServerManager.sendRegionClose(), please add javadoc for destServerName param.

          Done.

          Is it possible that destServerName is null ?

          Safety checks added.

          Please change the above to debug log. && Why is the above fatal (regionResult != null) ? Step 4 appears in a comment below the above code. Should the above say step 3 ?

          Bad logs fixed.

          Show
          Nicolas Liochon added a comment - Can we mark the failure and make this RegionMovedException behave the same as NotServingRegionException ? Done. For updateCachedLocations(), please put explanation for parameter on the same line as the parameter: Done. 'Failed all' -> 'Failed call' It's an existing comment that we can find again later in the code. It really means "failed all": all the queries on this server failed. I don't mind changing it to something better, but I think we should keep the "all". 'which the server' -> 'which the region' Done. Please increase the VERSION of HRegionInterface Done. How is the server removed from cache since I see 'continue' above ? That's what makes this code complex and difficult to change: the error is actually managed later, when we don't have the real exception anymore. For ServerManager.sendRegionClose(), please add javadoc for destServerName param. Done. Is it possible that destServerName is null ? Safety checks added. Please change the above to debug log. && Why is the above fatal (regionResult != null) ? Step 4 appears in a comment below the above code. Should the above say step 3 ? Bad logs fixed.
          Hide
          Ted Yu added a comment -

          For RegionMovedException.java:

          +    String tmpHostname = "nohostname";
          

          I think the above could potentially be a host name

          +    } catch (Exception ignored) {
          +      LOG.warn("Can't parse the hostname and the port from this string: " + s + ", "+
          +        "Continuing");
          +    }
          

          Can we mark the failure and make this RegionMovedException behave the same as NotServingRegionException ?
          For updateCachedLocations(), please put explanation for parameter on the same line as the parameter:

          +     * @param row  - row and tableName can be null id hrl is not null.
          
          +            LOG.warn("Failed all from " + loc, e);
          

          'Failed all' -> 'Failed call'

          +          if (resp == null) {
          +            // Entire server failed
          +            LOG.fatal("Failed all for server: " + loc.getHostnamePort() +
          +              ", removing from cache");
          +            continue;
          +          }
          

          How is the server removed from cache since I see 'continue' above ?

          +              } else {
          +                if (numRetries == 1)
          +                  LOG.fatal("step 4 got result " + regionResult.getFirst() + " " + regionResult.getSecond());
          

          Why is the above fatal (regionResult != null) ? Step 4 appears in a comment below the above code. Should the above say step 3 ?

          Please increase the VERSION of HRegionInterface

          +   * @param destServerName: server name on which the server will be moved
          

          'which the server' -> 'which the region'

          For ServerManager.sendRegionClose(), please add javadoc for destServerName param.
          For HRegionServer.java:

          +    LOG.info("Closing region "+region.getRegionName()+", moving to "+sn.getServerName() );
          

          Is it possible that destServerName is null ?

          +  private ServerName getMovedRegion(String encodedRegionName) {
          +    LOG.fatal("Called getMovedRegion for "+encodedRegionName+" "+ movedRegions.size()+ " "+movedRegions);
          

          Please change the above to debug log.

          Show
          Ted Yu added a comment - For RegionMovedException.java: + String tmpHostname = "nohostname" ; I think the above could potentially be a host name + } catch (Exception ignored) { + LOG.warn( "Can't parse the hostname and the port from this string: " + s + ", " + + "Continuing" ); + } Can we mark the failure and make this RegionMovedException behave the same as NotServingRegionException ? For updateCachedLocations(), please put explanation for parameter on the same line as the parameter: + * @param row - row and tableName can be null id hrl is not null . + LOG.warn( "Failed all from " + loc, e); 'Failed all' -> 'Failed call' + if (resp == null ) { + // Entire server failed + LOG.fatal( "Failed all for server: " + loc.getHostnamePort() + + ", removing from cache" ); + continue ; + } How is the server removed from cache since I see 'continue' above ? + } else { + if (numRetries == 1) + LOG.fatal( "step 4 got result " + regionResult.getFirst() + " " + regionResult.getSecond()); Why is the above fatal (regionResult != null) ? Step 4 appears in a comment below the above code. Should the above say step 3 ? Please increase the VERSION of HRegionInterface + * @param destServerName: server name on which the server will be moved 'which the server' -> 'which the region' For ServerManager.sendRegionClose(), please add javadoc for destServerName param. For HRegionServer.java: + LOG.info( "Closing region " +region.getRegionName()+ ", moving to " +sn.getServerName() ); Is it possible that destServerName is null ? + private ServerName getMovedRegion( String encodedRegionName) { + LOG.fatal( "Called getMovedRegion for " +encodedRegionName+ " " + movedRegions.size()+ " " +movedRegions); Please change the above to debug log.
          Ted Yu made changes -
          Summary When a query fails because the region has moved, let the regionserver returns the new address to the client When a query fails because the region has moved, let the regionserver return the new address to the client
          Fix Version/s 0.96.0 [ 12320040 ]
          Hide
          Nicolas Liochon added a comment -

          v1. On an old trunk, so it's just to give an overview. Includes some bits of HBASE-5844 as well.

          There are 3 workarounds in the implementation:
          1) As a ServerName is not serializable we use the String dedicated to this kind of issue. Acceptable I think.
          2) hadoop.ipc serialization of exception is based on the #getMessage. So we have to parse it internally. It's not visisble to the exception user. Still acceptable (?
          3) The code to manage the error in the client package is quite complex. We have the exception at the very beginning, and then it's checked again, but we don't have the real exception anymore. I used a new "updateList" to make it works, I'm looking for another solution here...

          Show
          Nicolas Liochon added a comment - v1. On an old trunk, so it's just to give an overview. Includes some bits of HBASE-5844 as well. There are 3 workarounds in the implementation: 1) As a ServerName is not serializable we use the String dedicated to this kind of issue. Acceptable I think. 2) hadoop.ipc serialization of exception is based on the #getMessage. So we have to parse it internally. It's not visisble to the exception user. Still acceptable (? 3) The code to manage the error in the client package is quite complex. We have the exception at the very beginning, and then it's checked again, but we don't have the real exception anymore. I used a new "updateList" to make it works, I'm looking for another solution here...
          Nicolas Liochon made changes -
          Attachment 5877.v1.patch [ 12524327 ]
          Nicolas Liochon made changes -
          Field Original Value New Value
          Link This issue is required by HBASE-5843 [ HBASE-5843 ]
          Nicolas Liochon created issue -

            People

            • Assignee:
              Nicolas Liochon
              Reporter:
              Nicolas Liochon
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development