HBase
  1. HBase
  2. HBASE-5974

Scanner retry behavior with RPC timeout on next() seems incorrect

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.90.7, 0.92.1, 0.94.0, 0.95.2
    • Fix Version/s: 0.95.0
    • Component/s: Client, regionserver
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Adds a sequence id that is incremented on each Scanner#next invocation so we can distingush calls that arrive out of order. Throws OutOfOrderScannerNextException when we encounter the latter case.

      Description

      I'm seeing the following behavior:

      • set RPC timeout to a short value
      • call next() for some batch of rows, big enough so the client times out before the result is returned
      • the HConnectionManager stuff will retry the next() call to the same server. At this point, one of two things can happen: 1) the previous next() call will still be processing, in which case you get a LeaseException, because it was removed from the map during the processing, or 2) the next() call will succeed but skip the prior batch of rows.
      1. 5974_trunk-V3.patch
        32 kB
        Anoop Sam John
      2. 5974_trunk-V2.patch
        30 kB
        Anoop Sam John
      3. 5974_trunk.patch
        30 kB
        Anoop Sam John
      4. 5974_94-V4.patch
        18 kB
        Ted Yu
      5. HBASE-5974_94-V3.patch
        17 kB
        Anoop Sam John
      6. HBASE-5974_94-V2.patch
        18 kB
        Anoop Sam John
      7. HBASE-5974_0.94.patch
        16 kB
        Anoop Sam John

        Activity

        Hide
        Anoop Sam John added a comment -

        So we might need a proper sequencing of the next() calls from the client to RS... Might be like the RegionScanner maintain the key of the last KV it passed to client and client in the next next() call pass back this key which is the last KV's key it received?

        In the scenario which is mentioned above which might cause KVs to be skipped, these 2 keys won't match and we can close the scan at RS side and give exception which in turn can cause a new scanner creation from that point of last KV from the client side... [Some thing like how currently ClientScanner react to NSRE etc]

        When the client is not passing any last key(null), there wont be any check in the RegionScanner level and scan will continue as normal like the current way. One down side is that with every scan() call to the RS, need to pass some extra data.. May be if we can distinguish the 1st attempt of call() in the ScannerCallable and the later attempts, we can avoid passing the key for the 1st RS call. Only for the later trials we pass the key...

        These are some thoughts came to me after going through the code.. Your thoughts...

        Show
        Anoop Sam John added a comment - So we might need a proper sequencing of the next() calls from the client to RS... Might be like the RegionScanner maintain the key of the last KV it passed to client and client in the next next() call pass back this key which is the last KV's key it received? In the scenario which is mentioned above which might cause KVs to be skipped, these 2 keys won't match and we can close the scan at RS side and give exception which in turn can cause a new scanner creation from that point of last KV from the client side... [Some thing like how currently ClientScanner react to NSRE etc] When the client is not passing any last key(null), there wont be any check in the RegionScanner level and scan will continue as normal like the current way. One down side is that with every scan() call to the RS, need to pass some extra data.. May be if we can distinguish the 1st attempt of call() in the ScannerCallable and the later attempts, we can avoid passing the key for the 1st RS call. Only for the later trials we pass the key... These are some thoughts came to me after going through the code.. Your thoughts...
        Hide
        Todd Lipcon added a comment -

        Hi Anoop. I was thinking something similar to that – but just using a long "cookie" instead of passing the actual keys around. So the conversation between client and server becomes:

        C->S: openScanner(Scan)
        S->C: return scanner ID, cookie = 0
        C->S: next(10, cookie=0)
        S->C: results: ... newCookie=1
        C->S: next(10, cookie=1) [ but times out ]
        S->C: results: ... newCookie = 2 [ but never gets received ]
        C->S: next(10, cookie=1) (retry)
        S->C: NACK: Bad cookie!

        Show
        Todd Lipcon added a comment - Hi Anoop. I was thinking something similar to that – but just using a long "cookie" instead of passing the actual keys around. So the conversation between client and server becomes: C->S: openScanner(Scan) S->C: return scanner ID, cookie = 0 C->S: next(10, cookie=0) S->C: results: ... newCookie=1 C->S: next(10, cookie=1) [ but times out ] S->C: results: ... newCookie = 2 [ but never gets received ] C->S: next(10, cookie=1) (retry) S->C: NACK: Bad cookie!
        Hide
        Anoop Sam John added a comment -

        mm sounds a long cookie is enough.. There might not be a case of more number of request response, btw Client and one region (for one scan), than the long max value..

        Good catch of a long living bug Todd..

        Show
        Anoop Sam John added a comment - mm sounds a long cookie is enough.. There might not be a case of more number of request response, btw Client and one region (for one scan), than the long max value.. Good catch of a long living bug Todd..
        Hide
        Todd Lipcon added a comment -

        Yea, actually an int would probably do - we shouldn't ever have 4B rows in a region. And even if it wrapped, the time it would take to do so would prevent any possible case where an old request got incorrectly allowed.

        Anoop: want to work on this?

        Show
        Todd Lipcon added a comment - Yea, actually an int would probably do - we shouldn't ever have 4B rows in a region. And even if it wrapped, the time it would take to do so would prevent any possible case where an old request got incorrectly allowed. Anoop: want to work on this?
        Hide
        Andrew Purtell added a comment -

        +1 on the cookie approach.

        We've seen the LeaseExceptions. I didn't realize that in some cases the next next() would skip a batch of rows, that's quite bad.

        Show
        Andrew Purtell added a comment - +1 on the cookie approach. We've seen the LeaseExceptions. I didn't realize that in some cases the next next() would skip a batch of rows, that's quite bad.
        Hide
        Todd Lipcon added a comment -

        We've seen the LeaseExceptions. I didn't realize that in some cases the next next() would skip a batch of rows, that's quite bad.

        On a LeaseException I think we do the right thing - the client detects the error at the scanner level and can restart the scanner from the right spot. That's because the lease exception is a DoNotRetryIOException. But other exceptions like RPC timeouts or IO errors would get retried incorrectly with the bug described here.

        Show
        Todd Lipcon added a comment - We've seen the LeaseExceptions. I didn't realize that in some cases the next next() would skip a batch of rows, that's quite bad. On a LeaseException I think we do the right thing - the client detects the error at the scanner level and can restart the scanner from the right spot. That's because the lease exception is a DoNotRetryIOException. But other exceptions like RPC timeouts or IO errors would get retried incorrectly with the bug described here.
        Hide
        Anoop Sam John added a comment -

        Ya I can work on this and try to provide a patch for 94 first..

        Show
        Anoop Sam John added a comment - Ya I can work on this and try to provide a patch for 94 first..
        Hide
        Ted Yu added a comment -

        In the scenario Todd described, can the server keep the most recent batch of results so that when server sees cookie from client carrying 1 less than expected value, the server can resend the stored batch ?

        Show
        Ted Yu added a comment - In the scenario Todd described, can the server keep the most recent batch of results so that when server sees cookie from client carrying 1 less than expected value, the server can resend the stored batch ?
        Hide
        Todd Lipcon added a comment -

        In the scenario Todd described, can the server keep the most recent batch of results so that when server sees cookie from client carrying 1 less than expected value, the server can resend the stored batch ?

        I think it's too expensive in terms of memory consumption on the server. My assertion is that this issue is very rare (or else people probably would have noticed the missing rows a lot more often). So, keeping the last response batch cached server side seems a poor use of RAM.

        Show
        Todd Lipcon added a comment - In the scenario Todd described, can the server keep the most recent batch of results so that when server sees cookie from client carrying 1 less than expected value, the server can resend the stored batch ? I think it's too expensive in terms of memory consumption on the server. My assertion is that this issue is very rare (or else people probably would have noticed the missing rows a lot more often). So, keeping the last response batch cached server side seems a poor use of RAM.
        Hide
        Andrew Purtell added a comment - - edited

        My assertion is that this issue is very rare

        We saw the LeaseExceptions frequently, so perhaps not. Critical priority is correct here I think.

        Show
        Andrew Purtell added a comment - - edited My assertion is that this issue is very rare We saw the LeaseExceptions frequently, so perhaps not. Critical priority is correct here I think.
        Hide
        Todd Lipcon added a comment -

        LeaseExceptions don't cause this issue, though. They are DoNotRetryIOExceptions. It's only IOEs that cause the issue.

        Show
        Todd Lipcon added a comment - LeaseExceptions don't cause this issue, though. They are DoNotRetryIOExceptions. It's only IOEs that cause the issue.
        Hide
        Andrew Purtell added a comment -

        LeaseExceptions don't cause this issue, though.

        Our environment is pretty good for causing IPC issues occasionally.

        Unfortunately I can't comment on the frequency that IOEs caused missing rows, but I can testify to getting complaints about frequent LeaseExceptions for a time.

        Show
        Andrew Purtell added a comment - LeaseExceptions don't cause this issue, though. Our environment is pretty good for causing IPC issues occasionally. Unfortunately I can't comment on the frequency that IOEs caused missing rows, but I can testify to getting complaints about frequent LeaseExceptions for a time.
        Hide
        ramkrishna.s.vasudevan added a comment -

        +1 on cookie based approach.

        Show
        ramkrishna.s.vasudevan added a comment - +1 on cookie based approach.
        Hide
        Ted Yu added a comment -

        Suppose client gets 'Bad cookie' notification from server in the middle of a long scan.
        If the client wants to continue where it left off in a new scanner (I assume this is the default / desriable behavior), the client needs to remember the key of the last KV it previously received and .
        Using cookie saves bandwidth. But in faulty condition, I think the key of last KV would still be needed.

        Show
        Ted Yu added a comment - Suppose client gets 'Bad cookie' notification from server in the middle of a long scan. If the client wants to continue where it left off in a new scanner (I assume this is the default / desriable behavior), the client needs to remember the key of the last KV it previously received and . Using cookie saves bandwidth. But in faulty condition, I think the key of last KV would still be needed.
        Hide
        Anoop Sam John added a comment -

        Suppose client gets 'Bad cookie' notification from server in the middle of a long scan.
        If the client wants to continue where it left off in a new scanner (I assume this is the default / desriable behavior), the client needs to remember the key of the last KV it previously received and .

        Yes. The client knows the last row and based on this it can create a new scanner so as to continue the scan from that point. This is already in place and this what will happen when client gets a NSRE in the middle of the scan from RS.

        But in faulty condition, I think the key of last KV would still be needed.

        Any specific reason in your mind Ted? As such I think a seq numbering is enough...

        Show
        Anoop Sam John added a comment - Suppose client gets 'Bad cookie' notification from server in the middle of a long scan. If the client wants to continue where it left off in a new scanner (I assume this is the default / desriable behavior), the client needs to remember the key of the last KV it previously received and . Yes. The client knows the last row and based on this it can create a new scanner so as to continue the scan from that point. This is already in place and this what will happen when client gets a NSRE in the middle of the scan from RS. But in faulty condition, I think the key of last KV would still be needed. Any specific reason in your mind Ted? As such I think a seq numbering is enough...
        Hide
        Anoop Sam John added a comment -

        I have prepared a patch based on 94. Will upload tonight.

        Show
        Anoop Sam John added a comment - I have prepared a patch based on 94. Will upload tonight.
        Hide
        Anoop Sam John added a comment -

        Trying to attach patch to JIRA but seems some issue with JIRA. I am getting some error dialog box instead of the attach dialog box !!

        Show
        Anoop Sam John added a comment - Trying to attach patch to JIRA but seems some issue with JIRA. I am getting some error dialog box instead of the attach dialog box !!
        Hide
        Anoop Sam John added a comment -

        One point came up when I and Ram discussed
        In ScannerCallable.call()
        When a RemoteException comes, can we mark that as DoNotRetryIOException?

        Any way the call reached the RS and it might have done atleast partial scan.[Moved the currentRow pointer]

        Show
        Anoop Sam John added a comment - One point came up when I and Ram discussed In ScannerCallable.call() When a RemoteException comes, can we mark that as DoNotRetryIOException? Any way the call reached the RS and it might have done atleast partial scan. [Moved the currentRow pointer]
        Hide
        Anoop Sam John added a comment -

        Patch for 94. 1st version.. there are some points to discuss which will be done in following comments

        Show
        Anoop Sam John added a comment - Patch for 94. 1st version.. there are some points to discuss which will be done in following comments
        Hide
        Anoop Sam John added a comment -
        +    if (-1 != cookie && cookie != rswc.cookie) {
        +      this.scanners.remove(scannerName);	
        +      this.leases.cancelLease(scannerName);
        +      throw new ScannerCookieOutOfOrderException("Cookie from client and cookie at server do not match");
        +    }
        

        When the cookie is out of order, we remove the scanner and cancel Lease.[Not waiting for the lease timeout] This is in same line as what happens with NSRE. One question which arise in my mind is do we need to contact the CP hooks of pre and post scanner close here?
        Now in normal close() of scanner from client side and lease timeout based close, we call CP hooks. But what abt the scanner getting removed from RS when the NSRE happens. [Same applicable to ScannerCookieOutOfOrderException case too]

        Show
        Anoop Sam John added a comment - + if (-1 != cookie && cookie != rswc.cookie) { + this .scanners.remove(scannerName); + this .leases.cancelLease(scannerName); + throw new ScannerCookieOutOfOrderException( "Cookie from client and cookie at server do not match" ); + } When the cookie is out of order, we remove the scanner and cancel Lease. [Not waiting for the lease timeout] This is in same line as what happens with NSRE. One question which arise in my mind is do we need to contact the CP hooks of pre and post scanner close here? Now in normal close() of scanner from client side and lease timeout based close, we call CP hooks. But what abt the scanner getting removed from RS when the NSRE happens. [Same applicable to ScannerCookieOutOfOrderException case too]
        Hide
        Jieshan Bean added a comment -

        When a RemoteException comes, can we mark that as DoNotRetryIOException?

        I think it should be marked as DNRI.

        Now in normal close() of scanner from client side and lease timeout based close, we call CP hooks. But what abt the scanner getting removed from RS when the NSRE happens.

        So what's your suggestion, Anoop? call CP hooks in the finally section?

        The patch looks nice. A minor comment:

        RegionScanner scanner = scanners.get(scannerIdString).s;
        

        Is there one chance of "scanners.get(scannerIdString)" is null? any way, the direct access to that private variable is not good.

        Show
        Jieshan Bean added a comment - When a RemoteException comes, can we mark that as DoNotRetryIOException? I think it should be marked as DNRI. Now in normal close() of scanner from client side and lease timeout based close, we call CP hooks. But what abt the scanner getting removed from RS when the NSRE happens. So what's your suggestion, Anoop? call CP hooks in the finally section? The patch looks nice. A minor comment: RegionScanner scanner = scanners.get(scannerIdString).s; Is there one chance of "scanners.get(scannerIdString)" is null? any way, the direct access to that private variable is not good.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Todd/Stack
        Can you review this patch? Seems to be pretty important fix.

        Show
        ramkrishna.s.vasudevan added a comment - @Todd/Stack Can you review this patch? Seems to be pretty important fix.
        Hide
        Todd Lipcon added a comment -
        • style: use 2-space indentation, not tabs
        • why do we need the new RegionScannerWithCookie class? why not add the cookie to RegionScanner itself? I'd also move the cookie check into a function "scanner.checkAndIncrementCookie(cookie);"
        • let's rename "cookie" to "callSequenceNumber" or something – since in this implementation it is a simple increasing sequence.
        • this isn't currently compatible with 0.94, since a new client wouldn't be able to scan an old server. In order to make it compatible, we'd need to catch the MethodNotExists exception and set a flag in the client which disables cookie usage for that cluster.
        • In the test, I think you should use HRegionInterface directly, so you don't have to actually generate an RPC timeout. As is, I think it's also not guaranteed to trigger the issue unless you set scanner caching to 1, right? I guess that's the default, but if we ever change the default, it would invalidate this test.
        Show
        Todd Lipcon added a comment - style: use 2-space indentation, not tabs why do we need the new RegionScannerWithCookie class? why not add the cookie to RegionScanner itself? I'd also move the cookie check into a function "scanner.checkAndIncrementCookie(cookie);" let's rename "cookie" to "callSequenceNumber" or something – since in this implementation it is a simple increasing sequence. this isn't currently compatible with 0.94, since a new client wouldn't be able to scan an old server. In order to make it compatible, we'd need to catch the MethodNotExists exception and set a flag in the client which disables cookie usage for that cluster. In the test, I think you should use HRegionInterface directly, so you don't have to actually generate an RPC timeout. As is, I think it's also not guaranteed to trigger the issue unless you set scanner caching to 1, right? I guess that's the default, but if we ever change the default, it would invalidate this test.
        Hide
        Anoop Sam John added a comment -

        Thanks for the review Todd

        why do we need the new RegionScannerWithCookie class? why not add the cookie to RegionScanner itself?

        I was also thinking initially in this way. There are 2 reasons why I have avoided to do the seqNo work within the RegionScanner
        1. In case of the caching>1 there will be more than one call to the RegionScanner.next(). U mean passing the client sent seqNo ( I am avoiding cookie as I agree with you to rename this ) to the RegionScanner which will change the interface. This is exposed
        2. This is the main reason. With the CP usage we have exposed the RegionScanner and using the preScannerOpen() and postScannerOpen() impls user can now return his own RegionScanner impl. If we do this seqNo maintain and check logics in RegionScanner this will make the user to worry abt these? I feel this should be handled by HBase core code. What do u say?

        this isn't currently compatible with 0.94, since a new client wouldn't be able to scan an old server.

        Agree.. I can fix this

        let's rename "cookie" to "callSequenceNumber"

        Already agreed..

        In the test, I think you should use HRegionInterface directly, so you don't have to actually generate an RPC timeout.

        I thought of an E2E FT case.. Yes as u said the other one also I can write. So what is your recommendation? Should I change?

        As is, I think it's also not guaranteed to trigger the issue unless you set scanner caching to 1, right?

        May be in that case I can explicitly set the caching=1 for this test case. I can do that

        Show
        Anoop Sam John added a comment - Thanks for the review Todd why do we need the new RegionScannerWithCookie class? why not add the cookie to RegionScanner itself? I was also thinking initially in this way. There are 2 reasons why I have avoided to do the seqNo work within the RegionScanner 1. In case of the caching>1 there will be more than one call to the RegionScanner.next(). U mean passing the client sent seqNo ( I am avoiding cookie as I agree with you to rename this ) to the RegionScanner which will change the interface. This is exposed 2. This is the main reason. With the CP usage we have exposed the RegionScanner and using the preScannerOpen() and postScannerOpen() impls user can now return his own RegionScanner impl. If we do this seqNo maintain and check logics in RegionScanner this will make the user to worry abt these? I feel this should be handled by HBase core code. What do u say? this isn't currently compatible with 0.94, since a new client wouldn't be able to scan an old server. Agree.. I can fix this let's rename "cookie" to "callSequenceNumber" Already agreed.. In the test, I think you should use HRegionInterface directly, so you don't have to actually generate an RPC timeout. I thought of an E2E FT case.. Yes as u said the other one also I can write. So what is your recommendation? Should I change? As is, I think it's also not guaranteed to trigger the issue unless you set scanner caching to 1, right? May be in that case I can explicitly set the caching=1 for this test case. I can do that
        Hide
        Anoop Sam John added a comment -

        Thanks for the review Jieshan

        So what's your suggestion, Anoop? call CP hooks in the finally section?

        I mean in whatever case when we close the scanner we need to call the CP hooks. Currently before this patch we were not doing this when getting a NSRE

        catch (Throwable t) {
          if (t instanceof NotServingRegionException) {
        	this.scanners.remove(scannerName);
          }
          throw convertThrowableToIOE(cleanup(t));
        }
        

        Here we can see it is not calling the CP hooks. As of now in case of the cookie out of order also I am not contacting the CP hooks.

        RegionScanner scanner = scanners.get(scannerIdString).s;

        Oh yes. Thanks for pointing it out. I will fix.. This was not in that direct next() call flow.. That is why I missed..

        Show
        Anoop Sam John added a comment - Thanks for the review Jieshan So what's your suggestion, Anoop? call CP hooks in the finally section? I mean in whatever case when we close the scanner we need to call the CP hooks. Currently before this patch we were not doing this when getting a NSRE catch (Throwable t) { if (t instanceof NotServingRegionException) { this .scanners.remove(scannerName); } throw convertThrowableToIOE(cleanup(t)); } Here we can see it is not calling the CP hooks. As of now in case of the cookie out of order also I am not contacting the CP hooks. RegionScanner scanner = scanners.get(scannerIdString).s; Oh yes. Thanks for pointing it out. I will fix.. This was not in that direct next() call flow.. That is why I missed..
        Hide
        Anoop Sam John added a comment -

        @Todd

        this isn't currently compatible with 0.94, since a new client wouldn't be able to scan an old server.

        Do we need to keep this way of compatibility also? I thought we need to have like old client and new server. Just wanted to confirm this.

        Show
        Anoop Sam John added a comment - @Todd this isn't currently compatible with 0.94, since a new client wouldn't be able to scan an old server. Do we need to keep this way of compatibility also? I thought we need to have like old client and new server. Just wanted to confirm this.
        Hide
        Anoop Sam John added a comment -

        @Todd
        Regarding the solution for handling the compatibility with client,
        When NoSuchMethodException happens at RS side, the IPC layer throws back the RemoteException in which the actual Exception is stringfied and set as the message. So if we need to handle, we need to do string matching on the Exception message at the client side. Pls give your opinion on this. Or can we mark this issue fix such that when the server side is upgraded the client side also need upgrade..
        Other devs also pls give you suggestion

        I have prepared a patch addressing the other comments and if we decide on the above I can submit the new patch

        Show
        Anoop Sam John added a comment - @Todd Regarding the solution for handling the compatibility with client, When NoSuchMethodException happens at RS side, the IPC layer throws back the RemoteException in which the actual Exception is stringfied and set as the message. So if we need to handle, we need to do string matching on the Exception message at the client side. Pls give your opinion on this. Or can we mark this issue fix such that when the server side is upgraded the client side also need upgrade.. Other devs also pls give you suggestion I have prepared a patch addressing the other comments and if we decide on the above I can submit the new patch
        Show
        ramkrishna.s.vasudevan added a comment - @Todd Also regarding the comment https://issues.apache.org/jira/browse/HBASE-5974?focusedCommentId=13275279&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13275279 What do you feel?
        Hide
        Anoop Sam John added a comment -

        When SocketTimeoutException happens it wont be a RemoteException.

        Show
        Anoop Sam John added a comment - When SocketTimeoutException happens it wont be a RemoteException.
        Hide
        Anoop Sam John added a comment -

        Addressed the comments
        The comment regarding the compatibility at client side, I have fixed with the string check on the RemoteException stacktrace. I am not happy with this way but there is no other way as of now I can find. Pls give your comments

        Show
        Anoop Sam John added a comment - Addressed the comments The comment regarding the compatibility at client side, I have fixed with the string check on the RemoteException stacktrace. I am not happy with this way but there is no other way as of now I can find. Pls give your comments
        Hide
        Ted Yu added a comment -

        Should CallSequenceOutOfOrderException be extending DoNotRetryIOException ?
        That way you don't need to create a new exception below:

        +        } else if (ioe instanceof CallSequenceOutOfOrderException) {
        ...
        +          throw new DoNotRetryIOException("Reset scanner", ioe);
        

        I think users haven't experienced this bug.
        In solving the bug, some kludge is introduced.
        We should think twice before integration.

        Show
        Ted Yu added a comment - Should CallSequenceOutOfOrderException be extending DoNotRetryIOException ? That way you don't need to create a new exception below: + } else if (ioe instanceof CallSequenceOutOfOrderException) { ... + throw new DoNotRetryIOException( "Reset scanner" , ioe); I think users haven't experienced this bug. In solving the bug, some kludge is introduced. We should think twice before integration.
        Hide
        Lars Hofhansl added a comment -

        Last patch looks good to me. The string check on the RemoteException is not ideal, but I cannot think of anything better; somebody with more knowledge about our RPC should chime in.
        Is the RegionScannerHolder needed? Why can't RegionScannerImpl not hold the sequence number and RegionScanner get a get/setSeq method?

        Show
        Lars Hofhansl added a comment - Last patch looks good to me. The string check on the RemoteException is not ideal, but I cannot think of anything better; somebody with more knowledge about our RPC should chime in. Is the RegionScannerHolder needed? Why can't RegionScannerImpl not hold the sequence number and RegionScanner get a get/setSeq method?
        Hide
        Anoop Sam John added a comment -

        Thanks for the review Lars

        Is the RegionScannerHolder needed? Why can't RegionScannerImpl not hold the sequence number and RegionScanner get a get/setSeq method?

        Why I was not doing that because through the CP we have exposed the RegionScanner interface. The pre and post scannerOpen methods in the CP can return an impl for the RegionScanner now. So in that case customer need to worry abt maintain this seqNo? I felt that wont be good. Am I clear to you Lars?

        Yes the string based check, I am not getting any way. When the NoSuchMethodException happens at the RS side, we are creating RemoteException not wrapping this NoSuchMethodException but with a message=exception trace...

        Show
        Anoop Sam John added a comment - Thanks for the review Lars Is the RegionScannerHolder needed? Why can't RegionScannerImpl not hold the sequence number and RegionScanner get a get/setSeq method? Why I was not doing that because through the CP we have exposed the RegionScanner interface. The pre and post scannerOpen methods in the CP can return an impl for the RegionScanner now. So in that case customer need to worry abt maintain this seqNo? I felt that wont be good. Am I clear to you Lars? Yes the string based check, I am not getting any way. When the NoSuchMethodException happens at the RS side, we are creating RemoteException not wrapping this NoSuchMethodException but with a message=exception trace...
        Hide
        Lars Hofhansl added a comment -

        W.r.t. to RegionScannerHolder. I see that makes sense. As long as the sequence numbering also works when the CP intercepts a scanner and provides its own implementation. I don't think there'd be a case where a CP would want to interfere with the sequence numbers.

        Re: the NoSuchMethodException check. I was wondering if there is anything with RPC version numbering to deal with this instead.

        Show
        Lars Hofhansl added a comment - W.r.t. to RegionScannerHolder. I see that makes sense. As long as the sequence numbering also works when the CP intercepts a scanner and provides its own implementation. I don't think there'd be a case where a CP would want to interfere with the sequence numbers. Re: the NoSuchMethodException check. I was wondering if there is anything with RPC version numbering to deal with this instead.
        Hide
        Anoop Sam John added a comment -

        @Lars
        Or is it this kind of a compatibility is not needed? Only upgradation at client side and not at server. I am not sure what kind of compatibility we need to maintain across new minor versions (from 94.0 to 94.1)

        Also see the comments for the issue HBASE-6124. [Similar to our case]
        https://issues.apache.org/jira/browse/HBASE-6124?focusedCommentId=13285008&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13285008

        https://issues.apache.org/jira/browse/HBASE-6124?focusedCommentId=13285016&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13285016

        Show
        Anoop Sam John added a comment - @Lars Or is it this kind of a compatibility is not needed? Only upgradation at client side and not at server. I am not sure what kind of compatibility we need to maintain across new minor versions (from 94.0 to 94.1) Also see the comments for the issue HBASE-6124 . [Similar to our case] https://issues.apache.org/jira/browse/HBASE-6124?focusedCommentId=13285008&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13285008 https://issues.apache.org/jira/browse/HBASE-6124?focusedCommentId=13285016&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13285016
        Hide
        Lars Hofhansl added a comment -

        @Anoop. Compatibility is definitely needed across a minor version change.

        Show
        Lars Hofhansl added a comment - @Anoop. Compatibility is definitely needed across a minor version change.
        Hide
        Anoop Sam John added a comment -

        If the solution is fine with every one I can make patch for other versions also.

        @Ted
        Regarding the new Exception extending DoNotRetryIOException, I was following NSRE. I can make this change. I think it should be ok.

        Show
        Anoop Sam John added a comment - If the solution is fine with every one I can make patch for other versions also. @Ted Regarding the new Exception extending DoNotRetryIOException, I was following NSRE. I can make this change. I think it should be ok.
        Hide
        Lars Hofhansl added a comment -

        +1 for V2

        Show
        Lars Hofhansl added a comment - +1 for V2
        Hide
        ramkrishna.s.vasudevan added a comment -

        +1 from me too.

        Show
        ramkrishna.s.vasudevan added a comment - +1 from me too.
        Hide
        Ted Yu added a comment -

        HRegionInterface.java doesn't exist in trunk so patch v2 wouldn't apply to trunk.
        I would suggest creating patch for trunk and run through hadoop QA.

        +              LOG.info("Seq number based scan API not present at RS side! Trying with API: "
        

        I think the above log should be at warn level.

        +        } else if (ioe instanceof CallSequenceOutOfOrderException) {
        +          // The callSeq from the client not matched with the one expected at the RS side
        +          // This means the RS might have done extra scanning of data which is not received by the
        +          // client.Throw a DNRE so that we close the current scanner and opens a new one with RS.
        +          throw new DoNotRetryIOException("Reset scanner", ioe);
        

        Should we disclose a little more detail in the message of DNRIOE ? The above is the same as response to NotServingRegionException and RegionServerStoppedException.
        'not matched with' -> 'does not match'
        'is not received' -> 'has not been received'
        'opens a new' -> 'open a new'

        +    // if callSeq do not match throw Exception straight away. This needs to be performed even
        

        'do not match' -> 'does not match'

        +public class TestClientScannerRPCTimesout {^M
        

        Please add short javadoc for the test class. I think it should be called TestClientScannerRPCTimeout.
        Please use utility such as dos2unix to remove the trailing ^M from the patch file.

        +  public static class RegionServerWithScanTimesout extends MiniHBaseClusterRegionServer {^M
        

        The above class can be made private. It should be named RegionServerWithScanTimeout.

        + * Thrown by a region server while scan related next() calls. Both client and server maintain a^M
        + * callSequence and if the both do not match, RS will throw this exception.^M
        + */^M
        +public class CallSequenceOutOfOrderException extends IOException {^M
        

        CallSequenceOutOfOrderException should extend DoNotRetryIOException so that we don't need to create DoNotRetryIOException instance (shown above).
        'while scan related next()' -> 'while doing scan related next()'
        'the both do not' -> 'they do not'

        It would be nice for Todd to take a look at the patch.

        Show
        Ted Yu added a comment - HRegionInterface.java doesn't exist in trunk so patch v2 wouldn't apply to trunk. I would suggest creating patch for trunk and run through hadoop QA. + LOG.info( "Seq number based scan API not present at RS side! Trying with API: " I think the above log should be at warn level. + } else if (ioe instanceof CallSequenceOutOfOrderException) { + // The callSeq from the client not matched with the one expected at the RS side + // This means the RS might have done extra scanning of data which is not received by the + // client.Throw a DNRE so that we close the current scanner and opens a new one with RS. + throw new DoNotRetryIOException( "Reset scanner" , ioe); Should we disclose a little more detail in the message of DNRIOE ? The above is the same as response to NotServingRegionException and RegionServerStoppedException. 'not matched with' -> 'does not match' 'is not received' -> 'has not been received' 'opens a new' -> 'open a new' + // if callSeq do not match throw Exception straight away. This needs to be performed even 'do not match' -> 'does not match' + public class TestClientScannerRPCTimesout {^M Please add short javadoc for the test class. I think it should be called TestClientScannerRPCTimeout. Please use utility such as dos2unix to remove the trailing ^M from the patch file. + public static class RegionServerWithScanTimesout extends MiniHBaseClusterRegionServer {^M The above class can be made private. It should be named RegionServerWithScanTimeout. + * Thrown by a region server while scan related next() calls. Both client and server maintain a^M + * callSequence and if the both do not match, RS will throw this exception.^M + */^M + public class CallSequenceOutOfOrderException extends IOException {^M CallSequenceOutOfOrderException should extend DoNotRetryIOException so that we don't need to create DoNotRetryIOException instance (shown above). 'while scan related next()' -> 'while doing scan related next()' 'the both do not' -> 'they do not' It would be nice for Todd to take a look at the patch.
        Hide
        Ted Yu added a comment -

        w.r.t. keeping RegionScannerHolder, I posted a poll to dev@hbase for use case of letting [pre,post]ScannerOpen() return a custom RegionScanner implementation.

        Show
        Ted Yu added a comment - w.r.t. keeping RegionScannerHolder, I posted a poll to dev@hbase for use case of letting [pre,post] ScannerOpen() return a custom RegionScanner implementation.
        Hide
        Anoop Sam John added a comment -

        @Ted
        Will look into your comments and need a rebase for 94 patch too..
        I will make seperate patch fro trunk also.

        wrt pre and post CP hooks, it is not only creating a new custom RegionScanner, but may be creating a wrapper for the actual RegionScanner. In one of our impl, we use this approach. [Just a wrapper which delegates the calls with some extra steps for next() calls. ] Here also if we add new methods to RegionScanner interface which deals with the check and incerement for this seqNo, will get exposed to user. I felt this might look odd for them.

        Show
        Anoop Sam John added a comment - @Ted Will look into your comments and need a rebase for 94 patch too.. I will make seperate patch fro trunk also. wrt pre and post CP hooks, it is not only creating a new custom RegionScanner, but may be creating a wrapper for the actual RegionScanner. In one of our impl, we use this approach. [Just a wrapper which delegates the calls with some extra steps for next() calls. ] Here also if we add new methods to RegionScanner interface which deals with the check and incerement for this seqNo, will get exposed to user. I felt this might look odd for them.
        Hide
        Anoop Sam John added a comment -

        Patch addressing Ted's comments

        The above class can be made private. It should be named RegionServerWithScanTimeout.

        The class name is changed. But we can not make this private. If so RS impl class can not get instantiated.

        Show
        Anoop Sam John added a comment - Patch addressing Ted's comments The above class can be made private. It should be named RegionServerWithScanTimeout. The class name is changed. But we can not make this private. If so RS impl class can not get instantiated.
        Hide
        Anoop Sam John added a comment -

        I will give patch fro trunk tomorrow.

        Show
        Anoop Sam John added a comment - I will give patch fro trunk tomorrow.
        Hide
        Ted Yu added a comment -

        Patch v4 makes a small change to JVMClusterUtil.java so that RegionServerWithScanTimeout can be made private.

        TestClientScannerRPCTimeout passes.

        Show
        Ted Yu added a comment - Patch v4 makes a small change to JVMClusterUtil.java so that RegionServerWithScanTimeout can be made private. TestClientScannerRPCTimeout passes.
        Hide
        Ted Yu added a comment - - edited

        w.r.t. potential change to RegionScanner, if users create wrapper(s), the maintenance of seqNo would still be completed by core implementation.
        See the following in patch:

        +  public Result[] next(final long scannerId, int nbRows) throws IOException {
        +    return next(scannerId, nbRows, -1);
        +  }
        

        and the following in current code base:

        @InterfaceAudience.Private
        public interface RegionScanner extends InternalScanner {
        
        Show
        Ted Yu added a comment - - edited w.r.t. potential change to RegionScanner, if users create wrapper(s), the maintenance of seqNo would still be completed by core implementation. See the following in patch: + public Result[] next( final long scannerId, int nbRows) throws IOException { + return next(scannerId, nbRows, -1); + } and the following in current code base: @InterfaceAudience.Private public interface RegionScanner extends InternalScanner {
        Hide
        Anoop Sam John added a comment -

        @Ted
        Pls see the below comment from Andrew Purtell
        https://issues.apache.org/jira/browse/HBASE-5517?focusedCommentId=13225624&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13225624

        I think we need to change the RegionScanner from private to public then.

        Show
        Anoop Sam John added a comment - @Ted Pls see the below comment from Andrew Purtell https://issues.apache.org/jira/browse/HBASE-5517?focusedCommentId=13225624&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13225624 I think we need to change the RegionScanner from private to public then.
        Hide
        Anoop Sam John added a comment -

        Patch for trunk.

        Show
        Anoop Sam John added a comment - Patch for trunk.
        Hide
        Anoop Sam John added a comment -

        In Trunk InterfaceAudience for RegionScanner to be public I feel ( as per the use case that we are having. We wants to impl this interface wrapping the actual RegionScanner object created by the Region ).. Not made this change in the attached patch. Pls provide suggestions

        Show
        Anoop Sam John added a comment - In Trunk InterfaceAudience for RegionScanner to be public I feel ( as per the use case that we are having. We wants to impl this interface wrapping the actual RegionScanner object created by the Region ).. Not made this change in the attached patch. Pls provide suggestions
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12530781/5974_trunk.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.regionserver.TestAtomicOperation

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2097//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2097//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2097//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2097//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12530781/5974_trunk.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestAtomicOperation Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2097//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2097//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2097//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2097//console This message is automatically generated.
        Hide
        Ted Yu added a comment -

        I would listen to Todd and Andy's opinion.

        Show
        Ted Yu added a comment - I would listen to Todd and Andy's opinion.
        Hide
        Anoop Sam John added a comment -

        @Andrew and @Todd
        Can you have a look at this pls? Also regarding the point of RegionScanner exposing through CP and its use cases, can you give your opinion?

        Show
        Anoop Sam John added a comment - @Andrew and @Todd Can you have a look at this pls? Also regarding the point of RegionScanner exposing through CP and its use cases, can you give your opinion?
        Hide
        Todd Lipcon added a comment -

        I don't think we should be trying to lock down internal APIs like RegionScanner. Coprocessors are an advanced interface and I think we should not endeavour to provide full compatibility between versions for them. We'll end up getting bogged down way too fast trying to keep this compatibility everywhere.

        So, I'm still in favor of just adding a new field/method to the existing RegionScanner implementation.

        As for the compatibility code: I think the useCallSeq flag has to be at a wider scope than this. Having it in ScannerCallable means that each individual scan will print a warning (and have to do an extra round trip), which is too much.

        Show
        Todd Lipcon added a comment - I don't think we should be trying to lock down internal APIs like RegionScanner. Coprocessors are an advanced interface and I think we should not endeavour to provide full compatibility between versions for them. We'll end up getting bogged down way too fast trying to keep this compatibility everywhere. So, I'm still in favor of just adding a new field/method to the existing RegionScanner implementation. As for the compatibility code: I think the useCallSeq flag has to be at a wider scope than this. Having it in ScannerCallable means that each individual scan will print a warning (and have to do an extra round trip), which is too much.
        Hide
        Anoop Sam John added a comment -

        As for the compatibility code: I think the useCallSeq flag has to be at a wider scope than this. Having it in ScannerCallable means that each individual scan will print a warning (and have to do an extra round trip), which is too much.

        I totally agree with you Todd. I thought about this while making the patch. I couldn't find an apt place for saving this info and making use. This is scan specific so should be some where in the scan code area I felt. Also my worry was can it be like in one cluster itself one RS is with new code and another with old?
        Any Suggestions Todd.. As you said, extra round trip is a worry..

        I don't think we should be trying to lock down internal APIs like RegionScanner. Coprocessors are an advanced interface and I think we should not endeavour to provide full compatibility between versions for them.

        Rather than the compatability at interface level my worry is RegionScanner handling this seq maintain and checking job.
        In our scenario we wanted custom behaviour in RegionScanner. So what we have done is created a new RegionScanner impl wrapping the actual RegionScanner object created by the HRegion. Our custom impl class implements all the methods and for the actual work we delegate the call. The extra steps we wanted to do, we are doing before this delegation.
        Now we need to add a new method in the RegionScanner [like checkAndIncrement(seqNo)] which should be called by HRS.. Now in our usecase, we need to implement this and may be delegate the call. Making this very much internal item exposed to user was my worry. Hope I make it clear. We dont have any usecase like making full custom RegionScanner impl (without even the wrapped object) now. I dont know someone might really need some thing like that.[But at least the CP allows to do so now] In such a thing this impl of seqNo check need to be handled by that code itself.

        Show
        Anoop Sam John added a comment - As for the compatibility code: I think the useCallSeq flag has to be at a wider scope than this. Having it in ScannerCallable means that each individual scan will print a warning (and have to do an extra round trip), which is too much. I totally agree with you Todd. I thought about this while making the patch. I couldn't find an apt place for saving this info and making use. This is scan specific so should be some where in the scan code area I felt. Also my worry was can it be like in one cluster itself one RS is with new code and another with old? Any Suggestions Todd.. As you said, extra round trip is a worry.. I don't think we should be trying to lock down internal APIs like RegionScanner. Coprocessors are an advanced interface and I think we should not endeavour to provide full compatibility between versions for them. Rather than the compatability at interface level my worry is RegionScanner handling this seq maintain and checking job. In our scenario we wanted custom behaviour in RegionScanner. So what we have done is created a new RegionScanner impl wrapping the actual RegionScanner object created by the HRegion. Our custom impl class implements all the methods and for the actual work we delegate the call. The extra steps we wanted to do, we are doing before this delegation. Now we need to add a new method in the RegionScanner [like checkAndIncrement(seqNo)] which should be called by HRS.. Now in our usecase, we need to implement this and may be delegate the call. Making this very much internal item exposed to user was my worry. Hope I make it clear. We dont have any usecase like making full custom RegionScanner impl (without even the wrapped object) now. I dont know someone might really need some thing like that. [But at least the CP allows to do so now] In such a thing this impl of seqNo check need to be handled by that code itself.
        Hide
        Anoop Sam John added a comment -

        In Trunk patch
        HRegionServer

        
        +          // if callSeq does not match throw Exception straight away. This needs to be performed
        +          // even before checking of Lease.
        +          if (rsh == null) {
        +            rsh = scanners.get(scannerName);
        +          }
        +          if (rsh != null) {
        +            if (request.getCallSeq() != rsh.callSeq) {
        +              throw new CallSequenceOutOfOrderException("Expected seq: " + rsh.callSeq
        +                  + " But the seq got from client: " + request.getCallSeq());
        +            }
        +            // Increment the callSeq value which is the next expected from client.
        +            rsh.callSeq++;
        +          }
        

        Here we need to check whether the callSeq is present in the ScanRequest or not. If it is present then only we can do the comparison. I have noticed this and will a patch V2 on Trunk with this change soon

        Show
        Anoop Sam John added a comment - In Trunk patch HRegionServer + // if callSeq does not match throw Exception straight away. This needs to be performed + // even before checking of Lease. + if (rsh == null ) { + rsh = scanners.get(scannerName); + } + if (rsh != null ) { + if (request.getCallSeq() != rsh.callSeq) { + throw new CallSequenceOutOfOrderException( "Expected seq: " + rsh.callSeq + + " But the seq got from client: " + request.getCallSeq()); + } + // Increment the callSeq value which is the next expected from client. + rsh.callSeq++; + } Here we need to check whether the callSeq is present in the ScanRequest or not. If it is present then only we can do the comparison. I have noticed this and will a patch V2 on Trunk with this change soon
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12531221/5974_trunk-V2.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 5 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.regionserver.TestAtomicOperation
        org.apache.hadoop.hbase.master.TestMasterNoCluster

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2120//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2120//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2120//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2120//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12531221/5974_trunk-V2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 5 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestAtomicOperation org.apache.hadoop.hbase.master.TestMasterNoCluster Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2120//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2120//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2120//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2120//console This message is automatically generated.
        Hide
        Anoop Sam John added a comment -

        2 test classes need change
        TestAssignmentManager#processServerShutdownHandler()
        TestMetaReaderEditorNoCluster#testRideOverServerNotRunning ()

        I will give a new patch with these test case changes needed as per the patch soon.

        Show
        Anoop Sam John added a comment - 2 test classes need change TestAssignmentManager#processServerShutdownHandler() TestMetaReaderEditorNoCluster#testRideOverServerNotRunning () I will give a new patch with these test case changes needed as per the patch soon.
        Hide
        Lars Hofhansl added a comment -

        Where are we with this? Should I hold off 0.94.1 for this?

        Show
        Lars Hofhansl added a comment - Where are we with this? Should I hold off 0.94.1 for this?
        Hide
        Anoop Sam John added a comment -

        There were some concerns over where to handle the seq no. [at HRS or at RegionScannerImpl]
        I had added my reasoning ... Any thoughts?
        Request opinion from Andrew also.

        Show
        Anoop Sam John added a comment - There were some concerns over where to handle the seq no. [at HRS or at RegionScannerImpl] I had added my reasoning ... Any thoughts? Request opinion from Andrew also.
        Hide
        Lars Hofhansl added a comment -

        Let's move this to 0.94.2. We had this behaviour since the beginning.

        Show
        Lars Hofhansl added a comment - Let's move this to 0.94.2. We had this behaviour since the beginning.
        Hide
        Lars Hofhansl added a comment -

        Moving out to 0.94.3 (unless there is some movement soon)

        Show
        Lars Hofhansl added a comment - Moving out to 0.94.3 (unless there is some movement soon)
        Hide
        Lars Hofhansl added a comment -

        Let's do this correctly in 0.96 (where it is OK to break wire compatibility).
        Removing this from 0.94.

        This means we can pass the seqno as a proper field in the Scan object.

        Show
        Lars Hofhansl added a comment - Let's do this correctly in 0.96 (where it is OK to break wire compatibility). Removing this from 0.94. This means we can pass the seqno as a proper field in the Scan object.
        Hide
        stack added a comment -

        Thanks for reminding me about this important issue Anoop. Lets get it into trunk at least (Should be a blocker on 0.96).

        Minor: Do you foresee the sequencing being used other than in scanners (Its implemented in ScannerCallable only at moment)? If NOT, would OutOfOrderNextException or OutOfOrderScannerNextException be a better name than CallSequenceOutOfOrderException? If you do this, would suggest changing param names from callSeq to nextSeq? The former is generic.

        Is CallSequenceOutOfOrderException an exception that will bubble up into the application or is the client its intended audience? If so, should we annotate it @InterfaceAudience.Private as Abortable is for instance?

        +              if ((cause == null || (!(cause instanceof NotServingRegionException)
        +                  && !(cause instanceof RegionServerStoppedException)))
        +                  && !(e instanceof CallSequenceOutOfOrderException)) {
        

        Would it make sense testing ! DoNotRetryIOException rather than calling out the two exceptions above? Would that be too broad? Otherwise, wondering if these two exceptions should inherit from a common base class given they are getting this special treatment. Not important. Just a thought.

        Check your comments. You seem to be saying 'scan' when you mean 'next'.

        For example, this comment:

        +            // increment the callSeq which will be getting used for the next time scan() call to
        +            // the RS.In case of a timeout this increment should not happen so that the next
        +            // trial also will be done with the same callSeq.
        

        Regards the above comment, is there a sentence missing on what happens if we go back w/ the same callseq? Maybe have another go at this whole comment... it could be a bit clearer. Thanks.

        Committing this set of comments.... more to come.

        Show
        stack added a comment - Thanks for reminding me about this important issue Anoop. Lets get it into trunk at least (Should be a blocker on 0.96). Minor: Do you foresee the sequencing being used other than in scanners (Its implemented in ScannerCallable only at moment)? If NOT, would OutOfOrderNextException or OutOfOrderScannerNextException be a better name than CallSequenceOutOfOrderException? If you do this, would suggest changing param names from callSeq to nextSeq? The former is generic. Is CallSequenceOutOfOrderException an exception that will bubble up into the application or is the client its intended audience? If so, should we annotate it @InterfaceAudience.Private as Abortable is for instance? + if ((cause == null || (!(cause instanceof NotServingRegionException) + && !(cause instanceof RegionServerStoppedException))) + && !(e instanceof CallSequenceOutOfOrderException)) { Would it make sense testing ! DoNotRetryIOException rather than calling out the two exceptions above? Would that be too broad? Otherwise, wondering if these two exceptions should inherit from a common base class given they are getting this special treatment. Not important. Just a thought. Check your comments. You seem to be saying 'scan' when you mean 'next'. For example, this comment: + // increment the callSeq which will be getting used for the next time scan() call to + // the RS.In case of a timeout this increment should not happen so that the next + // trial also will be done with the same callSeq. Regards the above comment, is there a sentence missing on what happens if we go back w/ the same callseq? Maybe have another go at this whole comment... it could be a bit clearer. Thanks. Committing this set of comments.... more to come.
        Hide
        stack added a comment -

        Like Lars and Todd above I almost said why RegionScannerHolder? Is it because if you add this class, you can do the seqid management in the regionserver and CPs do not have to be concerned with its management? That is fair enough and this is pretty elegant soln to the issue.

        Your worry is that a CP might return its own RegionScanner implementation. Can you not add to RegionScanner to methods, getNextSeq, and incrementNextSeq? Then the regionserver would have the means for keeping up the seqid in whatever the implementation rather than introduce this wrapper class? RegionScanner is marked as private. For the trunk patch, we are requiring a restart and allowing that CPs may need modification to make the leap from 0.94 to 0.96. Does that make it easier on you modifying RegionScanner to support nextSeq?

        Yeah, I think that rather than:

        +  optional uint64 callSeq = 6;
        

        it should be called nextSeq or nextCallSeq (the latter might be better because nextSeq might make you think the next sequence id rather than the sequence id for the 'next' call). All references should be changed if you change this I"d say Anoop.

        This test, TestClientScannerRPCTimeout, looks good. Does it have to be in a separate class? Do we not have a scanner test already that has a running minicluster? I can understand though if you want to keep this separate because its hard to integrate into a current test suite. If you are going to have a class for this single test and are going to go to the bother of spinning up a whole minicluster, I'd say do a bit more testing. Are there other scenarios you can manufacture? Can you add assert that you actually slept?

        Good stuff Anoop. Lets get this important patch in.

        Show
        stack added a comment - Like Lars and Todd above I almost said why RegionScannerHolder? Is it because if you add this class, you can do the seqid management in the regionserver and CPs do not have to be concerned with its management? That is fair enough and this is pretty elegant soln to the issue. Your worry is that a CP might return its own RegionScanner implementation. Can you not add to RegionScanner to methods, getNextSeq, and incrementNextSeq? Then the regionserver would have the means for keeping up the seqid in whatever the implementation rather than introduce this wrapper class? RegionScanner is marked as private. For the trunk patch, we are requiring a restart and allowing that CPs may need modification to make the leap from 0.94 to 0.96. Does that make it easier on you modifying RegionScanner to support nextSeq? Yeah, I think that rather than: + optional uint64 callSeq = 6; it should be called nextSeq or nextCallSeq (the latter might be better because nextSeq might make you think the next sequence id rather than the sequence id for the 'next' call). All references should be changed if you change this I"d say Anoop. This test, TestClientScannerRPCTimeout, looks good. Does it have to be in a separate class? Do we not have a scanner test already that has a running minicluster? I can understand though if you want to keep this separate because its hard to integrate into a current test suite. If you are going to have a class for this single test and are going to go to the bother of spinning up a whole minicluster, I'd say do a bit more testing. Are there other scenarios you can manufacture? Can you add assert that you actually slept? Good stuff Anoop. Lets get this important patch in.
        Hide
        Anoop Sam John added a comment -

        Stack
        Thanks for your review.. I will work on this today
        It is been some time that I have not seen this patch Let me get a look into this

        Show
        Anoop Sam John added a comment - Stack Thanks for your review.. I will work on this today It is been some time that I have not seen this patch Let me get a look into this
        Hide
        Anoop Sam John added a comment -

        Do you foresee the sequencing being used other than in scanners (Its implemented in ScannerCallable only at moment)? If NOT, would OutOfOrderNextException or OutOfOrderScannerNextException be a better name than CallSequenceOutOfOrderException? If you do this, would suggest changing param names from callSeq to nextSeq? The former is generic.

        Issues in which a retry from client as a result of timeout can come in other client API like increment() also I guess. It can increment 2 times or may be more.. Here in this particular case the names you suggested better than mine.. Sorry I am not good at naming

        Is CallSequenceOutOfOrderException an exception that will bubble up into the application or is the client its intended audience? If so, should we annotate it @InterfaceAudience.Private as Abortable is for instance?

        Yes you are correct. I will change

        Would it make sense testing ! DoNotRetryIOException rather than calling out the two exceptions above? Would that be too broad? Otherwise, wondering if these two exceptions should inherit from a common base class given they are getting this special treatment. Not important. Just a thought.

        Here 3 exceptions getting the special treatment. !DoNotRetryIOException I can try.. Didnt want to change this key area..But at the time of this patch preperation I felt like why this is been written this way. I had to add my new Exception out here to make thing work.. May be will keep the code the current way as in the patch. Will open up a new jira where we can discuss about making a common base class or other changes. I think your suggestion makes sense for maintainability

        Check your comments. You seem to be saying 'scan' when you mean 'next'.

        Yes I need do changes here. Will do

        Will continue in next comment....

        Show
        Anoop Sam John added a comment - Do you foresee the sequencing being used other than in scanners (Its implemented in ScannerCallable only at moment)? If NOT, would OutOfOrderNextException or OutOfOrderScannerNextException be a better name than CallSequenceOutOfOrderException? If you do this, would suggest changing param names from callSeq to nextSeq? The former is generic. Issues in which a retry from client as a result of timeout can come in other client API like increment() also I guess. It can increment 2 times or may be more.. Here in this particular case the names you suggested better than mine.. Sorry I am not good at naming Is CallSequenceOutOfOrderException an exception that will bubble up into the application or is the client its intended audience? If so, should we annotate it @InterfaceAudience.Private as Abortable is for instance? Yes you are correct. I will change Would it make sense testing ! DoNotRetryIOException rather than calling out the two exceptions above? Would that be too broad? Otherwise, wondering if these two exceptions should inherit from a common base class given they are getting this special treatment. Not important. Just a thought. Here 3 exceptions getting the special treatment. !DoNotRetryIOException I can try.. Didnt want to change this key area..But at the time of this patch preperation I felt like why this is been written this way. I had to add my new Exception out here to make thing work.. May be will keep the code the current way as in the patch. Will open up a new jira where we can discuss about making a common base class or other changes. I think your suggestion makes sense for maintainability Check your comments. You seem to be saying 'scan' when you mean 'next'. Yes I need do changes here. Will do Will continue in next comment....
        Hide
        Anoop Sam John added a comment -

        Regarding RegionScannerHolder
        Yes Stack my worry was wrt CP.. In our usage also we are returning a RegionScanner from the CP. This is a wrapper over the actual RegionScannerImpl. Yes by adding the methods we can make sure that the logic is being executed by the HRS but the seqno# maintain should be within the RegionScanner. In our case we can just delegate these 2 calls to the RegionScannerImpl created by kernel. My worry was this exposes this seq stuff to the user. Ideally the seq maintain etc is our problem and to solve we introduce this stuff.. That is why I was/am in 2 minds.. If you all say that is fine, I can change.. We need to clearly document these 2 methods with its importance and what to do wrt this etc.. Considering all these my mind is more hanging towards the current way..

        optional uint64 callSeq = 6;

        it should be called nextSeq or nextCallSeq.... All references should be changed if you change this I"d say Anoop

        Sure I will change.

        This test, TestClientScannerRPCTimeout, looks good. Does it have to be in a separate class?

        Let me see where I can fit this test case. Yes agree with you regarding the overhead of minicluster startup.

        Also this fix is already running in our cluster.

        Show
        Anoop Sam John added a comment - Regarding RegionScannerHolder Yes Stack my worry was wrt CP.. In our usage also we are returning a RegionScanner from the CP. This is a wrapper over the actual RegionScannerImpl. Yes by adding the methods we can make sure that the logic is being executed by the HRS but the seqno# maintain should be within the RegionScanner. In our case we can just delegate these 2 calls to the RegionScannerImpl created by kernel. My worry was this exposes this seq stuff to the user. Ideally the seq maintain etc is our problem and to solve we introduce this stuff.. That is why I was/am in 2 minds.. If you all say that is fine, I can change.. We need to clearly document these 2 methods with its importance and what to do wrt this etc.. Considering all these my mind is more hanging towards the current way.. optional uint64 callSeq = 6; it should be called nextSeq or nextCallSeq.... All references should be changed if you change this I"d say Anoop Sure I will change. This test, TestClientScannerRPCTimeout, looks good. Does it have to be in a separate class? Let me see where I can fit this test case. Yes agree with you regarding the overhead of minicluster startup. Also this fix is already running in our cluster.
        Hide
        stack added a comment -

        If you all say that is fine, I can change.. We need to clearly document these 2 methods with its importance and what to do wrt this etc.

        The class is marked private and used serverside only so its ok adding methods to the Interface?

        But if you feel strongly about it, keep it as is. it looks a bit ugly but at least the uglyness is contained inside HRegionServer.

        Make a new patch and lets get it committed. Thanks Anoop.

        Show
        stack added a comment - If you all say that is fine, I can change.. We need to clearly document these 2 methods with its importance and what to do wrt this etc. The class is marked private and used serverside only so its ok adding methods to the Interface? But if you feel strongly about it, keep it as is. it looks a bit ugly but at least the uglyness is contained inside HRegionServer. Make a new patch and lets get it committed. Thanks Anoop.
        Hide
        Anoop Sam John added a comment -

        This test, TestClientScannerRPCTimeout, looks good. Does it have to be in a separate class?

        Stack the reason why I had done it is we need to have an impl of the HRS where I can introduce some sleep so as to make the socket timeouts at the client side.
        conf.setStrings(HConstants.REGION_SERVER_IMPL, RegionServerWithScanTimeout.class.getName());
        Didn't want to make this change in any of the existing test cases. So is it reasonable to be have the new test class?

        Show
        Anoop Sam John added a comment - This test, TestClientScannerRPCTimeout, looks good. Does it have to be in a separate class? Stack the reason why I had done it is we need to have an impl of the HRS where I can introduce some sleep so as to make the socket timeouts at the client side. conf.setStrings(HConstants.REGION_SERVER_IMPL, RegionServerWithScanTimeout.class.getName()); Didn't want to make this change in any of the existing test cases. So is it reasonable to be have the new test class?
        Hide
        stack added a comment -

        Anoop Sam John You going to make a final patch Anoop? Lets get this in (Thanks).

        Show
        stack added a comment - Anoop Sam John You going to make a final patch Anoop? Lets get this in (Thanks).
        Hide
        Anoop Sam John added a comment -

        Yes Stack... I am working on this now.. Will give it today..
        Sorry for the delay..

        Show
        Anoop Sam John added a comment - Yes Stack... I am working on this now.. Will give it today.. Sorry for the delay..
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12550776/5974_trunk-V3.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        -1 javadoc. The javadoc tool appears to have generated 82 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3146//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3146//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3146//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3146//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3146//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3146//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3146//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12550776/5974_trunk-V3.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified tests. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. -1 javadoc . The javadoc tool appears to have generated 82 warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 findbugs . The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3146//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3146//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3146//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3146//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3146//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3146//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3146//console This message is automatically generated.
        Hide
        stack added a comment -

        Committed to trunk. Thanks for keeping at it Anoop.

        Show
        stack added a comment - Committed to trunk. Thanks for keeping at it Anoop.
        Hide
        Lars Hofhansl added a comment -

        This is not wire compatible, right? (if it would be I'd pull it back into 0.94)

        Show
        Lars Hofhansl added a comment - This is not wire compatible, right? (if it would be I'd pull it back into 0.94)
        Hide
        stack added a comment -

        Lars Hofhansl Correct. Adds a sequenceid on each next invocation via pb.

        Show
        stack added a comment - Lars Hofhansl Correct. Adds a sequenceid on each next invocation via pb.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #3486 (See https://builds.apache.org/job/HBase-TRUNK/3486/)
        HBASE-5974 Scanner retry behavior with RPC timeout on next() seems incorrect (Revision 1402214)

        Result = FAILURE
        stack :
        Files :

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/OutOfOrderScannerNextException.java
        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java
        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/client/ScannerCallable.java
        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/generated/ClientProtos.java
        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java
        • /hbase/trunk/hbase-server/src/main/protobuf/Client.proto
        • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestClientScannerRPCTimeout.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #3486 (See https://builds.apache.org/job/HBase-TRUNK/3486/ ) HBASE-5974 Scanner retry behavior with RPC timeout on next() seems incorrect (Revision 1402214) Result = FAILURE stack : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/OutOfOrderScannerNextException.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/client/ScannerCallable.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/generated/ClientProtos.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java /hbase/trunk/hbase-server/src/main/protobuf/Client.proto /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestClientScannerRPCTimeout.java
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #236 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/236/)
        HBASE-5974 Scanner retry behavior with RPC timeout on next() seems incorrect (Revision 1402214)

        Result = FAILURE
        stack :
        Files :

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/OutOfOrderScannerNextException.java
        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java
        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/client/ScannerCallable.java
        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/generated/ClientProtos.java
        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java
        • /hbase/trunk/hbase-server/src/main/protobuf/Client.proto
        • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestClientScannerRPCTimeout.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #236 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/236/ ) HBASE-5974 Scanner retry behavior with RPC timeout on next() seems incorrect (Revision 1402214) Result = FAILURE stack : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/OutOfOrderScannerNextException.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/client/ScannerCallable.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/generated/ClientProtos.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java /hbase/trunk/hbase-server/src/main/protobuf/Client.proto /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestClientScannerRPCTimeout.java
        Hide
        chunhui shen added a comment -

        After the patch, if a next request is always timeout because too large, we will retry forever now, is it so?

        Without the patch, we do the retry in ServerCallable#ServerCallable, but now we do the retry in ClientScanner#next which is no retry number limit

        Show
        chunhui shen added a comment - After the patch, if a next request is always timeout because too large, we will retry forever now, is it so? Without the patch, we do the retry in ServerCallable#ServerCallable, but now we do the retry in ClientScanner#next which is no retry number limit
        Hide
        Anoop Sam John added a comment -

        Chunhui
        Not really.. I tested once more with change in test code so as to sleep all the times.
        Got the exception at client side

        org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=10, exceptions:

        Show
        Anoop Sam John added a comment - Chunhui Not really.. I tested once more with change in test code so as to sleep all the times. Got the exception at client side org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=10, exceptions:
        Hide
        chunhui shen added a comment -

        I tested once more with change in test code so as to sleep all the times.

        Anoop, if change as so, you can't produce my mentioned problem.

        Do the following change in the test case TestClientScannerRPCTimeout, you could find we will retry forever

        @Override
            public Result[] next(long scannerId, int nbRows, long callSeq) throws IOException {
              if (!slept && this.tableScannerId == scannerId && seqNoToSleepOn == callSeq) {
                throw new CallSequenceOutOfOrderException("Test");
                // try {
                // Thread.sleep(rpcTimeout + 500);
                // } catch (InterruptedException e) {
                // }
                // slept = true;
              }
              return super.next(scannerId, nbRows, callSeq);
            }
        

        Wait for your answer, thanks

        Show
        chunhui shen added a comment - I tested once more with change in test code so as to sleep all the times. Anoop, if change as so, you can't produce my mentioned problem. Do the following change in the test case TestClientScannerRPCTimeout, you could find we will retry forever @Override public Result[] next( long scannerId, int nbRows, long callSeq) throws IOException { if (!slept && this .tableScannerId == scannerId && seqNoToSleepOn == callSeq) { throw new CallSequenceOutOfOrderException( "Test" ); // try { // Thread .sleep(rpcTimeout + 500); // } catch (InterruptedException e) { // } // slept = true ; } return super .next(scannerId, nbRows, callSeq); } Wait for your answer, thanks
        Hide
        chunhui shen added a comment -

        The process is as the following:
        1.A next request is very large, so first time it is failed because of timeout
        2.The second time it is failed again because of CallSequenceOutOfOrderException
        3.We will retry this next request again from the init startrow, so it is also very large and failed again
        4.Repeated the above forever...

        Show
        chunhui shen added a comment - The process is as the following: 1.A next request is very large, so first time it is failed because of timeout 2.The second time it is failed again because of CallSequenceOutOfOrderException 3.We will retry this next request again from the init startrow, so it is also very large and failed again 4.Repeated the above forever...
        Hide
        Anoop Sam John added a comment -

        Chunhui
        Your above mentioned steps for that the above test case change is correct?
        I think for that teh change to be

        public Result[] next(long scannerId, int nbRows, long callSeq) throws IOException {
              if ( this.tableScannerId == scannerId ) {
                try {
                 Thread.sleep(rpcTimeout + 500);
                 } catch (InterruptedException e) {
                 }
                
              }
              return super.next(scannerId, nbRows, callSeq);
            }
        
        Show
        Anoop Sam John added a comment - Chunhui Your above mentioned steps for that the above test case change is correct? I think for that teh change to be public Result[] next( long scannerId, int nbRows, long callSeq) throws IOException { if ( this .tableScannerId == scannerId ) { try { Thread .sleep(rpcTimeout + 500); } catch (InterruptedException e) { } } return super .next(scannerId, nbRows, callSeq); }
        Hide
        Anoop Sam John added a comment -

        Your change was always throwing CallSequenceOutOfOrderException so that at client never the timeout happens. This will happen ideally??
        As in your steps, #1 and #2 is correct. Then again the new call on the new Scanner it wont throw CallSequenceOutOfOrderException any way. But it can either be successfull or timed out at server. Why my change is doing is make it time out at client side always...

        Show
        Anoop Sam John added a comment - Your change was always throwing CallSequenceOutOfOrderException so that at client never the timeout happens. This will happen ideally?? As in your steps, #1 and #2 is correct. Then again the new call on the new Scanner it wont throw CallSequenceOutOfOrderException any way. But it can either be successfull or timed out at server. Why my change is doing is make it time out at client side always...
        Hide
        chunhui shen added a comment -

        Then again the new call on the new Scanner it wont throw CallSequenceOutOfOrderException any way, But it can either be successfull or timed out at server

        Yes,if it timed out again because this request is too large, in the next retry we will throw CallSequenceOutOfOrderException , is it so?

        Show
        chunhui shen added a comment - Then again the new call on the new Scanner it wont throw CallSequenceOutOfOrderException any way, But it can either be successfull or timed out at server Yes,if it timed out again because this request is too large, in the next retry we will throw CallSequenceOutOfOrderException , is it so?
        Hide
        Anoop Sam John added a comment -

        I think I have done wrong in the test code change.. Yes your point is valid.. (checked with functional way ) It behaves like NSRE which is also like infinite retry...
        Will open up a new JIRA and will solve the issue?

        Show
        Anoop Sam John added a comment - I think I have done wrong in the test code change.. Yes your point is valid.. (checked with functional way ) It behaves like NSRE which is also like infinite retry... Will open up a new JIRA and will solve the issue?
        Hide
        chunhui shen added a comment -

        @Annop

        I open a new issue HBASE-7070 and upload a simple patch, could you tak a view?

        Show
        chunhui shen added a comment - @Annop I open a new issue HBASE-7070 and upload a simple patch, could you tak a view?
        Hide
        Ted Yu added a comment -

        On 01/Jul/12 08:25, Anoop said:

        I had added my reasoning ... Any thoughts?

        There was a long period of inactivity till 18/Oct/12 when Stack picked this up.

        I don't see a second +1 since 18/Oct/12.

        Maybe I missed something.

        Show
        Ted Yu added a comment - On 01/Jul/12 08:25, Anoop said: I had added my reasoning ... Any thoughts? There was a long period of inactivity till 18/Oct/12 when Stack picked this up. I don't see a second +1 since 18/Oct/12. Maybe I missed something.
        Hide
        stack added a comment -

        Ted Yu The above makes no sense. I think you are trying to find a violation of commit protocol but there is none. Go reread what we all agreed.

        Show
        stack added a comment - Ted Yu The above makes no sense. I think you are trying to find a violation of commit protocol but there is none. Go reread what we all agreed.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #248 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/248/)
        HBASE-7070 Scanner may retry forever after HBASE-5974 (Revision 1405107)

        Result = FAILURE
        stack :
        Files :

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java
        • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestClientScannerRPCTimeout.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #248 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/248/ ) HBASE-7070 Scanner may retry forever after HBASE-5974 (Revision 1405107) Result = FAILURE stack : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestClientScannerRPCTimeout.java
        Hide
        stack added a comment -

        Marking closed.

        Show
        stack added a comment - Marking closed.

          People

          • Assignee:
            Anoop Sam John
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            21 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development