HBase
  1. HBase
  2. HBASE-2481

Client is not getting UnknownScannerExceptions; they are being eaten

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.20.4
    • Fix Version/s: 0.20.4, 0.90.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This was reported by mudphone on IRC and confirmed by myself in quick test. If the client takes too long going back to the RS, the RS will throw an UnknownScannerException but it doesn't get back to the client. Instead, the client scan silently ends. Marking this blocker. Its actually in 0.20.4. Thats what I was testing. Mayhaps an RC sinker?

      1. HBASE-2481.patch
        5 kB
        Jean-Daniel Cryans
      2. scannertimeouttest.txt
        3 kB
        stack

        Activity

        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #1721 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1721/)
        HBASE-2481 max seq id in flushed file can be larger than its correct value causing data loss during recovery

        Show
        Hudson added a comment - Integrated in HBase-TRUNK #1721 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1721/ ) HBASE-2481 max seq id in flushed file can be larger than its correct value causing data loss during recovery
        Hide
        stack added a comment -

        Here is the unit test done for 0.20. It passes. Added check of iterating a Scan too.

        Show
        stack added a comment - Here is the unit test done for 0.20. It passes. Added check of iterating a Scan too.
        Hide
        stack added a comment -

        Committed to two branches and trunk.

        Show
        stack added a comment - Committed to two branches and trunk.
        Hide
        stack added a comment -

        I tried the patch and now we seem to do the right thing:

        0000000097
        0000000098
        0000000099
        
        
        
        
        2010-04-24T09:58:57.642-0700: 62.527: [GC 62.527: [ParNew: 17024K->1870K(19136K), 0.0045580 secs] 21509K->6356K(83008K), 0.0045980 secs] [Times: user=0.03 sys=0.00, real=0.00 sec
        org/apache/hadoop/hbase/client/HTable.java:1991:in `next': org.apache.hadoop.hbase.client.ScannerTimeoutException: 60007ms passed since the last invocation, timeout is currently )
                from sun.reflect.GeneratedMethodAccessor2:-1:in `invoke'
                from sun/reflect/DelegatingMethodAccessorImpl.java:25:in `invoke'
                from java/lang/reflect/Method.java:597:in `invoke'
                from org/jruby/javasupport/JavaMethod.java:298:in `invokeWithExceptionHandling'
                from org/jruby/javasupport/JavaMethod.java:259:in `invoke'
                from org/jruby/java/invokers/InstanceMethodInvoker.java:36:in `call'
                from org/jruby/runtime/callsite/CachingCallSite.java:70:in `call'
                from bin/slow_scan.rb:27:in `__file__'
                from bin/slow_scan.rb:-1:in `load'
                from org/jruby/Ruby.java:577:in `runScript'
                from org/jruby/Ruby.java:480:in `runNormally'
                from org/jruby/Ruby.java:354:in `runFromMain'
                from org/jruby/Main.java:229:in `run'
                from org/jruby/Main.java:110:in `run'
                from org/jruby/Main.java:94:in `main'
        Complete Java stackTrace
        org.apache.hadoop.hbase.client.ScannerTimeoutException: 60007ms passed since the last invocation, timeout is currently set to 60000
                at org.apache.hadoop.hbase.client.HTable$ClientScanner.next(HTable.java:1991)
                at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                at java.lang.reflect.Method.invoke(Method.java:597)
                at org.jruby.javasupport.JavaMethod.invokeWithExceptionHandling(JavaMethod.java:298)
                at org.jruby.javasupport.JavaMethod.invoke(JavaMethod.java:259)
                at org.jruby.java.invokers.InstanceMethodInvoker.call(InstanceMethodInvoker.java:36)
                at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:70)
                at bin.slow_scan.__file__(slow_scan.rb:27)
                at bin.slow_scan.load(slow_scan.rb)
                at org.jruby.Ruby.runScript(Ruby.java:577)
                at org.jruby.Ruby.runNormally(Ruby.java:480)
                at org.jruby.Ruby.runFromMain(Ruby.java:354)
                at org.jruby.Main.run(Main.java:229)
                at org.jruby.Main.run(Main.java:110)
                at org.jruby.Main.main(Main.java:94)
        Caused by: org.apache.hadoop.hbase.UnknownScannerException: org.apache.hadoop.hbase.UnknownScannerException: Name: 3092528963968774042
                at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1889)
                at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                at java.lang.reflect.Method.invoke(Method.java:597)
        ...
        
        Show
        stack added a comment - I tried the patch and now we seem to do the right thing: 0000000097 0000000098 0000000099 2010-04-24T09:58:57.642-0700: 62.527: [GC 62.527: [ParNew: 17024K->1870K(19136K), 0.0045580 secs] 21509K->6356K(83008K), 0.0045980 secs] [Times: user=0.03 sys=0.00, real=0.00 sec org/apache/hadoop/hbase/client/HTable.java:1991:in `next': org.apache.hadoop.hbase.client.ScannerTimeoutException: 60007ms passed since the last invocation, timeout is currently ) from sun.reflect.GeneratedMethodAccessor2:-1:in `invoke' from sun/reflect/DelegatingMethodAccessorImpl.java:25:in `invoke' from java/lang/reflect/Method.java:597:in `invoke' from org/jruby/javasupport/JavaMethod.java:298:in `invokeWithExceptionHandling' from org/jruby/javasupport/JavaMethod.java:259:in `invoke' from org/jruby/java/invokers/InstanceMethodInvoker.java:36:in `call' from org/jruby/runtime/callsite/CachingCallSite.java:70:in `call' from bin/slow_scan.rb:27:in `__file__' from bin/slow_scan.rb:-1:in `load' from org/jruby/Ruby.java:577:in `runScript' from org/jruby/Ruby.java:480:in `runNormally' from org/jruby/Ruby.java:354:in `runFromMain' from org/jruby/Main.java:229:in `run' from org/jruby/Main.java:110:in `run' from org/jruby/Main.java:94:in `main' Complete Java stackTrace org.apache.hadoop.hbase.client.ScannerTimeoutException: 60007ms passed since the last invocation, timeout is currently set to 60000 at org.apache.hadoop.hbase.client.HTable$ClientScanner.next(HTable.java:1991) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.jruby.javasupport.JavaMethod.invokeWithExceptionHandling(JavaMethod.java:298) at org.jruby.javasupport.JavaMethod.invoke(JavaMethod.java:259) at org.jruby.java.invokers.InstanceMethodInvoker.call(InstanceMethodInvoker.java:36) at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:70) at bin.slow_scan.__file__(slow_scan.rb:27) at bin.slow_scan.load(slow_scan.rb) at org.jruby.Ruby.runScript(Ruby.java:577) at org.jruby.Ruby.runNormally(Ruby.java:480) at org.jruby.Ruby.runFromMain(Ruby.java:354) at org.jruby.Main.run(Main.java:229) at org.jruby.Main.run(Main.java:110) at org.jruby.Main.main(Main.java:94) Caused by: org.apache.hadoop.hbase.UnknownScannerException: org.apache.hadoop.hbase.UnknownScannerException: Name: 3092528963968774042 at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1889) at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) ...
        Hide
        stack added a comment -

        Patch looks good enough for 0.20.4. Let me see if I can get mudphone to test.

        Show
        stack added a comment - Patch looks good enough for 0.20.4. Let me see if I can get mudphone to test.
        Hide
        Jean-Daniel Cryans added a comment -

        Patch that adds test, does the right catching and adds a message to the scanner timeout exception. I'm still not satisfied by how we handle exceptions around that part of the code.

        Show
        Jean-Daniel Cryans added a comment - Patch that adds test, does the right catching and adds a message to the scanner timeout exception. I'm still not satisfied by how we handle exceptions around that part of the code.
        Hide
        Jean-Daniel Cryans added a comment -

        This was caused by HBASE-1671, this changed in ScannerCallable:

         
           public Result [] call() throws IOException { 
             if (scannerId != -1L && closed) { 
        - server.close(scannerId); 
        - scannerId = -1L; 
        + close(); 
             } else if (scannerId == -1L && !closed) { 
        - // open the scanner 
        - scannerId = openScanner(); 
        + this.scannerId = openScanner(); 
             } else { 
        - Result [] rrs = server.next(scannerId, caching); 
        + Result [] rrs = null; 
        + try { 
        + rrs = server.next(scannerId, caching); 
        + } catch (IOException e) { 
        + IOException ioe = null; 
        + if (e instanceof RemoteException) { 
        + ioe = RemoteExceptionHandler.decodeRemoteException((RemoteException)e); 
        + } 
        + if (ioe != null && ioe instanceof NotServingRegionException) { 
        + // Throw a DNRE so that we break out of cycle of calling NSRE 
        + // when what we need is to open scanner against new location. 
        + // Attach NSRE to signal client that it needs to resetup scanner. 
        + throw new DoNotRetryIOException("Reset scanner", ioe); 
        + } 
        + } 
               return rrs == null || rrs.length == 0? null: rrs; 
             } 
              
        

        We now eat the exception if it's not NSRE, throwing it if the exception is a DoNotRetryIOException is the right thing to do, but the client code is still broken. In HTable.ClientScanner.next:

         
        try { 
                    // Server returns a null values if scanning is to stop. Else, 
                    // returns an empty array if scanning is to go on and we've just 
                    // exhausted current region. 
                    values = getConnection().getRegionServerWithRetries(callable); 
                    if (skipFirst) { 
                      skipFirst = false; 
                      // Reget. 
                      values = getConnection().getRegionServerWithRetries(callable); 
                    } 
                  } catch (DoNotRetryIOException e) { 
                    Throwable cause = e.getCause(); 
                    if (cause == null || !(cause instanceof NotServingRegionException)) { 
                      throw e; 
                    } 
                    // Else, its signal from depths of ScannerCallable that we got an 
                    // NSRE on a next and that we need to reset the scanner. 
                    if (this.lastResult != null) { 
                      this.scan.setStartRow(this.lastResult.getRow()); 
                      // Skip first row returned. We already let it out on previous 
                      // invocation. 
                      skipFirst = true; 
                    } 
                    // Clear region 
                    this.currentRegion = null; 
                    continue; 
                  } catch (IOException e) { 
                    if (e instanceof UnknownScannerException && 
                        lastNext + scannerTimeout < System.currentTimeMillis()) { 
                      ScannerTimeoutException ex = new ScannerTimeoutException(); 
                      ex.initCause(e); 
                      throw ex; 
                    } 
                    throw e; 
                  } 
        

        We catch the DoNotRetryIOException first and in the other catch clause we check for UnknownScannerException, which extends DoNotRetryIOException... so ScannerTimeoutException is never used! Easy fix.

        Show
        Jean-Daniel Cryans added a comment - This was caused by HBASE-1671 , this changed in ScannerCallable: public Result [] call() throws IOException { if (scannerId != -1L && closed) { - server.close(scannerId); - scannerId = -1L; + close(); } else if (scannerId == -1L && !closed) { - // open the scanner - scannerId = openScanner(); + this .scannerId = openScanner(); } else { - Result [] rrs = server.next(scannerId, caching); + Result [] rrs = null ; + try { + rrs = server.next(scannerId, caching); + } catch (IOException e) { + IOException ioe = null ; + if (e instanceof RemoteException) { + ioe = RemoteExceptionHandler.decodeRemoteException((RemoteException)e); + } + if (ioe != null && ioe instanceof NotServingRegionException) { + // Throw a DNRE so that we break out of cycle of calling NSRE + // when what we need is to open scanner against new location. + // Attach NSRE to signal client that it needs to resetup scanner. + throw new DoNotRetryIOException( "Reset scanner" , ioe); + } + } return rrs == null || rrs.length == 0? null : rrs; } We now eat the exception if it's not NSRE, throwing it if the exception is a DoNotRetryIOException is the right thing to do, but the client code is still broken. In HTable.ClientScanner.next: try { // Server returns a null values if scanning is to stop. Else, // returns an empty array if scanning is to go on and we've just // exhausted current region. values = getConnection().getRegionServerWithRetries(callable); if (skipFirst) { skipFirst = false ; // Reget. values = getConnection().getRegionServerWithRetries(callable); } } catch (DoNotRetryIOException e) { Throwable cause = e.getCause(); if (cause == null || !(cause instanceof NotServingRegionException)) { throw e; } // Else, its signal from depths of ScannerCallable that we got an // NSRE on a next and that we need to reset the scanner. if ( this .lastResult != null ) { this .scan.setStartRow( this .lastResult.getRow()); // Skip first row returned. We already let it out on previous // invocation. skipFirst = true ; } // Clear region this .currentRegion = null ; continue ; } catch (IOException e) { if (e instanceof UnknownScannerException && lastNext + scannerTimeout < System .currentTimeMillis()) { ScannerTimeoutException ex = new ScannerTimeoutException(); ex.initCause(e); throw ex; } throw e; } We catch the DoNotRetryIOException first and in the other catch clause we check for UnknownScannerException, which extends DoNotRetryIOException... so ScannerTimeoutException is never used! Easy fix.
        Hide
        stack added a comment -

        Here is script to demo problem. It scans 99 rows in a PE table then pauses for 60 seconds. When it resumes, there is nothing else to fetch (though table has millions of rows) and over on serverside is a USE:

        $ more bin/slow_scan.rb 
        # To run: ./bin/hbase org.jruby.Main bin/slow_scan.rb
        include Java
        import org.apache.hadoop.hbase.util.Bytes
        import org.apache.hadoop.hbase.HConstants
        import org.apache.hadoop.hbase.client.HTable
        import org.apache.hadoop.hbase.client.Scan
        import org.apache.hadoop.hbase.HBaseConfiguration
        import org.apache.hadoop.hbase.util.FSUtils
        import org.apache.hadoop.fs.FileSystem
        import java.lang.Thread
        
        # Get configuration to use.
        c = HBaseConfiguration.new()
        
        # Set hadoop filesystem configuration using the hbase.rootdir.
        # Otherwise, we'll always use localhost though the hbase.rootdir
        # might be pointing at hdfs location.
        c.set("fs.default.name", c.get(HConstants::HBASE_DIR))
        fs = FileSystem.get(c)
        
        # Clean mentions of table from .META.
        # Scan the .META. and remove all lines that begin with tablename
        t = HTable.new(c, "TestTable")
        scan = Scan.new()
        scanner = t.getScanner(scan)
        count = 0
        while (result = scanner.next())
          rowid = Bytes.toString(result.getRow())
          puts rowid
          count = count + 1 
          # If count == 100, pause for 60 seconds.
          if count == 100 
            Thread.sleep(1000 * 60)
          end
        end
        scanner.close()
        
        Show
        stack added a comment - Here is script to demo problem. It scans 99 rows in a PE table then pauses for 60 seconds. When it resumes, there is nothing else to fetch (though table has millions of rows) and over on serverside is a USE: $ more bin/slow_scan.rb # To run: ./bin/hbase org.jruby.Main bin/slow_scan.rb include Java import org.apache.hadoop.hbase.util.Bytes import org.apache.hadoop.hbase.HConstants import org.apache.hadoop.hbase.client.HTable import org.apache.hadoop.hbase.client.Scan import org.apache.hadoop.hbase.HBaseConfiguration import org.apache.hadoop.hbase.util.FSUtils import org.apache.hadoop.fs.FileSystem import java.lang. Thread # Get configuration to use. c = HBaseConfiguration. new () # Set hadoop filesystem configuration using the hbase.rootdir. # Otherwise, we'll always use localhost though the hbase.rootdir # might be pointing at hdfs location. c.set( "fs. default .name" , c.get(HConstants::HBASE_DIR)) fs = FileSystem.get(c) # Clean mentions of table from .META. # Scan the .META. and remove all lines that begin with tablename t = HTable. new (c, "TestTable" ) scan = Scan. new () scanner = t.getScanner(scan) count = 0 while (result = scanner.next()) rowid = Bytes.toString(result.getRow()) puts rowid count = count + 1 # If count == 100, pause for 60 seconds. if count == 100 Thread .sleep(1000 * 60) end end scanner.close()
        Hide
        Andrew Purtell added a comment -

        +1 to sink release

        Good catch.

        Show
        Andrew Purtell added a comment - +1 to sink release Good catch.
        Hide
        Todd Lipcon added a comment -

        Yes, I think this should block release.

        Show
        Todd Lipcon added a comment - Yes, I think this should block release.

          People

          • Assignee:
            Jean-Daniel Cryans
            Reporter:
            stack
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development