HBase
  1. HBase
  2. HBASE-11325

Malformed RPC calls can corrupt stores

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.94.20
    • Fix Version/s: None
    • Component/s: Client, regionserver
    • Labels:
      None

      Description

      We noticed in a cluster a Region Server that aborted with a DroppedSnapshotException due an IOException in ScanWildcardColumnTracker when the RS tried to flush the memstore. After further research it was found that a client was sending corrupt RPCs requests to the RS and those corrupt requests ended into the stores causing corruption of the memstore itself and in some cases HFiles. More details to follow.

      1. HBASE-11325.patch
        2 kB
        Esteban Gutierrez

        Activity

        Hide
        Matteo Bertozzi added a comment -

        the validation assumes that the buffer is of the correct length, if it is not you will get an index error or something like that.

        also instead of the 2 * KEY_LENGTH_SIZE + 2, there are constants like KEY_INFRASTRUCTURE_SIZE, ROW_LENGTH_SIZE, FAMILY_LENGTH_SIZE, TIMESTAMP_SIZE, ...

        Show
        Matteo Bertozzi added a comment - the validation assumes that the buffer is of the correct length, if it is not you will get an index error or something like that. also instead of the 2 * KEY_LENGTH_SIZE + 2, there are constants like KEY_INFRASTRUCTURE_SIZE, ROW_LENGTH_SIZE, FAMILY_LENGTH_SIZE, TIMESTAMP_SIZE, ...
        Hide
        Matteo Bertozzi added a comment -

        what about adding an isValid() to the KeyValue, instead of having the kv internals exposed there

        also you can add a check that the input read matches the expected one

        +      if (offset != totalLen) {
        +        throw new IOException("Keys sum does not match the header - expected=" +
        +                              totalLen + " got=" + offset);
        +      }
               this.familyMap.put(family, keys);
        
        Show
        Matteo Bertozzi added a comment - what about adding an isValid() to the KeyValue, instead of having the kv internals exposed there also you can add a check that the input read matches the expected one + if (offset != totalLen) { + throw new IOException( "Keys sum does not match the header - expected=" + + totalLen + " got=" + offset); + } this .familyMap.put(family, keys);
        Hide
        Esteban Gutierrez added a comment -

        This is a prototype that matches the layout of a KV (according to KeyValue.createEmptyByteArray) and if the some values are out of range or the position of some markers like KeyValue.Type.Put don't match we throw an IOE back to the client. I couldn't find any issue in the unit tests and also I wasn't able to corrupt the store after this patch.

        Show
        Esteban Gutierrez added a comment - This is a prototype that matches the layout of a KV (according to KeyValue.createEmptyByteArray) and if the some values are out of range or the position of some markers like KeyValue.Type.Put don't match we throw an IOE back to the client. I couldn't find any issue in the unit tests and also I wasn't able to corrupt the store after this patch.
        Hide
        Esteban Gutierrez added a comment -

        Forgot to mention that I had to modify the HFile tool to print the row since the Type in some cases was not valid. (see /4/ instead of /Put/) in the previous comment.

        Show
        Esteban Gutierrez added a comment - Forgot to mention that I had to modify the HFile tool to print the row since the Type in some cases was not valid. (see /4/ instead of /Put/) in the previous comment.
        Hide
        Esteban Gutierrez added a comment -

        This is how the RS aborted due this corrupt entry in the memstore:

        14/06/05 18:41:44 FATAL regionserver.HRegionServer: ABORTING region server 172.16.0.101,60020,1402018185865: Unrecoverable exception while closing region t0,,1402015274138.a9b83f7801ce96574aeeb2be048690b8., still finishing close
        org.apache.hadoop.hbase.DroppedSnapshotException: region: t0,,1402015274138.a9b83f7801ce96574aeeb2be048690b8.
        	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1606)
        	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1480)
        	at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1009)
        	at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:957)
        	at org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:119)
        	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
        	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        	at java.lang.Thread.run(Thread.java:724)
        Caused by: java.io.IOException: ScanWildcardColumnTracker.checkColumn ran into a column actually smaller than the previous column:
        	at org.apache.hadoop.hbase.regionserver.ScanWildcardColumnTracker.checkColumn(ScanWildcardColumnTracker.java:104)
        	at org.apache.hadoop.hbase.regionserver.ScanQueryMatcher.match(ScanQueryMatcher.java:357)
        	at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:365)
        	at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:311)
        	at org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:812)
        	at org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:746)
        	at org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:2348)
        	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1581)
        

        If the malformed RPC Put didn't crash the RS, sometimes it was possible to end with a corrupt HFile:

        4/06/05 19:24:06 ERROR compactions.CompactionRequest: Compaction failed regionName=t0,,1402020343626.25a1ee35a486a512b5b3c18e1c56ba39., storeName=f, fileCount=10, fileSize=6.8k (875.0, 678.0, 678.0, 678.0, 678.0, 712.0, 678.0, 678.0, 678.0, 678.0), priority=-7, time=1402021446164920000
        java.lang.ArrayIndexOutOfBoundsException: 274
        	at org.apache.hadoop.hbase.regionserver.ScanQueryMatcher.match(ScanQueryMatcher.java:251)
        	at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:365)
        	at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:311)
        	at org.apache.hadoop.hbase.regionserver.Compactor.compact(Compactor.java:184)
        	at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:1081)
        	at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1336)
        	at org.apache.hadoop.hbase.regionserver.compactions.CompactionRequest.run(CompactionRequest.java:303)
        	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        	at java.ut14/06/05 19:24:06 DEBUG master.AssignmentManager: The znode of region t0,,1402020343626.25a1ee35a486a512b5b3c18e1c56ba39. has been deleted.
        il.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        	at java.lang.Thread.run(Thread.java:724)
        

        Inspecting the file was not possible after some point:

        K: 2\x01fc\x00\x00\x01F\x86\xE1\xC5\xC9/two\x00:/1402422281673/4/vlen=3/ts=0 V: two
        Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 264
        	at org.apache.hadoop.hbase.util.Bytes.toStringBinary(Bytes.java:387)
        	at org.apache.hadoop.hbase.KeyValue.keyToString(KeyValue.java:775)
        	at org.apache.hadoop.hbase.KeyValue.toString(KeyValue.java:731)
        	at java.lang.String.valueOf(String.java:2826)
        	at java.lang.StringBuilder.append(StringBuilder.java:115)
        	at org.apache.hadoop.hbase.io.hfile.HFilePrettyPrinter.scanKeysValues(HFilePrettyPrinter.java:269)
        	at org.apache.hadoop.hbase.io.hfile.HFilePrettyPrinter.processFile(HFilePrettyPrinter.java:229)
        	at org.apache.hadoop.hbase.io.hfile.HFilePrettyPrinter.run(HFilePrettyPrinter.java:189)
        	at org.apache.hadoop.hbase.io.hfile.HFile.main(HFile.java:750)
        
        Show
        Esteban Gutierrez added a comment - This is how the RS aborted due this corrupt entry in the memstore: 14/06/05 18:41:44 FATAL regionserver.HRegionServer: ABORTING region server 172.16.0.101,60020,1402018185865: Unrecoverable exception while closing region t0,,1402015274138.a9b83f7801ce96574aeeb2be048690b8., still finishing close org.apache.hadoop.hbase.DroppedSnapshotException: region: t0,,1402015274138.a9b83f7801ce96574aeeb2be048690b8. at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1606) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1480) at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1009) at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:957) at org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:119) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:724) Caused by: java.io.IOException: ScanWildcardColumnTracker.checkColumn ran into a column actually smaller than the previous column: at org.apache.hadoop.hbase.regionserver.ScanWildcardColumnTracker.checkColumn(ScanWildcardColumnTracker.java:104) at org.apache.hadoop.hbase.regionserver.ScanQueryMatcher.match(ScanQueryMatcher.java:357) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:365) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:311) at org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:812) at org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:746) at org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:2348) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1581) If the malformed RPC Put didn't crash the RS, sometimes it was possible to end with a corrupt HFile: 4/06/05 19:24:06 ERROR compactions.CompactionRequest: Compaction failed regionName=t0,,1402020343626.25a1ee35a486a512b5b3c18e1c56ba39., storeName=f, fileCount=10, fileSize=6.8k (875.0, 678.0, 678.0, 678.0, 678.0, 712.0, 678.0, 678.0, 678.0, 678.0), priority=-7, time=1402021446164920000 java.lang.ArrayIndexOutOfBoundsException: 274 at org.apache.hadoop.hbase.regionserver.ScanQueryMatcher.match(ScanQueryMatcher.java:251) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:365) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:311) at org.apache.hadoop.hbase.regionserver.Compactor.compact(Compactor.java:184) at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:1081) at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1336) at org.apache.hadoop.hbase.regionserver.compactions.CompactionRequest.run(CompactionRequest.java:303) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.ut14/06/05 19:24:06 DEBUG master.AssignmentManager: The znode of region t0,,1402020343626.25a1ee35a486a512b5b3c18e1c56ba39. has been deleted. il.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:724) Inspecting the file was not possible after some point: K: 2\x01fc\x00\x00\x01F\x86\xE1\xC5\xC9/two\x00:/1402422281673/4/vlen=3/ts=0 V: two Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 264 at org.apache.hadoop.hbase.util.Bytes.toStringBinary(Bytes.java:387) at org.apache.hadoop.hbase.KeyValue.keyToString(KeyValue.java:775) at org.apache.hadoop.hbase.KeyValue.toString(KeyValue.java:731) at java.lang. String .valueOf( String .java:2826) at java.lang.StringBuilder.append(StringBuilder.java:115) at org.apache.hadoop.hbase.io.hfile.HFilePrettyPrinter.scanKeysValues(HFilePrettyPrinter.java:269) at org.apache.hadoop.hbase.io.hfile.HFilePrettyPrinter.processFile(HFilePrettyPrinter.java:229) at org.apache.hadoop.hbase.io.hfile.HFilePrettyPrinter.run(HFilePrettyPrinter.java:189) at org.apache.hadoop.hbase.io.hfile.HFile.main(HFile.java:750)

          People

          • Assignee:
            Unassigned
            Reporter:
            Esteban Gutierrez
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development