HBase
  1. HBase
  2. HBASE-5665

Repeated split causes HRegionServer failures and breaks table

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.92.0, 0.92.1
    • Fix Version/s: 0.92.2, 0.94.0
    • Component/s: regionserver
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
      The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.

      The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.

      I was able to reproduce this on a smaller table consistently.

      hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
      hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
      

      Running overlapping splits in parallel (e.g. "#

      {x*10+1}

      ", "#

      {x*10+2}

      "... ) will reproduce the issue almost instantly and consistently.

      2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
      2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
      2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
      java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
              at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
              at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
              at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
              at java.lang.Thread.run(Thread.java:662)
      Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
              at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
              at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
              at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
              at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
              at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
              at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
              at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
              at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
              at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
              at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
              at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
              at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
              at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
              at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
              at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
              at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
              at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
              ... 1 more
      2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
      

      http://hastebin.com/diqinibajo.avrasm

      later edit:

      (I'm using the last 4 characters from each string)
      Region 94e3 has storefile 7237
      Region 94e3 gets splited in daughters a: ffa1 and b: eee1
      Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
      ffa1 has a reference: 7237.94e3 for it's store file
      when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
      when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region]

      "^([0-9a-f]+)(?:\\.(.+))?$"
      

      and will attempt to go to /hbase/t1/[region] which resolves to
      /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail.

      This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

      1. HBASE-5665-trunk.patch
        4 kB
        Matteo Bertozzi
      2. HBASE-5665-0.92.patch
        3 kB
        Cosmin Lehene
      3. 5665trunk.v2.patch
        4 kB
        stack

        Activity

        Cosmin Lehene created issue -
        Cosmin Lehene made changes -
        Field Original Value New Value
        Description Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
        The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.

        The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.

        I was able to reproduce this on a smaller table consistently.

        {code}
        hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
        hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
        {code}

        Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently.

        {code}
        2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
        2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1.. compaction_queue=(0:1), split_queue=10
        2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
        java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
                at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
                at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
                at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
                at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                at java.lang.Thread.run(Thread.java:662)
        Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
                at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
                at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
                at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
                at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
                at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
                at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
                at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
                at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
                at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
                at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
                at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
                at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
                at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
                at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
                at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
                at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
                at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
                ... 1 more
        2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
        {code}


        http://hastebin.com/diqinibajo.avrasm
        Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
        The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.

        The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.

        I was able to reproduce this on a smaller table consistently.

        {code}
        hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
        hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
        {code}

        Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently.

        {code}
        2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
        2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1.. compaction_queue=(0:1), split_queue=10
        2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
        java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
                at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
                at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
                at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
                at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                at java.lang.Thread.run(Thread.java:662)
        Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
                at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
                at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
                at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
                at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
                at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
                at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
                at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
                at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
                at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
                at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
                at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
                at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
                at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
                at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
                at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
                at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
                at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
                ... 1 more
        2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
        {code}

        http://hastebin.com/diqinibajo.avrasm

        later edit:

        (I'm using the last 4 characters from each string)
        Region 94e3 has storefile 7237
        Region 94e3 gets splited in daughters a: ffa1 and b: eee1
        Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
        ffa1 has a reference: 7237.94e3 for it's store file
        when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
        when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region]
        {code}
        "^([0-9a-f]+)(?:\\.(.+))?$"
        {code}
        and will attempt to go to /hbase/t1/[region] which resolves to
        /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail.

        This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

        Cosmin Lehene made changes -
        Assignee Cosmin Lehene [ clehene ]
        Cosmin Lehene made changes -
        Attachment HBASE-5665-0.92.patch [ 12520458 ]
        Cosmin Lehene made changes -
        Affects Version/s 0.94.0 [ 12316419 ]
        Affects Version/s 0.96.0 [ 12320040 ]
        Affects Version/s 0.94.1 [ 12320257 ]
        Cosmin Lehene made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Affects Version/s 0.94.0 [ 12316419 ]
        Affects Version/s 0.96.0 [ 12320040 ]
        Affects Version/s 0.94.1 [ 12320257 ]
        Matteo Bertozzi made changes -
        Attachment HBASE-5665-trunk.patch [ 12520847 ]
        stack made changes -
        Attachment 5665trunk.v2.patch [ 12521034 ]
        stack made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags Reviewed [ 10343 ]
        Fix Version/s 0.92.2 [ 12319888 ]
        Fix Version/s 0.94.0 [ 12316419 ]
        Resolution Fixed [ 1 ]
        Lars Hofhansl made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Cosmin Lehene
            Reporter:
            Cosmin Lehene
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development