HBase
  1. HBase
  2. HBASE-6294

Detect leftover data in ZK after a user delete all its HBase data

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Incomplete
    • Affects Version/s: 0.94.0
    • Fix Version/s: 0.95.1
    • Component/s: None
    • Labels:
      None

      Description

      It seems we have a new failure mode when a user deletes the hbase root.dir but doesn't delete the ZK data. For example a user on IRC came with this log:

      2012-06-30 09:07:48,017 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: kw,,1340981821308.2e8a318837602c9c9961e9d690b7fd02.
      2012-06-30 09:07:48,017 WARN org.apache.hadoop.hbase.util.FSTableDescriptors: The following folder is in HBase's root directory and doesn't contain a table descriptor, do consider deleting it: kw
      2012-06-30 09:07:48,018 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:34193-0x1383bfe01b70001 Attempting to transition node 2e8a318837602c9c9961e9d690b7fd02 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
      2012-06-30 09:07:48,018 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=localhost,50890,1341036299694, region=2e8a318837602c9c9961e9d690b7fd02
      2012-06-30 09:07:48,020 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_FAILED_OPEN, server=localhost,34193,1341036300138, region=b254af24c9127b8bb22cb6d24e523dad
      2012-06-30 09:07:48,020 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for b254af24c9127b8bb22cb6d24e523dad
      2012-06-30 09:07:48,020 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=kw_r,,1340981822374.b254af24c9127b8bb22cb6d24e523dad. state=CLOSED, ts=1341036467998, server=localhost,34193,1341036300138
      2012-06-30 09:07:48,020 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:50890-0x1383bfe01b70000 Creating (or updating) unassigned node for b254af24c9127b8bb22cb6d24e523dad with OFFLINE state
      2012-06-30 09:07:48,028 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:34193-0x1383bfe01b70001 Successfully transitioned node 2e8a318837602c9c9961e9d690b7fd02 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
      2012-06-30 09:07:48,028 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Opening region: {NAME => 'kw,,1340981821308.2e8a318837602c9c9961e9d690b7fd02.', STARTKEY => '', ENDKEY => '', ENCODED => 2e8a318837602c9c9961e9d690b7fd02,}
      2012-06-30 09:07:48,029 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=kw,,1340981821308.2e8a318837602c9c9961e9d690b7fd02., starting to roll back the global memstore size.
      java.lang.IllegalStateException: Could not instantiate a region instance.
      	at org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(HRegion.java:3490)
      	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3628)
      	at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:332)
      	at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
      	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
      	at java.lang.Thread.run(Thread.java:679)
      Caused by: java.lang.reflect.InvocationTargetException
      	at sun.reflect.GeneratedConstructorAccessor15.newInstance(Unknown Source)
      	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
      	at org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(HRegion.java:3487)
      	... 7 more
      Caused by: java.lang.NullPointerException
      	at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.loadTableCoprocessors(RegionCoprocessorHost.java:133)
      	at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.<init>(RegionCoprocessorHost.java:125)
      	at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:411)
      	... 11 more
      2012-06-30 09:07:48,031 INFO org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opening of region {NAME => 'kw,,1340981821308.2e8a318837602c9c9961e9d690b7fd02.', STARTKEY => '', ENDKEY => '', ENCODED => 2e8a318837602c9c9961e9d690b7fd02,} failed, marking as FAILED_OPEN in ZK
      2012-06-30 09:07:48,032 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:34193-0x1383bfe01b70001 Attempting to transition node 2e8a318837602c9c9961e9d690b7fd02 from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN
      2012-06-30 09:07:48,031 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=localhost,34193,1341036300138, region=2e8a318837602c9c9961e9d690b7fd02
      2012-06-30 09:07:48,043 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=localhost,50890,1341036299694, region=b254af24c9127b8bb22cb6d24e523dad
      

      The exception itself is not very useful, nor is the NPE deep in the coproc stack. What was really useful was this:

      2012-06-30 09:07:48,017 WARN org.apache.hadoop.hbase.util.FSTableDescriptors: The following folder is in HBase's root directory and doesn't contain a table descriptor, do consider deleting it: kw
      

      So the HBase wants to assign a region from a table that doesn't exist and we fail in an obscure way. I told the user to shut down HBase, nuke /tmp/hbase-user as it will contain both the HBase data and the ZK data, and restart. It worked.

      This situation is new in 0.94, we need to detect it so our users have a better experience getting started with HBase.

        Activity

        Jean-Daniel Cryans created issue -
        Hide
        Lars Hofhansl added a comment -

        In a similar vain I found it impossible to switch a cluster from HBase <= 0.94 to HBase 0.96 (protobufs) without wiping the ZK state.
        We typically say that the ZK is not important for operating HBase, but that is not strictly true. For example we need to the ZK state for replication.

        Show
        Lars Hofhansl added a comment - In a similar vain I found it impossible to switch a cluster from HBase <= 0.94 to HBase 0.96 (protobufs) without wiping the ZK state. We typically say that the ZK is not important for operating HBase, but that is not strictly true. For example we need to the ZK state for replication.
        Hide
        stack added a comment -

        Thats a bug Lars. Aim is to just stop/start to move 0.94 to 0.96. Most of the znodes are automigrated. I must have missed some. I made HBASE-6316 to make sure this not necessary going to 0.96.

        Show
        stack added a comment - Thats a bug Lars. Aim is to just stop/start to move 0.94 to 0.96. Most of the znodes are automigrated. I must have missed some. I made HBASE-6316 to make sure this not necessary going to 0.96.
        Hide
        Lars Hofhansl added a comment -

        Hmm... Upon starting a new cluster AssignmentManager already calls cleanoutUnassigned().

        Show
        Lars Hofhansl added a comment - Hmm... Upon starting a new cluster AssignmentManager already calls cleanoutUnassigned().
        Hide
        Jean-Daniel Cryans added a comment -

        It won't clear the table state tho.

        Show
        Jean-Daniel Cryans added a comment - It won't clear the table state tho.
        Hide
        Lars Hofhansl added a comment -

        It's not entirely clear how to fix this quickly.
        Let's move this to 0.94.2.

        Please pull back if you disagree.

        Show
        Lars Hofhansl added a comment - It's not entirely clear how to fix this quickly. Let's move this to 0.94.2. Please pull back if you disagree.
        Lars Hofhansl made changes -
        Field Original Value New Value
        Fix Version/s 0.94.2 [ 12321884 ]
        Fix Version/s 0.94.1 [ 12320257 ]
        Hide
        Lars George added a comment -

        Another issue reported by someone on IM is that an entry in /hbase/tables is causing an problem where you cannot create a table with a previously known name. For some reason the table was first disabled, then HDFS wiped clean, yet the entry in ZK remains and causes some check to fail when you try to create the table.

        Show
        Lars George added a comment - Another issue reported by someone on IM is that an entry in /hbase/tables is causing an problem where you cannot create a table with a previously known name. For some reason the table was first disabled, then HDFS wiped clean, yet the entry in ZK remains and causes some check to fail when you try to create the table.
        Hide
        Devaraj Das added a comment -

        I just gave a shot at trying to address the issue, and tried to reproduce the problem. I couldn't reproduce this problem when I removed all the directory contents (/tmp/hbase-ddas/hbase that is). But when I removed one table directory (/tmp/hbase-ddas/hbase/<table>), hbase failed to start up. I could fix that up by running "bin/hbase hbck -fixMeta", and then hbase started up fine.

        Show
        Devaraj Das added a comment - I just gave a shot at trying to address the issue, and tried to reproduce the problem. I couldn't reproduce this problem when I removed all the directory contents (/tmp/hbase-ddas/hbase that is). But when I removed one table directory (/tmp/hbase-ddas/hbase/<table>), hbase failed to start up. I could fix that up by running "bin/hbase hbck -fixMeta", and then hbase started up fine.
        Hide
        stack added a comment -

        @Deveraj So that seems like a decent workaround.

        Going back to J-Ds' original comment, sounds like we shouldn't be assigning regions for tables that don't exist. Or if a regionserver gets a region to open that is for a non-existent table, it should just eat it up with a nice log message.

        @LarsG Should we make a new issue for that? Seems like again we should eat up the zk data if no corresponding table in HDFS/.META. and proceed?

        @J-D Lars says "We typically say that the ZK is not important for operating HBase, but that is not strictly true. For example we need to the ZK state for replication."

        Can we fix that? It'd be cool if we could keep the axiom that zk state is transient. Or maybe, for the likes of data that needs to prevail across restarts and upgrades, it should be recorded elsewhere in zk, outside of the per-cluster location?

        Show
        stack added a comment - @Deveraj So that seems like a decent workaround. Going back to J-Ds' original comment, sounds like we shouldn't be assigning regions for tables that don't exist. Or if a regionserver gets a region to open that is for a non-existent table, it should just eat it up with a nice log message. @LarsG Should we make a new issue for that? Seems like again we should eat up the zk data if no corresponding table in HDFS/.META. and proceed? @J-D Lars says "We typically say that the ZK is not important for operating HBase, but that is not strictly true. For example we need to the ZK state for replication." Can we fix that? It'd be cool if we could keep the axiom that zk state is transient. Or maybe, for the likes of data that needs to prevail across restarts and upgrades, it should be recorded elsewhere in zk, outside of the per-cluster location?
        Hide
        Lars Hofhansl added a comment -

        I think this can be moved to 0.94.3 (unless somebody has a concrete plan about what to do here).

        Show
        Lars Hofhansl added a comment - I think this can be moved to 0.94.3 (unless somebody has a concrete plan about what to do here).
        Lars Hofhansl made changes -
        Fix Version/s 0.94.3 [ 12323144 ]
        Fix Version/s 0.94.2 [ 12321884 ]
        Hide
        Lars Hofhansl added a comment -

        Still no patch... Moving to 0.94.4

        Show
        Lars Hofhansl added a comment - Still no patch... Moving to 0.94.4
        Lars Hofhansl made changes -
        Fix Version/s 0.94.4 [ 12323367 ]
        Fix Version/s 0.94.3 [ 12323144 ]
        Hide
        Lars Hofhansl added a comment -

        No fix... Does not seem to be critical to anyone

        Show
        Lars Hofhansl added a comment - No fix... Does not seem to be critical to anyone
        Lars Hofhansl made changes -
        Fix Version/s 0.94.5 [ 12323874 ]
        Fix Version/s 0.94.4 [ 12323367 ]
        Priority Critical [ 2 ] Major [ 3 ]
        Hide
        Lars Hofhansl added a comment -

        Unscheduling from 0.94.

        Show
        Lars Hofhansl added a comment - Unscheduling from 0.94.
        Lars Hofhansl made changes -
        Fix Version/s 0.94.5 [ 12323874 ]
        stack made changes -
        Fix Version/s 0.95.0 [ 12324094 ]
        Fix Version/s 0.96.0 [ 12320040 ]
        Lars Hofhansl made changes -
        Fix Version/s 0.94.5 [ 12323874 ]
        Lars Hofhansl made changes -
        Fix Version/s 0.94.5 [ 12323874 ]
        stack made changes -
        Fix Version/s 0.95.1 [ 12324288 ]
        Fix Version/s 0.95.0 [ 12324094 ]
        Hide
        Jean-Daniel Cryans added a comment -

        At least on the tip of 0.94 in standalone mode I can wipe out HBase's root dir and restart without problems, ZK uses a new data folder when it's restarted. Not sure when this was introduced. I'm fine closing this unless Lars Hofhansl, Devaraj Das or Lars George have something against it.

        Show
        Jean-Daniel Cryans added a comment - At least on the tip of 0.94 in standalone mode I can wipe out HBase's root dir and restart without problems, ZK uses a new data folder when it's restarted. Not sure when this was introduced. I'm fine closing this unless Lars Hofhansl , Devaraj Das or Lars George have something against it.
        Hide
        Devaraj Das added a comment -

        I am fine resolving this.

        Show
        Devaraj Das added a comment - I am fine resolving this.
        Hide
        stack added a comment -

        Lets resolve and open issue when we run into a real obstacle?

        Show
        stack added a comment - Lets resolve and open issue when we run into a real obstacle?
        Hide
        Lars George added a comment -

        I am fine too, but have that strange feeling that this is still going to rear its ugly head somewhere. But agree, we should create a better JIRA then.

        Show
        Lars George added a comment - I am fine too, but have that strange feeling that this is still going to rear its ugly head somewhere. But agree, we should create a better JIRA then.
        Hide
        Jean-Daniel Cryans added a comment -

        Resolving as Incomplete.

        Show
        Jean-Daniel Cryans added a comment - Resolving as Incomplete.
        Jean-Daniel Cryans made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Incomplete [ 4 ]
        Hide
        Nicolas PHUNG added a comment -

        I don't know if it is related to http://arnon.me/2013/01/killing-hbase-zombie-table/. We got a zombie HBase table and "nuke /tmp/hbase-user as it will contain both the HBase data and the ZK data, and restart" helps us get rid of it.

        Show
        Nicolas PHUNG added a comment - I don't know if it is related to http://arnon.me/2013/01/killing-hbase-zombie-table/ . We got a zombie HBase table and "nuke /tmp/hbase-user as it will contain both the HBase data and the ZK data, and restart" helps us get rid of it.
        stack made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Jean-Daniel Cryans
          • Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development