HBase
  1. HBase
  2. HBASE-8502

Eternally stuck Region after split

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 0.92.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Exact HBase version: 0.92.1-cdh4.1.2

      A couple of days ago I encountered a RIT problem with a single region.
      After an hbck run it started trying to assign a region which has been
      bouncing between OFFLINE/PENDING_OPEN/OPENING for two days afterwards.

      This was due to a split gone wrong in some way, which led to several
      reference files being left in the region-directory despite the two relevant HFiles being copies successfully to the daughter.

      I will try to give as many details as possible, but unfortunately I was
      unable to find any information about the split itself.

      Short thread about this issue on the users-ML: http://mail-archives.apache.org/mod_mbox/hbase-user/201305.mbox/%3C5182758B.1060306@neofonie.de%3E

      ===

      Parent region: 5b9c16898a371de58f31f0bdf86b1f8b
      Daughter region in question: 79c619508659018ff3ef0887611eb8f7

      Rough sequence from the logs seems to be the following:

      ===

      • Received request to open region:
        documents,7128586022887322720,1363696791400.79c619508659018ff3ef0887611eb8f7.
      • Setting up tabledescriptor config now ...
      • Opening of region {NAME => 'documents,7128586022887322720,1363696791400.79c619508659018ff3ef0887611eb8f7.', STARTKEY => '7128586022887322720', ENDKEY => '7130716361635801616', ENCODED => 79c619508659018ff3ef0887611eb8f7,}

        failed, marking as
        FAILED_OPEN in ZK

      • File does not exist:
        /hbase/documents/5b9c16898a371de58f31f0bdf86b1f8b/d/0707b1ec4c6b41cf9174e0d2a1785fe9
        [...]
        ===

      What happened, was that somehow (and that's the question here) the daughters
      region folder contained some left-over reference files were causing the
      RegionServer to look-up the parent region, which already was deleted.

      original contents of /hbase/documents/79c619508659018ff3ef0887611eb8f7/d:
      ==
      0707b1ec4c6b41cf9174e0d2a1785fe9.5b9c16898a371de58f31f0bdf86b1f8b
      47511faae81b4452afd3ca206e28346f.5b9c16898a371de58f31f0bdf86b1f8b
      4f01ecd052ce464d81e79a62ea227d6b
      4f01ecd052ce464d81e79a62ea227d6b.5b9c16898a371de58f31f0bdf86b1f8b
      eb7dbb09701d4353be24ca82481c4a7e
      ==

      I attached the full FileNotFound Exception.

      Please let me know if I can provide more information or help otherwise.

      1. hbase_lost_parent.txt
        50 kB
        Dimitri Goldin
      2. hbase_run.log
        12 kB
        Dimitri Goldin
      3. stuck_region_exception.txt
        9 kB
        Dimitri Goldin

        Issue Links

          Activity

          Hide
          Jimmy Xiang added a comment -

          Closed this one for now. I think the region stuck in transition because hbck repair could have done something wrong. We can re-open it if we see the issue again, without running hbck repair.

          Show
          Jimmy Xiang added a comment - Closed this one for now. I think the region stuck in transition because hbck repair could have done something wrong. We can re-open it if we see the issue again, without running hbck repair.
          Hide
          Jean-Daniel Cryans added a comment -

          Added logs of my two recent HBCK runs

          Could it be that a HBCK was run before that? I can't find anything interesting in those logs unfortunately...

          To continue I would need:

          • In the master log grep for 5b9c16898a371de58f31f0bdf86b1f8b and get us the line where it splits and it lists the two daughters
          • Using that line, you can tell which region server did the split so get everything from that region server that relates to 5b9c16898a371de58f31f0bdf86b1f8b when it split
          • In the NN log grep for 4f01ecd052ce464d81e79a62ea227d6b.5b9c16898a371de58f31f0bdf86b1f8b
          Show
          Jean-Daniel Cryans added a comment - Added logs of my two recent HBCK runs Could it be that a HBCK was run before that? I can't find anything interesting in those logs unfortunately... To continue I would need: In the master log grep for 5b9c16898a371de58f31f0bdf86b1f8b and get us the line where it splits and it lists the two daughters Using that line, you can tell which region server did the split so get everything from that region server that relates to 5b9c16898a371de58f31f0bdf86b1f8b when it split In the NN log grep for 4f01ecd052ce464d81e79a62ea227d6b.5b9c16898a371de58f31f0bdf86b1f8b
          Hide
          Dimitri Goldin added a comment -

          Added logs of my two recent HBCK runs (please see hbck_run.log). As to whether hbck created the 'daughter' or not, I cant tell from the split logs, since it doesnt mention the daughters in this case.

          Show
          Dimitri Goldin added a comment - Added logs of my two recent HBCK runs (please see hbck_run.log). As to whether hbck created the 'daughter' or not, I cant tell from the split logs, since it doesnt mention the daughters in this case.
          Hide
          Jean-Daniel Cryans added a comment -

          This sounds like something someone else encountered that I helped fixing. In their case, they ran HBCK after a split happened (not sure why they did) and it merged the parent and daughters into a new region. The problem is that the reference files were still there and they were put alongside the files they reference to. What happens here is that since the parent also got moved, the referenced files moved too and are not there anymore. This is why in my case the region wasn't able to open getting a FNFE, and it kept getting reassigned by the master.

          The fix is to delete the reference files, or at least move them away, since the original file is right there.

          Dimitri Goldin can you verify that 79c619508659018ff3ef0887611eb8f7 is really a daughter of 5b9c16898a371de58f31f0bdf86b1f8b? It should tell in the RS log when it splits, or in the master log when the split is reported. If it's not a daughter, then this could definitely be the same issue, as 79c619508659018ff3ef0887611eb8f7 would be the region created by HBCK.

          Show
          Jean-Daniel Cryans added a comment - This sounds like something someone else encountered that I helped fixing. In their case, they ran HBCK after a split happened (not sure why they did) and it merged the parent and daughters into a new region. The problem is that the reference files were still there and they were put alongside the files they reference to. What happens here is that since the parent also got moved, the referenced files moved too and are not there anymore. This is why in my case the region wasn't able to open getting a FNFE, and it kept getting reassigned by the master. The fix is to delete the reference files, or at least move them away, since the original file is right there. Dimitri Goldin can you verify that 79c619508659018ff3ef0887611eb8f7 is really a daughter of 5b9c16898a371de58f31f0bdf86b1f8b? It should tell in the RS log when it splits, or in the master log when the split is reported. If it's not a daughter, then this could definitely be the same issue, as 79c619508659018ff3ef0887611eb8f7 would be the region created by HBCK.
          Hide
          Dimitri Goldin added a comment -

          Unfortunately not from march and the only interesting lines, which caught my eye during the recent runs are included in the appended log-extracts. If you'd like to see the full hbck run, let me know and I'll try to get it.

          Show
          Dimitri Goldin added a comment - Unfortunately not from march and the only interesting lines, which caught my eye during the recent runs are included in the appended log-extracts. If you'd like to see the full hbck run, let me know and I'll try to get it.
          Hide
          Jimmy Xiang added a comment -

          Do you have the hbck log? If so, I'd like to know if the parent region was still there at that moment.

          Show
          Jimmy Xiang added a comment - Do you have the hbck log? If so, I'd like to know if the parent region was still there at that moment.
          Hide
          Dimitri Goldin added a comment -

          I think it is most likely because the parent is still online so that we don't need the daughter. The parent may have gone through some compaction so the file is gone.

          Just want to add a small detail:

          Of course I can not be sure how the region folders looked back then, but when I noticed the issue I immediately checked for the parent regions folder and it was completely gone, not just a single hfile. No mention in .META. either, no splitA/splitB columns either. That's why until now I assumed, that the whole parent was removed after one of the split attempts or regionserver cleanup code (It crashed after point of no return, as mentioned in the log).

          Show
          Dimitri Goldin added a comment - I think it is most likely because the parent is still online so that we don't need the daughter. The parent may have gone through some compaction so the file is gone. Just want to add a small detail: Of course I can not be sure how the region folders looked back then, but when I noticed the issue I immediately checked for the parent regions folder and it was completely gone, not just a single hfile. No mention in .META. either, no splitA/splitB columns either. That's why until now I assumed, that the whole parent was removed after one of the split attempts or regionserver cleanup code (It crashed after point of no return, as mentioned in the log).
          Hide
          Jimmy Xiang added a comment -

          The daughter region was able to stay offline without any problems

          I think it is most likely because the parent is still online so that we don't need the daughter. The parent may have gone through some compaction so the file is gone. When the parent is split again, the same daughter region happens to be created, which is not cleaned up properly during the previous split try which has a ref file leftover which refers to the file is already gone. That's my guess. I will take a look the code to make sure rolling back clean up the daughter regions.

          As to move forward, other than reviewing the rolling back code check, I think we can set up some continuous testing to do region split all the time. We are sure to be able to reproduce it, if it is a code issue.

          Show
          Jimmy Xiang added a comment - The daughter region was able to stay offline without any problems I think it is most likely because the parent is still online so that we don't need the daughter. The parent may have gone through some compaction so the file is gone. When the parent is split again, the same daughter region happens to be created, which is not cleaned up properly during the previous split try which has a ref file leftover which refers to the file is already gone. That's my guess. I will take a look the code to make sure rolling back clean up the daughter regions. As to move forward, other than reviewing the rolling back code check, I think we can set up some continuous testing to do region split all the time. We are sure to be able to reproduce it, if it is a code issue.
          Hide
          Dimitri Goldin added a comment -

          Yes, during that time some system were overloaded and caused some stability issues. So probably the split failed because of a timeout.

          I checked the DN and NN logs, but they do not contain the period in question (they start at 2013-03-19 11:56) and are pretty useless. They mostly contain the already known FileNotFoundException and some renaming issues quoted below.

          The content of the 79c619508659018ff3ef0887611eb8f7 daughter suggest, that some split either succeeded despite a previously failed and
          improperly rolled back attempt or simply the failed split failed to clean up. So I do believe, that HBase deleted the parent region itself.

          My suspicion was that maybe either there might be some flaw in the rollback logic under strange circumstances or that re-tried splits
          don't check for left-over reference files and such. Though it is strange, that one of the successfully copied hfiles still has it's ref-file.

          I'm sorry, that I can not provide full logs for the period, as this has been silently lurking for quite a while.
          Any ideas as to why the daughter region was able to stay offline for so long without any problems? I think this might almost
          be a separate issue too.

          On the Mailinglist it seemed like Kevin Odell has encountered this issue before, maybe he can help us overcome the incomplete information from logs.

          How can we go about this with what we have? Any ideas how to reproduce and verify the behaviour?

          hadoop-cmf-hdfs1-NAMENODE-mia-node08.miacluster.priv.log.out.2:2013-03-19 13:39:52,937 mia-node08.miacluster.priv WARN org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.unprotectedRenameTo: failed to rename /hbase/documents/e6227aaa6f6e03188372ec534bf7e150/d/0707b1ec4c6b41cf9174e0d2a1785fe9.5b9c16898a371de58f31f0bdf86b1f8b to /hbase/documents/79c619508659018ff3ef0887611eb8f7/d/0707b1ec4c6b41cf9174e0d2a1785fe9.5b9c16898a371de58f31f0bdf86b1f8b because destination exists
          hadoop-cmf-hdfs1-NAMENODE-mia-node08.miacluster.priv.log.out.2:2013-03-19 13:39:52,938 mia-node08.miacluster.priv WARN org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.unprotectedRenameTo: failed to rename /hbase/documents/e6227aaa6f6e03188372ec534bf7e150/d/47511faae81b4452afd3ca206e28346f.5b9c16898a371de58f31f0bdf86b1f8b to /hbase/documents/79c619508659018ff3ef0887611eb8f7/d/47511faae81b4452afd3ca206e28346f.5b9c16898a371de58f31f0bdf86b1f8b because destination exists
          hadoop-cmf-hdfs1-NAMENODE-mia-node08.miacluster.priv.log.out.2:2013-03-19 13:39:52,938 mia-node08.miacluster.priv WARN org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.unprotectedRenameTo: failed to rename /hbase/documents/e6227aaa6f6e03188372ec534bf7e150/d/4f01ecd052ce464d81e79a62ea227d6b.5b9c16898a371de58f31f0bdf86b1f8b to /hbase/documents/79c619508659018ff3ef0887611eb8f7/d/4f01ecd052ce464d81e79a62ea227d6b.5b9c16898a371de58f31f0bdf86b1f8b because destination exists

          Show
          Dimitri Goldin added a comment - Yes, during that time some system were overloaded and caused some stability issues. So probably the split failed because of a timeout. I checked the DN and NN logs, but they do not contain the period in question (they start at 2013-03-19 11:56) and are pretty useless. They mostly contain the already known FileNotFoundException and some renaming issues quoted below. The content of the 79c619508659018ff3ef0887611eb8f7 daughter suggest, that some split either succeeded despite a previously failed and improperly rolled back attempt or simply the failed split failed to clean up. So I do believe, that HBase deleted the parent region itself. My suspicion was that maybe either there might be some flaw in the rollback logic under strange circumstances or that re-tried splits don't check for left-over reference files and such. Though it is strange, that one of the successfully copied hfiles still has it's ref-file. I'm sorry, that I can not provide full logs for the period, as this has been silently lurking for quite a while. Any ideas as to why the daughter region was able to stay offline for so long without any problems? I think this might almost be a separate issue too. On the Mailinglist it seemed like Kevin Odell has encountered this issue before, maybe he can help us overcome the incomplete information from logs. How can we go about this with what we have? Any ideas how to reproduce and verify the behaviour? hadoop-cmf-hdfs1-NAMENODE-mia-node08.miacluster.priv.log.out.2:2013-03-19 13:39:52,937 mia-node08.miacluster.priv WARN org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.unprotectedRenameTo: failed to rename /hbase/documents/e6227aaa6f6e03188372ec534bf7e150/d/0707b1ec4c6b41cf9174e0d2a1785fe9.5b9c16898a371de58f31f0bdf86b1f8b to /hbase/documents/79c619508659018ff3ef0887611eb8f7/d/0707b1ec4c6b41cf9174e0d2a1785fe9.5b9c16898a371de58f31f0bdf86b1f8b because destination exists hadoop-cmf-hdfs1-NAMENODE-mia-node08.miacluster.priv.log.out.2:2013-03-19 13:39:52,938 mia-node08.miacluster.priv WARN org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.unprotectedRenameTo: failed to rename /hbase/documents/e6227aaa6f6e03188372ec534bf7e150/d/47511faae81b4452afd3ca206e28346f.5b9c16898a371de58f31f0bdf86b1f8b to /hbase/documents/79c619508659018ff3ef0887611eb8f7/d/47511faae81b4452afd3ca206e28346f.5b9c16898a371de58f31f0bdf86b1f8b because destination exists hadoop-cmf-hdfs1-NAMENODE-mia-node08.miacluster.priv.log.out.2:2013-03-19 13:39:52,938 mia-node08.miacluster.priv WARN org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.unprotectedRenameTo: failed to rename /hbase/documents/e6227aaa6f6e03188372ec534bf7e150/d/4f01ecd052ce464d81e79a62ea227d6b.5b9c16898a371de58f31f0bdf86b1f8b to /hbase/documents/79c619508659018ff3ef0887611eb8f7/d/4f01ecd052ce464d81e79a62ea227d6b.5b9c16898a371de58f31f0bdf86b1f8b because destination exists
          Hide
          Jimmy Xiang added a comment -

          Dimitri Goldin, thanks a lot for the log. We can tell the region was split twice. The first time was around 2013-03-18 18:53:10,164, but it failed and rolled back at 2013-03-18 18:53:59,108. The second time was at 2013-03-18 19:44:59,963, which failed since a hfile was missing. The same hfile was the reason that the region stuck in transition.

          The first split took quite some time. My guess is that it's because it had some problem to access this hfile in question. Can you check your HDFS NN and DN log about this file?

          File does not exist: /hbase/documents/5b9c16898a371de58f31f0bdf86b1f8b/d/0707b1ec4c6b41cf9174e0d2a1785fe9

          Do we know what happened to it?

          Show
          Jimmy Xiang added a comment - Dimitri Goldin , thanks a lot for the log. We can tell the region was split twice. The first time was around 2013-03-18 18:53:10,164, but it failed and rolled back at 2013-03-18 18:53:59,108. The second time was at 2013-03-18 19:44:59,963, which failed since a hfile was missing. The same hfile was the reason that the region stuck in transition. The first split took quite some time. My guess is that it's because it had some problem to access this hfile in question. Can you check your HDFS NN and DN log about this file? File does not exist: /hbase/documents/5b9c16898a371de58f31f0bdf86b1f8b/d/0707b1ec4c6b41cf9174e0d2a1785fe9 Do we know what happened to it?
          Hide
          Dimitri Goldin added a comment -

          I haven't noticed the region until I ran hbck and according to the hbck run itself it just wasn't assigned anywhere anymore (I don't know why yet).

          It seems the actual problem happened somewhere in march during a time I was not in the office, but at least one of the daughters seems to have been stuck for a while. I think somebody restarted the cluster and ran hbck -repair back then attempting to fix the issue. And somehow it went away for a while, meaning that the regions just were not onlined.

          I didn't discover it until a couple of days ago, since the table in question is only updated every couple of weeks.

          Fortunately I was able to find some old logs in a backup from mid march mentioning failed splits and rollbacks of the parent region including a different daughter region (939c1e9d10cc4e97d7284025f20298fb), which seemed to have the same problem. I presume both were created somewhere on 2013-03-18 during the same failed split. But unfortunately they are not explicitly mentioned as daughters in the logs, since the split failed. Sadly I do not have any logs left between ~2013-03-20 and 2012-05-01, since most were rotated by now.

          Unfortunately I was also unable to find any mention of the 79c619508659018ff3ef0887611eb8f7 region in the master-logs from that time.

          As to what happened to the parent region after the region split; currently I'm really not sure. It is obvious that it was removed at some point in time causing inability to online both daughters, even though the parts of the logs state, that the splits were rolled back. The last mention of the region
          I'm able to find is from 2013-03-18 19:45:00,014 (MASTER, point of no return error).

          There is also another 'new' question: why have the attempts to online both daughters stopped at some
          point in time until hbck tried to touch one of them. It's also unclear what happened to the second daughter (939...8fb).

          Please see attached file in which I tried to collect relevant sections of logs from hbck, master and regionserver. I hope this helps more. I will try to find even more and update.

          Show
          Dimitri Goldin added a comment - I haven't noticed the region until I ran hbck and according to the hbck run itself it just wasn't assigned anywhere anymore (I don't know why yet). It seems the actual problem happened somewhere in march during a time I was not in the office, but at least one of the daughters seems to have been stuck for a while. I think somebody restarted the cluster and ran hbck -repair back then attempting to fix the issue. And somehow it went away for a while, meaning that the regions just were not onlined. I didn't discover it until a couple of days ago, since the table in question is only updated every couple of weeks. Fortunately I was able to find some old logs in a backup from mid march mentioning failed splits and rollbacks of the parent region including a different daughter region (939c1e9d10cc4e97d7284025f20298fb), which seemed to have the same problem. I presume both were created somewhere on 2013-03-18 during the same failed split. But unfortunately they are not explicitly mentioned as daughters in the logs, since the split failed. Sadly I do not have any logs left between ~2013-03-20 and 2012-05-01, since most were rotated by now. Unfortunately I was also unable to find any mention of the 79c619508659018ff3ef0887611eb8f7 region in the master-logs from that time. As to what happened to the parent region after the region split; currently I'm really not sure. It is obvious that it was removed at some point in time causing inability to online both daughters, even though the parts of the logs state, that the splits were rolled back. The last mention of the region I'm able to find is from 2013-03-18 19:45:00,014 (MASTER, point of no return error). There is also another 'new' question: why have the attempts to online both daughters stopped at some point in time until hbck tried to touch one of them. It's also unclear what happened to the second daughter (939...8fb). Please see attached file in which I tried to collect relevant sections of logs from hbck, master and regionserver. I hope this helps more. I will try to find even more and update.
          Hide
          Jimmy Xiang added a comment -

          Is this region already stuck in transition before you run hbck? I think it is more useful to have the region split related log (both the region server where the split happens, and the master). Do you know what happens to the parent region after the region is split?

          Show
          Jimmy Xiang added a comment - Is this region already stuck in transition before you run hbck? I think it is more useful to have the region split related log (both the region server where the split happens, and the master). Do you know what happens to the parent region after the region is split?

            People

            • Assignee:
              Unassigned
              Reporter:
              Dimitri Goldin
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development