HBase
  1. HBase
  2. HBASE-4094

improve hbck tool to fix more hbase problem

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 0.90.3
    • Fix Version/s: None
    • Component/s: master
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The hbck tool(org.apache.hadoop.hbase.util.HBaseFsck) can check and repair consistency problem.
      some error just be checked but not supply the way to repair, I plan to fix it by other tool(close_region...)or by new method.
      First, list it and discuss that is it right?

      Part A:check meta info
      1.errors.reportError(ERROR_CODE.NULL_ROOT_REGION,"Root Region or some of its attributes are null.");
      ------> after delete the root table,execute hbck tool to check but the tool run error. how to reproduce this error?

      2.errors.reportError(ERROR_CODE.NO_META_REGION, ".META. is not found on any region.");
      ------>after delete the meta table,execute hbck tool to check but the tool run error. how to reproduce this error?

      3.errors.reportError(ERROR_CODE.MULTI_META_REGION, ".META. is found on more than one region.");
      ----->the logic:scan the root table to get META table regioninfo,if META table's regions is more than one,throw the error.
      HBase allow META table has more than one region,is it?

      Part B:check Consistency
      4.ERROR_CODE.NOT_IN_META_HDFS---->close it from regionserver.

      5.ERROR_CODE.NOT_IN_META_OR_DEPLOYED---->do nothing,maybe it will be used to fix the chain hole in part C.

      6.ERROR_CODE.NOT_IN_META---->close it from regionserver.

      7.ERROR_CODE.NOT_IN_HDFS_OR_DEPLOYED---->delete it from META table,it will make a chain hole, when check chain integrity(in part C) to fix it.

      8.ERROR_CODE.NOT_IN_HDFS---->delete it from META table and close it from regionserver,when check chain integrity(in part C) to fix it.

      9.ERROR_CODE.NOT_DEPLOYED---->assign it.

      10.ERROR_CODE.SHOULD_NOT_BE_DEPLOYED---->delete if from META table and close it from regionserver.

      11.ERROR_CODE.MULTI_DEPLOYED--->close all from regionservers,and reassign it.

      12.ERROR_CODE.SERVER_DOES_NOT_MATCH_META---->close all from regionservers,and reassign it.

      Part C:check chain Integrity
      13.ERROR_CODE.FIRST_REGION_STARTKEY_NOT_EMPTY--->treat it as a hole problem(ERROR_CODE.HOLE_IN_REGION_CHAIN).

      14.ERROR_CODE.LAST_REGION_ENDKEY_NOT_EMPTY(new add)--->treat it as a hole problem(ERROR_CODE.HOLE_IN_REGION_CHAIN).

      15.ERROR_CODE.REGION_CYCLE---->shut down cluster and merge two region by merge tool(org.apache.hadoop.hbase.util.Merge)

      16.ERROR_CODE.DUPE_STARTKEYS--->shut down cluster and merge two region by merge tool(org.apache.hadoop.hbase.util.Merge)

      17.ERROR_CODE.OVERLAP_IN_REGION_CHAIN--->shut down cluster and merge two region by merge tool(org.apache.hadoop.hbase.util.Merge)

      18.ERROR_CODE.HOLE_IN_REGION_CHAIN--->write a new method to fix it,the logic is:for recover the data,collect the regionfo from regionserver and hdfs.if a region's key range is overlaping with the hole range,put it in META table and assign it,maybe it will create overlapping problem,we can fix it by merge tool.if no region be collected,create a new region by the hole key range to fix it.

        Issue Links

          Activity

          feng xu created issue -
          feng xu made changes -
          Field Original Value New Value
          Status Open [ 1 ] Patch Available [ 10002 ]
          feng xu made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          feng xu added a comment -

          check the table key chain just bese on the META info, if the region no deployed on any regionserver,we can delete it from META by hbase shell, so it will make a hole in chain,we can read regioninfo from hdfs or make a new region to fix the hole.

          Show
          feng xu added a comment - check the table key chain just bese on the META info, if the region no deployed on any regionserver,we can delete it from META by hbase shell, so it will make a hole in chain,we can read regioninfo from hdfs or make a new region to fix the hole.
          feng xu made changes -
          Attachment HBaseFsck.patch [ 12486983 ]
          feng xu made changes -
          Attachment HBaseFsck.patch [ 12486983 ]
          feng xu made changes -
          Attachment HbaseFsck_TableChain.patch [ 12486984 ]
          Hide
          Jieshan Bean added a comment -

          There's so many failure-scenarios listed above. Most of those make sense to me.
          I have one question about the patch, is it just fix one of those scenarios??

          Show
          Jieshan Bean added a comment - There's so many failure-scenarios listed above. Most of those make sense to me. I have one question about the patch, is it just fix one of those scenarios??
          Hide
          feng xu added a comment -

          some fail-scenarios(ERROR_CODE.MULTI_DEPLOYED,ERROR_CODE.NOT_DEPLOYED..) have been fixed by hbck tool.
          but many fail-scenarios the hbck do not supply the method to fix,like hole problem.

          this patch(HbaseFsck_TableChain.patch) will check the table chain hole in META,and next I plan to fix hole problem.

          >>18.ERROR_CODE.HOLE_IN_REGION_CHAIN--->write a new method to fix it,the logic is:for recover the data,collect the >>regionfo from regionserver and hdfs.if a region's key range is overlaping with the hole range,put it in META table and >>assign it,maybe it will create overlapping problem,we can fix it by merge tool.if no region be collected,create a new >>region by the hole key range to fix it.

          Show
          feng xu added a comment - some fail-scenarios(ERROR_CODE.MULTI_DEPLOYED,ERROR_CODE.NOT_DEPLOYED..) have been fixed by hbck tool. but many fail-scenarios the hbck do not supply the method to fix,like hole problem. this patch(HbaseFsck_TableChain.patch) will check the table chain hole in META,and next I plan to fix hole problem. >>18.ERROR_CODE.HOLE_IN_REGION_CHAIN--->write a new method to fix it,the logic is:for recover the data,collect the >>regionfo from regionserver and hdfs.if a region's key range is overlaping with the hole range,put it in META table and >>assign it,maybe it will create overlapping problem,we can fix it by merge tool.if no region be collected,create a new >>region by the hole key range to fix it.
          Hide
          stack added a comment -

          @Feng Thank you for digging in on this important issue. As Jieshan asks, are you going to fix all the issues not already addressed by --fix listed above? If so, do you think it best to do it one per issue (if you'd rather piecemeal it) or do you want to make a monster patch to do them all in this issue?

          I looked at your patch and it seems to comment out a line of code and add a comment.

          Regards 18 above, when you say collect the regioninfo from regionserver and hdfs, what do you mean? What if the region is not on a regionserver? If its in the filesystem, how will you find it? You only have the gap in the table and from here you need to get to the encoded name of the region in the filesystem. Are you thinking of looking at all the regions in the filesystem and getting all of their .regioninfos and then checking for which has a start and stop key that matches the hole?

          If this fails, yes, create a new region to bridge the hole.

          Do you think this issue has overlap with HBASE-4058 Feng?

          Thanks.

          Show
          stack added a comment - @Feng Thank you for digging in on this important issue. As Jieshan asks, are you going to fix all the issues not already addressed by --fix listed above? If so, do you think it best to do it one per issue (if you'd rather piecemeal it) or do you want to make a monster patch to do them all in this issue? I looked at your patch and it seems to comment out a line of code and add a comment. Regards 18 above, when you say collect the regioninfo from regionserver and hdfs, what do you mean? What if the region is not on a regionserver? If its in the filesystem, how will you find it? You only have the gap in the table and from here you need to get to the encoded name of the region in the filesystem. Are you thinking of looking at all the regions in the filesystem and getting all of their .regioninfos and then checking for which has a start and stop key that matches the hole? If this fails, yes, create a new region to bridge the hole. Do you think this issue has overlap with HBASE-4058 Feng? Thanks.
          Hide
          feng xu added a comment -

          >>Regards 18 above, when you say collect the regioninfo from regionserver and hdfs, what do you mean?
          the regions that in hole maybe be deployed on the reginserver,if not,maybe in hdfs. just to find them to fix the hole,it can recover the hole data.

          >>What if the region is not on a regionserver? If its in the filesystem, how will you find it?
          the hbck tool will check the regionserver(WorkItemRegion()) and hdfs(WorkItemHdfsDir()),if the region that from regionserver or hdfs, store it and
          sign it where from by enum INFO_FROM. when check table chain,we can reference it to fix the hole.

          >>Are you thinking of looking at all the regions in the filesystem and getting all of their .regioninfos
          yes, but just get the regions that not signed in META table from regionserver or filesystem .

          >>then checking for which has a start and stop key that matches the hole?
          the hole maybe need some regions to fix, in my patch ,if the region from regionserver or filesystem that the key range is overlapping with the hole,
          I will use it to fix the hole, I also know it will make the overlapping problem in META table,but it can recover hole data,we can fix the overlapping problem
          by merge tool,is it right?

          >>Do you think this issue has overlap with HBASE-4058 Feng?
          yes, this issue is also relate with the hbck tool to fix the cluster problem.

          I have filed another issue HBASE-4122 which is about how to fix the chain hole problem and submitted a patch.

          Show
          feng xu added a comment - >>Regards 18 above, when you say collect the regioninfo from regionserver and hdfs, what do you mean? the regions that in hole maybe be deployed on the reginserver,if not,maybe in hdfs. just to find them to fix the hole,it can recover the hole data. >>What if the region is not on a regionserver? If its in the filesystem, how will you find it? the hbck tool will check the regionserver(WorkItemRegion()) and hdfs(WorkItemHdfsDir()),if the region that from regionserver or hdfs, store it and sign it where from by enum INFO_FROM. when check table chain,we can reference it to fix the hole. >>Are you thinking of looking at all the regions in the filesystem and getting all of their .regioninfos yes, but just get the regions that not signed in META table from regionserver or filesystem . >>then checking for which has a start and stop key that matches the hole? the hole maybe need some regions to fix, in my patch ,if the region from regionserver or filesystem that the key range is overlapping with the hole, I will use it to fix the hole, I also know it will make the overlapping problem in META table,but it can recover hole data,we can fix the overlapping problem by merge tool,is it right? >>Do you think this issue has overlap with HBASE-4058 Feng? yes, this issue is also relate with the hbck tool to fix the cluster problem. I have filed another issue HBASE-4122 which is about how to fix the chain hole problem and submitted a patch.
          stack made changes -
          Fix Version/s 0.90.6 [ 12319200 ]
          Fix Version/s 0.90.5 [ 12317145 ]
          Hide
          ramkrishna.s.vasudevan added a comment -

          Moving to 0.90.7. HBASE-5128 also is related to improving hbck tool.

          Show
          ramkrishna.s.vasudevan added a comment - Moving to 0.90.7. HBASE-5128 also is related to improving hbck tool.
          ramkrishna.s.vasudevan made changes -
          Fix Version/s 0.90.7 [ 12319481 ]
          Fix Version/s 0.90.6 [ 12319200 ]
          Hide
          Anoop Sam John added a comment -

          This may be closed duplicate as HBASE-5128 handles these valid scenarios now.

          14.ERROR_CODE.LAST_REGION_ENDKEY_NOT_EMPTY(new add)--->treat it as a hole problem(ERROR_CODE.HOLE_IN_REGION_CHAIN).

          This check is not there now. But there is another issue HBASE-4379 on this.

          Show
          Anoop Sam John added a comment - This may be closed duplicate as HBASE-5128 handles these valid scenarios now. 14.ERROR_CODE.LAST_REGION_ENDKEY_NOT_EMPTY(new add)--->treat it as a hole problem(ERROR_CODE.HOLE_IN_REGION_CHAIN). This check is not there now. But there is another issue HBASE-4379 on this.
          Hide
          stack added a comment -

          Resolving at Anoop's suggestion as dup of hbase-5128

          Show
          stack added a comment - Resolving at Anoop's suggestion as dup of hbase-5128
          stack made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Jonathan Hsieh made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Hide
          Jonathan Hsieh added a comment -

          Changed to duplicate.

          Show
          Jonathan Hsieh added a comment - Changed to duplicate.
          Jonathan Hsieh made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Release Note The hbck tool(org.apache.hadoop.hbase.util.HBaseFsck) can check and repair consistency problem.
          some error just be checked but not supply the way to repair, I plan to fix it by other tool(close_region...)or by new method.
          First, list it and discuss that is it right?

          Part A:check meta info
          1.errors.reportError(ERROR_CODE.NULL_ROOT_REGION,"Root Region or some of its attributes are null.");
          ------> after delete the root table,execute hbck tool to check but the tool run error. how to reproduce this error?

          2.errors.reportError(ERROR_CODE.NO_META_REGION, ".META. is not found on any region.");
                   ------>after delete the meta table,execute hbck tool to check but the tool run error. how to reproduce this error?

          3.errors.reportError(ERROR_CODE.MULTI_META_REGION, ".META. is found on more than one region.");
          ----->the logic:scan the root table to get META table regioninfo,if META table's regions is more than one,throw the error.
          HBase allow META table has more than one region,is it?

          Part B:check Consistency
          4.ERROR_CODE.NOT_IN_META_HDFS---->close it from regionserver.

          5.ERROR_CODE.NOT_IN_META_OR_DEPLOYED---->do nothing,maybe it will be used to fix the chain hole in part C.

          6.ERROR_CODE.NOT_IN_META---->close it from regionserver.

          7.ERROR_CODE.NOT_IN_HDFS_OR_DEPLOYED---->delete it from META table,it will make a chain hole, when check chain integrity(in part C) to fix it.

          8.ERROR_CODE.NOT_IN_HDFS---->delete it from META table and close it from regionserver,when check chain integrity(in part C) to fix it.

          9.ERROR_CODE.NOT_DEPLOYED---->assign it.

          10.ERROR_CODE.SHOULD_NOT_BE_DEPLOYED---->delete if from META table and close it from regionserver.

          11.ERROR_CODE.MULTI_DEPLOYED--->close all from regionservers,and reassign it.

          12.ERROR_CODE.SERVER_DOES_NOT_MATCH_META---->close all from regionservers,and reassign it.

          Part C:check chain Integrity
          13.ERROR_CODE.FIRST_REGION_STARTKEY_NOT_EMPTY--->treat it as a hole problem(ERROR_CODE.HOLE_IN_REGION_CHAIN).

          14.ERROR_CODE.LAST_REGION_ENDKEY_NOT_EMPTY(new add)--->treat it as a hole problem(ERROR_CODE.HOLE_IN_REGION_CHAIN).

          15.ERROR_CODE.REGION_CYCLE---->shut down cluster and merge two region by merge tool(org.apache.hadoop.hbase.util.Merge)

          16.ERROR_CODE.DUPE_STARTKEYS--->shut down cluster and merge two region by merge tool(org.apache.hadoop.hbase.util.Merge)

          17.ERROR_CODE.OVERLAP_IN_REGION_CHAIN--->shut down cluster and merge two region by merge tool(org.apache.hadoop.hbase.util.Merge)

          18.ERROR_CODE.HOLE_IN_REGION_CHAIN--->write a new method to fix it,the logic is:for recover the data,collect the regionfo from regionserver and hdfs.if a region's key range is overlaping with the hole range,put it in META table and assign it,maybe it will create overlapping problem,we can fix it by merge tool.if no region be collected,create a new region by the hole key range to fix it.
          Resolution Duplicate [ 3 ]
          Jonathan Hsieh made changes -
          Description The hbck tool(org.apache.hadoop.hbase.util.HBaseFsck) can check and repair consistency problem.
          some error just be checked but not supply the way to repair, I plan to fix it by other tool(close_region...)or by new method.
          First, list it and discuss that is it right?

          Part A:check meta info
          1.errors.reportError(ERROR_CODE.NULL_ROOT_REGION,"Root Region or some of its attributes are null.");
          ------> after delete the root table,execute hbck tool to check but the tool run error. how to reproduce this error?

          2.errors.reportError(ERROR_CODE.NO_META_REGION, ".META. is not found on any region.");
                   ------>after delete the meta table,execute hbck tool to check but the tool run error. how to reproduce this error?

          3.errors.reportError(ERROR_CODE.MULTI_META_REGION, ".META. is found on more than one region.");
          ----->the logic:scan the root table to get META table regioninfo,if META table's regions is more than one,throw the error.
          HBase allow META table has more than one region,is it?

          Part B:check Consistency
          4.ERROR_CODE.NOT_IN_META_HDFS---->close it from regionserver.

          5.ERROR_CODE.NOT_IN_META_OR_DEPLOYED---->do nothing,maybe it will be used to fix the chain hole in part C.

          6.ERROR_CODE.NOT_IN_META---->close it from regionserver.

          7.ERROR_CODE.NOT_IN_HDFS_OR_DEPLOYED---->delete it from META table,it will make a chain hole, when check chain integrity(in part C) to fix it.

          8.ERROR_CODE.NOT_IN_HDFS---->delete it from META table and close it from regionserver,when check chain integrity(in part C) to fix it.

          9.ERROR_CODE.NOT_DEPLOYED---->assign it.

          10.ERROR_CODE.SHOULD_NOT_BE_DEPLOYED---->delete if from META table and close it from regionserver.

          11.ERROR_CODE.MULTI_DEPLOYED--->close all from regionservers,and reassign it.

          12.ERROR_CODE.SERVER_DOES_NOT_MATCH_META---->close all from regionservers,and reassign it.

          Part C:check chain Integrity
          13.ERROR_CODE.FIRST_REGION_STARTKEY_NOT_EMPTY--->treat it as a hole problem(ERROR_CODE.HOLE_IN_REGION_CHAIN).

          14.ERROR_CODE.LAST_REGION_ENDKEY_NOT_EMPTY(new add)--->treat it as a hole problem(ERROR_CODE.HOLE_IN_REGION_CHAIN).

          15.ERROR_CODE.REGION_CYCLE---->shut down cluster and merge two region by merge tool(org.apache.hadoop.hbase.util.Merge)

          16.ERROR_CODE.DUPE_STARTKEYS--->shut down cluster and merge two region by merge tool(org.apache.hadoop.hbase.util.Merge)

          17.ERROR_CODE.OVERLAP_IN_REGION_CHAIN--->shut down cluster and merge two region by merge tool(org.apache.hadoop.hbase.util.Merge)

          18.ERROR_CODE.HOLE_IN_REGION_CHAIN--->write a new method to fix it,the logic is:for recover the data,collect the regionfo from regionserver and hdfs.if a region's key range is overlaping with the hole range,put it in META table and assign it,maybe it will create overlapping problem,we can fix it by merge tool.if no region be collected,create a new region by the hole key range to fix it.
          Hide
          Jonathan Hsieh added a comment -

          Cleaned up jira to follow convention. Marked as duplicate of HBASE-5128

          Show
          Jonathan Hsieh added a comment - Cleaned up jira to follow convention. Marked as duplicate of HBASE-5128
          Jonathan Hsieh made changes -
          Fix Version/s 0.90.7 [ 12319481 ]
          Jonathan Hsieh made changes -
          Link This issue is duplicated by HBASE-5128 [ HBASE-5128 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              feng xu
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 12h
                12h
                Remaining:
                Remaining Estimate - 12h
                12h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development