HBase
  1. HBase
  2. HBASE-6752

On region server failure, serve writes and timeranged reads during the log split

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.95.2
    • Fix Version/s: None
    • Component/s: regionserver
    • Labels:
      None

      Description

      Opening for write on failure would mean:

      • Assign the region to a new regionserver. It marks the region as recovering
        • specific exception returned to the client when we cannot server.
        • allow them to know where they stand. The exception can include some time information (failure stated on: ...)
        • allow them to go immediately on the right regionserver, instead of retrying or calling the region holding meta to get the new address
          => save network calls, lower the load on meta.
      • Do the split as today. Priority is given to region server holding the new regions
        • help to share the load balancing code: the split is done by region server considered as available for new regions
        • help locality (the recovered edits are available on the region server) => lower the network usage
      • When the split is finished, we're done as of today
      • while the split is progressing, the region server can
        • serve writes
          • that's useful for all application that need to write but not read immediately:
          • whatever logs events to analyze them later
          • opentsdb is a perfect example.
        • serve reads if they have a compatible time range. For heavily used tables, it could be an help, because:
          • we can expect to have a few minutes of data only (as it's loaded)
          • the heaviest queries, often accepts a few or more minutes delay.

      Some "What if":
      1) the split fails
      => Retry until it works. As today. Just that we serves writes. We need to know (as today) that the region has not recovered if we fail again.
      2) the regionserver fails during the split
      => As 1 and as of today/
      3) the regionserver fails after the split but before the state change to fully available.
      => New assign. More logs to split (the ones already dones and the new ones).
      4) the assignment fails
      => Retry until it works. As today.

        Issue Links

          Activity

          Nicolas Liochon created issue -
          Ted Yu made changes -
          Field Original Value New Value
          Summary On region server failure, serves writes and timeranged reads during the log split. On region server failure, serve writes and timeranged reads during the log split
          Hide
          stack added a comment -

          specific exception returned to the client when we cannot server.

          Who would return this? Not the server that just failed?

          Or is it during recovery? The region will be assigned this new location and meta gets updated w/ new location only the region is not fully on line because its still recovering?

          Or is this when region is moved?

          Priority is given to region server holding the new regions

          What does this mean? What kinda of priority?

          I like being able to take writes the sooner.

          Show
          stack added a comment - specific exception returned to the client when we cannot server. Who would return this? Not the server that just failed? Or is it during recovery? The region will be assigned this new location and meta gets updated w/ new location only the region is not fully on line because its still recovering? Or is this when region is moved? Priority is given to region server holding the new regions What does this mean? What kinda of priority? I like being able to take writes the sooner.
          Hide
          Nicolas Liochon added a comment -

          Who would return this? Not the server that just failed?

          If we reassign immediately, the client will go to the new regionserver. So the region server will be able to tell it a real status (for example, on reads, we can estimate the recovery time left and the regionserver can say: come back in 20 seconds for this region).

          What does this mean? What kinda of priority?

          Today, the split is performed by any available RS. If we preassign the regions, the split can be done by the regionserver which is owning some of the data we're expecting to find in the hlog file...

          Show
          Nicolas Liochon added a comment - Who would return this? Not the server that just failed? If we reassign immediately, the client will go to the new regionserver. So the region server will be able to tell it a real status (for example, on reads, we can estimate the recovery time left and the regionserver can say: come back in 20 seconds for this region). What does this mean? What kinda of priority? Today, the split is performed by any available RS. If we preassign the regions, the split can be done by the regionserver which is owning some of the data we're expecting to find in the hlog file...
          Hide
          stack added a comment -

          Makes sense. Sounds great. How we know what regionserver to give a log split too when the log has edits for all regions that were on a regionserver. You thinking we could give all regions on the crashed regionserver to a particular regionserver?

          Show
          stack added a comment - Makes sense. Sounds great. How we know what regionserver to give a log split too when the log has edits for all regions that were on a regionserver. You thinking we could give all regions on the crashed regionserver to a particular regionserver?
          Hide
          Kannan Muthukkaruppan added a comment -

          There might be a bunch of nitty gritties to be ironed out-- but being able to take writes nearly all the time would be a very nice win. So big +1 for exploring this effort. Will throw out a few things that come to mind:

          • We do want the old edits to come in the correct order of sequence ids (i.e. be considered older than the newer puts that arrive when the region is in recovery mode, correct)? So, we somehow need to cheaply find the correct sequence id to use for the new puts. It needs to be bigger than sequence ids for all the edits for that region in the log files. So maybe all that's needed here is to open recover the latest log file, and scan it to find the last sequence id?
          • Picking a winner among duplicates in two files relies on using sequence id of the HFile as a tie-break. And therefore, today, compactions always pick a dense subrange of files order by sequence ids. That is if we have HFiles a, b, c, d, e sorted by sequence id, we might compact a,b,c or c,d,e but never say a,d,e. With this new scheme, we should take care that we don't violate this property. The old data should correctly be recovered into HFiles with the correct sequence id.. and even if newer data has been flushed before the recovery is complete we shouldn't compact those newer files with older HFiles given that some new files are supposed to come in between (after recovery).
          Show
          Kannan Muthukkaruppan added a comment - There might be a bunch of nitty gritties to be ironed out-- but being able to take writes nearly all the time would be a very nice win. So big +1 for exploring this effort. Will throw out a few things that come to mind: We do want the old edits to come in the correct order of sequence ids (i.e. be considered older than the newer puts that arrive when the region is in recovery mode, correct)? So, we somehow need to cheaply find the correct sequence id to use for the new puts. It needs to be bigger than sequence ids for all the edits for that region in the log files. So maybe all that's needed here is to open recover the latest log file, and scan it to find the last sequence id? Picking a winner among duplicates in two files relies on using sequence id of the HFile as a tie-break. And therefore, today, compactions always pick a dense subrange of files order by sequence ids. That is if we have HFiles a, b, c, d, e sorted by sequence id, we might compact a,b,c or c,d,e but never say a,d,e. With this new scheme, we should take care that we don't violate this property. The old data should correctly be recovered into HFiles with the correct sequence id.. and even if newer data has been flushed before the recovery is complete we shouldn't compact those newer files with older HFiles given that some new files are supposed to come in between (after recovery).
          Hide
          Gregory Chanan added a comment -

          On timeranged reads:

          if the user specified his own timestamps, couldn't the correct value to return be only in the WAL?

          Show
          Gregory Chanan added a comment - On timeranged reads: if the user specified his own timestamps, couldn't the correct value to return be only in the WAL?
          Gregory Chanan made changes -
          Assignee Gregory Chanan [ gchanan ]
          Hide
          Gregory Chanan added a comment -

          Assigned to myself, I'm definitely up for the serving writes part, need to think some more about the timeranged reads. May file separate JIRAs.

          Show
          Gregory Chanan added a comment - Assigned to myself, I'm definitely up for the serving writes part, need to think some more about the timeranged reads. May file separate JIRAs.
          Hide
          Nicolas Liochon added a comment -

          Seems reasonable, there are still some dark areas around timerange. Let's do thing smoothly . But I think your comment is right.

          Some various points I had in mind:
          There is another use case mentionned in HBASE-3745: "In some applications, a common access pattern is to frequently scan tables with a time range predicate restricted to a fairly recent time window. For example, you may want to do an incremental aggregation or indexing step only on rows that have changed in the last hour. We do this efficiently by tracking min and max timestamp on an HFile level, so that old HFiles don't have to be read."

          We do want the old edits to come in the correct order of sequence ids

          Imho yes, we should not relax any point of the HBase consistency.

          So, we somehow need to cheaply find the correct sequence id to use for the new puts. It needs to be bigger than sequence ids for all the edits for that region in the log files. So maybe all that's needed here is to open recover the latest log file, and scan it to find the last sequence id?

          I would like HBase to be resilient to log files issues (no replica, corrupted files, overloaded datanodes, bad luck when choosing the datanode to read from...) by not opening them at all during this process. Would a guess estimate be ok? counting the number of files/blocks to calculate the maximum number of id?

          Picking a winner among duplicates in two files relies on using sequence id of the HFile as a tie-break. And therefore, today, compactions always pick a dense subrange of files order by sequence ids.

          I wonder if we need major compactions? I was thinking that they could be skipped. But we need to be able to manage small compactions for sure. I imagine that we can have some critical cases where we can be in the intermediate state a few days: (week end + trying to fix the broken hlog on a test cluster + waiting for a non critical moment for fixing the production env)...

          Show
          Nicolas Liochon added a comment - Seems reasonable, there are still some dark areas around timerange. Let's do thing smoothly . But I think your comment is right. Some various points I had in mind: There is another use case mentionned in HBASE-3745 : "In some applications, a common access pattern is to frequently scan tables with a time range predicate restricted to a fairly recent time window. For example, you may want to do an incremental aggregation or indexing step only on rows that have changed in the last hour. We do this efficiently by tracking min and max timestamp on an HFile level, so that old HFiles don't have to be read." We do want the old edits to come in the correct order of sequence ids Imho yes, we should not relax any point of the HBase consistency. So, we somehow need to cheaply find the correct sequence id to use for the new puts. It needs to be bigger than sequence ids for all the edits for that region in the log files. So maybe all that's needed here is to open recover the latest log file, and scan it to find the last sequence id? I would like HBase to be resilient to log files issues (no replica, corrupted files, overloaded datanodes, bad luck when choosing the datanode to read from...) by not opening them at all during this process. Would a guess estimate be ok? counting the number of files/blocks to calculate the maximum number of id? Picking a winner among duplicates in two files relies on using sequence id of the HFile as a tie-break. And therefore, today, compactions always pick a dense subrange of files order by sequence ids. I wonder if we need major compactions? I was thinking that they could be skipped. But we need to be able to manage small compactions for sure. I imagine that we can have some critical cases where we can be in the intermediate state a few days: (week end + trying to fix the broken hlog on a test cluster + waiting for a non critical moment for fixing the production env)...
          Nicolas Liochon made changes -
          Link This issue is required by HBASE-5843 [ HBASE-5843 ]
          Gregory Chanan made changes -
          Assignee Gregory Chanan [ gchanan ]
          Hide
          Nicolas Liochon added a comment -

          @Gregory: As you have unassigned the jira, I will have a look in the next weeks. Have you studied some options in more details and rejected them?

          Show
          Nicolas Liochon added a comment - @Gregory: As you have unassigned the jira, I will have a look in the next weeks. Have you studied some options in more details and rejected them?
          Hide
          Gregory Chanan added a comment -

          @nkeywal: didn't study anything in too much depth.

          For the read part, my thought was to implement a config (in HTableDescriptor?) that would reject user-set timestamps on writes, so we know for sure there can't be any writes in the timestamp range that need to be replayed from the WAL. I suspect there are other optimizations we could do with that information, but haven't thought it through.

          For writes, do you create a new WAL for the new writes that are happening while the log is still replaying? If so, management could be complicated and it might make sense to have support for multiple WALs already before tackling that. If not (you write to the same WAL), would that even work? I guess you would want to avoid replaying the new writes (might be okay if all WAL updates are idempotent, but could be an issue if a lot of writes go in during the replay time).

          Show
          Gregory Chanan added a comment - @nkeywal: didn't study anything in too much depth. For the read part, my thought was to implement a config (in HTableDescriptor?) that would reject user-set timestamps on writes, so we know for sure there can't be any writes in the timestamp range that need to be replayed from the WAL. I suspect there are other optimizations we could do with that information, but haven't thought it through. For writes, do you create a new WAL for the new writes that are happening while the log is still replaying? If so, management could be complicated and it might make sense to have support for multiple WALs already before tackling that. If not (you write to the same WAL), would that even work? I guess you would want to avoid replaying the new writes (might be okay if all WAL updates are idempotent, but could be an issue if a lot of writes go in during the replay time).
          Nicolas Liochon made changes -
          Priority Minor [ 4 ] Major [ 3 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Nicolas Liochon
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development