HBase
  1. HBase
  2. HBASE-10241

implement mvcc-consistent scanners (across recovery)

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.99.0
    • Fix Version/s: None
    • Component/s: HFile, regionserver, Scanners
    • Labels:
      None

      Description

      Scanners currently use mvcc for consistency. However, mvcc is lost on server restart, or even a region move. This JIRA is to enable the scanners to transfer mvcc (or seqId, or some other number, see HBASE-8763) between servers. First, client scanner needs to get and store the readpoint. Second, mvcc needs to be preserved in WAL. Third, the mvcc needs to be stored in store files per KV and discarded when not needed.

      1. Consistent scanners.pdf
        75 kB
        Sergey Shelukhin

        Issue Links

          Activity

          Hide
          Sergey Shelukhin added a comment -
          Show
          Sergey Shelukhin added a comment - HBASE-10288
          Hide
          Sergey Shelukhin added a comment - - edited

          I will make a separate unrelated JIRA for that

          Show
          Sergey Shelukhin added a comment - - edited I will make a separate unrelated JIRA for that
          Hide
          Sergey Shelukhin added a comment -

          That would be PITA from backward compat perspective - we'd both add a field, requiring HFileFormat v4 (don't really want tag overhead for this), and presumably (tags or not) remove the old magic mechanism

          Show
          Sergey Shelukhin added a comment - That would be PITA from backward compat perspective - we'd both add a field, requiring HFileFormat v4 (don't really want tag overhead for this), and presumably (tags or not) remove the old magic mechanism
          Hide
          Enis Soztutar added a comment -

          Mvcc can already be serialized with KV in HFile. Comment in KeyValue.java is a lie

          Sorry, I am not talking about mvcc serialization in hfile. I was talking about making mvcc number a part of the byte[] in KV.

          Show
          Enis Soztutar added a comment - Mvcc can already be serialized with KV in HFile. Comment in KeyValue.java is a lie Sorry, I am not talking about mvcc serialization in hfile. I was talking about making mvcc number a part of the byte[] in KV.
          Hide
          Sergey Shelukhin added a comment -

          There's another issue, HBASE-10227 for the WAL stuff.
          Mvcc can already be serialized with KV in HFile. Comment in KeyValue.java is a lie

          Show
          Sergey Shelukhin added a comment - There's another issue, HBASE-10227 for the WAL stuff. Mvcc can already be serialized with KV in HFile. Comment in KeyValue.java is a lie
          Hide
          Enis Soztutar added a comment -

          but the plan was that I will do 1 and 3, and then take 2 if the other JIRA that does 2 is not done by then.

          Sounds good. I thought HBASE-8721 is won't fix.

          HBASE-8763 does not need to block this, it's probably bigger than this entire JIRA

          Indeed. But it will be a shame if we add mvcc's to WAL only to remove them again after HBASE-8763.

          BTW, I think we also have to handle mvcc / seqId as a part of the serialization in the KV byte array. Do we have any open issues for that?

          Show
          Enis Soztutar added a comment - but the plan was that I will do 1 and 3, and then take 2 if the other JIRA that does 2 is not done by then. Sounds good. I thought HBASE-8721 is won't fix. HBASE-8763 does not need to block this, it's probably bigger than this entire JIRA Indeed. But it will be a shame if we add mvcc's to WAL only to remove them again after HBASE-8763 . BTW, I think we also have to handle mvcc / seqId as a part of the serialization in the KV byte array. Do we have any open issues for that?
          Hide
          Sergey Shelukhin added a comment -

          Thanks for HBASE-9797 reference, yes, it is good for that!

          Show
          Sergey Shelukhin added a comment - Thanks for HBASE-9797 reference, yes, it is good for that!
          Hide
          Sergey Shelukhin added a comment -

          Subtask 2 is trivial; the reason it is not done is that it is being done elsewhere (see how it's resolved as dup), so since it's not blocking us here and now it doesn't make sense to do double work. I am not working on this jira right now (will get back to it hopefully and there's a patch out in client subtask), but the plan was that I will do 1 and 3, and then take 2 if the other JIRA that does 2 is not done by then.
          HBASE-8763 does not need to block this, it's probably bigger than this entire JIRA. If it's done before this due to delays, good, if not, also good

          Show
          Sergey Shelukhin added a comment - Subtask 2 is trivial; the reason it is not done is that it is being done elsewhere (see how it's resolved as dup), so since it's not blocking us here and now it doesn't make sense to do double work. I am not working on this jira right now (will get back to it hopefully and there's a patch out in client subtask), but the plan was that I will do 1 and 3, and then take 2 if the other JIRA that does 2 is not done by then. HBASE-8763 does not need to block this, it's probably bigger than this entire JIRA. If it's done before this due to delays, good, if not, also good
          Hide
          Enis Soztutar added a comment -

          We need to fix this for a couple of different reasons:

          • Fixing scanner consistency with multi-row transactions (see HBASE-9797)
          • Adding cell-based scanners, and streaming scans
          • Adding single-row scanners.
          • Consistent scanners with region replicas in case replicas are mostly up to date (HBASE-10070)

          What is the plan here? I think we should do subtasks 1 and 3 regardless of HBASE-8763. But it seems that if we do HBASE-8763 first, it will be much cleaner and we won't need subtask 2 at all.

          Show
          Enis Soztutar added a comment - We need to fix this for a couple of different reasons: Fixing scanner consistency with multi-row transactions (see HBASE-9797 ) Adding cell-based scanners, and streaming scans Adding single-row scanners. Consistent scanners with region replicas in case replicas are mostly up to date ( HBASE-10070 ) What is the plan here? I think we should do subtasks 1 and 3 regardless of HBASE-8763 . But it seems that if we do HBASE-8763 first, it will be much cleaner and we won't need subtask 2 at all.
          Hide
          Sergey Shelukhin added a comment -

          final client-side patch (that works across region re-open, split and merge as long as you trick existing server into preserving MVCC) is attached to HBASE-10242

          Show
          Sergey Shelukhin added a comment - final client-side patch (that works across region re-open, split and merge as long as you trick existing server into preserving MVCC) is attached to HBASE-10242
          Hide
          Honghua Feng added a comment -

          I also encountered this issue when implementing JIRA-8721, the fix for "deletes can mask puts that happen after the delete", and already persisted mvcc in WAL, and then they can be used to recover region's correct mvcc during re-opening. Anyone who has interest can refer it

          Seems setting mvcc (per hfile) to zero for (minor, arguably) performance benefit can't offset the correctness penalty it brings. Persisting mvcc and survive them across regionservers is a matter of semantic correctness, seems most related issues can be resolved by making this correct, combining mvcc and seqid is not as critical as this correcting.

          Show
          Honghua Feng added a comment - I also encountered this issue when implementing JIRA-8721, the fix for "deletes can mask puts that happen after the delete", and already persisted mvcc in WAL, and then they can be used to recover region's correct mvcc during re-opening. Anyone who has interest can refer it Seems setting mvcc (per hfile) to zero for (minor, arguably) performance benefit can't offset the correctness penalty it brings. Persisting mvcc and survive them across regionservers is a matter of semantic correctness, seems most related issues can be resolved by making this correct, combining mvcc and seqid is not as critical as this correcting.
          Hide
          Sergey Shelukhin added a comment -

          WIP patch attached to child jira

          Show
          Sergey Shelukhin added a comment - WIP patch attached to child jira
          Hide
          Sergey Shelukhin added a comment -

          Region-level inconsistency, essentially.

          Show
          Sergey Shelukhin added a comment - Region-level inconsistency, essentially.
          Hide
          Lars Hofhansl added a comment -

          Nice writeup. It would be better even if it would state the failure scenario we're guarding against.

          I think we have to careful not to add any "half guarantees". Currently HBase does MVCC based SI for rows (and serializable views for increment/append). If I understand this correctly, this is not needed for row consistency, right?
          While recovery is in progress we cannot scan, so we'd never see partial rows, right?

          Then what exactly are we guarding against? (As usually, I might just miss an important point )

          Show
          Lars Hofhansl added a comment - Nice writeup. It would be better even if it would state the failure scenario we're guarding against. I think we have to careful not to add any "half guarantees". Currently HBase does MVCC based SI for rows (and serializable views for increment/append). If I understand this correctly, this is not needed for row consistency, right? While recovery is in progress we cannot scan, so we'd never see partial rows, right? Then what exactly are we guarding against? (As usually, I might just miss an important point )
          Hide
          Sergey Shelukhin added a comment -

          On mvcc giving consistent view on region, that is unnecessary, right – when would we ever care about a consistent view across a region rather than just across a row (other than the fact that row boundaries are only known after the fact, after you have passed them out)

          It can actually be pretty important... if recovery takes a while and scanner bounces the data read can be several minutes apart. For certain use cases it's much better to have consistent data for close rows (esp. if some sharded data is stored). Also, if secondary reads are implemented the divergence between scanners can be even greater, so the negative effects of scanner "jumping" will be even more visible. Then, as suggested above, by querying mvcc from all requisite regions before scanner runs we can make it even more reasonable.
          Then it becomes as close as you can get to consistent view of the data without implementing something like Percolator, with external timestamps. Which is pretty neat

          Show
          Sergey Shelukhin added a comment - On mvcc giving consistent view on region, that is unnecessary, right – when would we ever care about a consistent view across a region rather than just across a row (other than the fact that row boundaries are only known after the fact, after you have passed them out) It can actually be pretty important... if recovery takes a while and scanner bounces the data read can be several minutes apart. For certain use cases it's much better to have consistent data for close rows (esp. if some sharded data is stored). Also, if secondary reads are implemented the divergence between scanners can be even greater, so the negative effects of scanner "jumping" will be even more visible. Then, as suggested above, by querying mvcc from all requisite regions before scanner runs we can make it even more reasonable. Then it becomes as close as you can get to consistent view of the data without implementing something like Percolator, with external timestamps. Which is pretty neat
          Hide
          stack added a comment -

          +1 on it being on by default and config is only to turn it off if problematic.

          On mvcc giving consistent view on region, that is unnecessary, right – when would we ever care about a consistent view across a region rather than just across a row (other than the fact that row boundaries are only known after the fact, after you have passed them out) – so if that the case, we should make do w/ the lesser scope if we can.

          When I say cost, I mean the amount of extra checks and work the server will have to do to enforce this consistency (the fancy dancing necessary enforcing the period during which a scanner may come in post recover of a server – will we suspend compactions during this time or will compactions have a special flag which says do not drop mvcc .... so no full compactions during this period – and so on).

          Show
          stack added a comment - +1 on it being on by default and config is only to turn it off if problematic. On mvcc giving consistent view on region, that is unnecessary, right – when would we ever care about a consistent view across a region rather than just across a row (other than the fact that row boundaries are only known after the fact, after you have passed them out) – so if that the case, we should make do w/ the lesser scope if we can. When I say cost, I mean the amount of extra checks and work the server will have to do to enforce this consistency (the fancy dancing necessary enforcing the period during which a scanner may come in post recover of a server – will we suspend compactions during this time or will compactions have a special flag which says do not drop mvcc .... so no full compactions during this period – and so on).
          Hide
          Sergey Shelukhin added a comment -

          sequenceid is not scoped to the region. Ergo mvcc should be too?

          That is not necessary, strictly speaking.

          IMO, just say no to more configs. When would someone want 'inconsistent scanners'? (In my experience, users do not play with configs in 98.1745% of cases – smile).

          That is ok, but in 1.8265% cases when you do need config you normally really need it
          It can be on by default, but at least for the first version the new functionality should be easy to disable if needed.

          ....unless mvcc and seqId are merged)

          Would this issue be better if above work is done first?

          Not much. After examining stuff I'd say it almost doesn't matter. Except maybe some more work adding mvcc to WAL, but that's not much.

          ...Recovery will have to make use of mvcc when replaying the edit to new server.

          This work is ongoing over in @jeffrey zhong replay wal effort?

          It will have to go into both that and old replay I guess, similar to nonces. There's also HBASE-10227, which is related.

          mvcc is about giving you a consistent view on a row only. This work is to deal with the case where you have a wide row and you have already passed the client the first half of a row, a crash happens, and you need to return to the client the second half of the row?

          Actually, mvcc currently gives you a consistent view of the region, in theory (unless there's some glitch in code). All updates in the region are mvcc-ordered.

          A section in doc. on implications of not having this change fixed and then the 'cost' of this fix going in would help.

          Will add. What do you mean by cost?

          Show
          Sergey Shelukhin added a comment - sequenceid is not scoped to the region. Ergo mvcc should be too? That is not necessary, strictly speaking. IMO, just say no to more configs. When would someone want 'inconsistent scanners'? (In my experience, users do not play with configs in 98.1745% of cases – smile). That is ok, but in 1.8265% cases when you do need config you normally really need it It can be on by default, but at least for the first version the new functionality should be easy to disable if needed. ....unless mvcc and seqId are merged) Would this issue be better if above work is done first? Not much. After examining stuff I'd say it almost doesn't matter. Except maybe some more work adding mvcc to WAL, but that's not much. ...Recovery will have to make use of mvcc when replaying the edit to new server. This work is ongoing over in @jeffrey zhong replay wal effort? It will have to go into both that and old replay I guess, similar to nonces. There's also HBASE-10227 , which is related. mvcc is about giving you a consistent view on a row only. This work is to deal with the case where you have a wide row and you have already passed the client the first half of a row, a crash happens, and you need to return to the client the second half of the row? Actually, mvcc currently gives you a consistent view of the region, in theory (unless there's some glitch in code). All updates in the region are mvcc-ordered. A section in doc. on implications of not having this change fixed and then the 'cost' of this fix going in would help. Will add. What do you mean by cost?
          Hide
          stack added a comment -

          Sergey Shelukhin Nice writeup.

          sequenceid is not scoped to the region. Ergo mvcc should be too?

          Client will have a configurable setting to enable consistent scanners.

          IMO, just say no to more configs. When would someone want 'inconsistent scanners'? (In my experience, users do not play with configs in 98.1745% of cases – smile).

          ....unless mvcc and seqId are merged)

          Would this issue be better if above work is done first?

          ...Recovery will have to make use of mvcc when replaying the edit to new server.

          This work is ongoing over in @jeffrey zhong replay wal effort?

          mvcc is about giving you a consistent view on a row only. This work is to deal with the case where you have a wide row and you have already passed the client the first half of a row, a crash happens, and you need to return to the client the second half of the row?

          A section in doc. on implications of not having this change fixed and then the 'cost' of this fix going in would help.

          Thanks for working on this tough one Sergey Shelukhin

          Lars Hofhansl One for you to peruse boss...

          Show
          stack added a comment - Sergey Shelukhin Nice writeup. sequenceid is not scoped to the region. Ergo mvcc should be too? Client will have a configurable setting to enable consistent scanners. IMO, just say no to more configs. When would someone want 'inconsistent scanners'? (In my experience, users do not play with configs in 98.1745% of cases – smile). ....unless mvcc and seqId are merged) Would this issue be better if above work is done first? ...Recovery will have to make use of mvcc when replaying the edit to new server. This work is ongoing over in @jeffrey zhong replay wal effort? mvcc is about giving you a consistent view on a row only. This work is to deal with the case where you have a wide row and you have already passed the client the first half of a row, a crash happens, and you need to return to the client the second half of the row? A section in doc. on implications of not having this change fixed and then the 'cost' of this fix going in would help. Thanks for working on this tough one Sergey Shelukhin Lars Hofhansl One for you to peruse boss...
          Hide
          Sergey Shelukhin added a comment -

          one-pager doesn't cover merges and splits. Looks like mvccs on client will have to be tracked per range, rather than region.
          For even more consistent scanner, mvccs might even optionally be pre-fetched from all regions, but that is for later

          Show
          Sergey Shelukhin added a comment - one-pager doesn't cover merges and splits. Looks like mvccs on client will have to be tracked per range, rather than region. For even more consistent scanner, mvccs might even optionally be pre-fetched from all regions, but that is for later
          Hide
          Sergey Shelukhin added a comment -

          One-pager

          Show
          Sergey Shelukhin added a comment - One-pager

            People

            • Assignee:
              Sergey Shelukhin
              Reporter:
              Sergey Shelukhin
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:

                Development