Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Deletes
    • Labels:

      Description

      While column families can be configured to keep deleted cells (allowing time range queries to still retrieve those cells), deletes are still somewhat unique in that they are irreversible operations. Once a delete has been issued on a cell, the only way to "undelete" it is to rewrite the data with a timestamp newer than the delete.

      The idea here is to add an "undelete" operation, that would make it possible to cancel a previous delete. An undelete operation will be similar to a delete, in that it will be written as a marker ("tombstone" doesn't seem like the right word). The undelete marker, however, will sort prior to a delete marker, canceling the effect of any following delete.

      In the absence of a column family configured to KEEP_DELETED_CELLS, we can't be sure if a prior delete marker and the effected cells have already been garbage collected. In this case (column family not configured with KEEP_DELETED_CELLS) it may be necessary for the server to reject undelete operations to avoid creating the appearance of a client contact for undeletes that can't reliably be honored.

      I think there are additional subtleties of the implementation to be worked out, but I'm also interested in a broader discussion of interest in this capability.

        Issue Links

          Activity

          Hide
          James Taylor added a comment -

          +1 for this new operation as it will help in the implementation of transactions. Would every type of Delete need a corresponding Undelete?

          When KEEP_DELETED_CELLS is on, assume:

          • Put is done at t1
          • Delete is done at t5
          • Undelete is done at t7

          If a client does a read at t6, the Delete would be in effect, while if a client does a read at t8, the Delete would not be in effect. Is that how it'd work?

          Show
          James Taylor added a comment - +1 for this new operation as it will help in the implementation of transactions. Would every type of Delete need a corresponding Undelete? When KEEP_DELETED_CELLS is on, assume: Put is done at t1 Delete is done at t5 Undelete is done at t7 If a client does a read at t6, the Delete would be in effect, while if a client does a read at t8, the Delete would not be in effect. Is that how it'd work?
          Hide
          Lars Hofhansl added a comment -

          I fear that the implications are tricky.

          1. we have delete bloom filters to avoid seeks to the beginning on the row to check for family delete markers. Those would no longer work, or in other words we'd need boom filters for the undelete markers.
          2. will we every need so to undo an undelete?
          3. how does this fit into the discussion about using sequence numbers to order operations?

          Still +1 on doing this if ts semantics are like the ones I introduced for KEEP_DELETED_CELLS (as James mentions)

          Show
          Lars Hofhansl added a comment - I fear that the implications are tricky. we have delete bloom filters to avoid seeks to the beginning on the row to check for family delete markers. Those would no longer work, or in other words we'd need boom filters for the undelete markers. will we every need so to undo an undelete? how does this fit into the discussion about using sequence numbers to order operations? Still +1 on doing this if ts semantics are like the ones I introduced for KEEP_DELETED_CELLS (as James mentions)
          Hide
          Gary Helmling added a comment -

          Would every type of Delete need a corresponding Undelete?

          Yes, I think it would be necessary to have complete parity between the two, since the goal is to cancel any type of delete operation.

          When KEEP_DELETED_CELLS is on, assume:

          • Put is done at t1
          • Delete is done at t5
          • Undelete is done at t7
            If a client does a read at t6, the Delete would be in effect, while if a client does a read at t8, the Delete would not be in effect. Is that how it'd work?

          Yes, I think this would work exactly as you describe. In the case of supporting timestamp-overriding transactions, though, the Delete and Undelete would occur at the same timestamp (t5), since the Undelete would be rolling back the Delete that was part of the transaction.

          Show
          Gary Helmling added a comment - Would every type of Delete need a corresponding Undelete? Yes, I think it would be necessary to have complete parity between the two, since the goal is to cancel any type of delete operation. When KEEP_DELETED_CELLS is on, assume: Put is done at t1 Delete is done at t5 Undelete is done at t7 If a client does a read at t6, the Delete would be in effect, while if a client does a read at t8, the Delete would not be in effect. Is that how it'd work? Yes, I think this would work exactly as you describe. In the case of supporting timestamp-overriding transactions, though, the Delete and Undelete would occur at the same timestamp (t5), since the Undelete would be rolling back the Delete that was part of the transaction.
          Hide
          Gary Helmling added a comment -

          we have delete bloom filters to avoid seeks to the beginning on the row to check for family delete markers. Those would no longer work, or in other words we'd need boom filters for the undelete markers.

          Hmm, I wasn't familiar with these, but, yes, sounds like we would unfortunately need another set of bloom filters for the undeletes. There would also need to be some change to the use of the delete bloom filters if operations were ordered by seqid, wouldn't there?

          will we every need so to undo an undelete?

          Yes, I thought of this as well. What undoes the undeletes? In a sense this shifts the irreversible operation to the undelete. I think this is an inherent problem with the approach of sorting operations within the same timestamp by operation type. There is always going to be something that sorts first, which effectively becomes undoable.

          how does this fit into the discussion about using sequence numbers to order operations?

          I think the discussion to order operations by seqid is simply an alternate approach to the some of the same underlying problems. For the issue mentioned above, sorting by seqid is conceptually simpler – there is no need for a separate undelete operation and undoing a delete is possible by re-issuing a new put for the previous value at the same timestamp as the previous delete. Of course that means that the final outcome depends on the server observed ordering of the operations. But at the same time, there is no "irreversible" operation.

          For the use case I'm considering (a single client rolling back it's previously persisted changes at a given timestamp), ordering by seqid would also work. Since it's a single client rolling back it's own operations, the operations can be ordered by the client, so there's no lack of determinism in server side ordering. Rolling back a delete with seqid ordering would be slightly more complicated, though, since the client would have to perform a read to find the previous value prior to the delete, then issue a new put, instead of simply issuing an undelete with the same parameters as the prior delete.

          Or you could combine both approaches and order operations by seqid, but add an undelete operation as well. The undelete would then sort prior to the delete by virtue of being issued after it. It would still be a no-read operation. And the undelete could be undone by issuing a new delete. Maybe this combination ultimately winds up being best.

          Thanks for the comments. I'm open to however we can most efficiently solve this with the least amount of added complexity.

          Show
          Gary Helmling added a comment - we have delete bloom filters to avoid seeks to the beginning on the row to check for family delete markers. Those would no longer work, or in other words we'd need boom filters for the undelete markers. Hmm, I wasn't familiar with these, but, yes, sounds like we would unfortunately need another set of bloom filters for the undeletes. There would also need to be some change to the use of the delete bloom filters if operations were ordered by seqid, wouldn't there? will we every need so to undo an undelete? Yes, I thought of this as well. What undoes the undeletes? In a sense this shifts the irreversible operation to the undelete. I think this is an inherent problem with the approach of sorting operations within the same timestamp by operation type. There is always going to be something that sorts first, which effectively becomes undoable. how does this fit into the discussion about using sequence numbers to order operations? I think the discussion to order operations by seqid is simply an alternate approach to the some of the same underlying problems. For the issue mentioned above, sorting by seqid is conceptually simpler – there is no need for a separate undelete operation and undoing a delete is possible by re-issuing a new put for the previous value at the same timestamp as the previous delete. Of course that means that the final outcome depends on the server observed ordering of the operations. But at the same time, there is no "irreversible" operation. For the use case I'm considering (a single client rolling back it's previously persisted changes at a given timestamp), ordering by seqid would also work. Since it's a single client rolling back it's own operations, the operations can be ordered by the client, so there's no lack of determinism in server side ordering. Rolling back a delete with seqid ordering would be slightly more complicated, though, since the client would have to perform a read to find the previous value prior to the delete, then issue a new put, instead of simply issuing an undelete with the same parameters as the prior delete. Or you could combine both approaches and order operations by seqid, but add an undelete operation as well. The undelete would then sort prior to the delete by virtue of being issued after it. It would still be a no-read operation. And the undelete could be undone by issuing a new delete. Maybe this combination ultimately winds up being best. Thanks for the comments. I'm open to however we can most efficiently solve this with the least amount of added complexity.
          Hide
          stack added a comment -

          There is always going to be something that sorts first, which effectively becomes undoable.

          Our sorting on type is a mistake (as Sergey Shelukhin has noted a few times). Would fixing this help here?

          Show
          stack added a comment - There is always going to be something that sorts first, which effectively becomes undoable. Our sorting on type is a mistake (as Sergey Shelukhin has noted a few times). Would fixing this help here?
          Hide
          James Taylor added a comment -

          the Delete and Undelete would occur at the same timestamp (t5), since the Undelete would be rolling back the Delete that was part of the transaction

          I think this is fine. I can't think of a use case where you'd need to see the Delete at a given timestamp.

          Show
          James Taylor added a comment - the Delete and Undelete would occur at the same timestamp (t5), since the Undelete would be rolling back the Delete that was part of the transaction I think this is fine. I can't think of a use case where you'd need to see the Delete at a given timestamp.
          Hide
          Jeffrey Zhong added a comment -

          since the client would have to perform a read to find the previous value prior to the delete, then issue a new put, instead of simply issuing an undelete with the same parameters as the prior delete.

          I think there is no difference in terms of the read because you need to read the previous "delete" put as well to construct the "undelete" put while the other way need to read&write more data.

          Show
          Jeffrey Zhong added a comment - since the client would have to perform a read to find the previous value prior to the delete, then issue a new put, instead of simply issuing an undelete with the same parameters as the prior delete. I think there is no difference in terms of the read because you need to read the previous "delete" put as well to construct the "undelete" put while the other way need to read&write more data.
          Hide
          Lars Hofhansl added a comment -

          Can we turn this around and implement undelete with redoing the corresponding put and handling the ambiguities with seq ids?
          I've been arguing against this, since it would break idempotency of HBase actions, but on light of this I might relent
          I guess it would require to know what exactly the delete affected, which may or may not possible.

          Yet another way to look at it is: why undelete at all? Ignoring failed transactions needs to be implemented anyway, so instead of undo one could just continue to ignore the transaction.

          Show
          Lars Hofhansl added a comment - Can we turn this around and implement undelete with redoing the corresponding put and handling the ambiguities with seq ids? I've been arguing against this, since it would break idempotency of HBase actions, but on light of this I might relent I guess it would require to know what exactly the delete affected, which may or may not possible. Yet another way to look at it is: why undelete at all? Ignoring failed transactions needs to be implemented anyway, so instead of undo one could just continue to ignore the transaction.
          Hide
          James Taylor added a comment -

          I think there is no difference in terms of the read because you need to read the previous "delete" put as well to construct the "undelete" put while the other way need to read&write more data.

          Jeffrey Zhong, the Undelete is issued by the client, the same client that issued the Delete. It's done to undo the effects of a failed transaction (i.e. like a compensating transaction). You don't need to lookup the prior state of the row. You just need to issue an Undelete with the same rowkey and timestamp as the Delete.

          Yet another way to look at it is: why undelete at all? Ignoring failed transactions needs to be implemented anyway, so instead of undo one could just continue to ignore the transaction.

          Lars Hofhansl, I agree, this is a fallback. It puts more pressure on being able to "prune" this invalid list. If that can be figured out and done efficiently, than the need for an Undelete decreases. Maybe that should be discussed in a separate JIRA?

          Assuming that an Undelete may only be done for a Delete (i.e. an Undelete can not be undeleted), is it feasible then? Besides requiring another set of Bloom filters, are their more gotchas? Is it too heavy a burden to have another set of Bloom filters? Any other options?

          Show
          James Taylor added a comment - I think there is no difference in terms of the read because you need to read the previous "delete" put as well to construct the "undelete" put while the other way need to read&write more data. Jeffrey Zhong , the Undelete is issued by the client, the same client that issued the Delete. It's done to undo the effects of a failed transaction (i.e. like a compensating transaction). You don't need to lookup the prior state of the row. You just need to issue an Undelete with the same rowkey and timestamp as the Delete. Yet another way to look at it is: why undelete at all? Ignoring failed transactions needs to be implemented anyway, so instead of undo one could just continue to ignore the transaction. Lars Hofhansl , I agree, this is a fallback. It puts more pressure on being able to "prune" this invalid list. If that can be figured out and done efficiently, than the need for an Undelete decreases. Maybe that should be discussed in a separate JIRA? Assuming that an Undelete may only be done for a Delete (i.e. an Undelete can not be undeleted), is it feasible then? Besides requiring another set of Bloom filters, are their more gotchas? Is it too heavy a burden to have another set of Bloom filters? Any other options?
          Hide
          Gary Helmling added a comment -

          Yet another way to look at it is: why undelete at all? Ignoring failed transactions needs to be implemented anyway, so instead of undo one could just continue to ignore the transaction.

          Yes, for the transaction case specifically, that is a fair approach to take. We can (and do) ignore writes from the failed transaction, but we also need a way to clear transaction IDs from the invalid set so that it doesn't continue to grow. This is difficult, but it does need to be solved, regardless of whether or not we have undeletes. If we can do this efficiently enough, then there's really no further need to attempt to rollback failed transactions at all.

          However, from the HBase perspective, it has been a long standing issue (or constraint at least) that deletes cannot be undone. Just ordering operations by seqid so that you can redo any deleted puts is not a symmetric solution, for the reasons described above (in the case of a single row delete marker, you have to rewrite all of the columns in the row).

          Show
          Gary Helmling added a comment - Yet another way to look at it is: why undelete at all? Ignoring failed transactions needs to be implemented anyway, so instead of undo one could just continue to ignore the transaction. Yes, for the transaction case specifically, that is a fair approach to take. We can (and do) ignore writes from the failed transaction, but we also need a way to clear transaction IDs from the invalid set so that it doesn't continue to grow. This is difficult, but it does need to be solved, regardless of whether or not we have undeletes. If we can do this efficiently enough, then there's really no further need to attempt to rollback failed transactions at all. However, from the HBase perspective, it has been a long standing issue (or constraint at least) that deletes cannot be undone. Just ordering operations by seqid so that you can redo any deleted puts is not a symmetric solution, for the reasons described above (in the case of a single row delete marker, you have to rewrite all of the columns in the row).
          Hide
          Lars Hofhansl added a comment -

          Let's revive this. What I have been proposing in a meeting we had was an "UNDO" cell type. Undo would sort before deletes and simply undo the next cell that immediately follows in sort order. Another UNDO with the same TS would undo the UNDO, as so on.

          I do not think we want to undo Puts, we have deletes for that.
          So we would need to two new Cell type: undo delete (for version and column delete markers) and undo delete family for family delete markers.

          I think my comments on family delete bloom filters above are misguided. At worst we'd see a performance degradation for seeking to the beginning of the row when that is not necessary when the delete is undone.

          The only trick part here is that all seeking before a row or column or to the end of a row or column need to be seeking correctly with UNDO cell types.

          Show
          Lars Hofhansl added a comment - Let's revive this. What I have been proposing in a meeting we had was an "UNDO" cell type. Undo would sort before deletes and simply undo the next cell that immediately follows in sort order. Another UNDO with the same TS would undo the UNDO, as so on. I do not think we want to undo Puts, we have deletes for that. So we would need to two new Cell type: undo delete (for version and column delete markers) and undo delete family for family delete markers. I think my comments on family delete bloom filters above are misguided. At worst we'd see a performance degradation for seeking to the beginning of the row when that is not necessary when the delete is undone. The only trick part here is that all seeking before a row or column or to the end of a row or column need to be seeking correctly with UNDO cell types.
          Hide
          Lars Hofhansl added a comment -

          The other tricky part is to to avoid removing identical cells (in terms of row, col, ts) if their type is UNDO.
          Also do we need REDO? What if we "undid" a delete, and then want to delete at the exact coordinates (row, col, ts) again? That delete would continue to be shadowed until the UNDO marker is gone, or we undo the UNDO.

          Would need to carefully think through the compaction implications. F.e. when can I safely compact the UNDO markers? Need to be able to guarantee that every affected Cell is also deleted. That is not true for all compactions (not even major ones).

          Show
          Lars Hofhansl added a comment - The other tricky part is to to avoid removing identical cells (in terms of row, col, ts) if their type is UNDO. Also do we need REDO? What if we "undid" a delete, and then want to delete at the exact coordinates (row, col, ts) again? That delete would continue to be shadowed until the UNDO marker is gone, or we undo the UNDO. Would need to carefully think through the compaction implications. F.e. when can I safely compact the UNDO markers? Need to be able to guarantee that every affected Cell is also deleted. That is not true for all compactions (not even major ones).
          Hide
          James Taylor added a comment -

          If it simplifies things, for Tephra transactions in Phoenix, we only need the ability to 1) undo a column delete markers and 2) undo a column family delete marker. We don't need to be able to undo an undo or to redo.

          Show
          James Taylor added a comment - If it simplifies things, for Tephra transactions in Phoenix, we only need the ability to 1) undo a column delete markers and 2) undo a column family delete marker. We don't need to be able to undo an undo or to redo.
          Hide
          Lars Hofhansl added a comment - - edited

          I will think about this in earnest during the next few days.

          Related: A scan coprocessor can currently not ignore delete markers (without reimplementing the entire delete marker logic). I am not sure about a good interface for this (seems like we need a pass down a list of excluded timeranges). Gary Helmling, any ideas?

          Show
          Lars Hofhansl added a comment - - edited I will think about this in earnest during the next few days. Related: A scan coprocessor can currently not ignore delete markers (without reimplementing the entire delete marker logic). I am not sure about a good interface for this (seems like we need a pass down a list of excluded timeranges). Gary Helmling , any ideas?
          Hide
          James Taylor added a comment -

          Lars Hofhansl - that's a good point. What about even ignoring a Put? Does the Tephra TransactionVisibilityFilter[1] handle ignoring a Put such that an earlier Put would be seen? Do all scans have to be raw scans, Gary Helmling?

          [1] https://github.com/caskdata/tephra/blob/develop/tephra-hbase-compat-0.98/src/main/java/co/cask/tephra/hbase98/coprocessor/TransactionVisibilityFilter.java

          Show
          James Taylor added a comment - Lars Hofhansl - that's a good point. What about even ignoring a Put? Does the Tephra TransactionVisibilityFilter [1] handle ignoring a Put such that an earlier Put would be seen? Do all scans have to be raw scans, Gary Helmling ? [1] https://github.com/caskdata/tephra/blob/develop/tephra-hbase-compat-0.98/src/main/java/co/cask/tephra/hbase98/coprocessor/TransactionVisibilityFilter.java
          Hide
          Gary Helmling added a comment -

          Lars Hofhansl some ability to apply a filter to the delete marker would be ideal (at least in this case having a filter indicate a SKIP), but how that changes ScanQueryMatcher is of course tricky. Need to look at this a bit and think about it.

          James Taylor yes we ignore puts from in progress or deleted transactions in the filter. This happens with normal (not raw) scans currently.

          Show
          Gary Helmling added a comment - Lars Hofhansl some ability to apply a filter to the delete marker would be ideal (at least in this case having a filter indicate a SKIP), but how that changes ScanQueryMatcher is of course tricky. Need to look at this a bit and think about it. James Taylor yes we ignore puts from in progress or deleted transactions in the filter. This happens with normal (not raw) scans currently.

            People

            • Assignee:
              Unassigned
              Reporter:
              Gary Helmling
            • Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:

                Development