Type: New Feature
Affects Version/s: None
Fix Version/s: None
We want to track the source of mutations (especially Deletes) via Phoenix. We have multiple use cases which does the deletes namely: customer deleting the data, internal process like GDPR compliance, Phoenix TTL MR jobs. For every mutations we want to track the source of operation which initiated the deletes.
At my day job, we have custom Backup/Restore tool.
For example: During GDPR compliance cleanup (lets say at time t0), we mistakenly deleted some customer data and it were possible that customer also deleted some data from their side (at time t1). To recover mistakenly deleted data, we restore from the backup at time (t0 - 1). By doing this, we also recovered the data that customer intentionally deleted.
We need a way for Restore tool to selectively recover data.
Trying to explain via an example.
Lets say there are 2 different systems (lets say accidental-delete and customer-delete) deleting the data from the same table at almost the same time. As the name suggest customer-delete is the intentional delete and accidental-delete is deletes done by mistake. We have restore tool which will restore all the data between start time and end times (start-ts and end-ts). We want to restore the deletes that happened by accidental-delete system and not want to restore the deletes done by customer-delete system. By adding cell tag to Delete Markers, we can not restore data done by customer-delete system.
In my proposal, I want to add cell tags to Tombstone delete marker so that we have that tag in the backups. Incase we have to restore data, we can restore specific row depending on the tag present in the cell.
We want to leverage Cell Tag feature for Delete mutations to store these metadata. Currently Delete object doesn't support Tag feature.
Also we want a solution that can be easily extensible to other mutations like Put.
Some of the use cases I can think of where we can use tags for Put mutations are:
1. Identifying whether the put came from primary cluster or replicated cluster so that we can make the backup tool more smarter and not backup the same put twice in source and replicated cluster.
2. We have a multi-tenancy concept in Phoenix. We want to track whether the upsert (put operation in hbase) came from Global or Tenant connection.