Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-27649

WALPlayer does not properly dedupe overridden cell versions

    XMLWordPrintableJSON

Details

    Description

      If you do 2 Puts to a cell with different values but the same timestamp, the latest one will win. This is because in the memstore we use a sequenceId as a tie breaker for duplicate timestamps. When the data is flushed to a StoreFile, the deduplication will occur and eventually the sequenceId will be dropped.

      Those 2 Puts would have been added to the WAL, and if you use WALPlayer to replay those WALs (as anyone could do, but also as backup/restore does for incremental restores) it will not properly do the same thing. It's unclear which of the duplicate cells you will get, when you should always get the latest.

      Our WAL encoder doesn't include the sequenceIds in the WALEntry cells. Instead the WALKey has a getSequenceId() which contains the same sequenceId the cells used to have. In WALCellMapper we don't pass those along, nor in CellSerialization, and thus CellSortReducer is not able to use the sequenceId to dedupe.

      I think we just need to translate the WALKey.getSequenceId() into the output Cells in WALCellMapper, then update CellSerialization to include them as well. At that point CellSortReducer should work as expected, and we should get the correct cell values in the hfiles.

      One open question is whether we should clear out the sequenceId before flushing to the hfile. I don't think so?

      Attachments

        Activity

          People

            bbeaudreault Bryan Beaudreault
            bbeaudreault Bryan Beaudreault
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: