[HDDS-9146] Potential data loss with HSync due to deletedTable entry having the same block as keyTable entry's - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.4.0
Component/s: None
Labels:
- pull-request-available

Description

It is observed when hsync() is called followed by a close() for a key stream (which triggers two OMKeyCommitRequest, the first one with isHSync = true and the second one with isHSync = false), deletedTable could have an entry with the exact same block conID (container ID) and locId (local ID) as the committed key in keyTable, which can cause OM's KeyDeletingService to call SCM to remove the committed block by mistake.

The catch is, actual data loss won't happen until the container is closed, only then will block deletion actually happen on DNs. CMIIW erose

Repro integration test branch (based on erose's integration test based on my initial draft):

https://github.com/smengcl/hadoop-ozone/tree/HDDS-9146-repro

Run integration test TestMiniOzoneCluster#testKeyRenameDirDelete for a repro:

Test log. See entries in keyTable and deletedTable share the same block conID: 1 and locID: 111677748019200001

2023-08-09 14:31:54,859 [main] WARN  ozone.TestMiniOzoneCluster (TestMiniOzoneCluster.java:testKeyRenameDirDelete(159)) - keyTable:     ----- START -----
2023-08-09 14:31:54,860 [main] WARN  ozone.TestMiniOzoneCluster (TestMiniOzoneCluster.java:testKeyRenameDirDelete(168)) - keyTable:     key = /testozonevol/testozonebucket/inputTera/_temporary/1/_temporary/attempt_1691047336995_0006_m_000001_0/part-m-00001, val = OmKeyInfo{volumeName='testozonevol', bucketName='testozonebucket', keyName='inputTera/_temporary/1/_temporary/attempt_1691047336995_0006_m_000001_0/part-m-00001', dataSize=11, keyLocationVersions=[OmKeyLocationInfoGroup{version=0, locationVersionMap={0=[{blockID={conID: 1 locID: 111677748019200001 bcsId: 2}, length=11, offset=0, token=null, pipeline=null, createVersion=0, partNumber=0}]}, isMultipartKey=false}], creationTime=1691616714661, modificationTime=1691616714848, replicationConfig=RATIS/THREE, encInfo=null, fileChecksum=null, isFile=true, fileName='part-m-00001'}
2023-08-09 14:31:54,860 [main] WARN  ozone.TestMiniOzoneCluster (TestMiniOzoneCluster.java:testKeyRenameDirDelete(171)) - keyTable:     -----  END  -----
2023-08-09 14:31:54,860 [main] WARN  ozone.TestMiniOzoneCluster (TestMiniOzoneCluster.java:testKeyRenameDirDelete(173)) - deletedTable: ----- START -----
2023-08-09 14:31:54,861 [main] WARN  ozone.TestMiniOzoneCluster (TestMiniOzoneCluster.java:testKeyRenameDirDelete(181)) - deletedTable: key = /testozonevol/testozonebucket/inputTera/_temporary/1/_temporary/attempt_1691047336995_0006_m_000001_0/part-m-00001/-9223372036854774528, val = RepeatedOmKeyInfo{omKeyInfoList=[OmKeyInfo{volumeName='testozonevol', bucketName='testozonebucket', keyName='inputTera/_temporary/1/_temporary/attempt_1691047336995_0006_m_000001_0/part-m-00001', dataSize=11, keyLocationVersions=[OmKeyLocationInfoGroup{version=0, locationVersionMap={0=[{blockID={conID: 1 locID: 111677748019200001 bcsId: 0}, length=11, offset=0, token=null, pipeline=null, createVersion=0, partNumber=0}]}, isMultipartKey=false}], creationTime=1691616714661, modificationTime=1691616714834, replicationConfig=RATIS/THREE, encInfo=null, fileChecksum=null, isFile=true, fileName='part-m-00001'}]}
2023-08-09 14:31:54,861 [main] WARN  ozone.TestMiniOzoneCluster (TestMiniOzoneCluster.java:testKeyRenameDirDelete(184)) - deletedTable: -----  END  -----

Sounds to me the fix should be to filter out any block that shares the same containerId and locId as the keyTable/fileTable entry when adding to deletedTable inside OMKeyCommitRequest / OMKeyCommitRequestWithFSO. But I'm no expert in HSync so please advise. cc weichiu szetszwo

Attachments

Issue Links

is caused by

HDDS-7593 Supporting HSync and lease recovery

Resolved

is related to

HDDS-9164 [Hsync] moves blocks to deleted table on final commit

Resolved

relates to

HDDS-9148 OM to reject hsync if ozone.fs.hsync.enabled is false

Resolved

HDDS-9291 Deny block read requests when block is marked as deleted on Datanodes

Open

links to

GitHub Pull Request #5167

Activity

People

Assignee:: Siyao Meng

Reporter:: Siyao Meng

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Aug/23 21:48

Updated:: 15/Sep/23 01:40

Resolved:: 11/Aug/23 07:49