[KUDU-1793] When out of disk space, LBM can corrupt data files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.1.0
Fix Version/s: 1.2.0
Component/s: None
Labels:
None

Target Version/s:

1.2.0
Code Review:
http://gerrit.cloudera.org:8080/5399

Description

The log block manager can corrupt a container data file when the following conditions are met:

A data directory runs out of disk space.
The operation in question is a merge compaction (that is, the server does not crash).
The data directory eventually empties somewhat, allowing the server to recover.

When all of these conditions are met, the changes introduced by commit abea8c6 (released in 1.1.0) may cause the container's bookkeeping to become somewhat inconsistent. Specifically, if the data dir has enough free space such that the container is able to append some data belonging to a new block but not finalize that block, an unexpected "hole" may be added to the container.

When the server is restarted, the container's bookkeeping doesn't account for this hole, leading to data being overwritten when a new block is appended to the container. Moreover, commit 4aacaf6 (not yet released) exacerbates the issue by causing the LBM to explicitly truncate the container at the wrong place during startup, yielding immediate data loss.

This case was observed in an internal Cloudera cluster.

Attachments

Activity

People

Assignee:: Adar Dembo

Reporter:: Adar Dembo

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 07/Dec/16 08:56

Updated:: 08/Dec/16 04:07

Resolved:: 08/Dec/16 04:07