[KAFKA-1712] Excessive storage usage on newly added node - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: log
Labels:
None

Description

When a new node is added to cluster data starts replicating into it. The mtime of creating segments will be set on the last message being written to them. Though the replication is a prolonged process, let's assume (for simplicity of explanation) that their mtime is very close to the time when the new node was added.

After the replication is done, new data will start to flow into this new node. After `log.retention.hours` the amount of data will be 2 * daily_amount_of_data_in_kafka_node (first one is the replicated data from other nodes when the node was added (let us call it `t1`) and the second is the amount of replicated data from other nodes which happened from `t1` to `t1 + log.retention.hours`). So by that time the node will have twice as much data as the other nodes.

This poses a big problem to us as our storage is chosen to fit normal amount of data (not twice this amount).

In our particular case it poses another problem. We have an emergency segment cleaner which runs in case storage is nearly full (>90%). We try to balance the amount of data for it not to run to rely solely on kafka internal log deletion, but sometimes emergency cleaner runs.
It works this way:

it gets all kafka segments for the volume
it filters out last segments of each partition (just to avoid unnecessary recreation of last small-size segments)
it sorts them by segment mtime
it changes mtime of the first N segements (with the lowest mtime) to 1, so they become really really old. Number N is chosen to free specified percentage of volume (3% in our case). Kafka deletes these segments later (as they are very old).

Emergency cleaner works very well. Except for the case when the data is replicated to the newly added node.
In this case segment mtime is the time the segment was replicated and does not reflect the real creation time of original data stored in this segment.
So in this case kafka emergency cleaner will delete segments with the lowest mtime, which may hold the data which is much more recent than the data in other segments.
This is not a big problem until we delete the data which hasn't been fully consumed.
In this case we loose data and this makes it a big problem.

Is it possible to retain segment mtime during initial replication on a new node?
This will help not to load the new node with the twice as large amount of data as other nodes have.

Or maybe there are another ways to sort segments by data creation times (or close to data creation time)? (for example if this ticket is implemented https://issues.apache.org/jira/browse/KAFKA-1403, we may take time of the first message from .index). In our case it will help with kafka emergency cleaner, which will be deleting really the oldest data.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Oleg Golovin

Votes:: 5 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 17/Oct/14 19:09

Updated:: 28/Aug/18 19:08

Resolved:: 28/Aug/18 19:08