[KUDU-797] Undesirable log latency curve on large ingest job - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: M5
Fix Version/s: None
Component/s: log
Labels:
None

Description

We've been running large (~5TB, SF=6000) invocations of tpch_real_world on a2412 (client) and a2414 (server). It's an insert-only workload with no compactions, so it approximates an idealized "ingest job".

One of the things we've learned is that if the MRS flush rate is high enough (due to a large number of disks and high number of mm threads) and if there's enough free RAM on the box, log writes may generate no I/O. This is because the writes dirty some pages, but by the time the kernel decides to write back those dirty pages, the associated MRS has been flushed and the log segments gc'ed.

While this is pretty neat, it produces undesirable changes in latencies as the tablet server consumes more and more RAM in its own data tracking structures (CFiles, ReadableBlocks, TabletSuperblockPBs, etc.). That is, as the TS (or another process on the system) consumes more RAM, this phenomenon lessens and log writes generate more I/O, sometimes stalling the response to clients.

Todd observed that HDFS ran into a similar issue and "smoothed" the latency curve by issuing sync_file_range(SYNC_FILE_RANGE_WRITE) on the dirty data just after appending data to it. He added that this was done in a separate thread, because there are some pathological cases where sync_file_range() blocked despite being configured not to block.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Adar Dembo

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/May/15 18:50

Updated:: 04/Jun/15 23:32