[DRILL-5282] Rationalize record batch sizes in all readers and operators - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.10.0
Fix Version/s: None
Component/s: None
Labels:
None

Description

Drill uses record batches to process data. A record batch consists of a "bundle" of vectors that, combined, hold the data for some number of records.

The key consideration for a record batch is memory consumed. Various operators and readers have vastly different ideas of the size of a batch. The text reader can produce batches of 100s of K, while the flatten operator produces batches of half a GB. Other operators are randomly in between. Some readers produce batches of unlimited size driven by average row width.

Another key consideration is record count. Batches have a hard physical limit of 64K (the number indexed by a two-byte selection vector.) Some operators produce this much, others far less. In one case, we saw a reader that produced 64K+1 records.

A final consideration is the size of individual vectors. Drill incurs severe memory fragmentation when vectors grow above 16 MB.

In some cases, operators (such as the Parquet reader) allocate large batches, but only partially fill them, creating a large amount of wasted space. That space adds up when we must buffer it during a sort.

This ticket asks to research an optimal batch size. Create a framework to build such batches. Retrofit all operators that produce batches to use that framework to produce uniform batches.

Attachments

Activity

People

Assignee:: Paul Rogers

Reporter:: Paul Rogers

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 21/Feb/17 16:57

Updated:: 19/Jun/17 18:29