[ARROW-645] [Format] Mitigating the cost of random access in "wide" record batches - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Format
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/16270

Description

In very large schemas, due of the way we are flattening the field and buffer metadata in the RecordBatch:

https://github.com/apache/arrow/blob/master/format/Message.fbs#L271

The cost to reconstruct / load a single array from a RecordBatch can be arbitrarily high.

As an example, let's consider a schema:

f0: int32
f1: string
...  omitting 999996 duplicate
f999998: int32
f999999: string

Here, a record batch has 1 million fields, and in total 2.5 million buffers. The problem with this is: to select a single field out of a record batch, we have to inspect all types leading up to the field of interest to know how many FieldNode and Buffer metadata values will have occurred in the serialized metadata before that field's metadata appears.

Solving this is a little bit tricky. One way would be to add optional "field position" and "buffer position" attributes to the Field table:

https://github.com/apache/arrow/blob/master/format/Message.fbs#L188

So here, we would know that for the f1 field, the field index is 1 and the buffer index is 2. Because a string has 3 buffers associated with it, we would know to select buffers in slots 2, 3, 4 to reconstruct the vector container.

Let me know if the problem is not clear, and any other ideas about solutions

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Wes McKinney

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Mar/17 04:03

Updated:: 11/Jan/23 07:10