Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
In very large schemas, due of the way we are flattening the field and buffer metadata in the RecordBatch:
https://github.com/apache/arrow/blob/master/format/Message.fbs#L271
The cost to reconstruct / load a single array from a RecordBatch can be arbitrarily high.
As an example, let's consider a schema:
f0: int32 f1: string ... omitting 999996 duplicate f999998: int32 f999999: string
Here, a record batch has 1 million fields, and in total 2.5 million buffers. The problem with this is: to select a single field out of a record batch, we have to inspect all types leading up to the field of interest to know how many FieldNode and Buffer metadata values will have occurred in the serialized metadata before that field's metadata appears.
Solving this is a little bit tricky. One way would be to add optional "field position" and "buffer position" attributes to the Field table:
https://github.com/apache/arrow/blob/master/format/Message.fbs#L188
So here, we would know that for the f1 field, the field index is 1 and the buffer index is 2. Because a string has 3 buffers associated with it, we would know to select buffers in slots 2, 3, 4 to reconstruct the vector container.
Let me know if the problem is not clear, and any other ideas about solutions