Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-645

[Format] Mitigating the cost of random access in "wide" record batches

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Format
    • None

    Description

      In very large schemas, due of the way we are flattening the field and buffer metadata in the RecordBatch:

      https://github.com/apache/arrow/blob/master/format/Message.fbs#L271

      The cost to reconstruct / load a single array from a RecordBatch can be arbitrarily high.

      As an example, let's consider a schema:

      f0: int32
      f1: string
      ...  omitting 999996 duplicate
      f999998: int32
      f999999: string
      

      Here, a record batch has 1 million fields, and in total 2.5 million buffers. The problem with this is: to select a single field out of a record batch, we have to inspect all types leading up to the field of interest to know how many FieldNode and Buffer metadata values will have occurred in the serialized metadata before that field's metadata appears.

      Solving this is a little bit tricky. One way would be to add optional "field position" and "buffer position" attributes to the Field table:

      https://github.com/apache/arrow/blob/master/format/Message.fbs#L188

      So here, we would know that for the f1 field, the field index is 1 and the buffer index is 2. Because a string has 3 buffers associated with it, we would know to select buffers in slots 2, 3, 4 to reconstruct the vector container.

      Let me know if the problem is not clear, and any other ideas about solutions

      Attachments

        Activity

          People

            Unassigned Unassigned
            wesm Wes McKinney
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: