Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-10919

Fine tune row batch handling to support arrays with millions of items

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • Backend

    Description

      Currently, during row batch serialization there is an upper limit for the row batch size of int32::max():
      https://github.com/apache/impala/blob/b67c0906f596ca336d0ea0e8cbc618a20ac0e563/be/src/runtime/row-batch.cc#L348

      As a result if we query an array of millions of items in the select list it won't fit into a serialized row batch even if it stores integers. With the current implementation, taking into account that the default number of rows in a row batch is 1024 then the limit now allow approximately 560k integers in each row. (560k * 4byte * 1024 rows = ~int32::max() )

      There is a workaround to reduce the number of rows in a row batch with the BATCH_SIZE query option but that seems an overkill as it reduces the batch size for all the nodes in the query but most probably after the arrays are being unnested it is safe to use the default batch size.

      Just an idea but it would be nice to dynamically adjust the row batch size when big arrays are queried so that they can fit in a serialized row bacth, but when the unnesting happened (and there are no longer arrays in the row batch) the default size could be used.

      Attachments

        Activity

          People

            Unassigned Unassigned
            gaborkaszab Gabor Kaszab
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: