Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5304

Parquet scanner transfers decompression buffers when not needed

    Details

    • Epic Color:
      ghx-label-4

      Description

      The Parquet scanner always transfers decompression buffers to the scratch batch:

      
      Status BaseScalarColumnReader::ReadDataPage() {
        // We're about to move to the next data page.  The previous data page is
        // now complete, pass along the memory allocated for it.
        parent_->scratch_batch_->mem_pool()->AcquireData(decompressed_data_pool_.get(), false);
      

      These in turn are passed along with the row batch. This is safe but unnecessary in many cases where the batch does not hold pointers into the decompression buffer: if the column has only fixed-length data, or if the data page is dictionary-encoded.

      This can make problems like IMPALA-4923 worse than they would be otherwise because extra data is transferred across threads.

        Issue Links

          Activity

          Hide
          tarmstrong Tim Armstrong added a comment -

          IMPALA-5304: reduce transfer of Parquet decompression buffers

          The buffers contain the Parquet DataPages, which need to be
          attached to the row batch if the rows point to var-len data
          stored directly in the page. Otherwise the buffers can be
          discarded once the values in the page have been materialized.

          This reduces the amount of memory transferred between threads, which is
          a known TCMalloc anti-pattern. It also allows us to free memory
          earlier, which may help reduce memory consumption slightly.

          Also fix a latent bug I noticed where needs_conversion_ is not
          always initialised in the constructor.

          Testing
          Ran exhaustive build. Most of the Parquet tests use compressed Parquet,
          which should exercise this code path.

          Change-Id: I2dbd749f43078b222ff8e1ddcec840986c466de6
          Reviewed-on: http://gerrit.cloudera.org:8080/6876
          Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          tarmstrong Tim Armstrong added a comment - IMPALA-5304 : reduce transfer of Parquet decompression buffers The buffers contain the Parquet DataPages, which need to be attached to the row batch if the rows point to var-len data stored directly in the page. Otherwise the buffers can be discarded once the values in the page have been materialized. This reduces the amount of memory transferred between threads, which is a known TCMalloc anti-pattern. It also allows us to free memory earlier, which may help reduce memory consumption slightly. Also fix a latent bug I noticed where needs_conversion_ is not always initialised in the constructor. Testing Ran exhaustive build. Most of the Parquet tests use compressed Parquet, which should exercise this code path. Change-Id: I2dbd749f43078b222ff8e1ddcec840986c466de6 Reviewed-on: http://gerrit.cloudera.org:8080/6876 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins —

            People

            • Assignee:
              tarmstrong Tim Armstrong
              Reporter:
              tarmstrong Tim Armstrong
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development