Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3899

Certain join types fail to spill after they start probing if there are matches stored in the hash table

    Details

      Description

      The fix for IMPALA-1488 disabled spilling for certain hash join types (e.g. RIGHT OUTER JOIN, FULL OUTER JOIN) if there were any matches in the hash table. The underlying cause of this (if I understand correctly) is that we can spill partitions after we've built the hash table and started probing.

      As part of IMPALA-3567 I need to fix the underlying issue so that we don't spill additional build partitions while probing. As a consequence of that work, it should be possible to disable the IMPALA-1488 workaround.

        Activity

        Hide
        tarmstrong Tim Armstrong added a comment -

        IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

        The main outcome of this patch is to split out PhjBuilder from
        PartitionedHashJoinNode, which manages the build partitions for the
        join and implements the DataSink interface.

        A lot of this work is fairly mechanical refactoring: dividing the state
        between the two classes and duplicating it where appropriate. One major
        change is that probe partitions need to be separated from build
        partitions.

        This required some significant algorithmic changes to memory management
        for probe partition buffers: memory management for probe partitions was
        entangled with the build partitions: small buffers were allocated for
        the probe partitions even for non-spilling joins and the join could
        spill additional partitions during probing if the probe partitions needed
        to switch from small to I/O buffers. The changes made were:

        • Probe partitions are only initialized after the build is partitioned, and
          only for spilled build partitions.
        • Probe partitions never use small buffers: once the initial write
          buffer is allocated, appending to the probe partition never fails.
        • All probe partition allocation is done after partitioning the build
          and before processing the probe input during the same phase as hash
          table building. (Aside from NAAJ partitions which are allocated
          upfront).

        The probe partition changes necessitated a change in
        BufferedTupleStream: allocation of write blocks is now explicit via the
        PrepareForWrite() API.

        Testing:
        Ran exhaustive build and local stress test.

        Memory Usage:
        Ran stress test binary search locally for TPC-DS SF-1 and TPC-H SF-20.
        No regressions on TPC-DS. TPC-H either stayed the same or improved in
        min memory requirement without spilling, but the min memory requirement
        with spilling regressed in some cases. I investigated each of the
        significant regressions on TPC-H and determined that they were all due
        to exec nodes racing for spillable or non-spillable memory. None of them
        were cases where exec nodes got their minimum reservation and failed to
        execute the spilling algorithm correctly.

        Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
        Reviewed-on: http://gerrit.cloudera.org:8080/3873
        Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
        Reviewed-by: Dan Hecht <dhecht@cloudera.com>
        Tested-by: Internal Jenkins

        Show
        tarmstrong Tim Armstrong added a comment - IMPALA-3567 Part 2, IMPALA-3899 : factor out PHJ builder The main outcome of this patch is to split out PhjBuilder from PartitionedHashJoinNode, which manages the build partitions for the join and implements the DataSink interface. A lot of this work is fairly mechanical refactoring: dividing the state between the two classes and duplicating it where appropriate. One major change is that probe partitions need to be separated from build partitions. This required some significant algorithmic changes to memory management for probe partition buffers: memory management for probe partitions was entangled with the build partitions: small buffers were allocated for the probe partitions even for non-spilling joins and the join could spill additional partitions during probing if the probe partitions needed to switch from small to I/O buffers. The changes made were: Probe partitions are only initialized after the build is partitioned, and only for spilled build partitions. Probe partitions never use small buffers: once the initial write buffer is allocated, appending to the probe partition never fails. All probe partition allocation is done after partitioning the build and before processing the probe input during the same phase as hash table building. (Aside from NAAJ partitions which are allocated upfront). The probe partition changes necessitated a change in BufferedTupleStream: allocation of write blocks is now explicit via the PrepareForWrite() API. Testing: Ran exhaustive build and local stress test. Memory Usage: Ran stress test binary search locally for TPC-DS SF-1 and TPC-H SF-20. No regressions on TPC-DS. TPC-H either stayed the same or improved in min memory requirement without spilling, but the min memory requirement with spilling regressed in some cases. I investigated each of the significant regressions on TPC-H and determined that they were all due to exec nodes racing for spillable or non-spillable memory. None of them were cases where exec nodes got their minimum reservation and failed to execute the spilling algorithm correctly. Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb Reviewed-on: http://gerrit.cloudera.org:8080/3873 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins

          People

          • Assignee:
            tarmstrong Tim Armstrong
            Reporter:
            tarmstrong Tim Armstrong
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development