Details

      Description

      This is a subtask of IMPALA-3200 to add basic support for spilling buffers

      We'll call this complete once we have the buffer pool features required to execute queries that successfully spill to disk. Given the new guaranteed reservation model, these queries should be guaranteed to successfully spill if the initial reservation is granted (excluding cases like rows larger than the default page size). We will not worry about performance or memory usage constraints at this stage.

        Activity

        Hide
        tarmstrong Tim Armstrong added a comment -

        IMPALA-3202: DiskIoMgr improvements for new buffer pool

        The main goal of this patch is to add support to the DiskIoMgr
        to read into client-provided buffers, instead of the DiskIoMgr
        allocating intermediate buffers. This is important for the new
        buffer pool because we want it to entirely manage its own memory,
        rather than delegating part of its memory management to the DiskIoMgr.

        Also do some cleanup:

        • Consolidate some correlated ScanRange parameters into a parameter
          struct.
        • Remove the "untracked" buffers mem tracker, which is no longer
          necessary.
        • Change the buffer type in DiskIoMgr to use uint8_t* to be consistent
          with the rest of Impala.

        Testing:
        Added a unit test. We also get coverage from the BufferedBlockMgr unit
        tests, any spilling tests and the Parquet tests with large footers.

        Change-Id: I913fbb8ae6c242f1585992eb2a622509344dccec
        Reviewed-on: http://gerrit.cloudera.org:8080/4631
        Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
        Tested-by: Internal Jenkins

        Show
        tarmstrong Tim Armstrong added a comment - IMPALA-3202 : DiskIoMgr improvements for new buffer pool The main goal of this patch is to add support to the DiskIoMgr to read into client-provided buffers, instead of the DiskIoMgr allocating intermediate buffers. This is important for the new buffer pool because we want it to entirely manage its own memory, rather than delegating part of its memory management to the DiskIoMgr. Also do some cleanup: Consolidate some correlated ScanRange parameters into a parameter struct. Remove the "untracked" buffers mem tracker, which is no longer necessary. Change the buffer type in DiskIoMgr to use uint8_t* to be consistent with the rest of Impala. Testing: Added a unit test. We also get coverage from the BufferedBlockMgr unit tests, any spilling tests and the Parquet tests with large footers. Change-Id: I913fbb8ae6c242f1585992eb2a622509344dccec Reviewed-on: http://gerrit.cloudera.org:8080/4631 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins
        Hide
        tarmstrong Tim Armstrong added a comment -

        Change subject: IMPALA-3202: refactor scratch file management into TmpFileMgr
        ......................................................................

        IMPALA-3202: refactor scratch file management into TmpFileMgr

        This is a pure refactoring patch that moves all of the logic
        for allocating scratch file ranges into TmpFileMgr in anticipation of
        this logic being used by the new BufferPool.

        There should be no behavioural changes.

        Also remove a bunch of TODOs that we're not going to fix.

        Change-Id: I0c56c195f3f28d520034f8c384494e566635fc62
        Reviewed-on: http://gerrit.cloudera.org:8080/4898
        Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
        Tested-by: Internal Jenkins

        M be/src/runtime/buffered-block-mgr-test.cc
        M be/src/runtime/buffered-block-mgr.cc
        M be/src/runtime/buffered-block-mgr.h
        M be/src/runtime/tmp-file-mgr-test.cc
        M be/src/runtime/tmp-file-mgr.cc
        M be/src/runtime/tmp-file-mgr.h
        6 files changed, 441 insertions, 432 deletions

        Approvals:
        Internal Jenkins: Verified
        Tim Armstrong: Looks good to me, approved

        Show
        tarmstrong Tim Armstrong added a comment - Change subject: IMPALA-3202 : refactor scratch file management into TmpFileMgr ...................................................................... IMPALA-3202 : refactor scratch file management into TmpFileMgr This is a pure refactoring patch that moves all of the logic for allocating scratch file ranges into TmpFileMgr in anticipation of this logic being used by the new BufferPool. There should be no behavioural changes. Also remove a bunch of TODOs that we're not going to fix. Change-Id: I0c56c195f3f28d520034f8c384494e566635fc62 Reviewed-on: http://gerrit.cloudera.org:8080/4898 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins — M be/src/runtime/buffered-block-mgr-test.cc M be/src/runtime/buffered-block-mgr.cc M be/src/runtime/buffered-block-mgr.h M be/src/runtime/tmp-file-mgr-test.cc M be/src/runtime/tmp-file-mgr.cc M be/src/runtime/tmp-file-mgr.h 6 files changed, 441 insertions , 432 deletions Approvals: Internal Jenkins: Verified Tim Armstrong: Looks good to me, approved
        Hide
        tarmstrong Tim Armstrong added a comment -

        Another step forward:

        IMPALA-3202,IMPALA-2079: rework scratch file I/O

        Refactor BufferedBlockMgr/TmpFileMgr to push more I/O logic into
        TmpFileMgr, in anticipation of it being shared with BufferPool.
        TmpFileMgr now handles:

        • Scratch space allocation and recycling
        • Read and write I/O

        The interface is also greatly changed so that it is built around Write()
        and Read() calls, abstracting away the details of temporary file
        allocation from clients. This means the TmpFileMgr::File class can
        be hidden from clients.

        Write error recovery:
        Also implement write error recovery in TmpFileMgr.

        If an error occurs while writing to scratch and we have multiple
        scratch directories, we will try one of the other directories
        before cancelling the query. File-level blacklisting is used to
        prevent excessive repeated attempts to resize a scratch file during
        a single query. Device-level blacklisting is not implemented because
        it is problematic to permanently take a scratch directory out of use.

        To reduce the number of error paths, all I/O errors are now handled
        asynchronously. Previously errors creating or extending the file were
        returned synchronously from WriteUnpinnedBlock(). This required
        modifying DiskIoMgr to create the file if not present when opened.

        Also set the default max_errors value in the thrift definition file,
        so that it is in effect for backend tests.

        Future Work:

        • Support for recycling variable-length scratch file ranges. I omitted
          this to avoid making the patch even large.

        Testing:
        Updated BufferedBlockMgr unit test to reflect changes in behaviour:

        • Scratch space is no longer permanently associated with a block, and
          is remapped every time a new block is written to disk .
        • Files are now blacklisted - updated existing tests and enable the
          disable blacklisting test.

        Added some basic testing of recycling of scratch file ranges in
        the TmpFileMgr unit test.

        I also manually tested the code in two ways. First by removing permissions
        for /tmp/impala-scratch and ensuring that a spilling query fails cleanly.
        Second, by creating a tiny ramdisk (16M) and running with two scratch
        directories: one on /tmp and one on the tiny ramdisk. When spilling, an
        out of space error is encountered for the tiny ramdisk and impala spills
        the remaining data (72M) to /tmp.

        Change-Id: I8c9c587df006d2f09d72dd636adafbd295fcdc17
        Reviewed-on: http://gerrit.cloudera.org:8080/5141
        Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
        Tested-by: Impala Public Jenkins

        Show
        tarmstrong Tim Armstrong added a comment - Another step forward: IMPALA-3202 , IMPALA-2079 : rework scratch file I/O Refactor BufferedBlockMgr/TmpFileMgr to push more I/O logic into TmpFileMgr, in anticipation of it being shared with BufferPool. TmpFileMgr now handles: Scratch space allocation and recycling Read and write I/O The interface is also greatly changed so that it is built around Write() and Read() calls, abstracting away the details of temporary file allocation from clients. This means the TmpFileMgr::File class can be hidden from clients. Write error recovery: Also implement write error recovery in TmpFileMgr. If an error occurs while writing to scratch and we have multiple scratch directories, we will try one of the other directories before cancelling the query. File-level blacklisting is used to prevent excessive repeated attempts to resize a scratch file during a single query. Device-level blacklisting is not implemented because it is problematic to permanently take a scratch directory out of use. To reduce the number of error paths, all I/O errors are now handled asynchronously. Previously errors creating or extending the file were returned synchronously from WriteUnpinnedBlock(). This required modifying DiskIoMgr to create the file if not present when opened. Also set the default max_errors value in the thrift definition file, so that it is in effect for backend tests. Future Work: Support for recycling variable-length scratch file ranges. I omitted this to avoid making the patch even large. Testing: Updated BufferedBlockMgr unit test to reflect changes in behaviour: Scratch space is no longer permanently associated with a block, and is remapped every time a new block is written to disk . Files are now blacklisted - updated existing tests and enable the disable blacklisting test. Added some basic testing of recycling of scratch file ranges in the TmpFileMgr unit test. I also manually tested the code in two ways. First by removing permissions for /tmp/impala-scratch and ensuring that a spilling query fails cleanly. Second, by creating a tiny ramdisk (16M) and running with two scratch directories: one on /tmp and one on the tiny ramdisk. When spilling, an out of space error is encountered for the tiny ramdisk and impala spills the remaining data (72M) to /tmp. Change-Id: I8c9c587df006d2f09d72dd636adafbd295fcdc17 Reviewed-on: http://gerrit.cloudera.org:8080/5141 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins
        Hide
        tarmstrong Tim Armstrong added a comment -

        IMPALA-3202: variable-length scratch file ranges

        This uses a simple approach where scratch ranges are managed in
        power-of-two size classes and we don't attempt to coalesce or
        split the ranges to move them into different size classes. Thus
        this does not optimally reuse space if a query spills a variety
        of page sizes, but improving this may not be worth the added
        complexity.

        Testing:
        Extended tmp-file-mgr-test to exercise the variable scratch range
        support. We will get system test coverage once this is used by the
        new buffer pool.

        Change-Id: Ic0ad84493c2c93a5602c404a83c718f25ea25575
        Reviewed-on: http://gerrit.cloudera.org:8080/5597
        Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
        Tested-by: Impala Public Jenkins

        Show
        tarmstrong Tim Armstrong added a comment - IMPALA-3202 : variable-length scratch file ranges This uses a simple approach where scratch ranges are managed in power-of-two size classes and we don't attempt to coalesce or split the ranges to move them into different size classes. Thus this does not optimally reuse space if a query spills a variety of page sizes, but improving this may not be worth the added complexity. Testing: Extended tmp-file-mgr-test to exercise the variable scratch range support. We will get system test coverage once this is used by the new buffer pool. Change-Id: Ic0ad84493c2c93a5602c404a83c718f25ea25575 Reviewed-on: http://gerrit.cloudera.org:8080/5597 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins
        Hide
        tarmstrong Tim Armstrong added a comment -

        Didn't mean to resolve it.

        Show
        tarmstrong Tim Armstrong added a comment - Didn't mean to resolve it.
        Hide
        tarmstrong Tim Armstrong added a comment -

        I'll close the issue. There is a follow-on which is that DecreaseReservation() can violate the invariant around pinned dirty bytes, which I will file an issue for.

        IMPALA-3202: implement spill-to-disk in new buffer pool

        See https://goo.gl/0zuy97 for a high-level summary of the design.

        Unpinned pages can now be written to disk to free up memory. After
        unpinning, pages enter a "dirty" state. Each client initiates
        asynchronous writes for dirty page to free up memory to allocate
        more buffers. After the write completes, pages are "clean" and can
        be evicted from memory by any client that needs the buffer. This
        is implemented by moving pages between lists in ClientImpl (a new
        internal class that stores the client's state) and BufferPool.

        I/O:


        The mechanics of I/O to scratch files is handled by the TmpFileMgr
        mechanisms introduced in earlier IMPALA-3202 patches.

        The decision to start a write I/O is based on two factors. First,
        each client maintains the invariant that bytes of the dirty
        pages should not exceed the unused reservation. This is to ensure
        that any client's reservation can always be fulfilled by evicting
        a clean page. Second, additional writes are initiated to proactively
        write data to disk, so that clean pages are available when needed.
        The buffer pool tries to keep enough writes in flight to keep all
        disks busy.

        The buffer pool avoids read I/O whenever possible: if an unpinned
        page is pinned again and its buffer is still in memory, no I/O
        is required to read back the data.

        Locking:
        --------
        Concurrency is managed using client, page, and clean page list locks.
        The design uses intrusive doubly-linked lists to track pages. This
        patch adds a LockType parameter to InternalQueue and a FakeLock type
        to support a non-concurrent InternalList type that is more convenient
        for the buffer pool's locking scheme.

        Testing:
        --------
        Added some basic unit tests to BufferPoolTest that exercise the new
        code paths. More test coverage will be added later with system tests
        and porting of tests from BufferedBlockMgrTest - see the draft
        test plan for IMPALA-3200.

        Change-Id: I8c6119a4557da853fefa02e17c49d8be0e9fbe01
        Reviewed-on: http://gerrit.cloudera.org:8080/5584
        Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
        Tested-by: Impala Public Jenkins

        Show
        tarmstrong Tim Armstrong added a comment - I'll close the issue. There is a follow-on which is that DecreaseReservation() can violate the invariant around pinned dirty bytes, which I will file an issue for. IMPALA-3202 : implement spill-to-disk in new buffer pool See https://goo.gl/0zuy97 for a high-level summary of the design. Unpinned pages can now be written to disk to free up memory. After unpinning, pages enter a "dirty" state. Each client initiates asynchronous writes for dirty page to free up memory to allocate more buffers. After the write completes, pages are "clean" and can be evicted from memory by any client that needs the buffer. This is implemented by moving pages between lists in ClientImpl (a new internal class that stores the client's state) and BufferPool. I/O: The mechanics of I/O to scratch files is handled by the TmpFileMgr mechanisms introduced in earlier IMPALA-3202 patches. The decision to start a write I/O is based on two factors. First, each client maintains the invariant that bytes of the dirty pages should not exceed the unused reservation. This is to ensure that any client's reservation can always be fulfilled by evicting a clean page. Second, additional writes are initiated to proactively write data to disk, so that clean pages are available when needed. The buffer pool tries to keep enough writes in flight to keep all disks busy. The buffer pool avoids read I/O whenever possible: if an unpinned page is pinned again and its buffer is still in memory, no I/O is required to read back the data. Locking: -------- Concurrency is managed using client, page, and clean page list locks. The design uses intrusive doubly-linked lists to track pages. This patch adds a LockType parameter to InternalQueue and a FakeLock type to support a non-concurrent InternalList type that is more convenient for the buffer pool's locking scheme. Testing: -------- Added some basic unit tests to BufferPoolTest that exercise the new code paths. More test coverage will be added later with system tests and porting of tests from BufferedBlockMgrTest - see the draft test plan for IMPALA-3200 . Change-Id: I8c6119a4557da853fefa02e17c49d8be0e9fbe01 Reviewed-on: http://gerrit.cloudera.org:8080/5584 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins

          People

          • Assignee:
            tarmstrong Tim Armstrong
            Reporter:
            tarmstrong Tim Armstrong
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development