Details

      Description

      In order to scale to many cores, we need to have scalable free lists and clean page lists in BufferPool so that throughput can continue to scale with the number of cores.

        Issue Links

          Activity

          Hide
          tarmstrong Tim Armstrong added a comment -

          IMPALA-3203: Part 1: Free list implementation

          We will have a single free list per size class. Free buffers
          are stored in a heap, ordered by the memory address of the
          buffer, to help reduce address space fragmentation.

          A follow-up patch will use the free lists in BufferPool.

          Currently TCMalloc has thread-local caches with somewhat similar
          purpose. However, these are not suitable for our purposes for
          several reasons:

          • They are designed for caching small allocations - large allocations
            like most buffers are served from a global page heap protected
            by a global lock.
          • We intend to move away from using TCMalloc for buffers: IMPALA-5073
          • Thread caches are ineffective for the producer-consumer pattern where
            one thread allocates memory and another thread frees it.
          • TCMalloc gives us limited control over how and when memory is
            actually released to the OS.

          Testing:
          Added unit tests for sanity checks and verification of behaviour
          that is trickier to check in integration or system tests.
          The cost will be exercised more thoroughly via BufferPool
          in Part 2.

          Performance:
          Includes a benchmark that demonstrates the scalability of
          the free lists under concurrency. When measuring pure throughput
          of free list operations, having a free list per core is
          significantly faster than a shared free list, or allocating
          directly from TCMalloc. On 8 cores, if the memory allocated is
          actually touched then for 64KB+ buffers, memory access is the
          bottleneck rather than lock contention.

          The benchmark also showed that non-inlined constructors and move
          operators of BufferHandle were taking significant CPU cycles, so
          I inlined those.

          This suggests that having a free list per core is more than sufficient
          (however, we will need to validate this with system concurrency
          benchmarks once we switch to using this during query execution).

          Change-Id: Ia89acfa4efdecb96d3678443b4748932b4133b9b
          Reviewed-on: http://gerrit.cloudera.org:8080/6410
          Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          tarmstrong Tim Armstrong added a comment - IMPALA-3203 : Part 1: Free list implementation We will have a single free list per size class. Free buffers are stored in a heap, ordered by the memory address of the buffer, to help reduce address space fragmentation. A follow-up patch will use the free lists in BufferPool. Currently TCMalloc has thread-local caches with somewhat similar purpose. However, these are not suitable for our purposes for several reasons: They are designed for caching small allocations - large allocations like most buffers are served from a global page heap protected by a global lock. We intend to move away from using TCMalloc for buffers: IMPALA-5073 Thread caches are ineffective for the producer-consumer pattern where one thread allocates memory and another thread frees it. TCMalloc gives us limited control over how and when memory is actually released to the OS. Testing: Added unit tests for sanity checks and verification of behaviour that is trickier to check in integration or system tests. The cost will be exercised more thoroughly via BufferPool in Part 2. Performance: Includes a benchmark that demonstrates the scalability of the free lists under concurrency. When measuring pure throughput of free list operations, having a free list per core is significantly faster than a shared free list, or allocating directly from TCMalloc. On 8 cores, if the memory allocated is actually touched then for 64KB+ buffers, memory access is the bottleneck rather than lock contention. The benchmark also showed that non-inlined constructors and move operators of BufferHandle were taking significant CPU cycles, so I inlined those. This suggests that having a free list per core is more than sufficient (however, we will need to validate this with system concurrency benchmarks once we switch to using this during query execution). Change-Id: Ia89acfa4efdecb96d3678443b4748932b4133b9b Reviewed-on: http://gerrit.cloudera.org:8080/6410 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins
          Hide
          tarmstrong Tim Armstrong added a comment -

          IMPALA-3203: Part 2: per-core free lists in buffer pool

          Add per-core lists of clean pages and free pages to enable allocation
          of buffers without contention on shared locks in the common case.

          This is implemented with an additional layer of abstraction in
          "BufferAllocator", which tracks all memory (free buffers and clean
          pages) that is not in use but has not been released to the OS.
          The old BufferAllocator is renamed to SystemAllocator.

          See "Spilled Page Mgmt" and "MMap Allocator & Scalable Free Lists" in
          https://goo.gl/0zuy97 for a high-level summary of how this fits into
          the buffer pool design.

          The guts of the new code is BufferAllocator::AllocateInternal(),
          which progresses through several strategies for allocating memory.

          Misc changes:

          • Enforce upper limit on buffer size to reduce the number of free lists
            required.
          • Add additional allocation counters.
          • Slightly reorganise the MemTracker GC functions to use lambdas and
            clarify the order in which they should be called. Also adds a target
            memory value so that they don't need to free all of the memory in
            the system.
          • Fix an accounting bug in the buffer pool where it didn't
            evict dirty pages before reclaiming a clean page.

          Performance:
          We will need to validate the performance of the system under high query
          concurrency before this is used as part of query execution. The benchmark
          in Part 1 provided some evidence that this approach of a list per core
          should scale well to many cores.

          Testing:
          Added buffer-allocator-test to test the free list resizing algorithm
          directly.

          Added a test to buffer-pool-test to exercise the various new memory
          reclamation code paths that are now possible. Also run buffer-pool-test
          under two different faked-out NUMA setups - one with no NUMA and another
          with three NUMA nodes.

          buffer-pool-test, suballocator-test, and buffered-tuple-stream-v2-test
          provide some further basic coverage. Future system and unit tests will
          validate this further before it is used for query execution (see
          IMPALA-3200).

          Ran an initial version of IMPALA-4114, the ported BufferedBlockMgr
          tests, against this. The randomised stress test revealed some accounting
          bugs which are fixed. I'll post those tests as a follow-on patch.

          Change-Id: I612bd1cd0f0e87f7d8186e5bedd53a22f2d80832
          Reviewed-on: http://gerrit.cloudera.org:8080/6414
          Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          tarmstrong Tim Armstrong added a comment - IMPALA-3203 : Part 2: per-core free lists in buffer pool Add per-core lists of clean pages and free pages to enable allocation of buffers without contention on shared locks in the common case. This is implemented with an additional layer of abstraction in "BufferAllocator", which tracks all memory (free buffers and clean pages) that is not in use but has not been released to the OS. The old BufferAllocator is renamed to SystemAllocator. See "Spilled Page Mgmt" and "MMap Allocator & Scalable Free Lists" in https://goo.gl/0zuy97 for a high-level summary of how this fits into the buffer pool design. The guts of the new code is BufferAllocator::AllocateInternal(), which progresses through several strategies for allocating memory. Misc changes: Enforce upper limit on buffer size to reduce the number of free lists required. Add additional allocation counters. Slightly reorganise the MemTracker GC functions to use lambdas and clarify the order in which they should be called. Also adds a target memory value so that they don't need to free all of the memory in the system. Fix an accounting bug in the buffer pool where it didn't evict dirty pages before reclaiming a clean page. Performance: We will need to validate the performance of the system under high query concurrency before this is used as part of query execution. The benchmark in Part 1 provided some evidence that this approach of a list per core should scale well to many cores. Testing: Added buffer-allocator-test to test the free list resizing algorithm directly. Added a test to buffer-pool-test to exercise the various new memory reclamation code paths that are now possible. Also run buffer-pool-test under two different faked-out NUMA setups - one with no NUMA and another with three NUMA nodes. buffer-pool-test, suballocator-test, and buffered-tuple-stream-v2-test provide some further basic coverage. Future system and unit tests will validate this further before it is used for query execution (see IMPALA-3200 ). Ran an initial version of IMPALA-4114 , the ported BufferedBlockMgr tests, against this. The randomised stress test revealed some accounting bugs which are fixed. I'll post those tests as a follow-on patch. Change-Id: I612bd1cd0f0e87f7d8186e5bedd53a22f2d80832 Reviewed-on: http://gerrit.cloudera.org:8080/6414 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins

            People

            • Assignee:
              tarmstrong Tim Armstrong
              Reporter:
              tarmstrong Tim Armstrong
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development