Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4787

Optimize APPX_MEDIAN() mem usage in case of many grouping keys

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Impala 2.8.0
    • Fix Version/s: Impala 2.9.0
    • Component/s: Backend
    • Labels:

      Description

      APPX_MEDIAN uses a lot of memory per grouping key. It allocates space for 20,000 samples per grouping key to estimate the median. The current implementation targeted towards non-grouping aggregations or aggregations with relatively few distinct grouping keys.

        Issue Links

          Activity

          Hide
          tarasbob Taras Bobrovytsky added a comment -

          The commit that fixes this JIRA introduced this bug: https://issues.apache.org/jira/browse/IMPALA-5088. When backporting this, be sure to also backport the fix for the bug.

          Show
          tarasbob Taras Bobrovytsky added a comment - The commit that fixes this JIRA introduced this bug: https://issues.apache.org/jira/browse/IMPALA-5088 . When backporting this, be sure to also backport the fix for the bug.
          Hide
          tarmstrong Tim Armstrong added a comment -

          I think this caused IMPALA-5088

          Show
          tarmstrong Tim Armstrong added a comment - I think this caused IMPALA-5088
          Hide
          tarasbob Taras Bobrovytsky added a comment - - edited
          commit 529a5f99b959079faead34a977fba1125d01840e
          Author: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
          Date:   Mon Feb 13 18:14:56 2017 -0800
          
              IMPALA-4787: Optimize APPX_MEDIAN() memory usage
              
              Before this change, ReservoirSample functions (such as APPX_MEDIAN())
              allocated memory for 20,000 elements up front per grouping key. This
              caused inefficient memory usage for aggregations with many grouping
              keys.
              
              This patch fixes this by initially allocating memory for 16 elements.
              Once the buffer becomes full, we reallocate a new buffer with double
              capacity and copy the original buffer into the new one. We continue
              doubling the buffer size until the buffer has room for 20,000 elements
              as before.
              
              Testing:
              Added some EE APPX_MEDIAN() tests on larger datasets that exercise the
              resize code path.
          
          Show
          tarasbob Taras Bobrovytsky added a comment - - edited commit 529a5f99b959079faead34a977fba1125d01840e Author: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Date: Mon Feb 13 18:14:56 2017 -0800 IMPALA-4787: Optimize APPX_MEDIAN() memory usage Before this change, ReservoirSample functions (such as APPX_MEDIAN()) allocated memory for 20,000 elements up front per grouping key. This caused inefficient memory usage for aggregations with many grouping keys. This patch fixes this by initially allocating memory for 16 elements. Once the buffer becomes full, we reallocate a new buffer with double capacity and copy the original buffer into the new one. We continue doubling the buffer size until the buffer has room for 20,000 elements as before. Testing: Added some EE APPX_MEDIAN() tests on larger datasets that exercise the resize code path.
          Show
          jbapple Jim Apple added a comment - Or https://datasketches.github.io/docs/Quantiles/QuantilesOverview.html which implements section 3.2 of http://dimacs.rutgers.edu/~graham/pubs/papers/mergeabletods.pdf
          Hide
          jbapple Jim Apple added a comment -

          There is now a much smaller data structure for finding the approximate median:

          [edit: remove github link with "no commercial use" comment in python file]
          https://arxiv.org/abs/1603.05346
          http://ieee-focs.org/FOCS-2016-Papers/3933a071.pdf

          Show
          jbapple Jim Apple added a comment - There is now a much smaller data structure for finding the approximate median: [edit: remove github link with "no commercial use" comment in python file] https://arxiv.org/abs/1603.05346 http://ieee-focs.org/FOCS-2016-Papers/3933a071.pdf
          Hide
          tarmstrong Tim Armstrong added a comment -

          The problem here is that the intermediate state for appx_median is fixed-size and quite large:

          const static int NUM_BUCKETS = 100;
          const static int NUM_SAMPLES_PER_BUCKET = 200;
          const static int NUM_SAMPLES = NUM_BUCKETS * NUM_SAMPLES_PER_BUCKET;
          ....
          template <typename T>
          struct ReservoirSample {
            T val;
            double key;
          };
          ...
          template <typename T>
          struct ReservoirSampleState {
            ReservoirSample<T> samples[NUM_SAMPLES];
            int num_samples;
            int64_t source_size;
            ranlux64_3 rng;
          ...
          };
          

          We could fix this by rewriting the aggregate function to dynamically size the samples array, which would reduce the memory consumption a lot when the # of samples is << NUM_SAMPLES.

          Show
          tarmstrong Tim Armstrong added a comment - The problem here is that the intermediate state for appx_median is fixed-size and quite large: const static int NUM_BUCKETS = 100; const static int NUM_SAMPLES_PER_BUCKET = 200; const static int NUM_SAMPLES = NUM_BUCKETS * NUM_SAMPLES_PER_BUCKET; .... template <typename T> struct ReservoirSample { T val; double key; }; ... template <typename T> struct ReservoirSampleState { ReservoirSample<T> samples[NUM_SAMPLES]; int num_samples; int64_t source_size; ranlux64_3 rng; ... }; We could fix this by rewriting the aggregate function to dynamically size the samples array, which would reduce the memory consumption a lot when the # of samples is << NUM_SAMPLES.
          Hide
          szama_impala_6295 Marcell Szabo added a comment -

          a possible workaround:
          the below two both give an approximate median:

          SELECT appx_median(salary) from employees;
          SELECT max(salary) from (SELECT salary, ntile(2) over (order by salary) half FROM employees) a where half = 1;

          the second uses significantly less memory in some situations (e.g. group by with many groups)

          Show
          szama_impala_6295 Marcell Szabo added a comment - a possible workaround: the below two both give an approximate median: SELECT appx_median(salary) from employees; SELECT max(salary) from (SELECT salary, ntile(2) over (order by salary) half FROM employees) a where half = 1; the second uses significantly less memory in some situations (e.g. group by with many groups)

            People

            • Assignee:
              tarasbob Taras Bobrovytsky
              Reporter:
              szama_impala_6295 Marcell Szabo
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development