Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-12491

Improve ndv heuristic for functions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.3.0, 2.0.0
    • 2.0.0
    • Statistics
    • None

    Description

      The eased out denominator has to detect duplicate row-stats from different attributes.

      select account_id from customers c,  customer_activation ca
        where c.customer_id = ca.customer_id
        and year(ca.dt) = year(c.dt) and month(ca.dt) = month(c.dt)
        and year(ca.dt) between year('2013-12-26') and year('2013-12-26')
      
        private Long getEasedOutDenominator(List<Long> distinctVals) {
            // Exponential back-off for NDVs.
            // 1) Descending order sort of NDVs
            // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * ....
            Collections.sort(distinctVals, Collections.reverseOrder());
      
            long denom = distinctVals.get(0);
            for (int i = 1; i < distinctVals.size(); i++) {
              denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << i)));
            }
      
            return denom;
          }
      

      This gets [8007986, 821974390, 821974390], which is actually 3 columns 2 of which are derived from the same column.

              Reduce Output Operator (RS_12)
                key expressions: _col0 (type: bigint), year(_col2) (type: int), month(_col2) (type: int)
                sort order: +++
                Map-reduce partition columns: _col0 (type: bigint), year(_col2) (type: int), month(_col2) (type: int)
                value expressions: _col1 (type: bigint)
                Join Operator (JOIN_13)
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) (type: int)
                    1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) (type: int)
                  outputColumnNames: _col3
      

      So the eased out denominator is off by a factor of 30,000 or so, causing OOMs in map-joins.

      Attachments

        1. HIVE-12491.WIP.patch
          6 kB
          Gopal Vijayaraghavan
        2. HIVE-12491.patch
          3 kB
          Ashutosh Chauhan
        3. HIVE-12491.5.patch
          57 kB
          Ashutosh Chauhan
        4. HIVE-12491.4.patch
          57 kB
          Ashutosh Chauhan
        5. HIVE-12491.3.patch
          36 kB
          Ashutosh Chauhan
        6. HIVE-12491.2.patch
          28 kB
          Ashutosh Chauhan

        Issue Links

          Activity

            People

              ashutoshc Ashutosh Chauhan
              gopalv Gopal Vijayaraghavan
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: