Hive
  1. Hive
  2. HIVE-6247

select count(distinct) should be MRR in Tez

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.13.0
    • Fix Version/s: None
    • Component/s: Tez
    • Labels:
      None

      Description

      The MR query plan for "select count(distinct) " fires off multiple reducers, with a local work task to perform final aggregation.

      The Tez version fires off exactly 1 reducer for the entire data-set which chokes and dies/slows down massively.

      To reproduce on a TPC-DS database (meaningless query)

      select count(distinct ss_net_profit) from store_sales ss join store s on ss.ss_store_sk = s.s_store_sk;
      

      This spins up Map 1, Map 2 (for the dim table + fact table) & Reducer 1 which is always "0/1".

        Activity

        Hide
        Gunther Hagleitner added a comment -

        Dug a little bit into. I think the idea makes good sense, but the description about MR is not correct. At least I wasn't able to make MR not use a single reducer for the query cited. You can rewrite the query though using a subquery to get the result you want.

        There are two more flags to consider (when rewriting):

        a) set hive.optimize.reducededuplication.min.reducer:

        If this is set to 1 you will have a single reducer regardless of rewrite.

        b) hive.fetch.task.aggr

        If this one is true the final count will happen on the client. This is more important in MR than Tez (because it would start a new job in MR, in tez it's just another stage in the DAG).

        Show
        Gunther Hagleitner added a comment - Dug a little bit into. I think the idea makes good sense, but the description about MR is not correct. At least I wasn't able to make MR not use a single reducer for the query cited. You can rewrite the query though using a subquery to get the result you want. There are two more flags to consider (when rewriting): a) set hive.optimize.reducededuplication.min.reducer: If this is set to 1 you will have a single reducer regardless of rewrite. b) hive.fetch.task.aggr If this one is true the final count will happen on the client. This is more important in MR than Tez (because it would start a new job in MR, in tez it's just another stage in the DAG).

          People

          • Assignee:
            Gunther Hagleitner
            Reporter:
            Gopal V
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development