Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.4.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      HIVE-61. Implement Group-By. (Namit Jain via zshao)

      Description

      ORDER BY is in the query language reference but currently is a no-op. We should make it an op.

      1. hive.61.1.patch
        16 kB
        Namit Jain
      2. hove.61.2.patch
        16 kB
        Namit Jain

        Issue Links

          Activity

          Hide
          Zheng Shao added a comment -

          We recently added the "SORT BY" clause which sorts the data in each reducer. An example query is:
          insert overwrite table table2 select city, state where city = 'Chicago' from table sort by state;

          If you set number of reducers to 1, then "sort by" will have the same result as "order by" (Do trim down the data size first - otherwise it will be very slow).

          "ORDER BY" is not supported yet but we have a plan to support it shortly. The implementation of order by in our mind will be based on sort by: we run the query with sort by, and then mark the table as sorted with these columns in the table meta data.
          Then we will be able to "merge" the sorted files from each reducer and produce a total order.

          Show
          Zheng Shao added a comment - We recently added the "SORT BY" clause which sorts the data in each reducer. An example query is: insert overwrite table table2 select city, state where city = 'Chicago' from table sort by state; If you set number of reducers to 1, then "sort by" will have the same result as "order by" (Do trim down the data size first - otherwise it will be very slow). "ORDER BY" is not supported yet but we have a plan to support it shortly. The implementation of order by in our mind will be based on sort by: we run the query with sort by, and then mark the table as sorted with these columns in the table meta data. Then we will be able to "merge" the sorted files from each reducer and produce a total order.
          Hide
          Zheng Shao added a comment -

          Some more details about the implementation of ORDER BY:

          We will store on what columns a table (or a partition) is partitioned, and sorted (in asc/desc order), right after we insert data into that table. We will also store whether the table is ordered or not.

          If a table/partition is ordered and if we do a select (that outputs the data to console) from that table, then we will need to merge-sort all files in that table/partition based on the sort order.

          Show
          Zheng Shao added a comment - Some more details about the implementation of ORDER BY: We will store on what columns a table (or a partition) is partitioned, and sorted (in asc/desc order), right after we insert data into that table. We will also store whether the table is ordered or not. If a table/partition is ordered and if we do a select (that outputs the data to console) from that table, then we will need to merge-sort all files in that table/partition based on the sort order.
          Hide
          Zheng Shao added a comment -

          Most of the use cases with total ordering is to get the top 10.

          For getting the top 10, the current work-around is:

          First store the top 10 from each partition to some temp table:
          INSERT OVERWRITE tableB
          REDUCE a.*
          USING 'head -n 10'
          AS (col1, col2, col3, col4, ...)
          FROM (SELECT * FROM tableA SORT BY col3 DESC, col4 ASC) a

          Second, set the #reducer to 1 and get the top 10 globally.
          set mapred.reduce.tasks=1;
          SELECT * FROM tableB SORT BY col3 DESC, col4 ASC LIMIT 10

          Show
          Zheng Shao added a comment - Most of the use cases with total ordering is to get the top 10. For getting the top 10, the current work-around is: First store the top 10 from each partition to some temp table: INSERT OVERWRITE tableB REDUCE a.* USING 'head -n 10' AS (col1, col2, col3, col4, ...) FROM (SELECT * FROM tableA SORT BY col3 DESC, col4 ASC) a Second, set the #reducer to 1 and get the top 10 globally. set mapred.reduce.tasks=1; SELECT * FROM tableB SORT BY col3 DESC, col4 ASC LIMIT 10
          Hide
          Adam Kramer added a comment -

          While this issue is open, it would be lovely to have Hive throw a syntax error when a user asks it to ORDER BY...lots of people are using it and being unhappy/confused when it fails.

          Show
          Adam Kramer added a comment - While this issue is open, it would be lovely to have Hive throw a syntax error when a user asks it to ORDER BY...lots of people are using it and being unhappy/confused when it fails.
          Hide
          Jeff Hammerbacher added a comment -

          +1 to a verbose comment when the user attempts an ORDER BY.

          Show
          Jeff Hammerbacher added a comment - +1 to a verbose comment when the user attempts an ORDER BY.
          Hide
          Zheng Shao added a comment -

          Nit: Can you move the definition of numReducers inside the "if (qbp.getClusterByForClause(dest) != null ... "? In case the "if" is not executed, the variables don't get used, so they should not be exposed outside the "if"s.

          There are 2 places that you assign values to extraMRStep - I think it's cleaner to split them into 2 different variables, each defined and assigned in the "then" and "else" clause of "if (qbp.getIsSubQ())". That makes the logic cleaner. What do you think?

          @@ -2944,21 +2958,36 @@
                 curr = genSelectPlan(dest, qb, curr);
                 Integer limit = qbp.getDestLimit(dest);
           
          +      boolean extraMRStep = true;
          +      int numReducers = -1;
          +      if (qbp.getOrderByForClause(dest) != null) {
          +        numReducers = 1;
          +        extraMRStep = false;
          +      }
          +
                 if (qbp.getClusterByForClause(dest) != null
                     || qbp.getDistributeByForClause(dest) != null
          +          || qbp.getOrderByForClause(dest) != null
                     || qbp.getSortByForClause(dest) != null) {
          -        curr = genReduceSinkPlan(dest, qb, curr, -1);
          +        curr = genReduceSinkPlan(dest, qb, curr, numReducers);
                 }
           
                 if (qbp.getIsSubQ()) {
                   if (limit != null) {
          -          curr = genLimitMapRedPlan(dest, qb, curr, limit.intValue(), false);
          +          curr = genLimitMapRedPlan(dest, qb, curr, limit.intValue(), extraMRStep);
                   }
                 } else {
                   curr = genConversionOps(dest, qb, curr);
                   // exact limit can be taken care of by the fetch operator
                   if (limit != null) {
          -          curr = genLimitMapRedPlan(dest, qb, curr, limit.intValue(), qb.getIsQuery());
          +          if (qb.getIsQuery() &&
          +              qbp.getClusterByForClause(dest) == null &&
          +              qbp.getSortByForClause(dest) == null)
          +            extraMRStep = false;
          +          else
          +            extraMRStep = true;
          +
          +          curr = genLimitMapRedPlan(dest, qb, curr, limit.intValue(), extraMRStep);
                     qb.getParseInfo().setOuterQueryLimit(limit.intValue());
                   }
                   curr = genFileSinkPlan(dest, qb, curr);
          
          Show
          Zheng Shao added a comment - Nit: Can you move the definition of numReducers inside the "if (qbp.getClusterByForClause(dest) != null ... "? In case the "if" is not executed, the variables don't get used, so they should not be exposed outside the "if"s. There are 2 places that you assign values to extraMRStep - I think it's cleaner to split them into 2 different variables, each defined and assigned in the "then" and "else" clause of "if (qbp.getIsSubQ())". That makes the logic cleaner. What do you think? @@ -2944,21 +2958,36 @@ curr = genSelectPlan(dest, qb, curr); Integer limit = qbp.getDestLimit(dest); + boolean extraMRStep = true ; + int numReducers = -1; + if (qbp.getOrderByForClause(dest) != null ) { + numReducers = 1; + extraMRStep = false ; + } + if (qbp.getClusterByForClause(dest) != null || qbp.getDistributeByForClause(dest) != null + || qbp.getOrderByForClause(dest) != null || qbp.getSortByForClause(dest) != null ) { - curr = genReduceSinkPlan(dest, qb, curr, -1); + curr = genReduceSinkPlan(dest, qb, curr, numReducers); } if (qbp.getIsSubQ()) { if (limit != null ) { - curr = genLimitMapRedPlan(dest, qb, curr, limit.intValue(), false ); + curr = genLimitMapRedPlan(dest, qb, curr, limit.intValue(), extraMRStep); } } else { curr = genConversionOps(dest, qb, curr); // exact limit can be taken care of by the fetch operator if (limit != null ) { - curr = genLimitMapRedPlan(dest, qb, curr, limit.intValue(), qb.getIsQuery()); + if (qb.getIsQuery() && + qbp.getClusterByForClause(dest) == null && + qbp.getSortByForClause(dest) == null ) + extraMRStep = false ; + else + extraMRStep = true ; + + curr = genLimitMapRedPlan(dest, qb, curr, limit.intValue(), extraMRStep); qb.getParseInfo().setOuterQueryLimit(limit.intValue()); } curr = genFileSinkPlan(dest, qb, curr);
          Hide
          Namit Jain added a comment -

          incorporated zheng's comments

          Show
          Namit Jain added a comment - incorporated zheng's comments
          Hide
          Zheng Shao added a comment -

          Committed to both branch-0.3 and trunk. Thanks Namit!

          Show
          Zheng Shao added a comment - Committed to both branch-0.3 and trunk. Thanks Namit!

            People

            • Assignee:
              Zheng Shao
              Reporter:
              Jeff Hammerbacher
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development