Hive
  1. Hive
  2. HIVE-1694

Accelerate GROUP BY execution using indexes

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 0.8.0
    • Component/s: Indexing, Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The index building patch (Hive-417) is checked into trunk, this JIRA issue tracks supporting indexes in Hive compiler & execution engine for SELECT queries.

      This is in ref. to John's comment at

      https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869

      on creating separate JIRA issue for tracking index usage in optimizer & query execution.

      The aim of this effort is to use indexes to accelerate query execution (for certain class of queries). E.g.

      • Filters and range scans (already being worked on by He Yongqiang as part of HIVE-417?)
      • Joins (index based joins)
      • Group By, Order By and other misc cases

      The proposal is multi-step:
      1. Building index based operators, compiler and execution engine changes
      2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose between index scans, full table scans etc.)

      This JIRA initially focuses on the first step. This JIRA is expected to hold the information about index based plans & operator implementations for above mentioned cases.

      1. HIVE-1694.7.patch
        369 kB
        Prajakta Kalmegh
      2. HIVE-1694.7.patch
        369 kB
        Prajakta Kalmegh
      3. HIVE-1694.6.patch
        363 kB
        Prajakta Kalmegh
      4. HIVE-1694.5.patch
        268 kB
        Prajakta Kalmegh
      5. HIVE-1694.4.patch
        114 kB
        Prajakta Kalmegh
      6. HIVE-1694.3.patch.txt
        281 kB
        Prajakta Kalmegh
      7. HIVE-1694.2.patch.txt
        255 kB
        Prajakta Kalmegh
      8. HIVE-1694.1.patch.txt
        259 kB
        Prajakta Kalmegh
      9. demo_q2.hql
        2 kB
        Nikhil Deshpande
      10. demo_q1.hql
        0.5 kB
        Nikhil Deshpande
      11. HIVE-1694_2010-10-28.diff
        184 kB
        Nikhil Deshpande

        Issue Links

          Activity

          Hide
          John Sichi added a comment -

          Note that we have a team of students from Harvey Mudd planning to work on this one.

          Show
          John Sichi added a comment - Note that we have a team of students from Harvey Mudd planning to work on this one.
          Hide
          Nikhil Deshpande added a comment -

          This is a patch to demonstrate query performance gains using indexes
          (added in HIVE-417). The patch is over latest hive trunk.

          ChangeLog for the patch:

          • Implements a new rewrite for a certain set of queries with GROUP BY to speed those queries by running them on index data instead of base table.
          • Implements a skeleton generic rewrite engine.
          • Implements the rewrite rule for a GroupBy queries set (mentioned above). More details in the class comment GbToCompactSumIdxRewrite.
          • Rewrite needs to be currently explicitly enabled with a flag hive.ql.rw.gb_to_idx.
          • Modifies metastore & metadata API for getting some index info.
          • Modifies QB metadata & parseblock code to add some rewrite assist methods.
          • Inserts a rewrite hook into Semantic Analyzer.
          • Fixes a bug in ql QTestUtil to clean-up indexed tables properly
          • Contains new test for Group By rewrite using indexes: ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q

          Quick performance test results on a very small Hadoop cluster:

          2 queries (chosen to demonstrate perf gains) run on TPC-H benchmark data lineitem table.

          Timings in seconds, data set size (1M, 1G etc.) is TPC-H scale factor.

          -----------------------------------------------
                         1M      1G       10G      30G 
          -----------------------------------------------
            q1_no_idx  24.161   76.790  506.005  1551.555
          q1_with_idx  21.268   27.292   35.502    86.133
          -----------------------------------------------
            q1_no_idx  73.660  130.587  764.619  2146.423
          q2_with_idx  69.393   75.493   92.867   190.619
          -----------------------------------------------
          

          Hadoop cluster description used for above perf test:

          • 2 server class machines (each box: CentOS 5.x Linux, 5 SAS disks in RAID5, 16GB RAM)
          • 2-node Hadoop cluster (0.20.2), un-tuned and un-optimized, data not partitioned and clustered, Hive tables stored in row-store format, HDFS replication factor: 2
          • Sun JDK 1.6 (server mode JVM, JVM_HEAP_SIZE:4GB RAM)
          • Queries on TPC-H Data (lineitem table: 70% of TPC-H data size, e.g. TPC-H 30GB data: 21GB lineitem, ~180Million tuples)

          These changes are being maintained at http://github.com/prafullat/hive

          Show
          Nikhil Deshpande added a comment - This is a patch to demonstrate query performance gains using indexes (added in HIVE-417 ). The patch is over latest hive trunk. ChangeLog for the patch: Implements a new rewrite for a certain set of queries with GROUP BY to speed those queries by running them on index data instead of base table. Implements a skeleton generic rewrite engine. Implements the rewrite rule for a GroupBy queries set (mentioned above). More details in the class comment GbToCompactSumIdxRewrite. Rewrite needs to be currently explicitly enabled with a flag hive.ql.rw.gb_to_idx. Modifies metastore & metadata API for getting some index info. Modifies QB metadata & parseblock code to add some rewrite assist methods. Inserts a rewrite hook into Semantic Analyzer. Fixes a bug in ql QTestUtil to clean-up indexed tables properly Contains new test for Group By rewrite using indexes: ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q Quick performance test results on a very small Hadoop cluster: 2 queries (chosen to demonstrate perf gains) run on TPC-H benchmark data lineitem table. Timings in seconds, data set size (1M, 1G etc.) is TPC-H scale factor. ----------------------------------------------- 1M 1G 10G 30G ----------------------------------------------- q1_no_idx 24.161 76.790 506.005 1551.555 q1_with_idx 21.268 27.292 35.502 86.133 ----------------------------------------------- q1_no_idx 73.660 130.587 764.619 2146.423 q2_with_idx 69.393 75.493 92.867 190.619 ----------------------------------------------- Hadoop cluster description used for above perf test: 2 server class machines (each box: CentOS 5.x Linux, 5 SAS disks in RAID5, 16GB RAM) 2-node Hadoop cluster (0.20.2), un-tuned and un-optimized, data not partitioned and clustered, Hive tables stored in row-store format, HDFS replication factor: 2 Sun JDK 1.6 (server mode JVM, JVM_HEAP_SIZE:4GB RAM) Queries on TPC-H Data (lineitem table: 70% of TPC-H data size, e.g. TPC-H 30GB data: 21GB lineitem, ~180Million tuples) These changes are being maintained at http://github.com/prafullat/hive
          Hide
          Nikhil Deshpande added a comment -

          Patch mentioned for above comment.

          Show
          Nikhil Deshpande added a comment - Patch mentioned for above comment.
          Hide
          Nikhil Deshpande added a comment -

          Example/demo queries used for testing performance gains.

          Show
          Nikhil Deshpande added a comment - Example/demo queries used for testing performance gains.
          Hide
          Prafulla T added a comment -

          Here is one presentation we(Me and Nikhil) had at Persistent Systems Ltd, regarding this changes.
          http://www.slideshare.net/NikhilDeshpande/indexed-hive

          Show
          Prafulla T added a comment - Here is one presentation we(Me and Nikhil) had at Persistent Systems Ltd, regarding this changes. http://www.slideshare.net/NikhilDeshpande/indexed-hive
          Hide
          John Sichi added a comment -

          Hey guys, I haven't gone through all the code yet, but reading through the slides just now, there's one problem I should point out with using the existing compact indexes for aggregate rewrite.

          Namely, we store only the distinct block offsets, not the distinct row offsets. So, if the same key appears more than once within the same block, you'll get the wrong answer for COUNT. One way to address this would be to compute the COUNT per index entry at the time we are building the index, and then SUM that later for aggregation. But currently the compact index does not store that, so we would need to add it as a new index type.

          One smaller item is that for the DISTINCT rewrite (slide 10), you still need to keep a DISTINCT on the rewritten query since the same l_shipdate may be repeated in the index table if it appears in multiple buckets.

          Show
          John Sichi added a comment - Hey guys, I haven't gone through all the code yet, but reading through the slides just now, there's one problem I should point out with using the existing compact indexes for aggregate rewrite. Namely, we store only the distinct block offsets, not the distinct row offsets. So, if the same key appears more than once within the same block, you'll get the wrong answer for COUNT. One way to address this would be to compute the COUNT per index entry at the time we are building the index, and then SUM that later for aggregation. But currently the compact index does not store that, so we would need to add it as a new index type. One smaller item is that for the DISTINCT rewrite (slide 10), you still need to keep a DISTINCT on the rewritten query since the same l_shipdate may be repeated in the index table if it appears in multiple buckets.
          Hide
          Namit Jain added a comment -

          Had an offline discussion with John just now - I think he is also giving similar comments, so I will keep it very brief.

          One concern is that all the optimizations should be at the operator level, I mean - this should be treated as just another optimization.
          As you mentioned in your presentation correctly, Hive does not support cost-based optimizer currently, and it will
          require all the optimizations to be consolidated in one place to help move to that model.

          We are thinking about moving the group by skews also in the optimizer (instead of the current approach in
          SemanticAnalyzer). Once all the optimizations are in a central place, it will be much easier to move to costing.

          The Harvey Mudd folks currently are not looking at Group By optimizations for indexing, so, this will be extremely
          useful for the whole community.

          Show
          Namit Jain added a comment - Had an offline discussion with John just now - I think he is also giving similar comments, so I will keep it very brief. One concern is that all the optimizations should be at the operator level, I mean - this should be treated as just another optimization. As you mentioned in your presentation correctly, Hive does not support cost-based optimizer currently, and it will require all the optimizations to be consolidated in one place to help move to that model. We are thinking about moving the group by skews also in the optimizer (instead of the current approach in SemanticAnalyzer). Once all the optimizations are in a central place, it will be much easier to move to costing. The Harvey Mudd folks currently are not looking at Group By optimizations for indexing, so, this will be extremely useful for the whole community.
          Hide
          John Sichi added a comment - - edited

          +1 to what Namit said. Doing the rewrites at the relational algebra Operator level (similar to the way optimizer transformations such as predicate pushdown already work) will have two big advantages:

          • more general (syntax-independent)
          • much easier to maintain (as you noted in your presentation, the SemanticAnalyzer data structures can be very difficult to analyze and manipulate, whereas the Operator structures are a lot cleaner)

          BTW, thanks for the very clear explanation of the work you've done so far.

          Show
          John Sichi added a comment - - edited +1 to what Namit said. Doing the rewrites at the relational algebra Operator level (similar to the way optimizer transformations such as predicate pushdown already work) will have two big advantages: more general (syntax-independent) much easier to maintain (as you noted in your presentation, the SemanticAnalyzer data structures can be very difficult to analyze and manipulate, whereas the Operator structures are a lot cleaner) BTW, thanks for the very clear explanation of the work you've done so far.
          Hide
          Prajakta Kalmegh added a comment -

          Hi,

          I am Prajakta from Persistent Systems Ltd. and am working on the changes that John and Namit have suggested above along with Nikhil and Prafulla.
          This is a design note about implementation of above review comments.

          We're implementing the following changes as a single transformation in optimizer:
          (1) Table replacement: involves modification of some internal members of TableScanOperator.
          (2) Group by removal: involves removal of some operators (GBY-RS-GBY) where GBY is done at both mapper-reducer side; and re-setting of correct parent and child operators within the DAG.
          (3) Sub-query insertion: involves creation of new DAG for sub-query and attaching it to the original DAG at an appropriate place.
          (4) Projection modification: involves steps similar to (3).

          We have implemented the above changes as a proof of concept. In this implementation, we have invoked this rule as the first transformation in the optimizer code so that lineage information is computed later as part of the Generator transformation. Another reason that we have applied this as the first transformation is that, as of now, the implementation uses the query block (QB) information to decide if the transformation can be applied for the input query (similar to the canApplyThisRule() method in the original rewrite code). Finally, to make the changes (3) and (4), we are modifying the operator DAG. However, we are not modifying the original query block (QB). Hence, this leaves the QB and the operator DAG out of sync.

          The major issues in this implementation approach are:
          1. The changes listed above require either modification of operator DAG (in case of 2) or creation of new operator DAG(in case of 3 and 4). The implementation requires some hacks in the SemanticAnalyzer code if we create a new DAG (as in the case of replaceViewReferenceWithDefinition() method which uses ParseDriver() to do the same). However, the methods are private (like genBodyPlan(...), genSelectPlan(...) etc) making it all the more difficult to implement changes (3) and (4) without access to these methods.
          2. The creation of new DAG will require creating all associated data structures like QB, ASTNode etc as this information is necessary to generate DAG operator plan for the sub-queries. This approach would be very similar to what we are already doing in rewrite i.e creating new QB and ASTNode.
          3. Since we are creating a new DAG and appending it to the enclosing query DAG, we also need to modify the row schema and row resolvers for the operators.

          One of the questions that underlies before finalizing the above approach is whether the cost-based optimizer, which is to be implemented in the future, will work on the query block or on the DAG operator tree. In case it works on the operator DAG, then the implementation changes we listed here are bound to be done. However, if the cost-based optimizer is to work on the query block, then we feel that the initial query rewrite engine code which worked after semantic analysis but before plan generation can be made to work with the cost-based optimizer. It will be a valuable input from your side if you could comment on the cost-based optimizer.

          Show
          Prajakta Kalmegh added a comment - Hi, I am Prajakta from Persistent Systems Ltd. and am working on the changes that John and Namit have suggested above along with Nikhil and Prafulla. This is a design note about implementation of above review comments. We're implementing the following changes as a single transformation in optimizer: (1) Table replacement: involves modification of some internal members of TableScanOperator. (2) Group by removal: involves removal of some operators (GBY-RS-GBY) where GBY is done at both mapper-reducer side; and re-setting of correct parent and child operators within the DAG. (3) Sub-query insertion: involves creation of new DAG for sub-query and attaching it to the original DAG at an appropriate place. (4) Projection modification: involves steps similar to (3). We have implemented the above changes as a proof of concept. In this implementation, we have invoked this rule as the first transformation in the optimizer code so that lineage information is computed later as part of the Generator transformation. Another reason that we have applied this as the first transformation is that, as of now, the implementation uses the query block (QB) information to decide if the transformation can be applied for the input query (similar to the canApplyThisRule() method in the original rewrite code). Finally, to make the changes (3) and (4), we are modifying the operator DAG. However, we are not modifying the original query block (QB). Hence, this leaves the QB and the operator DAG out of sync. The major issues in this implementation approach are: 1. The changes listed above require either modification of operator DAG (in case of 2) or creation of new operator DAG(in case of 3 and 4). The implementation requires some hacks in the SemanticAnalyzer code if we create a new DAG (as in the case of replaceViewReferenceWithDefinition() method which uses ParseDriver() to do the same). However, the methods are private (like genBodyPlan(...), genSelectPlan(...) etc) making it all the more difficult to implement changes (3) and (4) without access to these methods. 2. The creation of new DAG will require creating all associated data structures like QB, ASTNode etc as this information is necessary to generate DAG operator plan for the sub-queries. This approach would be very similar to what we are already doing in rewrite i.e creating new QB and ASTNode. 3. Since we are creating a new DAG and appending it to the enclosing query DAG, we also need to modify the row schema and row resolvers for the operators. One of the questions that underlies before finalizing the above approach is whether the cost-based optimizer, which is to be implemented in the future, will work on the query block or on the DAG operator tree. In case it works on the operator DAG, then the implementation changes we listed here are bound to be done. However, if the cost-based optimizer is to work on the query block, then we feel that the initial query rewrite engine code which worked after semantic analysis but before plan generation can be made to work with the cost-based optimizer. It will be a valuable input from your side if you could comment on the cost-based optimizer.
          Hide
          John Sichi added a comment -

          I talked to Namit, and he thinks there should be no relevant dependencies on the QB once we start on optimization, so letting it get out of sync with the operator DAG may not be an issue. (I scanned the code in optimizer, and it seems like a few dependencies have crept in, but only for special cases like ANALYZE.)

          For issue #1, you are proposing what I'll call the "internal SQL" approach, which is to construct an internal SQL expression (either in string or ASTNode form) and then partially analyze that (via SemanticAnalyzer), producing an operator DAG to be spliced into the main one. For this approach, we would need to figure out how to make the relevant phases of SemanticAnalyzer modularized and invocable.

          Alternately, the "direct construction" approach would be to attempt to construct the new operator subgraph directly via custom code targeted to the specific patterns you generate, and then splice that in.

          I'm not sure which approach is better; Namit, any opinions? The internal SQL approach definitely seems the most appropriate for the WHERE clause work being done by the Harvey Mudd team, since it produces a self-contained job to be run to produce the temp table containing the filtered block list. But for GROUP BY, the direct construction approach may be cleaner.

          For issue #2, it seems like this would happen automatically for the internal SQL approach (but this could also pollute the SemanticAnalyzer state to some extent). The direct construction approach is the opposite: it avoids polluting SemanticAnalyzer, but still might require modularizing some SemanticAnalyzer calls, e.g. for generating and registering the necessary aliases for index tables.

          Regarding issue #3, that's already true for other optimizations such as projection pushdown (ColumnPruner), which modifies operator row schemas/resolvers; see for example ColumnPrunerProcFactory.pruneJoinOperator. So there shouldn't be anything new here.

          Regarding the need to run your transformation first, it would be best to avoid this since a more advanced optimizer may want freedom in reordering transformations. So instead of relying on information from the QB, analyze the relevant operator subgraph to decide whether your transformation is applicable. This is the approach we expect to require for cost-based optimization.

          Also, note that from a lineage perspective, it makes more sense for lineage to be derived prior to index transformation rather than subsequently. Someone examining the lineage associated with an ETL job would typically be more interested in the logical source table from which it pulls (rather than from a physical index).

          Show
          John Sichi added a comment - I talked to Namit, and he thinks there should be no relevant dependencies on the QB once we start on optimization, so letting it get out of sync with the operator DAG may not be an issue. (I scanned the code in optimizer, and it seems like a few dependencies have crept in, but only for special cases like ANALYZE.) For issue #1, you are proposing what I'll call the "internal SQL" approach, which is to construct an internal SQL expression (either in string or ASTNode form) and then partially analyze that (via SemanticAnalyzer), producing an operator DAG to be spliced into the main one. For this approach, we would need to figure out how to make the relevant phases of SemanticAnalyzer modularized and invocable. Alternately, the "direct construction" approach would be to attempt to construct the new operator subgraph directly via custom code targeted to the specific patterns you generate, and then splice that in. I'm not sure which approach is better; Namit, any opinions? The internal SQL approach definitely seems the most appropriate for the WHERE clause work being done by the Harvey Mudd team, since it produces a self-contained job to be run to produce the temp table containing the filtered block list. But for GROUP BY, the direct construction approach may be cleaner. For issue #2, it seems like this would happen automatically for the internal SQL approach (but this could also pollute the SemanticAnalyzer state to some extent). The direct construction approach is the opposite: it avoids polluting SemanticAnalyzer, but still might require modularizing some SemanticAnalyzer calls, e.g. for generating and registering the necessary aliases for index tables. Regarding issue #3, that's already true for other optimizations such as projection pushdown (ColumnPruner), which modifies operator row schemas/resolvers; see for example ColumnPrunerProcFactory.pruneJoinOperator. So there shouldn't be anything new here. Regarding the need to run your transformation first, it would be best to avoid this since a more advanced optimizer may want freedom in reordering transformations. So instead of relying on information from the QB, analyze the relevant operator subgraph to decide whether your transformation is applicable. This is the approach we expect to require for cost-based optimization. Also, note that from a lineage perspective, it makes more sense for lineage to be derived prior to index transformation rather than subsequently. Someone examining the lineage associated with an ETL job would typically be more interested in the logical source table from which it pulls (rather than from a physical index).
          Hide
          Namit Jain added a comment -

          I think having a mechanism which lets is issue "internal" or "recursive" sql is better in the long term.
          That is something we will need anyway for future optimizations.

          We can create a thin API around SemanticAnalyzer (analyze etc.), which is indirectly present in Driver.
          Another implementation of that API can be the internal API, say RecursiveDriver.
          In a recursive context, you are only allowed to invoke RecursiveDriver.
          External Clients (CliDriver, HiveServer etc.) invoke Driver directly.

          As John said, definitely keep your optimizations pluggable. Currently, they are invoked as rule-based,
          but should be flexible enough to be invoked based on some costs in the future.

          Show
          Namit Jain added a comment - I think having a mechanism which lets is issue "internal" or "recursive" sql is better in the long term. That is something we will need anyway for future optimizations. We can create a thin API around SemanticAnalyzer (analyze etc.), which is indirectly present in Driver. Another implementation of that API can be the internal API, say RecursiveDriver. In a recursive context, you are only allowed to invoke RecursiveDriver. External Clients (CliDriver, HiveServer etc.) invoke Driver directly. As John said, definitely keep your optimizations pluggable. Currently, they are invoked as rule-based, but should be flexible enough to be invoked based on some costs in the future.
          Hide
          Prajakta Kalmegh added a comment -

          Thanks to both of you for your comments on our proposed design. Since the last post, we have been working on the code changes as per your comments. The progress has been in the following areas:
          1) Removed the dependency for our optimizer to be the first one. It can now be used as any other optimizer by adding it to "transformations" list.
          2) Implemented changes to re-structure the operator DAG plan for group-by queries.
          3) We have removed the dependency of our optimization to read data from QB(query block) as it used to do earlier to check if the optimization can be applied before proceeding to apply the re-write. (See canApply() method in the original rewrite code.)
          4) Regarding issue #3 (from my original post), as per John's suggestion, the change for modification of operator row schemas/resolvers are done smoothly wherever applicable.
          5) We have completed testing the new implementation for simple group-by cases. Also, the code to append a sub-query to original DAG is implemented separately as of now. This needs to be integrated as part of our optimization.

          The only issue that will be pending post this implementation will be regarding John's post on Nov 1st stating "...we store only the distinct block offsets, not the distinct row offsets.....". We plan to work on this once the current implementation is tested end-to-end. You can expect the update on this in a couple of weeks.

          Show
          Prajakta Kalmegh added a comment - Thanks to both of you for your comments on our proposed design. Since the last post, we have been working on the code changes as per your comments. The progress has been in the following areas: 1) Removed the dependency for our optimizer to be the first one. It can now be used as any other optimizer by adding it to "transformations" list. 2) Implemented changes to re-structure the operator DAG plan for group-by queries. 3) We have removed the dependency of our optimization to read data from QB(query block) as it used to do earlier to check if the optimization can be applied before proceeding to apply the re-write. (See canApply() method in the original rewrite code.) 4) Regarding issue #3 (from my original post), as per John's suggestion, the change for modification of operator row schemas/resolvers are done smoothly wherever applicable. 5) We have completed testing the new implementation for simple group-by cases. Also, the code to append a sub-query to original DAG is implemented separately as of now. This needs to be integrated as part of our optimization. The only issue that will be pending post this implementation will be regarding John's post on Nov 1st stating "...we store only the distinct block offsets, not the distinct row offsets.....". We plan to work on this once the current implementation is tested end-to-end. You can expect the update on this in a couple of weeks.
          Hide
          John Sichi added a comment -

          Thanks Prajakta. Let us know once you have new code ready to review. HIVE-1644 is going to need the internal SQL support too, so I'd like to make sure that as much as possible is reusable there.

          Show
          John Sichi added a comment - Thanks Prajakta. Let us know once you have new code ready to review. HIVE-1644 is going to need the internal SQL support too, so I'd like to make sure that as much as possible is reusable there.
          Hide
          John Sichi added a comment -

          Is there an updated tree for this? I checked github but didn't see it. HIVE-1644 needs support for compiling internally-generated SQL into operators, so if you have that working, I'd like to point the Harvey Mudd folks at it when I talk to them tomorrow.

          Show
          John Sichi added a comment - Is there an updated tree for this? I checked github but didn't see it. HIVE-1644 needs support for compiling internally-generated SQL into operators, so if you have that working, I'd like to point the Harvey Mudd folks at it when I talk to them tomorrow.
          Hide
          Prajakta Kalmegh added a comment -

          Hey John,

          We are maintaining the latest code in github on 'prj' branch. You can access it here:
          <https://github.com/prafullat/hive/tree/prj>

          We have created a 'RewriteParseContextGenerator' class in the org.apache.hadoop.hive.ql.optimizer package to return a ParseContext instance (which contains all the information on the operator tree) when given a string command (query) as input. Since we only need the operator tree for our execution, we have created this basic utility class for our code.

          Rest of the code is being internally reviewed by the team right now. We are expecting it to be completed by end of this week. We will update you once the code is ready for your review.

          Show
          Prajakta Kalmegh added a comment - Hey John, We are maintaining the latest code in github on 'prj' branch. You can access it here: < https://github.com/prafullat/hive/tree/prj > We have created a 'RewriteParseContextGenerator' class in the org.apache.hadoop.hive.ql.optimizer package to return a ParseContext instance (which contains all the information on the operator tree) when given a string command (query) as input. Since we only need the operator tree for our execution, we have created this basic utility class for our code. Rest of the code is being internally reviewed by the team right now. We are expecting it to be completed by end of this week. We will update you once the code is ready for your review.
          Hide
          Prajakta Kalmegh added a comment -

          Hi John,

          We have the code ready for review. You can view it at <https://github.com/prafullat/hive/tree/prj>.
          Please find attached the diff 'HIVE-1694.1.patch' for the changes. We have taken the diff from the github hive repo <https://github.com/apache/hive.git> on 30th Jan 2011. The last commit on github apache/hive before we took a diff was <https://github.com/apache/hive/commit/2cbbccc5fa9fe3bd9b0569021831f745fa1d4a06>.

          Rewrite needs to be enabled explicitly by setting the 'hive.optimize.gbyusingindex' flag to true as done in 'ql_rewrite_gbtoidx.q' test case. We have added the 'ql_rewrite_gbtoidx.q' file in ql/src/test/queries/clientpositive.

          Show
          Prajakta Kalmegh added a comment - Hi John, We have the code ready for review. You can view it at < https://github.com/prafullat/hive/tree/prj >. Please find attached the diff ' HIVE-1694 .1.patch' for the changes. We have taken the diff from the github hive repo < https://github.com/apache/hive.git > on 30th Jan 2011. The last commit on github apache/hive before we took a diff was < https://github.com/apache/hive/commit/2cbbccc5fa9fe3bd9b0569021831f745fa1d4a06 >. Rewrite needs to be enabled explicitly by setting the 'hive.optimize.gbyusingindex' flag to true as done in 'ql_rewrite_gbtoidx.q' test case. We have added the 'ql_rewrite_gbtoidx.q' file in ql/src/test/queries/clientpositive.
          Hide
          Prajakta Kalmegh added a comment -

          Changes on branch prj.

          Show
          Prajakta Kalmegh added a comment - Changes on branch prj.
          Hide
          John Sichi added a comment -

          I've created a review board request but have not added any comments yet:

          https://reviews.apache.org/r/392/

          The patch required -p1 instead of -p0 to apply. For the next patch, please make sure

          patch -p0 < patchfile

          applies cleanly.

          Show
          John Sichi added a comment - I've created a review board request but have not added any comments yet: https://reviews.apache.org/r/392/ The patch required -p1 instead of -p0 to apply. For the next patch, please make sure patch -p0 < patchfile applies cleanly.
          Hide
          Prajakta Kalmegh added a comment -

          Thanks John. We will ensure that henceforth.

          Show
          Prajakta Kalmegh added a comment - Thanks John. We will ensure that henceforth.
          Hide
          John Sichi added a comment -

          Taking a closer look at this one now.

          Show
          John Sichi added a comment - Taking a closer look at this one now.
          Hide
          John Sichi added a comment -

          I'm buffering up a bunch of comments in review board, but won't publish for a while since it'll take me some time to go through all the code. Looks good so far.

          Show
          John Sichi added a comment - I'm buffering up a bunch of comments in review board, but won't publish for a while since it'll take me some time to go through all the code. Looks good so far.
          Hide
          Prajakta Kalmegh added a comment -

          Thanks for the update John.

          Show
          Prajakta Kalmegh added a comment - Thanks for the update John.
          Hide
          John Sichi added a comment -

          Note: I pointed the Harvey Mudd team over to your branch, so they're copying bits and pieces of necessary support into their patch. Once they're a little further along, we can figure out how to reconcile the two before commit.

          Show
          John Sichi added a comment - Note: I pointed the Harvey Mudd team over to your branch, so they're copying bits and pieces of necessary support into their patch. Once they're a little further along, we can figure out how to reconcile the two before commit.
          Hide
          Prajakta Kalmegh added a comment -

          Hi John,

          Can you please let us know the status of the review of this patch?

          Show
          Prajakta Kalmegh added a comment - Hi John, Can you please let us know the status of the review of this patch?
          Hide
          John Sichi added a comment -

          Hi Prajakta,

          I got caught up with some other work; I'll try to publish my comments before next week.

          Show
          John Sichi added a comment - Hi Prajakta, I got caught up with some other work; I'll try to publish my comments before next week.
          Hide
          Prajakta Kalmegh added a comment -

          Thanks John.

          Show
          Prajakta Kalmegh added a comment - Thanks John.
          Hide
          John Sichi added a comment -

          First round of review comments are here:

          https://reviews.apache.org/r/392/

          After those are resolved, and patch is rebased, I'll ask Namit and Yongqiang to take a look to see if they can find ways to simplify any of the rewrite logic.

          Show
          John Sichi added a comment - First round of review comments are here: https://reviews.apache.org/r/392/ After those are resolved, and patch is rebased, I'll ask Namit and Yongqiang to take a look to see if they can find ways to simplify any of the rewrite logic.
          Hide
          Prajakta Kalmegh added a comment -

          Hi John,

          Thanks for the review comments. I have posted my replies on some of your questions. For the others, we will make the required changes in the code and upload a new patch.

          Show
          Prajakta Kalmegh added a comment - Hi John, Thanks for the review comments. I have posted my replies on some of your questions. For the others, we will make the required changes in the code and upload a new patch.
          Hide
          Prajakta Kalmegh added a comment -

          We are currently working on implementing a new index type to get a correct COUNT for group-by queries that are re-written to use index table instead of base table. We have three implementation options as listed below:

          1. Create a new index handler called 'AggregationIndexHandler' which pre-computes the count on indexed columns
          In this appraoch, we plan to implement a new index type that stores the COUNT of the indexed column per index entry as suggested by John in November 1st 2010's comment. The 'AggregationIndexHandler' will override the default implementation for 'analyzeIndexDefinition(...)' and 'generateIndexBuildTaskList(...)' methods. The 'analyzeIndexDefinition(...)' method implementation will add a FieldSchema object for COUNT rather than 'offsets' in the StorageDescriptor for the index table. The 'generateIndexBuildTaskList(...)' method will be the same but there will be a change in the implementation of the 'getIndexBuilderMapRedTask(...)' method which is invoked within this method.

          Currently, for the index rebuild on the following index creation,

          CREATE INDEX lineitem_lshipdate_idx ON TABLE lineitem(l_shipdate) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD;
          ALTER INDEX lineitem_lshipdate_idx ON lineitem REBUILD;

          the 'getIndexBuilderMapRedTask(...)' method creates a command 'INSERT OVERWRITE TABLE `default_lineitem_lineitem_lshipdate_idx` SELECT `l_shipdate`,INPUTFILENAME, collect_set (BLOCKOFFSETINSIDEFILE) FROM `lineitem` GROUP BY `l_shipdate`, INPUTFILE_NAME' in the CompactIndexHandler. This command is later passed to the driver to compile and rebuild the index which collects data from base table and stores it in index table.

          With the 'AggregationIndexHandler', we plan to create a command like ' INSERT OVERWRITE TABLE `default_lineitem_lineitem_lshipdate_idx` SELECT `l_shipdate`,INPUTFILENAME, COUNT(`l_shipdate`) FROM `lineitem` GROUP BY `l_shipdate`, INPUTFILE_NAME' which will count the number of unique indexed column entries.

          In the current implementation of our optimizer, we are doing a 'size(_offsets)' to replace the COUNT in group-by queries. We need to change this implementation if we use the AggregationIndexHandler. This will still give a comparable performance improvement on the original queries that will, otherwise, scan full tables for a COUNT of the indexed columns.
          Note: We plan to go ahead with creating a new HiveIndexHandler type rather than changing the implementation of the CompactIndexHandler to let in some flexibility in the use of AggregationIndexHandler only for cases where it is required.

          2. Implement a 'CompactRowIndexHandler' similar to 'CompactIndexHandler' but it stores the ROW_OFFSET_INSIDE_FILE instead of BLOCK_OFFSET_INSIDE_FILE
          We create a new index handler called 'CompactRowIndexHandler'. This will give us a ROW_OFFSET_INSIDE_FILE array<bigint> and we can compute the COUNT for group-by queries by doing a s'ize(ROW_OFFSET_INSIDE_FILE)' on it. The command will look like,

          ' INSERT OVERWRITE TABLE `default_lineitem_lineitem_lshipdate_idx` SELECT `l_shipdate`,INPUTFILENAME, collect_set (ROW_OFFSET_INSIDE_FILE) FROM `lineitem` GROUP BY `l_shipdate`, INPUTFILE_NAME' which will give us an array of all row offsets in the input file grouped by indexed column.

          We had a look at the HIVE-1803 patch. If we re-use the implementation changes done in Hive1803 to classes HiveContextAwareRecordReader.java, IOContext.java, VirtualColumn.java and MapOperator.java, we can implement 'CompactRowIndexHandler.java' to support the above index type.

          3. Use BitMapIndexHandler without creating a new index handler
          Another approach is to use the Hive1803 BitMapIndexHandler to get a count of all those bits in the bitmap array for which the row_offset is ON. This will indeed give us a correct count for all the rows that contain the indexed column. However, we will have multiple counts per block entry. For this, we may need to do the SUM again to get a correct count per indexed column entry. Although this is do-able, it will be somewhat complex to implement.

          Please let us know what you think of the proposed solutions. If you have a better implementation approach in mind, please do share it with us.

          Show
          Prajakta Kalmegh added a comment - We are currently working on implementing a new index type to get a correct COUNT for group-by queries that are re-written to use index table instead of base table. We have three implementation options as listed below: 1. Create a new index handler called 'AggregationIndexHandler' which pre-computes the count on indexed columns In this appraoch, we plan to implement a new index type that stores the COUNT of the indexed column per index entry as suggested by John in November 1st 2010's comment. The 'AggregationIndexHandler' will override the default implementation for 'analyzeIndexDefinition(...)' and 'generateIndexBuildTaskList(...)' methods. The 'analyzeIndexDefinition(...)' method implementation will add a FieldSchema object for COUNT rather than 'offsets' in the StorageDescriptor for the index table. The 'generateIndexBuildTaskList(...)' method will be the same but there will be a change in the implementation of the 'getIndexBuilderMapRedTask(...)' method which is invoked within this method. Currently, for the index rebuild on the following index creation, CREATE INDEX lineitem_lshipdate_idx ON TABLE lineitem(l_shipdate) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD; ALTER INDEX lineitem_lshipdate_idx ON lineitem REBUILD; the 'getIndexBuilderMapRedTask(...)' method creates a command 'INSERT OVERWRITE TABLE `default_ lineitem_lineitem_lshipdate_idx ` SELECT `l_shipdate`,INPUT FILE NAME, collect_set (BLOCK OFFSET INSIDE FILE) FROM `lineitem` GROUP BY `l_shipdate`, INPUT FILE _NAME' in the CompactIndexHandler. This command is later passed to the driver to compile and rebuild the index which collects data from base table and stores it in index table. With the 'AggregationIndexHandler', we plan to create a command like ' INSERT OVERWRITE TABLE `default_ lineitem_lineitem_lshipdate_idx ` SELECT `l_shipdate`,INPUT FILE NAME, COUNT(`l_shipdate`) FROM `lineitem` GROUP BY `l_shipdate`, INPUT FILE _NAME' which will count the number of unique indexed column entries. In the current implementation of our optimizer, we are doing a 'size(_offsets)' to replace the COUNT in group-by queries. We need to change this implementation if we use the AggregationIndexHandler. This will still give a comparable performance improvement on the original queries that will, otherwise, scan full tables for a COUNT of the indexed columns. Note: We plan to go ahead with creating a new HiveIndexHandler type rather than changing the implementation of the CompactIndexHandler to let in some flexibility in the use of AggregationIndexHandler only for cases where it is required. 2. Implement a 'CompactRowIndexHandler' similar to 'CompactIndexHandler' but it stores the ROW_OFFSET_INSIDE_FILE instead of BLOCK_OFFSET_INSIDE_FILE We create a new index handler called 'CompactRowIndexHandler'. This will give us a ROW_OFFSET_INSIDE_FILE array<bigint> and we can compute the COUNT for group-by queries by doing a s'ize(ROW_OFFSET_INSIDE_FILE)' on it. The command will look like, ' INSERT OVERWRITE TABLE `default_ lineitem_lineitem_lshipdate_idx ` SELECT `l_shipdate`,INPUT FILE NAME, collect_set (ROW_OFFSET_INSIDE_FILE) FROM `lineitem` GROUP BY `l_shipdate`, INPUT FILE _NAME' which will give us an array of all row offsets in the input file grouped by indexed column. We had a look at the HIVE-1803 patch. If we re-use the implementation changes done in Hive1803 to classes HiveContextAwareRecordReader.java, IOContext.java, VirtualColumn.java and MapOperator.java, we can implement 'CompactRowIndexHandler.java' to support the above index type. 3. Use BitMapIndexHandler without creating a new index handler Another approach is to use the Hive1803 BitMapIndexHandler to get a count of all those bits in the bitmap array for which the row_offset is ON. This will indeed give us a correct count for all the rows that contain the indexed column. However, we will have multiple counts per block entry. For this, we may need to do the SUM again to get a correct count per indexed column entry. Although this is do-able, it will be somewhat complex to implement. Please let us know what you think of the proposed solutions. If you have a better implementation approach in mind, please do share it with us.
          Hide
          John Sichi added a comment -

          I'd like to propose a fourth option instead: create a new handler type which stores both the count and the offsets together, so that it can be used for both aggregation and filtering. The index build can still be done with a single GROUP BY, but now with three aggregate expressions in the SELECT list: collect_set (BLOCKOFFSETINSIDEFILE), COUNT(`l_shipdate`), COUNT. For a column known to be NOT NULL, just COUNT is good enough, but Hive doesn't currently have that metadata. You could also use IDXPROPERTIES to allow for additional expressions (SUM/MAX/MIN, complex expressions, etc), making these start to look more like materialized aggregate views.

          In HIVE-1803, they are working on factoring out some of the generic parts of compact index handler for reuse; we should depend on that for the aggregate index handler to avoid duplicating code.

          Show
          John Sichi added a comment - I'd like to propose a fourth option instead: create a new handler type which stores both the count and the offsets together, so that it can be used for both aggregation and filtering. The index build can still be done with a single GROUP BY, but now with three aggregate expressions in the SELECT list: collect_set (BLOCKOFFSETINSIDEFILE), COUNT(`l_shipdate`), COUNT . For a column known to be NOT NULL, just COUNT is good enough, but Hive doesn't currently have that metadata. You could also use IDXPROPERTIES to allow for additional expressions (SUM/MAX/MIN, complex expressions, etc), making these start to look more like materialized aggregate views. In HIVE-1803 , they are working on factoring out some of the generic parts of compact index handler for reuse; we should depend on that for the aggregate index handler to avoid duplicating code.
          Hide
          Prajakta Kalmegh added a comment -

          Patch version 2 - includes changes for review comments from John.

          Show
          Prajakta Kalmegh added a comment - Patch version 2 - includes changes for review comments from John.
          Hide
          Prajakta Kalmegh added a comment -

          Hi John

          We have made all the changes as suggested by you except for making the code pluggable (so that the rewrite expression changes depending on which index handler is used). We will submit this change along with the patch for new index type.

          We have started working on the new index type creation as per your suggestion and will let you know once that is complete.

          Show
          Prajakta Kalmegh added a comment - Hi John We have made all the changes as suggested by you except for making the code pluggable (so that the rewrite expression changes depending on which index handler is used). We will submit this change along with the patch for new index type. We have started working on the new index type creation as per your suggestion and will let you know once that is complete.
          Hide
          Prajakta Kalmegh added a comment -

          Patch version 2 - includes changes for review comments from John. Re-attaching the appropriate file.

          Show
          Prajakta Kalmegh added a comment - Patch version 2 - includes changes for review comments from John. Re-attaching the appropriate file.
          Hide
          Prajakta Kalmegh added a comment -

          Patch with new index type support and optimizer code that uses the new index type.

          Show
          Prajakta Kalmegh added a comment - Patch with new index type support and optimizer code that uses the new index type.
          Hide
          Prajakta Kalmegh added a comment -

          Hi John

          Please find attached the patch with new index type support. We have also made changes to the our optimizer code to use count of indexed columns from this new index type (instead of computing the size(_offsets)). Can you please upload it for review on ReviewBoard?

          Thanks.

          Show
          Prajakta Kalmegh added a comment - Hi John Please find attached the patch with new index type support. We have also made changes to the our optimizer code to use count of indexed columns from this new index type (instead of computing the size(_offsets)). Can you please upload it for review on ReviewBoard? Thanks.
          Hide
          John Sichi added a comment -

          Hi Prajakta,

          Review Board is self-service...you can create yourself an account and then follow the steps here:

          http://wiki.apache.org/hadoop/Hive/HowToContribute#Review_Process

          Show
          John Sichi added a comment - Hi Prajakta, Review Board is self-service...you can create yourself an account and then follow the steps here: http://wiki.apache.org/hadoop/Hive/HowToContribute#Review_Process
          Hide
          Prajakta Kalmegh added a comment -

          Hi John

          Thanks for the link. We have created a new Review Board entry: <https://reviews.apache.org/r/505/>

          Show
          Prajakta Kalmegh added a comment - Hi John Thanks for the link. We have created a new Review Board entry: < https://reviews.apache.org/r/505/ >
          Hide
          John Sichi added a comment -

          I added a few review board comments; there are a lot of places where the exception handling is still wrong; I didn't comment on all of those but they need to be fixed.

          We still need to reconcile with HIVE-1803, but I'll ask Namit and Yongqiang to take a look now to get their comments on the rewrite implementation.

          Show
          John Sichi added a comment - I added a few review board comments; there are a lot of places where the exception handling is still wrong; I didn't comment on all of those but they need to be fixed. We still need to reconcile with HIVE-1803 , but I'll ask Namit and Yongqiang to take a look now to get their comments on the rewrite implementation.
          Hide
          John Sichi added a comment -

          Need a rebase now that the refactoring from HIVE-1803 has been committed.

          Show
          John Sichi added a comment - Need a rebase now that the refactoring from HIVE-1803 has been committed.
          Hide
          Prajakta Kalmegh added a comment -

          Thanks John. Please let us know how to proceed on this. We are taking a look at the HIVE-1803 changes in the meanwhile.

          Show
          Prajakta Kalmegh added a comment - Thanks John. Please let us know how to proceed on this. We are taking a look at the HIVE-1803 changes in the meanwhile.
          Hide
          John Sichi added a comment -

          For the rebasing, you'll need to make your new handlers work with the refactored base classes. HIVE-1803 copied some of your refactoring and took it further.

          I'm going to ping Yongqiang again.

          Show
          John Sichi added a comment - For the rebasing, you'll need to make your new handlers work with the refactored base classes. HIVE-1803 copied some of your refactoring and took it further. I'm going to ping Yongqiang again.
          Hide
          John Sichi added a comment -

          Scheduled a meeting this Friday to take a look at this with some other FB folks and get you more feedback.

          Show
          John Sichi added a comment - Scheduled a meeting this Friday to take a look at this with some other FB folks and get you more feedback.
          Hide
          Prajakta Kalmegh added a comment -

          That would be great. Thanks.

          Show
          Prajakta Kalmegh added a comment - That would be great. Thanks.
          Hide
          John Sichi added a comment -

          I collected comments from last week's review meeting below.

          • The rewrite needs to check to make sure that the index partitions are available (matching the referenced table partitions). You can take a look at the way the Harvey Mudd team handles this, and maybe reuse their code. This implies that predicate pushdown and partition pruning need to happen BEFORE the rewrite is applied (currently the rewrite happens before them).
          • Isn't it a bug that the GROUP BY is removed in some cases? The index may store multiple rows for the same base table key (since FILENAME is part of the index table key), so it seems like a GROUP BY should always be required for removing those duplicates.
          • Where is _countall used instead of _countkey? Also, what happens if the index is compound (multiple columns in its key)?
          • Add a test case for a query in which a table scan is reused in a directed acyclic graph, e.g. a UNION where one branch of the union does a rewritable GROUP BY on the table and the other branch just reads the table directly. We want to make sure that in this case, the rewrite's replacement of the base table in one branch does not corrupt the other branch in any way.

          After these have been addressed (along with the existing review board comments) and you've had a chance to rebase the patch, we'll do another pass.

          Thanks again!

          Show
          John Sichi added a comment - I collected comments from last week's review meeting below. The rewrite needs to check to make sure that the index partitions are available (matching the referenced table partitions). You can take a look at the way the Harvey Mudd team handles this, and maybe reuse their code. This implies that predicate pushdown and partition pruning need to happen BEFORE the rewrite is applied (currently the rewrite happens before them). Isn't it a bug that the GROUP BY is removed in some cases? The index may store multiple rows for the same base table key (since FILENAME is part of the index table key), so it seems like a GROUP BY should always be required for removing those duplicates. Where is _countall used instead of _countkey? Also, what happens if the index is compound (multiple columns in its key)? Add a test case for a query in which a table scan is reused in a directed acyclic graph, e.g. a UNION where one branch of the union does a rewritable GROUP BY on the table and the other branch just reads the table directly. We want to make sure that in this case, the rewrite's replacement of the base table in one branch does not corrupt the other branch in any way. After these have been addressed (along with the existing review board comments) and you've had a chance to rebase the patch, we'll do another pass. Thanks again!
          Hide
          Prajakta Kalmegh added a comment -

          Thanks John. Our team here is working on it. I will let you know once the new patch is ready.

          Show
          Prajakta Kalmegh added a comment - Thanks John. Our team here is working on it. I will let you know once the new patch is ready.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1194/
          -----------------------------------------------------------

          Review request for hive and John Sichi.

          Summary
          -------

          This patch has defined a new AggregateIndexHandler which is used to optimize the query plan for groupby queries.

          This addresses bug HIVE-1694.
          https://issues.apache.org/jira/browse/HIVE-1694

          Diffs


          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java b46976f
          ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java 591c9ff
          ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 2ca63b3
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 590d69a
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java 77a6dc6
          ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q PRE-CREATION
          ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out PRE-CREATION

          Diff: https://reviews.apache.org/r/1194/diff

          Testing
          -------

          Thanks,

          Prajakta

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1194/ ----------------------------------------------------------- Review request for hive and John Sichi. Summary ------- This patch has defined a new AggregateIndexHandler which is used to optimize the query plan for groupby queries. This addresses bug HIVE-1694 . https://issues.apache.org/jira/browse/HIVE-1694 Diffs common/src/java/org/apache/hadoop/hive/conf/HiveConf.java b46976f ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java 591c9ff ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 2ca63b3 ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 590d69a ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java 77a6dc6 ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q PRE-CREATION ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out PRE-CREATION Diff: https://reviews.apache.org/r/1194/diff Testing ------- Thanks, Prajakta
          Hide
          Prajakta Kalmegh added a comment -

          Review Changes done after last review. Added new functionality (See post for more details)

          Show
          Prajakta Kalmegh added a comment - Review Changes done after last review. Added new functionality (See post for more details)
          Hide
          Prajakta Kalmegh added a comment -

          Hi John

          Please find attached the latest patch (HIVE-1694.4.patch):
          The patch contains:
          1. Support for multiple aggregates in index creation using the AggregateIndexHandler. The column names for the index schema are constructed dynamically depending on the aggregates.
          For 'aggregateFunction(columnName)', the column name in index will be `_aggregateFunction_of_columnName`.
          For example, for count(l_shipdate), the column name will be `_count_of_l_shipdate)`.
          For 'count' function, the column name will be `_count_of_all`.

          2. Fixed the bug for duplicates in Group-by removal cases. We are not removing group-by in any case now. This has made the logic for query rewrites quite simpler than before.
          We removed 4 classes (RewriteIndexSubqueryCtx.java, RewriteIndexSubqueryProcFactory.java, RewriteRemoveGroupbyCtx.java, RewriteRemoveGroupbyProcFactory.java) from the previous patch and added two new simpler classes instead (RewriteQueryUsingAggregateIndex.java, RewriteQueryUsingAggregateIndexCtx.java).

          3. Added a new query (with 'UNION ALL') in the same ql_rewrite_gbtoidx.q file to demonstrate your requirement in last post. Please note that the query is not a valid real-work use case scenario; but still suffices our purpose to see that one branch rewrite does not corrupt the other branch.

          4. Rewrite Optimization now happens after the PredicatePushdown, PartitionPruner and PartitionConditionRemover.

          This patch does not contain:
          1. Optimization for cases with mulitple aggregates in selection
          2. Optimization for any other aggregate function apart from count
          3. Optimization for queries involving multiple tables (even if they are in a different branch). Since we are not optimizing for case of joins, the constraint also filters out queries which have different tables in union queries.
          4. Optimizations for index with multiple columns in its key

          Here is the review board link for the patch <https://reviews.apache.org/r/1194/>.

          Please let me know if you have any questions.

          Show
          Prajakta Kalmegh added a comment - Hi John Please find attached the latest patch ( HIVE-1694 .4.patch): The patch contains: 1. Support for multiple aggregates in index creation using the AggregateIndexHandler. The column names for the index schema are constructed dynamically depending on the aggregates. For 'aggregateFunction(columnName)', the column name in index will be `_aggregateFunction_of_columnName`. For example, for count(l_shipdate), the column name will be `_count_of_l_shipdate)`. For 'count ' function, the column name will be `_count_of_all`. 2. Fixed the bug for duplicates in Group-by removal cases. We are not removing group-by in any case now. This has made the logic for query rewrites quite simpler than before. We removed 4 classes (RewriteIndexSubqueryCtx.java, RewriteIndexSubqueryProcFactory.java, RewriteRemoveGroupbyCtx.java, RewriteRemoveGroupbyProcFactory.java) from the previous patch and added two new simpler classes instead (RewriteQueryUsingAggregateIndex.java, RewriteQueryUsingAggregateIndexCtx.java). 3. Added a new query (with 'UNION ALL') in the same ql_rewrite_gbtoidx.q file to demonstrate your requirement in last post. Please note that the query is not a valid real-work use case scenario; but still suffices our purpose to see that one branch rewrite does not corrupt the other branch. 4. Rewrite Optimization now happens after the PredicatePushdown, PartitionPruner and PartitionConditionRemover. This patch does not contain: 1. Optimization for cases with mulitple aggregates in selection 2. Optimization for any other aggregate function apart from count 3. Optimization for queries involving multiple tables (even if they are in a different branch). Since we are not optimizing for case of joins, the constraint also filters out queries which have different tables in union queries. 4. Optimizations for index with multiple columns in its key Here is the review board link for the patch < https://reviews.apache.org/r/1194/ >. Please let me know if you have any questions.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1194/#review1212
          -----------------------------------------------------------

          ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java
          <https://reviews.apache.org/r/1194/#comment2711>

          Please run ant checkstyle and fix all the formatting discrepancies it reports for your new files.

          ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java
          <https://reviews.apache.org/r/1194/#comment2695>

          Don't you need to reuse the compact implementation here so that the index can be used for WHERE (not just GROUP BY)?

          ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
          <https://reviews.apache.org/r/1194/#comment2696>

          This method is redundant now.

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java
          <https://reviews.apache.org/r/1194/#comment2698>

          I can't think of a case where it would be worse.

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java
          <https://reviews.apache.org/r/1194/#comment2699>

          Actually group-by is now preserved in all cases.

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java
          <https://reviews.apache.org/r/1194/#comment2700>

          Please use HTML bullet syntax for Javadoc (otherwise it all gets run together into one line when rendered).

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java
          <https://reviews.apache.org/r/1194/#comment2701>

          indentation

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java
          <https://reviews.apache.org/r/1194/#comment2703>

          Shouldn't this be BIGINT?

          Also, I think you're supposed to use a TypeInfoFactory for this purpose.

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java
          <https://reviews.apache.org/r/1194/#comment2702>

          indentation

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java
          <https://reviews.apache.org/r/1194/#comment2704>

          typo: Repace

          ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java
          <https://reviews.apache.org/r/1194/#comment2707>

          Not sure why this new constructor is needed...after using it, all you do is get the table out of it.

          ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q
          <https://reviews.apache.org/r/1194/#comment2709>

          This should not be using the index, since the index is built on count(l_shipdate), and l_shipdate may contain nulls, whereas the query is referencing count(1), which is insensitive to nulls.

          ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q
          <https://reviews.apache.org/r/1194/#comment2710>

          Need additional tests to verify all the cases where the optimization should not be used:

          • when configuration disables it
          • when index partitions do not cover table partitions (I still don't see the code for this case)
          • ... all the other conditions checked for in the code ...
          • John

          On 2011-07-26 14:44:01, Prajakta Kalmegh wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1194/

          -----------------------------------------------------------

          (Updated 2011-07-26 14:44:01)

          Review request for hive and John Sichi.

          Summary

          -------

          This patch has defined a new AggregateIndexHandler which is used to optimize the query plan for groupby queries.

          This addresses bug HIVE-1694.

          https://issues.apache.org/jira/browse/HIVE-1694

          Diffs

          -----

          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java b46976f

          ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java 591c9ff

          ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 2ca63b3

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 590d69a

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java 77a6dc6

          ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q PRE-CREATION

          ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out PRE-CREATION

          Diff: https://reviews.apache.org/r/1194/diff

          Testing

          -------

          Thanks,

          Prajakta

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1194/#review1212 ----------------------------------------------------------- ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java < https://reviews.apache.org/r/1194/#comment2711 > Please run ant checkstyle and fix all the formatting discrepancies it reports for your new files. ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java < https://reviews.apache.org/r/1194/#comment2695 > Don't you need to reuse the compact implementation here so that the index can be used for WHERE (not just GROUP BY)? ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java < https://reviews.apache.org/r/1194/#comment2696 > This method is redundant now. ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java < https://reviews.apache.org/r/1194/#comment2698 > I can't think of a case where it would be worse. ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java < https://reviews.apache.org/r/1194/#comment2699 > Actually group-by is now preserved in all cases. ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java < https://reviews.apache.org/r/1194/#comment2700 > Please use HTML bullet syntax for Javadoc (otherwise it all gets run together into one line when rendered). ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java < https://reviews.apache.org/r/1194/#comment2701 > indentation ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java < https://reviews.apache.org/r/1194/#comment2703 > Shouldn't this be BIGINT? Also, I think you're supposed to use a TypeInfoFactory for this purpose. ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java < https://reviews.apache.org/r/1194/#comment2702 > indentation ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java < https://reviews.apache.org/r/1194/#comment2704 > typo: Repace ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java < https://reviews.apache.org/r/1194/#comment2707 > Not sure why this new constructor is needed...after using it, all you do is get the table out of it. ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q < https://reviews.apache.org/r/1194/#comment2709 > This should not be using the index, since the index is built on count(l_shipdate), and l_shipdate may contain nulls, whereas the query is referencing count(1), which is insensitive to nulls. ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q < https://reviews.apache.org/r/1194/#comment2710 > Need additional tests to verify all the cases where the optimization should not be used: when configuration disables it when index partitions do not cover table partitions (I still don't see the code for this case) ... all the other conditions checked for in the code ... John On 2011-07-26 14:44:01, Prajakta Kalmegh wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1194/ ----------------------------------------------------------- (Updated 2011-07-26 14:44:01) Review request for hive and John Sichi. Summary ------- This patch has defined a new AggregateIndexHandler which is used to optimize the query plan for groupby queries. This addresses bug HIVE-1694 . https://issues.apache.org/jira/browse/HIVE-1694 Diffs ----- common/src/java/org/apache/hadoop/hive/conf/HiveConf.java b46976f ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java 591c9ff ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 2ca63b3 ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 590d69a ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java 77a6dc6 ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q PRE-CREATION ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out PRE-CREATION Diff: https://reviews.apache.org/r/1194/diff Testing ------- Thanks, Prajakta
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-07-28 21:40:30, John Sichi wrote:

          > ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java, line 61

          > <https://reviews.apache.org/r/1194/diff/1/?file=27052#file27052line61>

          >

          > Please run ant checkstyle and fix all the formatting discrepancies it reports for your new files.

          >

          Done! The code is still having checkstyle formatting errors only for places where we have used LinkedHashMap, HashMap and ArrayList. The error states "Declaring variables, return values or parameters of type 'HashMap' is not allowed".

          On 2011-07-28 21:40:30, John Sichi wrote:

          > ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java, line 184

          > <https://reviews.apache.org/r/1194/diff/1/?file=27052#file27052line184>

          >

          > Don't you need to reuse the compact implementation here so that the index can be used for WHERE (not just GROUP BY)?

          >

          The AggregateIndexHandler now extends from CompactIndexHandler instead of TableBasedIndexHandler. We override only analyzeIndexDefinition(...) and getIndexBuilderMapRedTask(...) methods from CompactIndexHandler.

          On 2011-07-28 21:40:30, John Sichi wrote:

          > ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java, line 700

          > <https://reviews.apache.org/r/1194/diff/1/?file=27054#file27054line700>

          >

          > This method is redundant now.

          Removed. Sorry to have missed that.

          On 2011-07-28 21:40:30, John Sichi wrote:

          > ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java, line 252

          > <https://reviews.apache.org/r/1194/diff/1/?file=27056#file27056line252>

          >

          > I can't think of a case where it would be worse.

          Ok.

          On 2011-07-28 21:40:30, John Sichi wrote:

          > ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java, line 164

          > <https://reviews.apache.org/r/1194/diff/1/?file=27057#file27057line164>

          >

          > Actually group-by is now preserved in all cases.

          Forgot to change a few comments after fixing the bug. Done!

          On 2011-07-28 21:40:30, John Sichi wrote:

          > ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java, line 66

          > <https://reviews.apache.org/r/1194/diff/1/?file=27058#file27058line66>

          >

          > Please use HTML bullet syntax for Javadoc (otherwise it all gets run together into one line when rendered).

          >

          Done!

          On 2011-07-28 21:40:30, John Sichi wrote:

          > ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java, line 93

          > <https://reviews.apache.org/r/1194/diff/1/?file=27060#file27060line93>

          >

          > Shouldn't this be BIGINT?

          >

          > Also, I think you're supposed to use a TypeInfoFactory for this purpose.

          Yes. Changed it to bigint. Also changed it in AggregateIndexHandler where I had declared the type to be "int".

          On 2011-07-28 21:40:30, John Sichi wrote:

          > ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java, line 603

          > <https://reviews.apache.org/r/1194/diff/1/?file=27062#file27062line603>

          >

          > Not sure why this new constructor is needed...after using it, all you do is get the table out of it.

          The only other constructor option for tableSpec needs the ASTNode as one of its parameters. Since we need to construct a new tableSpec using only the index table name, and we do not have a ASTNode for this, I need this constructor. If you have any other way in mind, please let me know. That would be helpful.

          On 2011-07-28 21:40:30, John Sichi wrote:

          > ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q, line 27

          > <https://reviews.apache.org/r/1194/diff/1/?file=27063#file27063line27>

          >

          > This should not be using the index, since the index is built on count(l_shipdate), and l_shipdate may contain nulls, whereas the query is referencing count(1), which is insensitive to nulls.

          Yes true.
          I have now changed the count(1) queries with count(l_shipdate) in ql_rewrite_gbtoidx.q file. Also, verified that the count(1) queries are not using the index.

          On 2011-07-28 21:40:30, John Sichi wrote:

          > ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q, line 61

          > <https://reviews.apache.org/r/1194/diff/1/?file=27063#file27063line61>

          >

          > Need additional tests to verify all the cases where the optimization should not be used:

          >

          > * when configuration disables it

          > * when index partitions do not cover table partitions (I still don't see the code for this case)

          > * ... all the other conditions checked for in the code ...

          >

          Added new queries to verify that optimization is not used in case of:

          • when configuration disables it
          • ... all the other conditions checked for in the code ...

          About "* when index partitions do not cover table partitions ", still pending (also the code for it). I will upload the new patch once this is done.

          • Prajakta

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1194/#review1212
          -----------------------------------------------------------

          On 2011-07-26 14:44:01, Prajakta Kalmegh wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1194/

          -----------------------------------------------------------

          (Updated 2011-07-26 14:44:01)

          Review request for hive and John Sichi.

          Summary

          -------

          This patch has defined a new AggregateIndexHandler which is used to optimize the query plan for groupby queries.

          This addresses bug HIVE-1694.

          https://issues.apache.org/jira/browse/HIVE-1694

          Diffs

          -----

          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java b46976f

          ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java 591c9ff

          ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 2ca63b3

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 590d69a

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java 77a6dc6

          ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q PRE-CREATION

          ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out PRE-CREATION

          Diff: https://reviews.apache.org/r/1194/diff

          Testing

          -------

          Thanks,

          Prajakta

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-07-28 21:40:30, John Sichi wrote: > ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java, line 61 > < https://reviews.apache.org/r/1194/diff/1/?file=27052#file27052line61 > > > Please run ant checkstyle and fix all the formatting discrepancies it reports for your new files. > Done! The code is still having checkstyle formatting errors only for places where we have used LinkedHashMap, HashMap and ArrayList. The error states "Declaring variables, return values or parameters of type 'HashMap' is not allowed". On 2011-07-28 21:40:30, John Sichi wrote: > ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java, line 184 > < https://reviews.apache.org/r/1194/diff/1/?file=27052#file27052line184 > > > Don't you need to reuse the compact implementation here so that the index can be used for WHERE (not just GROUP BY)? > The AggregateIndexHandler now extends from CompactIndexHandler instead of TableBasedIndexHandler. We override only analyzeIndexDefinition(...) and getIndexBuilderMapRedTask(...) methods from CompactIndexHandler. On 2011-07-28 21:40:30, John Sichi wrote: > ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java, line 700 > < https://reviews.apache.org/r/1194/diff/1/?file=27054#file27054line700 > > > This method is redundant now. Removed. Sorry to have missed that. On 2011-07-28 21:40:30, John Sichi wrote: > ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java, line 252 > < https://reviews.apache.org/r/1194/diff/1/?file=27056#file27056line252 > > > I can't think of a case where it would be worse. Ok. On 2011-07-28 21:40:30, John Sichi wrote: > ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java, line 164 > < https://reviews.apache.org/r/1194/diff/1/?file=27057#file27057line164 > > > Actually group-by is now preserved in all cases. Forgot to change a few comments after fixing the bug. Done! On 2011-07-28 21:40:30, John Sichi wrote: > ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java, line 66 > < https://reviews.apache.org/r/1194/diff/1/?file=27058#file27058line66 > > > Please use HTML bullet syntax for Javadoc (otherwise it all gets run together into one line when rendered). > Done! On 2011-07-28 21:40:30, John Sichi wrote: > ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java, line 93 > < https://reviews.apache.org/r/1194/diff/1/?file=27060#file27060line93 > > > Shouldn't this be BIGINT? > > Also, I think you're supposed to use a TypeInfoFactory for this purpose. Yes. Changed it to bigint. Also changed it in AggregateIndexHandler where I had declared the type to be "int". On 2011-07-28 21:40:30, John Sichi wrote: > ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java, line 603 > < https://reviews.apache.org/r/1194/diff/1/?file=27062#file27062line603 > > > Not sure why this new constructor is needed...after using it, all you do is get the table out of it. The only other constructor option for tableSpec needs the ASTNode as one of its parameters. Since we need to construct a new tableSpec using only the index table name, and we do not have a ASTNode for this, I need this constructor. If you have any other way in mind, please let me know. That would be helpful. On 2011-07-28 21:40:30, John Sichi wrote: > ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q, line 27 > < https://reviews.apache.org/r/1194/diff/1/?file=27063#file27063line27 > > > This should not be using the index, since the index is built on count(l_shipdate), and l_shipdate may contain nulls, whereas the query is referencing count(1), which is insensitive to nulls. Yes true. I have now changed the count(1) queries with count(l_shipdate) in ql_rewrite_gbtoidx.q file. Also, verified that the count(1) queries are not using the index. On 2011-07-28 21:40:30, John Sichi wrote: > ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q, line 61 > < https://reviews.apache.org/r/1194/diff/1/?file=27063#file27063line61 > > > Need additional tests to verify all the cases where the optimization should not be used: > > * when configuration disables it > * when index partitions do not cover table partitions (I still don't see the code for this case) > * ... all the other conditions checked for in the code ... > Added new queries to verify that optimization is not used in case of: when configuration disables it ... all the other conditions checked for in the code ... About "* when index partitions do not cover table partitions ", still pending (also the code for it). I will upload the new patch once this is done. Prajakta ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1194/#review1212 ----------------------------------------------------------- On 2011-07-26 14:44:01, Prajakta Kalmegh wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1194/ ----------------------------------------------------------- (Updated 2011-07-26 14:44:01) Review request for hive and John Sichi. Summary ------- This patch has defined a new AggregateIndexHandler which is used to optimize the query plan for groupby queries. This addresses bug HIVE-1694 . https://issues.apache.org/jira/browse/HIVE-1694 Diffs ----- common/src/java/org/apache/hadoop/hive/conf/HiveConf.java b46976f ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java 591c9ff ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 2ca63b3 ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 590d69a ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java 77a6dc6 ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q PRE-CREATION ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out PRE-CREATION Diff: https://reviews.apache.org/r/1194/diff Testing ------- Thanks, Prajakta
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-07-28 21:40:30, John Sichi wrote:

          > ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java, line 61

          > <https://reviews.apache.org/r/1194/diff/1/?file=27052#file27052line61>

          >

          > Please run ant checkstyle and fix all the formatting discrepancies it reports for your new files.

          >

          Prajakta Kalmegh wrote:

          Done! The code is still having checkstyle formatting errors only for places where we have used LinkedHashMap, HashMap and ArrayList. The error states "Declaring variables, return values or parameters of type 'HashMap' is not allowed".

          Best practice is to only use interfaces (Map/List) except at the point of instantiation where you select a concrete class. Hive violates this in a number of places, and sometimes that forces you to violate it in new code too; but otherwise, please follow this one.

          On 2011-07-28 21:40:30, John Sichi wrote:

          > ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java, line 603

          > <https://reviews.apache.org/r/1194/diff/1/?file=27062#file27062line603>

          >

          > Not sure why this new constructor is needed...after using it, all you do is get the table out of it.

          Prajakta Kalmegh wrote:

          The only other constructor option for tableSpec needs the ASTNode as one of its parameters. Since we need to construct a new tableSpec using only the index table name, and we do not have a ASTNode for this, I need this constructor. If you have any other way in mind, please let me know. That would be helpful.

          I'm asking why you even need to construct a new tableSpec instance. All you do with it is reference ts.tableHandle. And to create that tableHandle, you can just do db.getTable(tableName). So I don't see the purpose of the tableSpec instance.

          • John

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1194/#review1212
          -----------------------------------------------------------

          On 2011-07-26 14:44:01, Prajakta Kalmegh wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1194/

          -----------------------------------------------------------

          (Updated 2011-07-26 14:44:01)

          Review request for hive and John Sichi.

          Summary

          -------

          This patch has defined a new AggregateIndexHandler which is used to optimize the query plan for groupby queries.

          This addresses bug HIVE-1694.

          https://issues.apache.org/jira/browse/HIVE-1694

          Diffs

          -----

          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java b46976f

          ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java 591c9ff

          ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 2ca63b3

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 590d69a

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java 77a6dc6

          ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q PRE-CREATION

          ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out PRE-CREATION

          Diff: https://reviews.apache.org/r/1194/diff

          Testing

          -------

          Thanks,

          Prajakta

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-07-28 21:40:30, John Sichi wrote: > ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java, line 61 > < https://reviews.apache.org/r/1194/diff/1/?file=27052#file27052line61 > > > Please run ant checkstyle and fix all the formatting discrepancies it reports for your new files. > Prajakta Kalmegh wrote: Done! The code is still having checkstyle formatting errors only for places where we have used LinkedHashMap, HashMap and ArrayList. The error states "Declaring variables, return values or parameters of type 'HashMap' is not allowed". Best practice is to only use interfaces (Map/List) except at the point of instantiation where you select a concrete class. Hive violates this in a number of places, and sometimes that forces you to violate it in new code too; but otherwise, please follow this one. On 2011-07-28 21:40:30, John Sichi wrote: > ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java, line 603 > < https://reviews.apache.org/r/1194/diff/1/?file=27062#file27062line603 > > > Not sure why this new constructor is needed...after using it, all you do is get the table out of it. Prajakta Kalmegh wrote: The only other constructor option for tableSpec needs the ASTNode as one of its parameters. Since we need to construct a new tableSpec using only the index table name, and we do not have a ASTNode for this, I need this constructor. If you have any other way in mind, please let me know. That would be helpful. I'm asking why you even need to construct a new tableSpec instance. All you do with it is reference ts.tableHandle. And to create that tableHandle, you can just do db.getTable(tableName). So I don't see the purpose of the tableSpec instance. John ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1194/#review1212 ----------------------------------------------------------- On 2011-07-26 14:44:01, Prajakta Kalmegh wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1194/ ----------------------------------------------------------- (Updated 2011-07-26 14:44:01) Review request for hive and John Sichi. Summary ------- This patch has defined a new AggregateIndexHandler which is used to optimize the query plan for groupby queries. This addresses bug HIVE-1694 . https://issues.apache.org/jira/browse/HIVE-1694 Diffs ----- common/src/java/org/apache/hadoop/hive/conf/HiveConf.java b46976f ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java 591c9ff ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 2ca63b3 ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 590d69a ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java 77a6dc6 ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q PRE-CREATION ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out PRE-CREATION Diff: https://reviews.apache.org/r/1194/diff Testing ------- Thanks, Prajakta
          Hide
          Prajakta Kalmegh added a comment -

          Changes after comments on Reviewboard <https://reviews.apache.org/r/1194/>

          Show
          Prajakta Kalmegh added a comment - Changes after comments on Reviewboard < https://reviews.apache.org/r/1194/ >
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1194/
          -----------------------------------------------------------

          (Updated 2011-08-03 10:31:42.024369)

          Review request for hive and John Sichi.

          Changes
          -------

          Updated the patch to HIVE-1694.5.patch with changes for review comments for patch 4.

          Summary
          -------

          This patch has defined a new AggregateIndexHandler which is used to optimize the query plan for groupby queries.

          This addresses bug HIVE-1694.
          https://issues.apache.org/jira/browse/HIVE-1694

          Diffs (updated)


          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java b46976f
          ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java 591c9ff
          ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java a57f9cf
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/IndexUtils.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 590d69a
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereProcessor.java 8295687
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereTaskDispatcher.java 699519b
          ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q PRE-CREATION
          ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out PRE-CREATION

          Diff: https://reviews.apache.org/r/1194/diff

          Testing
          -------

          Thanks,

          Prajakta

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1194/ ----------------------------------------------------------- (Updated 2011-08-03 10:31:42.024369) Review request for hive and John Sichi. Changes ------- Updated the patch to HIVE-1694 .5.patch with changes for review comments for patch 4. Summary ------- This patch has defined a new AggregateIndexHandler which is used to optimize the query plan for groupby queries. This addresses bug HIVE-1694 . https://issues.apache.org/jira/browse/HIVE-1694 Diffs (updated) common/src/java/org/apache/hadoop/hive/conf/HiveConf.java b46976f ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java 591c9ff ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java a57f9cf ql/src/java/org/apache/hadoop/hive/ql/optimizer/IndexUtils.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 590d69a ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereProcessor.java 8295687 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereTaskDispatcher.java 699519b ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q PRE-CREATION ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out PRE-CREATION Diff: https://reviews.apache.org/r/1194/diff Testing ------- Thanks, Prajakta
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1194/#review1303
          -----------------------------------------------------------

          ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java
          <https://reviews.apache.org/r/1194/#comment2955>

          Can't you just look up "AGGREGATES" in the map?

          ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java
          <https://reviews.apache.org/r/1194/#comment2953>

          Add a helper method to avoid duplicating the code in the else block below.

          ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java
          <https://reviews.apache.org/r/1194/#comment2954>

          Can't you just look up "AGGREGATES" in the map?

          ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java
          <https://reviews.apache.org/r/1194/#comment2956>

          See recent changes in corresponding CompactIndexHandler code for HIVEOPTINDEXFILTER; need the same here (or better, factor out common code here and elsewhere).

          On a related note, you may be able to use the same technique instead of isQueryInsertToTable; this would be preferable since it's nice to be able to use the index rewrite in cases where it's a normal INSERT table with index being used for GROUP BY on SELECT from some other table.

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/IndexUtils.java
          <https://reviews.apache.org/r/1194/#comment2957>

          @params here don't match actual params

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java
          <https://reviews.apache.org/r/1194/#comment2958>

          Shouldn't this be the same as COUNT?

          ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q
          <https://reviews.apache.org/r/1194/#comment2980>

          Besides EXPLAIN, you should include a few queries against a non-empty table verifying that you get the correct results both with and without the optimization applied. Remember to include an ORDER BY for test determinism.

          ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q
          <https://reviews.apache.org/r/1194/#comment2978>

          Isn't this set redundant?

          • John

          On 2011-08-03 10:31:42, Prajakta Kalmegh wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1194/

          -----------------------------------------------------------

          (Updated 2011-08-03 10:31:42)

          Review request for hive and John Sichi.

          Summary

          -------

          This patch has defined a new AggregateIndexHandler which is used to optimize the query plan for groupby queries.

          This addresses bug HIVE-1694.

          https://issues.apache.org/jira/browse/HIVE-1694

          Diffs

          -----

          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java b46976f

          ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java 591c9ff

          ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java a57f9cf

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/IndexUtils.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 590d69a

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereProcessor.java 8295687

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereTaskDispatcher.java 699519b

          ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q PRE-CREATION

          ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out PRE-CREATION

          Diff: https://reviews.apache.org/r/1194/diff

          Testing

          -------

          Thanks,

          Prajakta

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1194/#review1303 ----------------------------------------------------------- ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java < https://reviews.apache.org/r/1194/#comment2955 > Can't you just look up "AGGREGATES" in the map? ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java < https://reviews.apache.org/r/1194/#comment2953 > Add a helper method to avoid duplicating the code in the else block below. ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java < https://reviews.apache.org/r/1194/#comment2954 > Can't you just look up "AGGREGATES" in the map? ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java < https://reviews.apache.org/r/1194/#comment2956 > See recent changes in corresponding CompactIndexHandler code for HIVEOPTINDEXFILTER; need the same here (or better, factor out common code here and elsewhere). On a related note, you may be able to use the same technique instead of isQueryInsertToTable; this would be preferable since it's nice to be able to use the index rewrite in cases where it's a normal INSERT table with index being used for GROUP BY on SELECT from some other table. ql/src/java/org/apache/hadoop/hive/ql/optimizer/IndexUtils.java < https://reviews.apache.org/r/1194/#comment2957 > @params here don't match actual params ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java < https://reviews.apache.org/r/1194/#comment2958 > Shouldn't this be the same as COUNT ? ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q < https://reviews.apache.org/r/1194/#comment2980 > Besides EXPLAIN, you should include a few queries against a non-empty table verifying that you get the correct results both with and without the optimization applied. Remember to include an ORDER BY for test determinism. ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q < https://reviews.apache.org/r/1194/#comment2978 > Isn't this set redundant? John On 2011-08-03 10:31:42, Prajakta Kalmegh wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1194/ ----------------------------------------------------------- (Updated 2011-08-03 10:31:42) Review request for hive and John Sichi. Summary ------- This patch has defined a new AggregateIndexHandler which is used to optimize the query plan for groupby queries. This addresses bug HIVE-1694 . https://issues.apache.org/jira/browse/HIVE-1694 Diffs ----- common/src/java/org/apache/hadoop/hive/conf/HiveConf.java b46976f ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java 591c9ff ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java a57f9cf ql/src/java/org/apache/hadoop/hive/ql/optimizer/IndexUtils.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 590d69a ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereProcessor.java 8295687 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereTaskDispatcher.java 699519b ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q PRE-CREATION ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out PRE-CREATION Diff: https://reviews.apache.org/r/1194/diff Testing ------- Thanks, Prajakta
          Hide
          John Sichi added a comment -

          Added some comments on Review Board. Thanks for carrying this one all the way through such a long review process; let's see if we can get this one committed before it turns one year old

          Show
          John Sichi added a comment - Added some comments on Review Board. Thanks for carrying this one all the way through such a long review process; let's see if we can get this one committed before it turns one year old
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-08-05 21:20:21, John Sichi wrote:

          > ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java, line 172

          > <https://reviews.apache.org/r/1194/diff/2/?file=30443#file30443line172>

          >

          > See recent changes in corresponding CompactIndexHandler code for HIVEOPTINDEXFILTER; need the same here (or better, factor out common code here and elsewhere).

          >

          > On a related note, you may be able to use the same technique instead of isQueryInsertToTable; this would be preferable since it's nice to be able to use the index rewrite in cases where it's a normal INSERT table with index being used for GROUP BY on SELECT from some other table.

          >

          I have factored out the common code in all Index handler classes and placed it in IndexUtils file.

          I also removed the code for "isQueryInsertToTable" and am setting the HIVEOPTINDEXFILTER to false instead.

          On 2011-08-05 21:20:21, John Sichi wrote:

          > ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java, line 153

          > <https://reviews.apache.org/r/1194/diff/2/?file=30449#file30449line153>

          >

          > Shouldn't this be the same as COUNT?

          >

          Yes it is. I missed to change this part from the previous code.

          • Prajakta

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1194/#review1303
          -----------------------------------------------------------

          On 2011-08-03 10:31:42, Prajakta Kalmegh wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1194/

          -----------------------------------------------------------

          (Updated 2011-08-03 10:31:42)

          Review request for hive and John Sichi.

          Summary

          -------

          This patch has defined a new AggregateIndexHandler which is used to optimize the query plan for groupby queries.

          This addresses bug HIVE-1694.

          https://issues.apache.org/jira/browse/HIVE-1694

          Diffs

          -----

          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java b46976f

          ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java 591c9ff

          ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java a57f9cf

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/IndexUtils.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 590d69a

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java PRE-CREATION

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereProcessor.java 8295687

          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereTaskDispatcher.java 699519b

          ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q PRE-CREATION

          ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out PRE-CREATION

          Diff: https://reviews.apache.org/r/1194/diff

          Testing

          -------

          Thanks,

          Prajakta

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-08-05 21:20:21, John Sichi wrote: > ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java, line 172 > < https://reviews.apache.org/r/1194/diff/2/?file=30443#file30443line172 > > > See recent changes in corresponding CompactIndexHandler code for HIVEOPTINDEXFILTER; need the same here (or better, factor out common code here and elsewhere). > > On a related note, you may be able to use the same technique instead of isQueryInsertToTable; this would be preferable since it's nice to be able to use the index rewrite in cases where it's a normal INSERT table with index being used for GROUP BY on SELECT from some other table. > I have factored out the common code in all Index handler classes and placed it in IndexUtils file. I also removed the code for "isQueryInsertToTable" and am setting the HIVEOPTINDEXFILTER to false instead. On 2011-08-05 21:20:21, John Sichi wrote: > ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java, line 153 > < https://reviews.apache.org/r/1194/diff/2/?file=30449#file30449line153 > > > Shouldn't this be the same as COUNT ? > Yes it is. I missed to change this part from the previous code. Prajakta ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1194/#review1303 ----------------------------------------------------------- On 2011-08-03 10:31:42, Prajakta Kalmegh wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1194/ ----------------------------------------------------------- (Updated 2011-08-03 10:31:42) Review request for hive and John Sichi. Summary ------- This patch has defined a new AggregateIndexHandler which is used to optimize the query plan for groupby queries. This addresses bug HIVE-1694 . https://issues.apache.org/jira/browse/HIVE-1694 Diffs ----- common/src/java/org/apache/hadoop/hive/conf/HiveConf.java b46976f ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java 591c9ff ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java a57f9cf ql/src/java/org/apache/hadoop/hive/ql/optimizer/IndexUtils.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 590d69a ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereProcessor.java 8295687 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereTaskDispatcher.java 699519b ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q PRE-CREATION ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out PRE-CREATION Diff: https://reviews.apache.org/r/1194/diff Testing ------- Thanks, Prajakta
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1194/
          -----------------------------------------------------------

          (Updated 2011-09-09 01:14:16.218940)

          Review request for hive and John Sichi.

          Summary
          -------

          This patch has defined a new AggregateIndexHandler which is used to optimize the query plan for groupby queries.

          This addresses bug HIVE-1694.
          https://issues.apache.org/jira/browse/HIVE-1694

          Diffs (updated)


          ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out PRE-CREATION
          ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereTaskDispatcher.java 699519b
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereProcessor.java dcdfb9e
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java PRE-CREATION
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 66ee0be
          data/files/lineitem.txt PRE-CREATION
          data/files/tbl.txt PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java 591c9ff
          ql/src/java/org/apache/hadoop/hive/ql/index/bitmap/BitmapIndexHandler.java 5053576
          ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java 7a00c00
          ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java bec8787
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/IndexUtils.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 590d69a
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java PRE-CREATION

          Diff: https://reviews.apache.org/r/1194/diff

          Testing
          -------

          Thanks,

          Prajakta

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1194/ ----------------------------------------------------------- (Updated 2011-09-09 01:14:16.218940) Review request for hive and John Sichi. Summary ------- This patch has defined a new AggregateIndexHandler which is used to optimize the query plan for groupby queries. This addresses bug HIVE-1694 . https://issues.apache.org/jira/browse/HIVE-1694 Diffs (updated) ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out PRE-CREATION ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereTaskDispatcher.java 699519b ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereProcessor.java dcdfb9e ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java PRE-CREATION common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 66ee0be data/files/lineitem.txt PRE-CREATION data/files/tbl.txt PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java 591c9ff ql/src/java/org/apache/hadoop/hive/ql/index/bitmap/BitmapIndexHandler.java 5053576 ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java 7a00c00 ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java bec8787 ql/src/java/org/apache/hadoop/hive/ql/optimizer/IndexUtils.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 590d69a ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java PRE-CREATION Diff: https://reviews.apache.org/r/1194/diff Testing ------- Thanks, Prajakta
          Hide
          John Sichi added a comment -

          Looks great. One last change: for all the SELECT queries in the .q file, can you add an ORDER BY on a full key for test determinism.

          Show
          John Sichi added a comment - Looks great. One last change: for all the SELECT queries in the .q file, can you add an ORDER BY on a full key for test determinism.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1194/
          -----------------------------------------------------------

          (Updated 2011-09-10 21:10:06.178279)

          Review request for hive and John Sichi.

          Changes
          -------

          Added order-by to queries for test determinism.

          Summary
          -------

          This patch has defined a new AggregateIndexHandler which is used to optimize the query plan for groupby queries.

          This addresses bug HIVE-1694.
          https://issues.apache.org/jira/browse/HIVE-1694

          Diffs (updated)


          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 66ee0be
          data/files/lineitem.txt PRE-CREATION
          data/files/tbl.txt PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java 591c9ff
          ql/src/java/org/apache/hadoop/hive/ql/index/bitmap/BitmapIndexHandler.java 5053576
          ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java 7a00c00
          ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java bec8787
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/IndexUtils.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 590d69a
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java PRE-CREATION
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereProcessor.java dcdfb9e
          ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereTaskDispatcher.java 699519b
          ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q PRE-CREATION
          ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out PRE-CREATION

          Diff: https://reviews.apache.org/r/1194/diff

          Testing
          -------

          Thanks,

          Prajakta

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1194/ ----------------------------------------------------------- (Updated 2011-09-10 21:10:06.178279) Review request for hive and John Sichi. Changes ------- Added order-by to queries for test determinism. Summary ------- This patch has defined a new AggregateIndexHandler which is used to optimize the query plan for groupby queries. This addresses bug HIVE-1694 . https://issues.apache.org/jira/browse/HIVE-1694 Diffs (updated) common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 66ee0be data/files/lineitem.txt PRE-CREATION data/files/tbl.txt PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java 591c9ff ql/src/java/org/apache/hadoop/hive/ql/index/bitmap/BitmapIndexHandler.java 5053576 ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java 7a00c00 ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java bec8787 ql/src/java/org/apache/hadoop/hive/ql/optimizer/IndexUtils.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 590d69a ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereProcessor.java dcdfb9e ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereTaskDispatcher.java 699519b ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q PRE-CREATION ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out PRE-CREATION Diff: https://reviews.apache.org/r/1194/diff Testing ------- Thanks, Prajakta
          Hide
          John Sichi added a comment -

          Prajakta, can you re-attach your latest patch granting rights to ASF (so the feather shows up next to the attachment), and then click the Submit Patch button?

          Show
          John Sichi added a comment - Prajakta, can you re-attach your latest patch granting rights to ASF (so the feather shows up next to the attachment), and then click the Submit Patch button?
          Hide
          Prajakta Kalmegh added a comment -

          Accelerate group-by queries using aggregate index.

          Show
          Prajakta Kalmegh added a comment - Accelerate group-by queries using aggregate index.
          Hide
          John Sichi added a comment -

          +1. Will commit when tests pass.

          Show
          John Sichi added a comment - +1. Will commit when tests pass.
          Hide
          John Sichi added a comment -

          Committed to trunk. Thanks Prajakta!

          Show
          John Sichi added a comment - Committed to trunk. Thanks Prajakta!
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-h0.21 #949 (See https://builds.apache.org/job/Hive-trunk-h0.21/949/)
          HIVE-1694. Accelerate GROUP BY execution using indexes
          (Prajakta Kalmegh via jvs)

          jvs : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1170007
          Files :

          • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          • /hive/trunk/data/files/lineitem.txt
          • /hive/trunk/data/files/tbl.txt
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/index/bitmap/BitmapIndexHandler.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/IndexUtils.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/index
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereProcessor.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereTaskDispatcher.java
          • /hive/trunk/ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q
          • /hive/trunk/ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out
          Show
          Hudson added a comment - Integrated in Hive-trunk-h0.21 #949 (See https://builds.apache.org/job/Hive-trunk-h0.21/949/ ) HIVE-1694 . Accelerate GROUP BY execution using indexes (Prajakta Kalmegh via jvs) jvs : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1170007 Files : /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/data/files/lineitem.txt /hive/trunk/data/files/tbl.txt /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/index/AggregateIndexHandler.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/index/HiveIndex.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/index/bitmap/BitmapIndexHandler.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/IndexUtils.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/index /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyCtx.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteCanApplyProcFactory.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteGBUsingIndex.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndex.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteQueryUsingAggregateIndexCtx.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereProcessor.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/index/IndexWhereTaskDispatcher.java /hive/trunk/ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q /hive/trunk/ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out
          Hide
          Lefty Leverenz added a comment -

          This added the hive.optimize.index.groupby configuration property in Hive 0.8.1 (branch08r2).

          Show
          Lefty Leverenz added a comment - This added the hive.optimize.index.groupby configuration property in Hive 0.8.1 (branch08r2).

            People

            • Assignee:
              Prajakta Kalmegh
              Reporter:
              Nikhil Deshpande
            • Votes:
              2 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development