Hive
  1. Hive
  2. HIVE-1938

Cost Based Query optimization for Joins in Hive

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Query Processor, Statistics
    • Labels:
      None
    • Environment:

      *nix,java

    • Tags:
      Query optimization . Query processor

      Description

      Current optimization in Hive is just rule-based and involves applying a set of rules on the Plan tree. This depends on hints given by the user (which may or may-not be correct) and might result in execution of costlier plans.So this jira aims at building a cost-model which can give a good estimate various plans before hand (using some meta-data already collected) and we can choose the best plan which incurs the least cost.

        Issue Links

          Activity

          Hide
          bharath v added a comment -

          I have a few basic ideas about join costs. Consider the case of Common Join(We can extend it to other joins) for which I propose a mathematical approach.

          We can consider disk IO and Communication cost as the major bottlenecks and optimize the join based on these factors.The challenge here is in predicting the size of the intermediate tables output ( while joining more than 2 tables).

          Suppose the there are "N" machines in the Hive-Hadoop cluster.

          Suppose we are joining 2 table "A" on column "a" with table "B" ob column "b".Table A contains n1 rows and table B has n2 rows. Suppose A.a has "d1" distinct values and B.b has "d2" distinct values and suppose d1<d2 without loss of generality. The cost computation of a single join can be divided into 3 phases

          1) Map phase :
          Both the tables are scanned from the disks once and we can assume that this is done parallely on N machines. So total disk IO cost is estimated as O((n1+n2)/N) (Assuming constant page size)

          2) Shuffle Phase:
          The cost estimation during this phase depends on the type of join. In the worst case every row gets transferred to a reducer on different machine (other than the one on which it resides). This results in movement of every row to a different machine . Assuming some constant ping-time (that we get after making some avg calculations) the cost is O(k*(n1+n2)) where this k is "cost/unit-transfer" of data. Actual shuffle costs may be less than this due to "maximizing locality" effect in hadoop MR. Iam just providing an upper bound and we can surely improve upon this depending on the type of join.

          For eg: Consider map-side join we can add the cost of moving the smaller table to all the mappers = O(k*N*N3) where N3 is no. of rows in the smaller table.

          3) Reduce phase :

          In each reduce we just do a simple nested-loop join. Assuming "AVERAGE DISTRIBUTION" of data, we get (n1/d1) rows of "A" and (n2/d2) rows of "B" per a single call of reducer and cost for this is O(n1/d1*n2/d2) and there can be at-max d1/N sequential reducers running . So multiplying that factor, the total cost is O((n1*n2)/(N*d2)).

          The above cost is as a result of assuming uniform distribution of data per table. We can improve this by maintaining histograms of statistics per node by which we can improve upon this calculation.

          After calculation of join cost we need to predict the size of resultant join table so that it's statistics can be used in continuing the above procedure for the rest of the tables.

          Estimating the result table size:

          The resultant table will have O(n1*n2/d1) rows assuming that the distribution of A.a and B.b 's distinct values is same. This is where we can improve a great deal by maintaining good statistics so that we can predict correct size of the resultant table.

          We can follow a simple system-R approach of dynamic programming where we choose the best-ones in each iteration using the above cost formulae.

          This is a pretty long post already .. This is the basic approach and we can improve upon this further depending on the feedback I get.

          I presented this work as a poster at ACM-SIGMOD 2010 and I got some positive feedback.

          Show
          bharath v added a comment - I have a few basic ideas about join costs. Consider the case of Common Join(We can extend it to other joins) for which I propose a mathematical approach. We can consider disk IO and Communication cost as the major bottlenecks and optimize the join based on these factors.The challenge here is in predicting the size of the intermediate tables output ( while joining more than 2 tables). Suppose the there are "N" machines in the Hive-Hadoop cluster. Suppose we are joining 2 table "A" on column "a" with table "B" ob column "b".Table A contains n1 rows and table B has n2 rows. Suppose A.a has "d1" distinct values and B.b has "d2" distinct values and suppose d1<d2 without loss of generality. The cost computation of a single join can be divided into 3 phases 1) Map phase : Both the tables are scanned from the disks once and we can assume that this is done parallely on N machines. So total disk IO cost is estimated as O((n1+n2)/N) (Assuming constant page size) 2) Shuffle Phase: The cost estimation during this phase depends on the type of join. In the worst case every row gets transferred to a reducer on different machine (other than the one on which it resides). This results in movement of every row to a different machine . Assuming some constant ping-time (that we get after making some avg calculations) the cost is O(k*(n1+n2)) where this k is "cost/unit-transfer" of data. Actual shuffle costs may be less than this due to "maximizing locality" effect in hadoop MR. Iam just providing an upper bound and we can surely improve upon this depending on the type of join. For eg: Consider map-side join we can add the cost of moving the smaller table to all the mappers = O(k*N*N3) where N3 is no. of rows in the smaller table. 3) Reduce phase : In each reduce we just do a simple nested-loop join. Assuming "AVERAGE DISTRIBUTION" of data, we get (n1/d1) rows of "A" and (n2/d2) rows of "B" per a single call of reducer and cost for this is O(n1/d1*n2/d2) and there can be at-max d1/N sequential reducers running . So multiplying that factor, the total cost is O((n1*n2)/(N*d2)). The above cost is as a result of assuming uniform distribution of data per table. We can improve this by maintaining histograms of statistics per node by which we can improve upon this calculation. After calculation of join cost we need to predict the size of resultant join table so that it's statistics can be used in continuing the above procedure for the rest of the tables. Estimating the result table size: The resultant table will have O(n1*n2/d1) rows assuming that the distribution of A.a and B.b 's distinct values is same. This is where we can improve a great deal by maintaining good statistics so that we can predict correct size of the resultant table. We can follow a simple system-R approach of dynamic programming where we choose the best-ones in each iteration using the above cost formulae. This is a pretty long post already .. This is the basic approach and we can improve upon this further depending on the feedback I get. I presented this work as a poster at ACM-SIGMOD 2010 and I got some positive feedback.
          Hide
          Namit Jain added a comment -

          Currently, Hive does not maintain statistics (distinct values per table/partition), which is the basis for the
          cost model for this discussion.

          Do you want to work on collecting such statistis first, and then we can use them for various plan optimizations ?

          I can think of some advantages of the cost model right away (and I am sure there are many more):
          1. Predict "progress" for a query, predict the time taken.
          2. Determine the join order.

          Show
          Namit Jain added a comment - Currently, Hive does not maintain statistics (distinct values per table/partition), which is the basis for the cost model for this discussion. Do you want to work on collecting such statistis first, and then we can use them for various plan optimizations ? I can think of some advantages of the cost model right away (and I am sure there are many more): 1. Predict "progress" for a query, predict the time taken. 2. Determine the join order.
          Hide
          bharath v added a comment -

          Can you please see https://issues.apache.org/jira/browse/HIVE-33 .. Column-level statistics part remain unresolved though!

          Show
          bharath v added a comment - Can you please see https://issues.apache.org/jira/browse/HIVE-33 .. Column-level statistics part remain unresolved though!
          Hide
          Namit Jain added a comment -

          Yes, but do need column stats for this

          Show
          Namit Jain added a comment - Yes, but do need column stats for this
          Hide
          bharath v added a comment -

          So do I need to start working on collecting column-stats first and then continue to join reordering ? Any hints or suggestions on where to start ?

          Show
          bharath v added a comment - So do I need to start working on collecting column-stats first and then continue to join reordering ? Any hints or suggestions on where to start ?
          Hide
          bharath v added a comment -

          We need https://issues.apache.org/jira/browse/HIVE-1362 to resolved to start this .. But it seems to be inactive.

          Show
          bharath v added a comment - We need https://issues.apache.org/jira/browse/HIVE-1362 to resolved to start this .. But it seems to be inactive.
          Hide
          bharath v added a comment -

          We need column level statistics per table to implement our cost model.

          Show
          bharath v added a comment - We need column level statistics per table to implement our cost model.
          Hide
          bharath v added a comment -

          We need column level statistics to implement our cost model.

          Show
          bharath v added a comment - We need column level statistics to implement our cost model.
          Hide
          Srinivasan Sembakkam Rajivelu added a comment -

          Bharath,
          Still you are working with this ticket. I am interested in doing join optimization in Hive. I am also interested in the idea of implementing statistics in Hive. Does it implemented??

          If you are not working I can take this ticket and work or Suggest me some ticket related to this, so that I can take it up and work.

          Show
          Srinivasan Sembakkam Rajivelu added a comment - Bharath, Still you are working with this ticket. I am interested in doing join optimization in Hive. I am also interested in the idea of implementing statistics in Hive. Does it implemented?? If you are not working I can take this ticket and work or Suggest me some ticket related to this, so that I can take it up and work.
          Hide
          bharath v added a comment -

          Hey,

          I am currently working on Hive-33 (which blocks this JIRA) and I am submitting a proposal to GSOC'2012. If everything works as planned, I'll submit some patches for it in a couple of months or so.

          Show
          bharath v added a comment - Hey, I am currently working on Hive-33 (which blocks this JIRA) and I am submitting a proposal to GSOC'2012. If everything works as planned, I'll submit some patches for it in a couple of months or so.

            People

            • Assignee:
              bharath v
              Reporter:
              bharath v
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:

                Development