Hive
  1. Hive
  2. HIVE-5369

Annotate hive operator tree with statistics from metastore

    Details

      Description

      Currently the statistics gathered at table/partition level and column level are not used during query planning stage. Statistics at table/partition and column level can be used for optimizing the query plans. Basic statistics like uncompressed data size can be used for better reducer estimation. Other statistics like number of rows, distinct values of columns, average length of columns etc. can be used by Cost Based Optimizer (CBO) for making better query plan selection. As a first step in improving query planning the statistics that are available in the metastore should be attached to hive operator tree. The operator tree should be walked and annotated with statistics information. The attached statistics will vary for each operator depending on the operation it performs. For example, select operator will change the average row size but doesn't affect the number of rows. Similarly filter operator will change the number of rows but doesn't change the average row size. Similar rules can be applied for other operators as well.

      Rules for different operators are added as comments in the code. For more detailed information, the reference book that I am using is "Database Systems: The Complete Book" by Garcia-Molina et.al.

      1. HIVE-5369.WIP.txt
        146 kB
        Prasanth Jayachandran
      2. HIVE-5369.2.WIP.txt
        874 kB
        Prasanth Jayachandran
      3. HIVE-5369.1.txt
        750 kB
        Prasanth Jayachandran
      4. HIVE-5369.refactor.WIP.txt
        700 kB
        Prasanth Jayachandran
      5. HIVE-5369.2.patch.txt
        725 kB
        Prasanth Jayachandran
      6. HIVE-5369.3.patch.txt
        718 kB
        Prasanth Jayachandran
      7. HIVE-5369.4.patch.txt
        796 kB
        Prasanth Jayachandran
      8. HIVE-5369.5.patch.txt
        800 kB
        Prasanth Jayachandran
      9. HIVE-5369.6.patch.txt
        803 kB
        Prasanth Jayachandran
      10. HIVE-5369.7.patch.txt
        1.23 MB
        Prasanth Jayachandran
      11. HIVE-5369.8.patch.txt
        1.27 MB
        Prasanth Jayachandran
      12. HIVE-5369.9.patch.txt
        1.29 MB
        Prasanth Jayachandran
      13. HIVE-5369.9.patch
        1.29 MB
        Gunther Hagleitner
      14. HIVE-5369.10.patch
        1.29 MB
        Prasanth Jayachandran

        Issue Links

        1.
        Improve the stats of operators based on heuristics in the absence of any column statistics Sub-task Resolved Prasanth Jayachandran
         
        2.
        Make fetching of column statistics configurable Sub-task Resolved Prasanth Jayachandran
         
        3.
        Better heuristics for worst case statistics estimates for join, limit and filter operator Sub-task Resolved Prasanth Jayachandran
         
        4.
        Fix statistics annotation related test failures in hadoop2 Sub-task Resolved Prasanth Jayachandran
         
        5. In statistics annotation add flag to say if statistics is estimated or accurate Sub-task Open Prasanth Jayachandran
         
        6. Support column statistics for expressions in GBY attributes, JOIN condition etc. when annotating operator tree with statistics Sub-task Open Prasanth Jayachandran
         
        7. Add statistics rule for Union operator Sub-task Open Prasanth Jayachandran
         
        8. Support for operators like PTF, Script, Extract etc. in statistics annotation. Sub-task Open Prasanth Jayachandran
         
        9. Update statistics rules for different types of joins Sub-task Open Prasanth Jayachandran
         
        10.
        Add documentation for stats configs to hive-default.xml.template Sub-task Resolved Prasanth Jayachandran
         
        11.
        Add protection against divide by zero in stats annotation Sub-task Resolved Prasanth Jayachandran
         
        12. Update column stats based on filter expression in stats annotation Sub-task Open Prasanth Jayachandran
         
        13.
        Stats annotation fails to evaluate constant expressions in filter operator Sub-task Closed Prasanth Jayachandran
         
        14.
        Make use of number of nulls column statistics in filter rule Sub-task Closed Prasanth Jayachandran
         
        15.
        Make use of decimal column statistics in statistics annotation Sub-task Closed Prasanth Jayachandran
         
        16.
        Some fixes and improvements to statistics annotation rules Sub-task Closed Prasanth Jayachandran
         
        17.
        JOIN operator should update the column stats when number of rows changes Sub-task Closed Prasanth Jayachandran
         
        18.
        Join stats annotation rule is not updating columns statistics correctly Sub-task Closed Prasanth Jayachandran
         
        19.
        Ease-out denominator for multi-attribute join case in statistics annotation Sub-task Closed Prasanth Jayachandran
         
        20.
        Missing null check cause NPE when updating join column stats in statistics annotation Sub-task Closed Prasanth Jayachandran
         
        21.
        Column statistics from expression does not handle fields within complex types Sub-task Closed Prasanth Jayachandran
         
        22.
        With fetch column stats disabled number of elements in grouping set is not taken into account Sub-task Closed Prasanth Jayachandran
         
        23.
        Incorrect calculation of number of rows in JoinStatsRule.process results in overflow Sub-task Closed Prasanth Jayachandran
         
        24.
        StatsRulesProcFactory should gracefully handle overflows Sub-task Closed Prasanth Jayachandran
         
        25.
        Group-By operator stat-annotation only uses distinct approx to generate rollups Sub-task Closed Prasanth Jayachandran
         
        26.
        Select Operator does not rename column stats properly in case of select star Sub-task Closed Prasanth Jayachandran
         
        27.
        With dynamic partition enabled fact table selectivity is not taken into account when generating the physical plan (Use CBO cardinality using physical plan generation) Sub-task Closed Prasanth Jayachandran
         
        28.
        NPE in PK-FK inference when one side of join is complex tree Sub-task Closed Prasanth Jayachandran
         
        29.
        Support LateralViewJoinOperator and LateralViewForwardOperator in stats annotation Sub-task Closed Prasanth Jayachandran
         

          Activity

          Lefty Leverenz made changes -
          Link This issue is related to HIVE-6300 [ HIVE-6300 ]
          Harish Butani made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-5369.10.patch [ 12614426 ]
          Gunther Hagleitner made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Gunther Hagleitner made changes -
          Attachment HIVE-5369.9.patch [ 12614341 ]
          Gunther Hagleitner made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-5369.9.patch.txt [ 12614028 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-5369.8.patch.txt [ 12613969 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-5369.7.patch.txt [ 12613940 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-5369.6.patch.txt [ 12613777 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-5369.5.patch.txt [ 12613517 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-5369.4.patch.txt [ 12613466 ]
          Prasanth Jayachandran made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-5369.3.patch.txt [ 12612255 ]
          Prasanth Jayachandran made changes -
          Remote Link This issue links to "Review Board Link (Web Link)" [ 13312 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-5369.2.patch.txt [ 12612081 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-5369.refactor.WIP.txt [ 12611993 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-5369.1.txt [ 12607973 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-5369.2.WIP.txt [ 12607765 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-5369.WIP.txt [ 12607434 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-5369.WIP.2.txt [ 12607430 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-5369.WIP.2.txt [ 12607432 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-5369.WIP.2.txt [ 12607432 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-5369.WIP.txt [ 12605224 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-5369.WIP.2.txt [ 12607430 ]
          Prasanth Jayachandran made changes -
          Link This issue is blocked by HIVE-5325 [ HIVE-5325 ]
          Prasanth Jayachandran made changes -
          Description Currently the statistics gathered at table/partition level and column level are not used during query planning stage. Statistics at table/partition and column level can be used for optimizing the query plans. Basic statistics like uncompressed data size can be used for better reducer estimation. Other statistics like number of rows, distinct values of columns, average length of columns etc. can be used by Cost Based Optimizer (CBO) for making better query plan selection. As a first step in improving query planning the statistics that are available in the metastore should be attached to hive operator tree. The operator tree should be walked and annotated with statistics information. The attached statistics will vary for each operator depending on the operation it performs. For example, select operator will change the average row size but doesn't affect the number of rows. Similarly filter operator will change the number of rows but doesn't change the average row size. Similar rules can be applied for other operators as well. Currently the statistics gathered at table/partition level and column level are not used during query planning stage. Statistics at table/partition and column level can be used for optimizing the query plans. Basic statistics like uncompressed data size can be used for better reducer estimation. Other statistics like number of rows, distinct values of columns, average length of columns etc. can be used by Cost Based Optimizer (CBO) for making better query plan selection. As a first step in improving query planning the statistics that are available in the metastore should be attached to hive operator tree. The operator tree should be walked and annotated with statistics information. The attached statistics will vary for each operator depending on the operation it performs. For example, select operator will change the average row size but doesn't affect the number of rows. Similarly filter operator will change the number of rows but doesn't change the average row size. Similar rules can be applied for other operators as well.

          Rules for different operators are added as comments in the code. For more detailed information, the reference book that I am using is "Database Systems: The Complete Book" by Garcia-Molina et.al.
          Prasanth Jayachandran made changes -
          Field Original Value New Value
          Attachment HIVE-5369.WIP.txt [ 12605224 ]
          Prasanth Jayachandran created issue -

            People

            • Assignee:
              Prasanth Jayachandran
              Reporter:
              Prasanth Jayachandran
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development