Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-11160

Auto-gather column stats

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Statistics
    • None

    Description

      Hive will collect table stats when set hive.stats.autogather=true during the INSERT OVERWRITE command. And then the users need to collect the column stats themselves using "Analyze" command. In this patch, the column stats will also be collected automatically. More specifically, INSERT OVERWRITE will automatically create new column stats. INSERT INTO will automatically merge new column stats with existing ones.

      Attachments

        1. HIVE-11160.01.patch
          24 kB
          Pengcheng Xiong
        2. HIVE-11160.02.patch
          103 kB
          Pengcheng Xiong
        3. HIVE-11160.03.patch
          101 kB
          Pengcheng Xiong
        4. HIVE-11160.04.patch
          327 kB
          Pengcheng Xiong
        5. HIVE-11160.05.patch
          327 kB
          Pengcheng Xiong
        6. HIVE-11160.06.patch
          343 kB
          Pengcheng Xiong
        7. HIVE-11160.07.patch
          3.55 MB
          Pengcheng Xiong
        8. HIVE-11160.08.patch
          3.79 MB
          Pengcheng Xiong
        9. HIVE-11160.09.patch
          4.10 MB
          Pengcheng Xiong

        Issue Links

          1.
          Deprecate HIVESTATSAUTOGATHER Sub-task Patch Available Pengcheng Xiong  
          2.
          Move filesystem stats collection from metastore to ql Sub-task Patch Available Zoltan Haindrich  
          3.
          StatsUtils.getColStatisticsFromExprMap may only provide info for a column once Sub-task Patch Available Zoltan Haindrich  
          4.
          Deprecate hive.typecheck.on.insert Sub-task Open Bertalan Kondrat  
          5.
          UpdateColumnStatsTask should set column stats as inaccurate Sub-task Open Pengcheng Xiong  
          6.
          improve explain when invalidate stats Sub-task Open Pengcheng Xiong  
          7.
          Consolidate basic stats logic for standalone table / partitioned Sub-task Open Zoltan Haindrich  
          8.
          Support date type for merging column stats Sub-task Open Pengcheng Xiong  
          9.
          Support vectorization for UDAF compute_stats Sub-task Open Pengcheng Xiong  
          10.
          Stats: Consolidate stat state for limit 0 and where false Sub-task Open Zoltan Haindrich  
          11.
          Tables which are known to be empty should not have NONE basic stat state Sub-task Open Denys Kuzmenko

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 0.5h
          12.
          Possible misuse of getDataSizeFromColumnStats Sub-task Open Unassigned  
          13.
          Derby throws java.lang.StackOverflowError when it tries to get column stats from a table with thousands columns Sub-task Open Pengcheng Xiong  
          14.
          Revise basic stat states for estimations Sub-task Open Unassigned  
          15.
          Investigate bucketed table stats Sub-task Open Unassigned  
          16.
          Differentiate table level stat / operator level stats Sub-task Open Unassigned  
          17.
          Make StatsTask use less metastore calls Sub-task Open Unassigned  
          18.
          Incorrect rownum estimation in joins Sub-task Open Unassigned  
          19.
          Estimate avgrowsize for stats calc in mixed case Sub-task Open Unassigned  
          20.
          Statistics: rawDataSize seems to be underestimated for text tables Sub-task Open Unassigned  
          21.
          Support "analyze table T" Sub-task In Progress Unassigned  

          Activity

            People

              pxiong Pengcheng Xiong
              pxiong Pengcheng Xiong
              Votes:
              1 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m