Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-11160

Auto-gather column stats

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Statistics
    • Labels:
      None
    • Target Version/s:

      Description

      Hive will collect table stats when set hive.stats.autogather=true during the INSERT OVERWRITE command. And then the users need to collect the column stats themselves using "Analyze" command. In this patch, the column stats will also be collected automatically. More specifically, INSERT OVERWRITE will automatically create new column stats. INSERT INTO will automatically merge new column stats with existing ones.

        Attachments

        1. HIVE-11160.09.patch
          4.10 MB
          Pengcheng Xiong
        2. HIVE-11160.08.patch
          3.79 MB
          Pengcheng Xiong
        3. HIVE-11160.07.patch
          3.55 MB
          Pengcheng Xiong
        4. HIVE-11160.06.patch
          343 kB
          Pengcheng Xiong
        5. HIVE-11160.05.patch
          327 kB
          Pengcheng Xiong
        6. HIVE-11160.04.patch
          327 kB
          Pengcheng Xiong
        7. HIVE-11160.03.patch
          101 kB
          Pengcheng Xiong
        8. HIVE-11160.02.patch
          103 kB
          Pengcheng Xiong
        9. HIVE-11160.01.patch
          24 kB
          Pengcheng Xiong

          Issue Links

          1.
          Deprecate HIVESTATSAUTOGATHER Sub-task Patch Available Pengcheng Xiong
          2.
          Move filesystem stats collection from metastore to ql Sub-task Patch Available Zoltan Haindrich
          3.
          StatsUtils.getColStatisticsFromExprMap may only provide info for a column once Sub-task Patch Available Zoltan Haindrich
          4.
          Deprecate hive.typecheck.on.insert Sub-task Open Bertalan Kondrat
          5.
          UpdateColumnStatsTask should set column stats as inaccurate Sub-task Open Pengcheng Xiong
          6.
          improve explain when invalidate stats Sub-task Open Pengcheng Xiong
          7.
          Consolidate basic stats logic for standalone table / partitioned Sub-task Open Zoltan Haindrich
          8.
          Support date type for merging column stats Sub-task Open Pengcheng Xiong
          9.
          Support vectorization for UDAF compute_stats Sub-task Open Pengcheng Xiong
          10.
          Stats: Consolidate stat state for limit 0 and where false Sub-task Open Zoltan Haindrich
          11.
          Tables which are known to be empty should not have NONE basic stat state Sub-task Open Unassigned
          12.
          Possible misuse of getDataSizeFromColumnStats Sub-task Open Unassigned
          13.
          Derby throws java.lang.StackOverflowError when it tries to get column stats from a table with thousands columns Sub-task Open Pengcheng Xiong
          14.
          Revise basic stat states for estimations Sub-task Open Unassigned
          15.
          Investigate bucketed table stats Sub-task Open Unassigned
          16.
          Differentiate table level stat / operator level stats Sub-task Open Unassigned
          17.
          Make StatsTask use less metastore calls Sub-task Open Unassigned
          18.
          Incorrect rownum estimation in joins Sub-task Open Unassigned
          19.
          Estimate avgrowsize for stats calc in mixed case Sub-task Open Unassigned
          20.
          Statistics: rawDataSize seems to be underestimated for text tables Sub-task Open Unassigned
          21.
          Support "analyze table T" Sub-task In Progress Unassigned

            Activity

              People

              • Assignee:
                pxiong Pengcheng Xiong
                Reporter:
                pxiong Pengcheng Xiong
              • Votes:
                1 Vote for this issue
                Watchers:
                15 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m