Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-11160

Auto-gather column stats

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Statistics
    • Labels:
      None
    • Target Version/s:

      Description

      Hive will collect table stats when set hive.stats.autogather=true during the INSERT OVERWRITE command. And then the users need to collect the column stats themselves using "Analyze" command. In this patch, the column stats will also be collected automatically. More specifically, INSERT OVERWRITE will automatically create new column stats. INSERT INTO will automatically merge new column stats with existing ones.

        Attachments

        1. HIVE-11160.01.patch
          24 kB
          Pengcheng Xiong
        2. HIVE-11160.02.patch
          103 kB
          Pengcheng Xiong
        3. HIVE-11160.03.patch
          101 kB
          Pengcheng Xiong
        4. HIVE-11160.04.patch
          327 kB
          Pengcheng Xiong
        5. HIVE-11160.05.patch
          327 kB
          Pengcheng Xiong
        6. HIVE-11160.06.patch
          343 kB
          Pengcheng Xiong
        7. HIVE-11160.07.patch
          3.55 MB
          Pengcheng Xiong
        8. HIVE-11160.08.patch
          3.79 MB
          Pengcheng Xiong
        9. HIVE-11160.09.patch
          4.10 MB
          Pengcheng Xiong

          Issue Links

          1.
          thrift change Sub-task Closed Pengcheng Xiong  
          2.
          Auto-gather column stats - phase 1 Sub-task Closed Pengcheng Xiong  
          3.
          "Create table like" command should initialize the basic stats for the table Sub-task Closed Pengcheng Xiong  
          4.
          GenMRFileSink1.java may refer to a wrong MR task in multi-insert case Sub-task Resolved Pengcheng Xiong  
          5.
          Support auto gather column stats for columns with trailing white spaces Sub-task Resolved Pengcheng Xiong  
          6.
          Support stats computation for column in QuotedIdentifier Sub-task Resolved Pengcheng Xiong  
          7.
          With column stats, mergejoin.q throws NPE Sub-task Resolved Pengcheng Xiong  
          8.
          Column pruner should continue to work when SEL has more than 1 child Sub-task Resolved Pengcheng Xiong  
          9.
          Fix failing test org.apache.hive.jdbc.TestJdbcDriver2.testResultSetMetaData Sub-task Resolved Pengcheng Xiong  
          10.
          Fix failing test columnstats_partlvl_invalid_values when autogather column stats is on Sub-task Resolved Pengcheng Xiong  
          11.
          analyze table compute statistics fails due to presence of Infinity value in double column Sub-task Resolved Pengcheng Xiong  
          12.
          Skip column stats when colStats is empty Sub-task Closed Pengcheng Xiong  
          13.
          ColumnStats merge should consider the accuracy of the current stats Sub-task Resolved Zoltan Haindrich  
          14.
          Set column stats default as true when creating new tables/partitions Sub-task Closed Pengcheng Xiong  
          15.
          Merge stats task and column stats task into a single task Sub-task Closed Zoltan Haindrich  
          16.
          hive.optimize.bucketingsorting should compare the schema before removing RS Sub-task Closed Pengcheng Xiong  
          17.
          remove ColumnStatsDesc usage from columnstatsupdatetask Sub-task Closed Gergely Hajós  
          18.
          retire ANALYZE TABLE ... PARTIALSCAN Sub-task Closed Zoltan Haindrich  
          19.
          TableScanOperator might miss vectorization on flag Sub-task Closed Zoltan Haindrich  
          20.
          Merging Statistics are promoted to COMPLETE (most of the time) Sub-task Closed Zoltan Haindrich  
          21.
          Fix exception on tables handled by HBaseHandler if columnsstats are auto-gathered Sub-task Closed Zoltan Haindrich  
          22.
          Remove mixed partitions/table schema support Sub-task Resolved Zoltan Haindrich  
          23.
          Aggregation of an empty set doesn't pass constants to the UDAF Sub-task Resolved Zoltan Haindrich  
          24.
          Fix StatsUtils.combineRange to combine intervals Sub-task Closed Zoltan Haindrich  
          25.
          Stats: create materialized view should also collect stats Sub-task Closed Zoltan Haindrich  
          26.
          Stats: Remove usage of clone() methods Sub-task Closed Bertalan Kondrat  
          27.
          Improve size estimation for array() to be not 0 Sub-task Closed Zoltan Haindrich  
          28.
          Fix columnstats problem in case schema evolution Sub-task Closed Zoltan Haindrich

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          29.
          Enable auto-gather column stats by default Sub-task Closed Zoltan Haindrich  
          30.
          Stats: rownum estimation from datasize underestimates in most cases Sub-task Closed Zoltan Haindrich  
          31.
          Fix test TestAcidOnTez Sub-task Resolved Zoltan Haindrich  
          32.
          in case basic stats are missing; rowcount estimation depends on the selected columns size Sub-task Resolved Zoltan Haindrich  
          33.
          Remove hive.stats.atomic Sub-task Closed Bertalan Kondrat  
          34.
          Columnstats gather on mm tables: re-enable disabled test Sub-task Closed Zoltan Haindrich  
          35.
          Partitioned tables statistics can go wrong in basic stats mixed case Sub-task Resolved Zoltan Haindrich  
          36.
          Fix columnstats merge NPE Sub-task Closed László Bodor  
          37.
          Fill stats for temporary tables Sub-task Resolved Unassigned  
          38.
          Aggregate row traffic for acid tables Sub-task Resolved Zoltan Haindrich  
          39.
          Deprecate HIVESTATSAUTOGATHER Sub-task Patch Available Pengcheng Xiong  
          40.
          Move filesystem stats collection from metastore to ql Sub-task Patch Available Zoltan Haindrich  
          41.
          StatsUtils.getColStatisticsFromExprMap may only provide info for a column once Sub-task Patch Available Zoltan Haindrich  
          42.
          Deprecate hive.typecheck.on.insert Sub-task Open Bertalan Kondrat  
          43.
          UpdateColumnStatsTask should set column stats as inaccurate Sub-task Open Pengcheng Xiong  
          44.
          Support CTAS for auto gather column stats Sub-task Resolved Jesus Camacho Rodriguez  
          45.
          improve explain when invalidate stats Sub-task Open Pengcheng Xiong  
          46.
          Consolidate basic stats logic for standalone table / partitioned Sub-task Open Zoltan Haindrich  
          47.
          Support date type for merging column stats Sub-task Open Pengcheng Xiong  
          48.
          Support vectorization for UDAF compute_stats Sub-task Open Pengcheng Xiong  
          49.
          Stats: Consolidate stat state for limit 0 and where false Sub-task Open Zoltan Haindrich  
          50.
          Tables which are known to be empty should not have NONE basic stat state Sub-task Open Denys Kuzmenko

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 0.5h
          51.
          Possible misuse of getDataSizeFromColumnStats Sub-task Open Unassigned  
          52.
          Derby throws java.lang.StackOverflowError when it tries to get column stats from a table with thousands columns Sub-task Open Pengcheng Xiong  
          53.
          Revise basic stat states for estimations Sub-task Open Unassigned  
          54.
          Investigate bucketed table stats Sub-task Open Unassigned  
          55.
          Differentiate table level stat / operator level stats Sub-task Open Unassigned  
          56.
          Column stats are not autogathered for materialized views Sub-task Resolved Jesus Camacho Rodriguez  
          57.
          Make StatsTask use less metastore calls Sub-task Open Unassigned  
          58.
          Incorrect rownum estimation in joins Sub-task Open Unassigned  
          59.
          Estimate avgrowsize for stats calc in mixed case Sub-task Open Unassigned  
          60.
          Statistics: rawDataSize seems to be underestimated for text tables Sub-task Open Unassigned  
          61.
          Support "analyze table T" Sub-task In Progress Unassigned  
          62.
          Support date type for column stats autogather Sub-task Resolved Zoltan Haindrich  

            Activity

              People

              • Assignee:
                pxiong Pengcheng Xiong
                Reporter:
                pxiong Pengcheng Xiong
              • Votes:
                1 Vote for this issue
                Watchers:
                15 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m