Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-11160

Auto-gather column stats

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Statistics
    • None

    Description

      Hive will collect table stats when set hive.stats.autogather=true during the INSERT OVERWRITE command. And then the users need to collect the column stats themselves using "Analyze" command. In this patch, the column stats will also be collected automatically. More specifically, INSERT OVERWRITE will automatically create new column stats. INSERT INTO will automatically merge new column stats with existing ones.

      Attachments

        1. HIVE-11160.09.patch
          4.10 MB
          Pengcheng Xiong
        2. HIVE-11160.08.patch
          3.79 MB
          Pengcheng Xiong
        3. HIVE-11160.07.patch
          3.55 MB
          Pengcheng Xiong
        4. HIVE-11160.06.patch
          343 kB
          Pengcheng Xiong
        5. HIVE-11160.05.patch
          327 kB
          Pengcheng Xiong
        6. HIVE-11160.04.patch
          327 kB
          Pengcheng Xiong
        7. HIVE-11160.03.patch
          101 kB
          Pengcheng Xiong
        8. HIVE-11160.02.patch
          103 kB
          Pengcheng Xiong
        9. HIVE-11160.01.patch
          24 kB
          Pengcheng Xiong

        Issue Links

        1.
        thrift change Sub-task Closed Pengcheng Xiong   Actions
        2.
        Auto-gather column stats - phase 1 Sub-task Closed Pengcheng Xiong   Actions
        3.
        "Create table like" command should initialize the basic stats for the table Sub-task Closed Pengcheng Xiong   Actions
        4.
        GenMRFileSink1.java may refer to a wrong MR task in multi-insert case Sub-task Resolved Pengcheng Xiong   Actions
        5.
        Support auto gather column stats for columns with trailing white spaces Sub-task Resolved Pengcheng Xiong   Actions
        6.
        Support stats computation for column in QuotedIdentifier Sub-task Resolved Pengcheng Xiong   Actions
        7.
        With column stats, mergejoin.q throws NPE Sub-task Resolved Pengcheng Xiong   Actions
        8.
        Column pruner should continue to work when SEL has more than 1 child Sub-task Resolved Pengcheng Xiong   Actions
        9.
        Fix failing test org.apache.hive.jdbc.TestJdbcDriver2.testResultSetMetaData Sub-task Resolved Pengcheng Xiong   Actions
        10.
        Fix failing test columnstats_partlvl_invalid_values when autogather column stats is on Sub-task Resolved Pengcheng Xiong   Actions
        11.
        analyze table compute statistics fails due to presence of Infinity value in double column Sub-task Resolved Pengcheng Xiong   Actions
        12.
        Skip column stats when colStats is empty Sub-task Closed Pengcheng Xiong   Actions
        13.
        ColumnStats merge should consider the accuracy of the current stats Sub-task Resolved Zoltan Haindrich   Actions
        14.
        Set column stats default as true when creating new tables/partitions Sub-task Closed Pengcheng Xiong   Actions
        15.
        Merge stats task and column stats task into a single task Sub-task Closed Zoltan Haindrich   Actions
        16.
        hive.optimize.bucketingsorting should compare the schema before removing RS Sub-task Closed Pengcheng Xiong   Actions
        17.
        remove ColumnStatsDesc usage from columnstatsupdatetask Sub-task Closed Gergely Hajós   Actions
        18.
        retire ANALYZE TABLE ... PARTIALSCAN Sub-task Closed Zoltan Haindrich   Actions
        19.
        TableScanOperator might miss vectorization on flag Sub-task Closed Zoltan Haindrich   Actions
        20.
        Merging Statistics are promoted to COMPLETE (most of the time) Sub-task Closed Zoltan Haindrich   Actions
        21.
        Fix exception on tables handled by HBaseHandler if columnsstats are auto-gathered Sub-task Closed Zoltan Haindrich   Actions
        22.
        Remove mixed partitions/table schema support Sub-task Resolved Zoltan Haindrich   Actions
        23.
        Aggregation of an empty set doesn't pass constants to the UDAF Sub-task Resolved Zoltan Haindrich   Actions
        24.
        Fix StatsUtils.combineRange to combine intervals Sub-task Closed Zoltan Haindrich   Actions
        25.
        Stats: create materialized view should also collect stats Sub-task Closed Zoltan Haindrich   Actions
        26.
        Stats: Remove usage of clone() methods Sub-task Closed Bertalan Kondrat   Actions
        27.
        Improve size estimation for array() to be not 0 Sub-task Closed Zoltan Haindrich   Actions
        28.
        Fix columnstats problem in case schema evolution Sub-task Closed Zoltan Haindrich

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        Actions
        29.
        Enable auto-gather column stats by default Sub-task Closed Zoltan Haindrich   Actions
        30.
        Stats: rownum estimation from datasize underestimates in most cases Sub-task Closed Zoltan Haindrich   Actions
        31.
        Fix test TestAcidOnTez Sub-task Resolved Zoltan Haindrich   Actions
        32.
        in case basic stats are missing; rowcount estimation depends on the selected columns size Sub-task Resolved Zoltan Haindrich   Actions
        33.
        Remove hive.stats.atomic Sub-task Closed Bertalan Kondrat   Actions
        34.
        Columnstats gather on mm tables: re-enable disabled test Sub-task Closed Zoltan Haindrich   Actions
        35.
        Partitioned tables statistics can go wrong in basic stats mixed case Sub-task Resolved Zoltan Haindrich   Actions
        36.
        Fix columnstats merge NPE Sub-task Closed László Bodor   Actions
        37.
        Fill stats for temporary tables Sub-task Resolved Unassigned   Actions
        38.
        Aggregate row traffic for acid tables Sub-task Resolved Zoltan Haindrich   Actions
        39.
        Deprecate HIVESTATSAUTOGATHER Sub-task Patch Available Pengcheng Xiong   Actions
        40.
        Move filesystem stats collection from metastore to ql Sub-task Patch Available Zoltan Haindrich   Actions
        41.
        StatsUtils.getColStatisticsFromExprMap may only provide info for a column once Sub-task Patch Available Zoltan Haindrich   Actions
        42.
        Deprecate hive.typecheck.on.insert Sub-task Open Bertalan Kondrat   Actions
        43.
        UpdateColumnStatsTask should set column stats as inaccurate Sub-task Open Pengcheng Xiong   Actions
        44.
        Support CTAS for auto gather column stats Sub-task Resolved Jesus Camacho Rodriguez   Actions
        45.
        improve explain when invalidate stats Sub-task Open Pengcheng Xiong   Actions
        46.
        Consolidate basic stats logic for standalone table / partitioned Sub-task Open Zoltan Haindrich   Actions
        47.
        Support date type for merging column stats Sub-task Open Pengcheng Xiong   Actions
        48.
        Support vectorization for UDAF compute_stats Sub-task Open Pengcheng Xiong   Actions
        49.
        Stats: Consolidate stat state for limit 0 and where false Sub-task Open Zoltan Haindrich   Actions
        50.
        Tables which are known to be empty should not have NONE basic stat state Sub-task Open Denys Kuzmenko

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 0.5h
        Actions
        51.
        Possible misuse of getDataSizeFromColumnStats Sub-task Open Unassigned   Actions
        52.
        Derby throws java.lang.StackOverflowError when it tries to get column stats from a table with thousands columns Sub-task Open Pengcheng Xiong   Actions
        53.
        Revise basic stat states for estimations Sub-task Open Unassigned   Actions
        54.
        Investigate bucketed table stats Sub-task Open Unassigned   Actions
        55.
        Differentiate table level stat / operator level stats Sub-task Open Unassigned   Actions
        56.
        Column stats are not autogathered for materialized views Sub-task Resolved Jesus Camacho Rodriguez   Actions
        57.
        Make StatsTask use less metastore calls Sub-task Open Unassigned   Actions
        58.
        Incorrect rownum estimation in joins Sub-task Open Unassigned   Actions
        59.
        Estimate avgrowsize for stats calc in mixed case Sub-task Open Unassigned   Actions
        60.
        Statistics: rawDataSize seems to be underestimated for text tables Sub-task Open Unassigned   Actions
        61.
        Support "analyze table T" Sub-task In Progress Unassigned   Actions
        62.
        Support date type for column stats autogather Sub-task Resolved Zoltan Haindrich   Actions

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            pxiong Pengcheng Xiong Assign to me
            pxiong Pengcheng Xiong

            Dates

              Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 50m
              50m

              Slack

                Issue deployment