Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16026

Cost-based Optimizer Framework

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Done
    • None
    • 2.3.0
    • SQL

    Description

      This is an umbrella ticket to implement a cost-based optimizer framework beyond broadcast join selection. This framework can be used to implement some useful optimizations such as join reordering.

      The design should discuss how to break the work down into multiple, smaller logical units. For example, changes to statistics class, system catalog, cost estimation/propagation in expressions, cost estimation/propagation in operators can be done in decoupled pull requests.

      Attachments

        Issue Links

        1.
        Define the data structure to hold the statistics for CBO Sub-task Closed Unassigned Actions
        2.
        generate table level stats:stats generation/storing/loading Sub-task Resolved Zhenhua Wang Actions
        3.
        generate basic stats for column Sub-task Resolved Zhenhua Wang Actions
        4.
        Simplify AnalyzeColumnCommand Sub-task Resolved Reynold Xin Actions
        5.
        Create explicit contract for column stats serialization Sub-task Resolved Reynold Xin Actions
        6.
        Decouple Statistics and CatalogTable Sub-task Resolved Zhenhua Wang Actions
        7.
        Cardinality Estimation of Predicate Expressions Sub-task Resolved Ron Hu Actions
        8.
        Cardinality estimation of join operator Sub-task Resolved Zhenhua Wang Actions
        9.
        Cardinality estimation of project operator Sub-task Resolved Zhenhua Wang Actions
        10.
        Cardinality estimation of aggregate operator Sub-task Resolved Zhenhua Wang Actions
        11.
        show estimated stats when doing explain Sub-task Resolved Zhenhua Wang Actions
        12.
        broadcast decision based on cbo Sub-task Closed Unassigned Actions
        13.
        build side decision based on cbo Sub-task Closed Unassigned Actions
        14.
        join reorder Sub-task Resolved Zhenhua Wang Actions
        15.
        Support DESC FORMATTED TABLE COLUMN command to show column-level statistics Sub-task Resolved Zhenhua Wang Actions
        16.
        Implement ALTER TABLE UPDATE STATISTICS SET Sub-task Closed Unassigned Actions
        17.
        Aggregation function for computing bins (distinct value, count) pairs for equi-width histograms Sub-task Closed Unassigned Actions
        18.
        SQL aggregate function for CountMinSketch Sub-task Resolved Zhenhua Wang Actions
        19.
        Simplify CountMinSketch aggregate implementation Sub-task Resolved Reynold Xin Actions
        20.
        NPE when collecting column stats for string/binary column having only null values Sub-task Resolved Zhenhua Wang Actions
        21.
        Add a cbo conf to switch between default statistics and cbo estimated statistics Sub-task Resolved Zhenhua Wang Actions
        22.
        Add test cases for row size estimation Sub-task Resolved Zhenhua Wang Actions
        23.
        Unify two sets of statistics in LogicalPlan Sub-task Resolved Zhenhua Wang Actions
        24.
        Change non-cbo estimation of aggregate Sub-task Resolved Zhenhua Wang Actions
        25.
        Cardinality estimation of Limit and Sample Sub-task Resolved Zhenhua Wang Actions
        26.
        cardinality estimation involving two columns of the same table Sub-task Resolved Ron Hu Actions
        27.
        Improve join reorder: Exclude cartesian product candidates to reduce the search space Sub-task Resolved Zhenhua Wang Actions
        28.
        Join reorder should keep the same order of final project attributes Sub-task Resolved Zhenhua Wang Actions
        29.
        Don't estimate IsNull or IsNotNull predicates for non-leaf node Sub-task Resolved Zhenhua Wang Actions
        30.
        BroadcastHint should use child's stats Sub-task Resolved Zhenhua Wang Actions
        31.
        Use Catalyst type for min/max in ColumnStat for ease of estimation Sub-task Resolved Zhenhua Wang Actions
        32.
        Fix recursive join reordering: inside joins are not reordered Sub-task Resolved Zhenhua Wang Actions
        33.
        StatisticsColumnSuite failures on big endian platforms Sub-task Resolved Peter George Robbins Actions
        34.
        Ndv for columns not in filter condition should also be updated Sub-task Resolved Zhenhua Wang Actions
        35.
        Clearly document the mechanism to choose between two sources of statistics Sub-task Resolved Zhenhua Wang Actions
        36.
        Add `alterTableStats` to store spark's stats and let `alterTable` keep existing stats Sub-task Resolved Zhenhua Wang Actions
        37.
        Store zero size and row count after analyzing empty table Sub-task Resolved Zhenhua Wang Actions
        38.
        Invalidate stats once table data is changed Sub-task Resolved Zhenhua Wang Actions
        39.
        Update statistics after data changing commands Sub-task Resolved Zhenhua Wang Actions
        40.
        Remove conf from stats functions since now we have conf in LogicalPlan Sub-task Resolved Zhenhua Wang Actions
        41.
        Improve statistics test suites Sub-task Resolved Zhenhua Wang Actions
        42.
        Estimation relation size based on numRows * rowSize Sub-task Resolved Zhenhua Wang Actions
        43.
        Relation stats should be consistent with other plans based on cbo config Sub-task Resolved Zhenhua Wang Actions
        44.
        Wrong Hive table statistics may trigger OOM if enables CBO Sub-task Resolved Yuming Wang Actions
        45.
        Simplify some estimation logic by using double instead of decimal Sub-task Resolved Zhenhua Wang Actions
        46.
        CBO rowcount statistics doesn't work for partitioned parquet external table Sub-task Open Unassigned Actions
        47.
        Should use old partition stats to decide whether to update stats when analyzing partition Sub-task Resolved Zhenhua Wang Actions

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ZenWzh Zhenhua Wang Assign to me
            rxin Reynold Xin
            Votes:
            12 Vote for this issue
            Watchers:
            97 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment