Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16026

Cost-based Optimizer Framework

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Done
    • None
    • 2.3.0
    • SQL

    Description

      This is an umbrella ticket to implement a cost-based optimizer framework beyond broadcast join selection. This framework can be used to implement some useful optimizations such as join reordering.

      The design should discuss how to break the work down into multiple, smaller logical units. For example, changes to statistics class, system catalog, cost estimation/propagation in expressions, cost estimation/propagation in operators can be done in decoupled pull requests.

      Attachments

        Issue Links

          1.
          Define the data structure to hold the statistics for CBO Sub-task Closed Unassigned
          2.
          generate table level stats:stats generation/storing/loading Sub-task Resolved Zhenhua Wang
          3.
          generate basic stats for column Sub-task Resolved Zhenhua Wang
          4.
          Simplify AnalyzeColumnCommand Sub-task Resolved Reynold Xin
          5.
          Create explicit contract for column stats serialization Sub-task Resolved Reynold Xin
          6.
          Decouple Statistics and CatalogTable Sub-task Resolved Zhenhua Wang
          7.
          Cardinality Estimation of Predicate Expressions Sub-task Resolved Ron Hu
          8.
          Cardinality estimation of join operator Sub-task Resolved Zhenhua Wang
          9.
          Cardinality estimation of project operator Sub-task Resolved Zhenhua Wang
          10.
          Cardinality estimation of aggregate operator Sub-task Resolved Zhenhua Wang
          11.
          show estimated stats when doing explain Sub-task Resolved Zhenhua Wang
          12.
          broadcast decision based on cbo Sub-task Closed Unassigned
          13.
          build side decision based on cbo Sub-task Closed Unassigned
          14.
          join reorder Sub-task Resolved Zhenhua Wang
          15.
          Support DESC FORMATTED TABLE COLUMN command to show column-level statistics Sub-task Resolved Zhenhua Wang
          16.
          Implement ALTER TABLE UPDATE STATISTICS SET Sub-task Closed Unassigned
          17.
          Aggregation function for computing bins (distinct value, count) pairs for equi-width histograms Sub-task Closed Unassigned
          18.
          SQL aggregate function for CountMinSketch Sub-task Resolved Zhenhua Wang
          19.
          Simplify CountMinSketch aggregate implementation Sub-task Resolved Reynold Xin
          20.
          NPE when collecting column stats for string/binary column having only null values Sub-task Resolved Zhenhua Wang
          21.
          Add a cbo conf to switch between default statistics and cbo estimated statistics Sub-task Resolved Zhenhua Wang
          22.
          Add test cases for row size estimation Sub-task Resolved Zhenhua Wang
          23.
          Unify two sets of statistics in LogicalPlan Sub-task Resolved Zhenhua Wang
          24.
          Change non-cbo estimation of aggregate Sub-task Resolved Zhenhua Wang
          25.
          Cardinality estimation of Limit and Sample Sub-task Resolved Zhenhua Wang
          26.
          cardinality estimation involving two columns of the same table Sub-task Resolved Ron Hu
          27.
          Improve join reorder: Exclude cartesian product candidates to reduce the search space Sub-task Resolved Zhenhua Wang
          28.
          Join reorder should keep the same order of final project attributes Sub-task Resolved Zhenhua Wang
          29.
          Don't estimate IsNull or IsNotNull predicates for non-leaf node Sub-task Resolved Zhenhua Wang
          30.
          BroadcastHint should use child's stats Sub-task Resolved Zhenhua Wang
          31.
          Use Catalyst type for min/max in ColumnStat for ease of estimation Sub-task Resolved Zhenhua Wang
          32.
          Fix recursive join reordering: inside joins are not reordered Sub-task Resolved Zhenhua Wang
          33.
          StatisticsColumnSuite failures on big endian platforms Sub-task Resolved Peter George Robbins
          34.
          Ndv for columns not in filter condition should also be updated Sub-task Resolved Zhenhua Wang
          35.
          Clearly document the mechanism to choose between two sources of statistics Sub-task Resolved Zhenhua Wang
          36.
          Add `alterTableStats` to store spark's stats and let `alterTable` keep existing stats Sub-task Resolved Zhenhua Wang
          37.
          Store zero size and row count after analyzing empty table Sub-task Resolved Zhenhua Wang
          38.
          Invalidate stats once table data is changed Sub-task Resolved Zhenhua Wang
          39.
          Update statistics after data changing commands Sub-task Resolved Zhenhua Wang
          40.
          Remove conf from stats functions since now we have conf in LogicalPlan Sub-task Resolved Zhenhua Wang
          41.
          Improve statistics test suites Sub-task Resolved Zhenhua Wang
          42.
          Estimation relation size based on numRows * rowSize Sub-task Resolved Zhenhua Wang
          43.
          Relation stats should be consistent with other plans based on cbo config Sub-task Resolved Zhenhua Wang
          44.
          Wrong Hive table statistics may trigger OOM if enables CBO Sub-task Resolved Yuming Wang
          45.
          Simplify some estimation logic by using double instead of decimal Sub-task Resolved Zhenhua Wang
          46.
          CBO rowcount statistics doesn't work for partitioned parquet external table Sub-task Open Unassigned
          47.
          Should use old partition stats to decide whether to update stats when analyzing partition Sub-task Resolved Zhenhua Wang

          Activity

            People

              ZenWzh Zhenhua Wang
              rxin Reynold Xin
              Votes:
              12 Vote for this issue
              Watchers:
              97 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: