Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16026 Cost-based Optimizer Framework
  3. SPARK-22515

Estimation relation size based on numRows * rowSize

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.3.0
    • SQL
    • None

    Description

      Currently, relation size is computed as the sum of file size, which is error-prone because storage format like parquet may have a much smaller file size compared to in-memory size. When we choose broadcast join based on file size, there's a risk of OOM. But if the number of rows is available in statistics, we can get a better estimation by `numRows * rowSize`, which helps to alleviate this problem.

      Attachments

        Activity

          People

            ZenWzh Zhenhua Wang
            ZenWzh Zhenhua Wang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: