Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23445

ColumnStat refactoring

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.0
    • Fix Version/s: 2.4.0
    • Component/s: SQL
    • Labels:
      None

      Description

      Refactor ColumnStat to be more flexible.

      • Split ColumnStat and CatalogColumnStat just like CatalogStatistics is split from Statistics. This detaches how the statistics are stored from how they are processed in the query plan. CatalogColumnStat keeps min and max as String, making it not depend on dataType information.
      • For CatalogColumnStat, parse column names from property names in the metastore ({{KEY_VERSION }}property), not from metastore schema. This allows the catalog to read stats into {{CatalogColumnStat}}s even if the schema itself is not in the metastore.
      • Make all fields optional. minmax and histogram for columns were optional already. Having them all optional is more consistent, and gives flexibility to e.g. drop some of the fields through transformations if they are difficult / impossible to calculate.

      The added flexibility will make it possible to have alternative implementations for stats, and separates stats collection from stats and estimation processing in plans.

        Attachments

          Activity

            People

            • Assignee:
              juliuszsompolski Juliusz Sompolski
              Reporter:
              juliuszsompolski Juliusz Sompolski
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: