Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23445

ColumnStat refactoring

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 2.4.0
    • SQL
    • None

    Description

      Refactor ColumnStat to be more flexible.

      • Split ColumnStat and CatalogColumnStat just like CatalogStatistics is split from Statistics. This detaches how the statistics are stored from how they are processed in the query plan. CatalogColumnStat keeps min and max as String, making it not depend on dataType information.
      • For CatalogColumnStat, parse column names from property names in the metastore ({{KEY_VERSION }}property), not from metastore schema. This allows the catalog to read stats into {{CatalogColumnStat}}s even if the schema itself is not in the metastore.
      • Make all fields optional. minmax and histogram for columns were optional already. Having them all optional is more consistent, and gives flexibility to e.g. drop some of the fields through transformations if they are difficult / impossible to calculate.

      The added flexibility will make it possible to have alternative implementations for stats, and separates stats collection from stats and estimation processing in plans.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            juliuszsompolski Juliusz Sompolski
            juliuszsompolski Juliusz Sompolski
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment