Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-1570

DROP / COMPUTE incremental stats with dynamic partition specs

    Details

      Description

      COMPUTE INCREMENTAL STATS, and its counterpart DROP INCREMENTAL STATS can both take PARTITION ... clauses to specify the precise partition to act on. If a small set of partitions need updating (or if the user wants to batch incremental stat computations in clumps), it would be good to allow dynamic partition specs, and only drop or update those partitions matched by the looser specification.

      For example COMPUTE INCREMENTAL STATS tbl PARTITION(year=2009, month) would update all months in 2009 that were missing incremental stats.

      For maximum benefit, we could add a set of constraints:

      COMPUTE INCREMENTAL STATS tbl PARTITION(year, month) WHERE year > 2009 and month=10

      would update all the October partitions since 2010 inclusive.

      We already have logic to do partition pruning in the frontend - this is the reverse. COMPUTE INCREMENTAL STATS already is able to work on a subset of partitions, so the work here should be substantially in the frontend.

        Activity

        Hide
        marcelk Marcel Kornacker added a comment -

        Regarding COMPUTE INCREMENTAL STATS tbl PARTITION(year=2009, month):
        why not simply specify that as COMPUTE INCREMENTAL STATS tbl PARTITION(year=2009)?

        Regarding COMPUTE INCREMENTAL STATS tbl PARTITION(year, month) WHERE year > 2009 and month=10:
        this will definitely require more work in the frontend, and given that ultimately we want to retire 'compute stats' altogether, I don't think it's worth making that investment.

        Show
        marcelk Marcel Kornacker added a comment - Regarding COMPUTE INCREMENTAL STATS tbl PARTITION(year=2009, month): why not simply specify that as COMPUTE INCREMENTAL STATS tbl PARTITION(year=2009)? Regarding COMPUTE INCREMENTAL STATS tbl PARTITION(year, month) WHERE year > 2009 and month=10: this will definitely require more work in the frontend, and given that ultimately we want to retire 'compute stats' altogether, I don't think it's worth making that investment.
        Hide
        tomas79_impala_72f3 Tomas Farkas added a comment -

        Marcel, why will be compute stats retired?
        Regarding COMPUTE INCREMENTAL STATS where the upper level is constant and the lower level (subdirectory) is variable, it does not work.
        COMPUTE INCREMENTAL STATS tbl PARTITION( year=2009) will throw an error:

        ERROR: AnalysisException: Items in partition spec must exactly match the partition columns in the table definition

        Show
        tomas79_impala_72f3 Tomas Farkas added a comment - Marcel, why will be compute stats retired? Regarding COMPUTE INCREMENTAL STATS where the upper level is constant and the lower level (subdirectory) is variable, it does not work. COMPUTE INCREMENTAL STATS tbl PARTITION( year=2009) will throw an error: ERROR: AnalysisException: Items in partition spec must exactly match the partition columns in the table definition
        Hide
        marcelk Marcel Kornacker added a comment -

        Regarding retiring Compute Stats: eventually we want to do this computation in the asynchronously background, as new data appears, and the explicit command Compute Stats would go away.

        Regarding the exact syntax: we should adopt what we're doing for https://issues.cloudera.org/browse/IMPALA-1654 (Where predicate in the Partition clause)

        Show
        marcelk Marcel Kornacker added a comment - Regarding retiring Compute Stats: eventually we want to do this computation in the asynchronously background, as new data appears, and the explicit command Compute Stats would go away. Regarding the exact syntax: we should adopt what we're doing for https://issues.cloudera.org/browse/IMPALA-1654 (Where predicate in the Partition clause)
        Hide
        gatsbylee Gatsby Lee added a comment -

        @Marcel:

        If the computation runs asynchronously on background, doesn't it give any unexpected results?

        1. How can we know if the computation is done or not on production after ETL?
        2. How should we handle the overhead which can be brought in by the computation since the process is CPU intensive?

        Show
        gatsbylee Gatsby Lee added a comment - @Marcel: If the computation runs asynchronously on background, doesn't it give any unexpected results? 1. How can we know if the computation is done or not on production after ETL? 2. How should we handle the overhead which can be brought in by the computation since the process is CPU intensive?
        Hide
        alex.behm Alexander Behm added a comment -

        Amos Bird, assigning to you since you are already fixing this

        Show
        alex.behm Alexander Behm added a comment - Amos Bird , assigning to you since you are already fixing this
        Hide
        amosbird Amos Bird added a comment -
        Show
        amosbird Amos Bird added a comment - Fixed by https://gerrit.cloudera.org/#/c/3942/
        Hide
        jbapple Jim Apple added a comment -

        John Russell, I suspect this needs documentation.

        Show
        jbapple Jim Apple added a comment - John Russell , I suspect this needs documentation.

          People

          • Assignee:
            amosbird Amos Bird
            Reporter:
            henryr Henry Robinson
          • Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development