Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3369

Add a variant of alter table to manually add column stats

    Details

      Description

      Current we have SQLs to add row stats for partitioned/non_partitioned tables. For ex:

      alter table analysis_data set tblproperties('numRows'='1001000000'); #non-partitioned
      alter table partitioned_data set tblproperties ('numRows'='1030000'); #partitioned
      

      We should have something similar for column stats too.

        Activity

        Hide
        alex.behm Alexander Behm added a comment -

        commit b5dcd8dcccbbd31ec5c8205fc7a929b969628cca
        Author: Alex Behm <alex.behm@cloudera.com>
        Date: Mon May 23 11:01:39 2016 -0700

        IMPALA-3369: Add ALTER TABLE SET COLUMN STATS statement.

        Adds a new command to manually set the table-level column stats.

        Syntax:
        ALTER TABLE [<db_name>.]<tbl_name> SET COLUMN STATS <col_name>
        ('statsKey'='val','statsKey2',='val2')

        Valid values for 'statsKey': numDVs, numNulls, avgSize, maxSize

        The 'val' portion needs to be a number appropriate for the given stats
        key (e.g., a long for numDVs, a float for avgSize).

        The special value of '-1' is allowed to reset stats to 'unknown'.

        The keys as well as the values are specified as string literals to be
        consistent with the existing DDL for setting TBLPROPERTIES/SERDEPROPERTIES,
        in particular, setting the 'numRows' table/partition property.

        Testing: Ran the tests locally on exhaustive. Did private runs
        on core/hdfs and core/S3.

        Change-Id: I45cd8aa7241ea962788ba9ca7d0bbfd864c4304f
        Reviewed-on: http://gerrit.cloudera.org:8080/3189
        Reviewed-by: Alex Behm <alex.behm@cloudera.com>
        Tested-by: Internal Jenkins

        Show
        alex.behm Alexander Behm added a comment - commit b5dcd8dcccbbd31ec5c8205fc7a929b969628cca Author: Alex Behm <alex.behm@cloudera.com> Date: Mon May 23 11:01:39 2016 -0700 IMPALA-3369 : Add ALTER TABLE SET COLUMN STATS statement. Adds a new command to manually set the table-level column stats. Syntax: ALTER TABLE [<db_name>.] <tbl_name> SET COLUMN STATS <col_name> ('statsKey'='val','statsKey2',='val2') Valid values for 'statsKey': numDVs, numNulls, avgSize, maxSize The 'val' portion needs to be a number appropriate for the given stats key (e.g., a long for numDVs, a float for avgSize). The special value of '-1' is allowed to reset stats to 'unknown'. The keys as well as the values are specified as string literals to be consistent with the existing DDL for setting TBLPROPERTIES/SERDEPROPERTIES, in particular, setting the 'numRows' table/partition property. Testing: Ran the tests locally on exhaustive. Did private runs on core/hdfs and core/S3. Change-Id: I45cd8aa7241ea962788ba9ca7d0bbfd864c4304f Reviewed-on: http://gerrit.cloudera.org:8080/3189 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins
        Hide
        morhidi Matyas Orhidi added a comment -

        Would it be possible to display/set the column level statistics per partition? What is the recommended way to compute partition level statistics and update these properties?

        Show
        morhidi Matyas Orhidi added a comment - Would it be possible to display/set the column level statistics per partition? What is the recommended way to compute partition level statistics and update these properties?
        Hide
        bharathv bharath v added a comment -

        With compute stats, Impala computes the column stats aggregated at the table level, even for partitioned tables. So as such, you can't manually add per partition column stats. However there is something called "compute incremental stats" that you might want to checkout. Its intended for partitioned tables and only computes stats for the newly added partitions, which I believe is the recommended way. However please keep in mind that the "incremental stats" has some scalability issues and may not work with very large no. of partitions

        Show
        bharathv bharath v added a comment - With compute stats, Impala computes the column stats aggregated at the table level, even for partitioned tables. So as such, you can't manually add per partition column stats. However there is something called "compute incremental stats" that you might want to checkout. Its intended for partitioned tables and only computes stats for the newly added partitions, which I believe is the recommended way. However please keep in mind that the "incremental stats" has some scalability issues and may not work with very large no. of partitions
        Hide
        jrussell John Russell added a comment -

        This information is now added under ALTER TABLE, Performance :: Manually Setting Stats, and as a new features bullet. Blanked out "doc text" field so this issue doesn't show up on my to-do list.

        Show
        jrussell John Russell added a comment - This information is now added under ALTER TABLE, Performance :: Manually Setting Stats, and as a new features bullet. Blanked out "doc text" field so this issue doesn't show up on my to-do list.
        Hide
        HuaisiXu Huaisi Xu added a comment -

        I am not sure how this is more useful than adding something like "compute stats column xyz" since user has to "compute" those stats herself? not sure how to do that painlessly.

        Show
        HuaisiXu Huaisi Xu added a comment - I am not sure how this is more useful than adding something like "compute stats column xyz" since user has to "compute" those stats herself? not sure how to do that painlessly.
        Hide
        bharathv bharath v added a comment -

        IMO its good to have both. Sometimes uses already know the characteristics/distribution of data they are ingesting and would like to directly update the stats in which case, this variant helps. The one you are talking about is tracked in IMPALA-3562.

        Show
        bharathv bharath v added a comment - IMO its good to have both. Sometimes uses already know the characteristics/distribution of data they are ingesting and would like to directly update the stats in which case, this variant helps. The one you are talking about is tracked in IMPALA-3562 .

          People

          • Assignee:
            alex.behm Alexander Behm
            Reporter:
            bharathv bharath v
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development