Hive
  1. Hive
  2. HIVE-2471

Add timestamp column to the partition stats table.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Occasionally, when entries are added to the partition stats table the program is halted before it can delete those entries, by an exception, keyboard interrupt, etc. These build up to the point where the table gets very large, and it hurts the performance of the update statement which is often called. In order to fix this, I am adding a column to the table which is auto-populated with the current timestamp. This will allow us to create scripts that go through periodically and clean out old entries from the table.

        Activity

        Hide
        Hudson added a comment -

        Integrated in Hive-trunk-hadoop2 #54 (See https://builds.apache.org/job/Hive-trunk-hadoop2/54/)
        HIVE-2471 Add timestamp column to the partition stats table.
        (Kevin Wilfong via namit) (Revision 1302739)

        Result = ABORTED
        namit : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1302739
        Files :

        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java
        Show
        Hudson added a comment - Integrated in Hive-trunk-hadoop2 #54 (See https://builds.apache.org/job/Hive-trunk-hadoop2/54/ ) HIVE-2471 Add timestamp column to the partition stats table. (Kevin Wilfong via namit) (Revision 1302739) Result = ABORTED namit : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1302739 Files : /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java
        Hide
        Ashutosh Chauhan added a comment -

        This issue is closed now. It was released with the fix in 0.9.0. If there is a problem, please open a new jira and link this one with that.

        Show
        Ashutosh Chauhan added a comment - This issue is closed now. It was released with the fix in 0.9.0. If there is a problem, please open a new jira and link this one with that.
        Hide
        Hudson added a comment -

        Integrated in Hive-trunk-h0.21 #1322 (See https://builds.apache.org/job/Hive-trunk-h0.21/1322/)
        HIVE-2471 Add timestamp column to the partition stats table.
        (Kevin Wilfong via namit) (Revision 1302739)

        Result = FAILURE
        namit : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1302739
        Files :

        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java
        Show
        Hudson added a comment - Integrated in Hive-trunk-h0.21 #1322 (See https://builds.apache.org/job/Hive-trunk-h0.21/1322/ ) HIVE-2471 Add timestamp column to the partition stats table. (Kevin Wilfong via namit) (Revision 1302739) Result = FAILURE namit : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1302739 Files : /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java
        Hide
        Namit Jain added a comment -

        Committed. Thanks Kevin

        Show
        Namit Jain added a comment - Committed. Thanks Kevin
        Hide
        Phabricator added a comment -

        njain has accepted the revision "HIVE-2471 [jira] Add timestamp column with index to the partition stats table.".

        Running tests

        REVISION DETAIL
        https://reviews.facebook.net/D2367

        BRANCH
        svn

        Show
        Phabricator added a comment - njain has accepted the revision " HIVE-2471 [jira] Add timestamp column with index to the partition stats table.". Running tests REVISION DETAIL https://reviews.facebook.net/D2367 BRANCH svn
        Hide
        Phabricator added a comment -

        njain has commented on the revision "HIVE-2471 [jira] Add timestamp column with index to the partition stats table.".

        INLINE COMMENTS
        ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java:128 Let us discuss offline - I am not sure I understood

        REVISION DETAIL
        https://reviews.facebook.net/D2367

        Show
        Phabricator added a comment - njain has commented on the revision " HIVE-2471 [jira] Add timestamp column with index to the partition stats table.". INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java:128 Let us discuss offline - I am not sure I understood REVISION DETAIL https://reviews.facebook.net/D2367
        Hide
        Phabricator added a comment -

        kevinwilfong updated the revision "HIVE-2471 [jira] Add timestamp column with index to the partition stats table.".
        Reviewers: JIRA, njain

        Added a big comment saying it is up to the Hive administrator to drop old partition stats tables.

        REVISION DETAIL
        https://reviews.facebook.net/D2367

        AFFECTED FILES
        ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
        ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java

        Show
        Phabricator added a comment - kevinwilfong updated the revision " HIVE-2471 [jira] Add timestamp column with index to the partition stats table.". Reviewers: JIRA, njain Added a big comment saying it is up to the Hive administrator to drop old partition stats tables. REVISION DETAIL https://reviews.facebook.net/D2367 AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java
        Hide
        Phabricator added a comment -

        kevinwilfong has commented on the revision "HIVE-2471 [jira] Add timestamp column with index to the partition stats table.".

        INLINE COMMENTS
        ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java:26 Will do
        ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java:128 This is invoked by the StatsPublisher, it is used for the case where a row was not deleted by a previous StatsPublisher, otherwise there is a conflict between the primary keys.

        The StatsAggregator only invokes SELECT and DELETE statements. The aggregated stats are added to the metastore via a call to the metastore's alter_table method.

        REVISION DETAIL
        https://reviews.facebook.net/D2367

        Show
        Phabricator added a comment - kevinwilfong has commented on the revision " HIVE-2471 [jira] Add timestamp column with index to the partition stats table.". INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java:26 Will do ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java:128 This is invoked by the StatsPublisher, it is used for the case where a row was not deleted by a previous StatsPublisher, otherwise there is a conflict between the primary keys. The StatsAggregator only invokes SELECT and DELETE statements. The aggregated stats are added to the metastore via a call to the metastore's alter_table method. REVISION DETAIL https://reviews.facebook.net/D2367
        Hide
        Phabricator added a comment -

        njain has commented on the revision "HIVE-2471 [jira] Add timestamp column with index to the partition stats table.".

        INLINE COMMENTS
        ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java:26 Write a big comment here that it is the users responsibility
        to delete the old table
        ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java:128 I am not sure this will work -
        I am assuming this is invoked by StatsAggregator, but the data is
        inserted by StatsPublisher.

        The timestamp will be different in the 2 places

        REVISION DETAIL
        https://reviews.facebook.net/D2367

        Show
        Phabricator added a comment - njain has commented on the revision " HIVE-2471 [jira] Add timestamp column with index to the partition stats table.". INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java:26 Write a big comment here that it is the users responsibility to delete the old table ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java:128 I am not sure this will work - I am assuming this is invoked by StatsAggregator, but the data is inserted by StatsPublisher. The timestamp will be different in the 2 places REVISION DETAIL https://reviews.facebook.net/D2367
        Hide
        Phabricator added a comment -

        kevinwilfong updated the revision "HIVE-2471 [jira] Add timestamp column with index to the partition stats table.".
        Reviewers: JIRA, njain

        Changed the name of the stats table so that this update will apply automatically and immediately, otherwise the update command will fail on old schemas. Also introduced versioning to the name while I think is better than the old method of coming up with a new combination of (PARTITION, PART) and (STATISTICS, STATS).

        REVISION DETAIL
        https://reviews.facebook.net/D2367

        AFFECTED FILES
        ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
        ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java

        Show
        Phabricator added a comment - kevinwilfong updated the revision " HIVE-2471 [jira] Add timestamp column with index to the partition stats table.". Reviewers: JIRA, njain Changed the name of the stats table so that this update will apply automatically and immediately, otherwise the update command will fail on old schemas. Also introduced versioning to the name while I think is better than the old method of coming up with a new combination of (PARTITION, PART) and (STATISTICS, STATS). REVISION DETAIL https://reviews.facebook.net/D2367 AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java
        Hide
        Phabricator added a comment -

        kevinwilfong requested code review of "HIVE-2471 [jira] Add timestamp column with index to the partition stats table.".
        Reviewers: JIRA

        https://issues.apache.org/jira/browse/HIVE-2471

        Added a timestamp column to the stats table. It defaults to the current timestamp on inserts. I also updated the update query to update the timestamp, since derby does not support the on update option as far as I can tell.

        I modified the insert query to specify the columns it's inserting to. This is not only necessary to prevent the query from inserting into the timestamp column, it is safer in general.

        Occasionally, when entries are added to the partition stats table the program is halted before it can delete those entries, by an exception, keyboard interrupt, etc. These build up to the point where the table gets very large, and it hurts the performance of the update statement which is often called. In order to fix this, I am adding a column to the table which is auto-populated with the current timestamp. I am also adding an index on this column. This will allow us to create scripts that go through periodically and clean out old entries from the table. The index will help to keep the runtime of these scripts short, and hence reduce the amount of time they need to lock the table/indexes for.

        TEST PLAN
        EMPTY

        REVISION DETAIL
        https://reviews.facebook.net/D2367

        AFFECTED FILES
        ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
        ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java

        MANAGE HERALD DIFFERENTIAL RULES
        https://reviews.facebook.net/herald/view/differential/

        WHY DID I GET THIS EMAIL?
        https://reviews.facebook.net/herald/transcript/5253/

        Tip: use the X-Herald-Rules header to filter Herald messages in your client.

        Show
        Phabricator added a comment - kevinwilfong requested code review of " HIVE-2471 [jira] Add timestamp column with index to the partition stats table.". Reviewers: JIRA https://issues.apache.org/jira/browse/HIVE-2471 Added a timestamp column to the stats table. It defaults to the current timestamp on inserts. I also updated the update query to update the timestamp, since derby does not support the on update option as far as I can tell. I modified the insert query to specify the columns it's inserting to. This is not only necessary to prevent the query from inserting into the timestamp column, it is safer in general. Occasionally, when entries are added to the partition stats table the program is halted before it can delete those entries, by an exception, keyboard interrupt, etc. These build up to the point where the table gets very large, and it hurts the performance of the update statement which is often called. In order to fix this, I am adding a column to the table which is auto-populated with the current timestamp. I am also adding an index on this column. This will allow us to create scripts that go through periodically and clean out old entries from the table. The index will help to keep the runtime of these scripts short, and hence reduce the amount of time they need to lock the table/indexes for. TEST PLAN EMPTY REVISION DETAIL https://reviews.facebook.net/D2367 AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java MANAGE HERALD DIFFERENTIAL RULES https://reviews.facebook.net/herald/view/differential/ WHY DID I GET THIS EMAIL? https://reviews.facebook.net/herald/transcript/5253/ Tip: use the X-Herald-Rules header to filter Herald messages in your client.
        Hide
        Kevin Wilfong added a comment -

        Given the issues we have seen in the past with indexes on this table, I have decided not to add an index on the timestamp column.

        Show
        Kevin Wilfong added a comment - Given the issues we have seen in the past with indexes on this table, I have decided not to add an index on the timestamp column.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/2079/
        -----------------------------------------------------------

        Review request for hive, Yongqiang He and Ning Zhang.

        Summary
        -------

        I added a timestamp column ts to the partition statistics table which defaults to the current_timestamp. I also added code to create an index on that column, and verify that index exists when we check if the table exists.

        I also took the opportunity to fix another problem. Every time we change the schema of the partition statistics table we give it a slightly different name, like PARTITION_STATS, PARITION_STATISTICS, PARTITION_STAT_TBL, etc. Instead, I want to put a number at the end of the table name, here I have PARTITION_STATS_V2, instead of trying to come up on a new variation of name, we can just increment the final number, this will also make it easy to identify old tables which can be dropped.

        Checking whether the index exists may not be worth the time it takes. We have to check this every time we init JDBCStatsPublisher, unless the table doesn't exist, and if it doesn't exist, it's not the end of the world, it just means any scripts which try to use the index will be slower, and the index can always be added later. Also, the chance the program creates the table, but is interrupted before it can create the index is low. I added the check because I thought the chance of having to try and find the reason why Hive slowed down, and having to find that a clean up script is running slow, and hence holding the locks for a long time, sounded painful, and hence the check would be worth it, but I am open to debate.

        This addresses bug HIVE-2471.
        https://issues.apache.org/jira/browse/HIVE-2471

        Diffs


        trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1175957
        trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1175957
        trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java 1175957

        Diff: https://reviews.apache.org/r/2079/diff

        Testing
        -------

        I ran TestStatsPublisherEnhanced using both derby and MySQL, and verified all the tests succeeded.

        I also ran a few queries and verified that the table and index were created and that the rows, including timestamp, appeared in the table.

        Thanks,

        Kevin

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2079/ ----------------------------------------------------------- Review request for hive, Yongqiang He and Ning Zhang. Summary ------- I added a timestamp column ts to the partition statistics table which defaults to the current_timestamp. I also added code to create an index on that column, and verify that index exists when we check if the table exists. I also took the opportunity to fix another problem. Every time we change the schema of the partition statistics table we give it a slightly different name, like PARTITION_STATS, PARITION_STATISTICS, PARTITION_STAT_TBL, etc. Instead, I want to put a number at the end of the table name, here I have PARTITION_STATS_V2, instead of trying to come up on a new variation of name, we can just increment the final number, this will also make it easy to identify old tables which can be dropped. Checking whether the index exists may not be worth the time it takes. We have to check this every time we init JDBCStatsPublisher, unless the table doesn't exist, and if it doesn't exist, it's not the end of the world, it just means any scripts which try to use the index will be slower, and the index can always be added later. Also, the chance the program creates the table, but is interrupted before it can create the index is low. I added the check because I thought the chance of having to try and find the reason why Hive slowed down, and having to find that a clean up script is running slow, and hence holding the locks for a long time, sounded painful, and hence the check would be worth it, but I am open to debate. This addresses bug HIVE-2471 . https://issues.apache.org/jira/browse/HIVE-2471 Diffs trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1175957 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1175957 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java 1175957 Diff: https://reviews.apache.org/r/2079/diff Testing ------- I ran TestStatsPublisherEnhanced using both derby and MySQL, and verified all the tests succeeded. I also ran a few queries and verified that the table and index were created and that the rows, including timestamp, appeared in the table. Thanks, Kevin

          People

          • Assignee:
            Kevin Wilfong
            Reporter:
            Kevin Wilfong
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development