Hive
  1. Hive
  2. HIVE-2144

reduce workload generated by JDBCStatsPublisher

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query.

      1. HIVE-2144.patch
        15 kB
        Tomasz Nykiel
      2. HIVE-2144.1.patch
        21 kB
        Tomasz Nykiel
      3. HIVE-2144.2.patch
        21 kB
        Tomasz Nykiel

        Activity

        Hide
        Tomasz Nykiel added a comment -

        IMPORTANT NOTE!

        Before deployment, the primary key constraint needs to be added manually on the ID column of PARTITION_STAT_TBL, if the table already exists.
        Otherwise, the statistics might be duplicated for some entries, and the aggregated statistics will be silently incorrect.

        If the table does not exist, it will be created in the proper format.

        Show
        Tomasz Nykiel added a comment - IMPORTANT NOTE! Before deployment, the primary key constraint needs to be added manually on the ID column of PARTITION_STAT_TBL, if the table already exists. Otherwise, the statistics might be duplicated for some entries, and the aggregated statistics will be silently incorrect. If the table does not exist, it will be created in the proper format.
        Hide
        Hudson added a comment -

        Integrated in Hive-trunk-h0.21 #748 (See https://builds.apache.org/hudson/job/Hive-trunk-h0.21/748/)
        HIVE-2144. reduce workload generated by JDBCStatsPublisher (Tomasz Nykiel via Ning Zhang)

        nzhang : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1127229
        Files :

        • /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
        • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
        Show
        Hudson added a comment - Integrated in Hive-trunk-h0.21 #748 (See https://builds.apache.org/hudson/job/Hive-trunk-h0.21/748/ ) HIVE-2144 . reduce workload generated by JDBCStatsPublisher (Tomasz Nykiel via Ning Zhang) nzhang : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1127229 Files : /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
        Hide
        Tomasz Nykiel added a comment -

        Thanks

        Show
        Tomasz Nykiel added a comment - Thanks
        Hide
        Ning Zhang added a comment -

        Committed. Thanks Tom!

        Congrats on your 1st commit!

        Show
        Ning Zhang added a comment - Committed. Thanks Tom! Congrats on your 1st commit!
        Hide
        Ning Zhang added a comment -

        Actually the failure was due to port confliction on my test machine. Tested on another box and it passed.

        Show
        Ning Zhang added a comment - Actually the failure was due to port confliction on my test machine. Tested on another box and it passed.
        Hide
        Ning Zhang added a comment -

        Tom, TestMetaStoreAuthorization failed with 1 Error. Can you take a look?

        Show
        Ning Zhang added a comment - Tom, TestMetaStoreAuthorization failed with 1 Error. Can you take a look?
        Hide
        Ning Zhang added a comment -

        +1. Will commit if tests pass.

        Show
        Ning Zhang added a comment - +1. Will commit if tests pass.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/765/#review709
        -----------------------------------------------------------

        trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
        <https://reviews.apache.org/r/765/#comment1417>

        can you add a comment on what situation this exception will be thrown? Just for the sake of reader that didn't notice there is a primary key constraint in the DDL.

        trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
        <https://reviews.apache.org/r/765/#comment1418>

        remove this?

        • Ning

        On 2011-05-21 01:49:07, Tomasz Nykiel wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/765/

        -----------------------------------------------------------

        (Updated 2011-05-21 01:49:07)

        Review request for hive.

        Summary

        -------

        Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row.

        The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks.

        Currently the schema of the stat table is the following:

        PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared.

        We amend it to:

        PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).

        HIVE-2144 improves on performance by greedily performing the insertion statement.

        Then instead of executing two queries per row inserted, we can execute one INSERT query.

        In the case primary key constraint violation, we perform a single UPDATE query.

        The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table.

        This addresses bug HIVE-2144.

        https://issues.apache.org/jira/browse/HIVE-2144

        Diffs

        -----

        trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125468

        trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION

        Diff: https://reviews.apache.org/r/765/diff

        Testing

        -------

        TestStatsPublisher JUnit test:

        - basic behaviour

        - multiple updates

        - cleanup of the statistics table after aggregation

        Standalone testing on the cluster.

        - insert/analyze queries over non-partitioned/partitioned tables

        NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared.

        Thanks,

        Tomasz

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/#review709 ----------------------------------------------------------- trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java < https://reviews.apache.org/r/765/#comment1417 > can you add a comment on what situation this exception will be thrown? Just for the sake of reader that didn't notice there is a primary key constraint in the DDL. trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java < https://reviews.apache.org/r/765/#comment1418 > remove this? Ning On 2011-05-21 01:49:07, Tomasz Nykiel wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/ ----------------------------------------------------------- (Updated 2011-05-21 01:49:07) Review request for hive. Summary ------- Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row. The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks. Currently the schema of the stat table is the following: PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared. We amend it to: PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ). HIVE-2144 improves on performance by greedily performing the insertion statement. Then instead of executing two queries per row inserted, we can execute one INSERT query. In the case primary key constraint violation, we perform a single UPDATE query. The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table. This addresses bug HIVE-2144 . https://issues.apache.org/jira/browse/HIVE-2144 Diffs ----- trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125468 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION Diff: https://reviews.apache.org/r/765/diff Testing ------- TestStatsPublisher JUnit test: - basic behaviour - multiple updates - cleanup of the statistics table after aggregation Standalone testing on the cluster. - insert/analyze queries over non-partitioned/partitioned tables NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared. Thanks, Tomasz
        Hide
        Tomasz Nykiel added a comment -

        Fixed after revision 1.

        Show
        Tomasz Nykiel added a comment - Fixed after revision 1.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/765/
        -----------------------------------------------------------

        (Updated 2011-05-21 01:49:07.819494)

        Review request for hive.

        Changes
        -------

        -Amended the test cases to accommodate prefix aggregation
        -Fixed unnecessary conf settings
        -Fixed exception handling in JDBCStatsPublisher.publishStats -> SQLRecoverableException is handled when executing the update statement.

        Summary
        -------

        Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row.
        The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks.

        Currently the schema of the stat table is the following:

        PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared.

        We amend it to:

        PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).

        HIVE-2144 improves on performance by greedily performing the insertion statement.
        Then instead of executing two queries per row inserted, we can execute one INSERT query.
        In the case primary key constraint violation, we perform a single UPDATE query.
        The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table.

        This addresses bug HIVE-2144.
        https://issues.apache.org/jira/browse/HIVE-2144

        Diffs (updated)


        trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125468
        trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION

        Diff: https://reviews.apache.org/r/765/diff

        Testing
        -------

        TestStatsPublisher JUnit test:

        • basic behaviour
        • multiple updates
        • cleanup of the statistics table after aggregation

        Standalone testing on the cluster.

        • insert/analyze queries over non-partitioned/partitioned tables

        NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared.

        Thanks,

        Tomasz

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/ ----------------------------------------------------------- (Updated 2011-05-21 01:49:07.819494) Review request for hive. Changes ------- -Amended the test cases to accommodate prefix aggregation -Fixed unnecessary conf settings -Fixed exception handling in JDBCStatsPublisher.publishStats -> SQLRecoverableException is handled when executing the update statement. Summary ------- Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row. The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks. Currently the schema of the stat table is the following: PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared. We amend it to: PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ). HIVE-2144 improves on performance by greedily performing the insertion statement. Then instead of executing two queries per row inserted, we can execute one INSERT query. In the case primary key constraint violation, we perform a single UPDATE query. The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table. This addresses bug HIVE-2144 . https://issues.apache.org/jira/browse/HIVE-2144 Diffs (updated) trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125468 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION Diff: https://reviews.apache.org/r/765/diff Testing ------- TestStatsPublisher JUnit test: basic behaviour multiple updates cleanup of the statistics table after aggregation Standalone testing on the cluster. insert/analyze queries over non-partitioned/partitioned tables NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared. Thanks, Tomasz
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/765/#review702
        -----------------------------------------------------------

        trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
        <https://reviews.apache.org/r/765/#comment1402>

        Yes. That's correct.

        trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
        <https://reviews.apache.org/r/765/#comment1403>

        ok.

        trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
        <https://reviews.apache.org/r/765/#comment1404>

        I will amend the test cases to aggregate over prefixes. I will also add one simple test case to aggregate over exact match.

        trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
        <https://reviews.apache.org/r/765/#comment1405>

        The original value inserted in line 120 is 200. Neither 100, nor 150 should change the values.

        trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
        <https://reviews.apache.org/r/765/#comment1406>

        As disscussed before, I will improve the test cases to aggregate over prefixes.

        • Tomasz

        On 2011-05-19 23:14:26, Tomasz Nykiel wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/765/

        -----------------------------------------------------------

        (Updated 2011-05-19 23:14:26)

        Review request for hive.

        Summary

        -------

        Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row.

        The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks.

        Currently the schema of the stat table is the following:

        PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared.

        We amend it to:

        PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).

        HIVE-2144 improves on performance by greedily performing the insertion statement.

        Then instead of executing two queries per row inserted, we can execute one INSERT query.

        In the case primary key constraint violation, we perform a single UPDATE query.

        The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table.

        This addresses bug HIVE-2144.

        https://issues.apache.org/jira/browse/HIVE-2144

        Diffs

        -----

        trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125140

        trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION

        Diff: https://reviews.apache.org/r/765/diff

        Testing

        -------

        TestStatsPublisher JUnit test:

        - basic behaviour

        - multiple updates

        - cleanup of the statistics table after aggregation

        Standalone testing on the cluster.

        - insert/analyze queries over non-partitioned/partitioned tables

        NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared.

        Thanks,

        Tomasz

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/#review702 ----------------------------------------------------------- trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java < https://reviews.apache.org/r/765/#comment1402 > Yes. That's correct. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java < https://reviews.apache.org/r/765/#comment1403 > ok. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java < https://reviews.apache.org/r/765/#comment1404 > I will amend the test cases to aggregate over prefixes. I will also add one simple test case to aggregate over exact match. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java < https://reviews.apache.org/r/765/#comment1405 > The original value inserted in line 120 is 200. Neither 100, nor 150 should change the values. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java < https://reviews.apache.org/r/765/#comment1406 > As disscussed before, I will improve the test cases to aggregate over prefixes. Tomasz On 2011-05-19 23:14:26, Tomasz Nykiel wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/ ----------------------------------------------------------- (Updated 2011-05-19 23:14:26) Review request for hive. Summary ------- Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row. The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks. Currently the schema of the stat table is the following: PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared. We amend it to: PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ). HIVE-2144 improves on performance by greedily performing the insertion statement. Then instead of executing two queries per row inserted, we can execute one INSERT query. In the case primary key constraint violation, we perform a single UPDATE query. The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table. This addresses bug HIVE-2144 . https://issues.apache.org/jira/browse/HIVE-2144 Diffs ----- trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125140 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION Diff: https://reviews.apache.org/r/765/diff Testing ------- TestStatsPublisher JUnit test: - basic behaviour - multiple updates - cleanup of the statistics table after aggregation Standalone testing on the cluster. - insert/analyze queries over non-partitioned/partitioned tables NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared. Thanks, Tomasz
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/765/#review696
        -----------------------------------------------------------

        trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
        <https://reviews.apache.org/r/765/#comment1395>

        Here we need to catch SQLRecoverableException and retry.

        trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
        <https://reviews.apache.org/r/765/#comment1396>

        If these parameters present in conf/hive-default.xml, you don't need to set them again here since new JobConf() should read from hive-default.xml.

        trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
        <https://reviews.apache.org/r/765/#comment1397>

        the usual use case for aggregateStats() is that the key should be the prefix (e.g., file_000) of the string inserted by publishStats, so that all keys that match the prefix will be aggregated.

        Can you add one more test for aggregateStats('file_000')?

        trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
        <https://reviews.apache.org/r/765/#comment1398>

        won't this also change the stats at the 2nd publishStat()?

        trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
        <https://reviews.apache.org/r/765/#comment1399>

        also add another aggStats for prefix.

        • Ning

        On 2011-05-19 23:14:26, Tomasz Nykiel wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/765/

        -----------------------------------------------------------

        (Updated 2011-05-19 23:14:26)

        Review request for hive.

        Summary

        -------

        Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row.

        The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks.

        Currently the schema of the stat table is the following:

        PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared.

        We amend it to:

        PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).

        HIVE-2144 improves on performance by greedily performing the insertion statement.

        Then instead of executing two queries per row inserted, we can execute one INSERT query.

        In the case primary key constraint violation, we perform a single UPDATE query.

        The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table.

        This addresses bug HIVE-2144.

        https://issues.apache.org/jira/browse/HIVE-2144

        Diffs

        -----

        trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125140

        trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION

        Diff: https://reviews.apache.org/r/765/diff

        Testing

        -------

        TestStatsPublisher JUnit test:

        - basic behaviour

        - multiple updates

        - cleanup of the statistics table after aggregation

        Standalone testing on the cluster.

        - insert/analyze queries over non-partitioned/partitioned tables

        NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared.

        Thanks,

        Tomasz

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/#review696 ----------------------------------------------------------- trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java < https://reviews.apache.org/r/765/#comment1395 > Here we need to catch SQLRecoverableException and retry. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java < https://reviews.apache.org/r/765/#comment1396 > If these parameters present in conf/hive-default.xml, you don't need to set them again here since new JobConf() should read from hive-default.xml. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java < https://reviews.apache.org/r/765/#comment1397 > the usual use case for aggregateStats() is that the key should be the prefix (e.g., file_000) of the string inserted by publishStats, so that all keys that match the prefix will be aggregated. Can you add one more test for aggregateStats('file_000')? trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java < https://reviews.apache.org/r/765/#comment1398 > won't this also change the stats at the 2nd publishStat()? trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java < https://reviews.apache.org/r/765/#comment1399 > also add another aggStats for prefix. Ning On 2011-05-19 23:14:26, Tomasz Nykiel wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/ ----------------------------------------------------------- (Updated 2011-05-19 23:14:26) Review request for hive. Summary ------- Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row. The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks. Currently the schema of the stat table is the following: PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared. We amend it to: PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ). HIVE-2144 improves on performance by greedily performing the insertion statement. Then instead of executing two queries per row inserted, we can execute one INSERT query. In the case primary key constraint violation, we perform a single UPDATE query. The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table. This addresses bug HIVE-2144 . https://issues.apache.org/jira/browse/HIVE-2144 Diffs ----- trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125140 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION Diff: https://reviews.apache.org/r/765/diff Testing ------- TestStatsPublisher JUnit test: - basic behaviour - multiple updates - cleanup of the statistics table after aggregation Standalone testing on the cluster. - insert/analyze queries over non-partitioned/partitioned tables NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared. Thanks, Tomasz
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/765/
        -----------------------------------------------------------

        Review request for hive.

        Summary
        -------

        Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row.
        The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks.

        Currently the schema of the stat table is the following:

        PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared.

        We amend it to:

        PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).

        HIVE-2144 improves on performance by greedily performing the insertion statement.
        Then instead of executing two queries per row inserted, we can execute one INSERT query.
        In the case primary key constraint violation, we perform a single UPDATE query.
        The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table.

        This addresses bug HIVE-2144.
        https://issues.apache.org/jira/browse/HIVE-2144

        Diffs


        trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125140
        trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION

        Diff: https://reviews.apache.org/r/765/diff

        Testing
        -------

        TestStatsPublisher JUnit test:

        • basic behaviour
        • multiple updates
        • cleanup of the statistics table after aggregation

        Standalone testing on the cluster.

        • insert/analyze queries over non-partitioned/partitioned tables

        NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared.

        Thanks,

        Tomasz

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/ ----------------------------------------------------------- Review request for hive. Summary ------- Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row. The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks. Currently the schema of the stat table is the following: PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared. We amend it to: PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ). HIVE-2144 improves on performance by greedily performing the insertion statement. Then instead of executing two queries per row inserted, we can execute one INSERT query. In the case primary key constraint violation, we perform a single UPDATE query. The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table. This addresses bug HIVE-2144 . https://issues.apache.org/jira/browse/HIVE-2144 Diffs trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125140 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION Diff: https://reviews.apache.org/r/765/diff Testing ------- TestStatsPublisher JUnit test: basic behaviour multiple updates cleanup of the statistics table after aggregation Standalone testing on the cluster. insert/analyze queries over non-partitioned/partitioned tables NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared. Thanks, Tomasz
        Hide
        Tomasz Nykiel added a comment -

        Yes, I agree. There are some subtle differences between UNIQUE and PK in Derby and MySQL (e.g., in MySQL the unique index allows null values, and in Derby it does not. So in general, PK constraint will be more suitable.

        CREATE TABLE PARTITION_STAT_TBL ( IDE VARCHAR(255) PRIMARY KEY, ROW_COUNT BIGINT ) works for both Derby and MySql.
        After a quick check it seems that it's supported by Oracle/MSSQL as well.

        Show
        Tomasz Nykiel added a comment - Yes, I agree. There are some subtle differences between UNIQUE and PK in Derby and MySQL (e.g., in MySQL the unique index allows null values, and in Derby it does not. So in general, PK constraint will be more suitable. CREATE TABLE PARTITION_STAT_TBL ( IDE VARCHAR(255) PRIMARY KEY, ROW_COUNT BIGINT ) works for both Derby and MySql. After a quick check it seems that it's supported by Oracle/MSSQL as well.
        Hide
        Ning Zhang added a comment -

        Great! I like the idea.

        One comment about the primary key constraint: I'm not sure if UNIQUE is the standard way to specify primary key constraint. There are people using Oralce/MS SQL sever/Postgres as metastore, we should use a standard way. I think 'id varchar(255) PRIMARY KEY' is more widely supported. Can you double check with mysql and derby?

        Show
        Ning Zhang added a comment - Great! I like the idea. One comment about the primary key constraint: I'm not sure if UNIQUE is the standard way to specify primary key constraint. There are people using Oralce/MS SQL sever/Postgres as metastore, we should use a standard way. I think 'id varchar(255) PRIMARY KEY' is more widely supported. Can you double check with mysql and derby?
        Hide
        Tomasz Nykiel added a comment -

        Currently the schema of the stat table is the following:

        PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared.

        We can amend it to:

        PARTITION_STAT_TABLE ( ID VARCHAR(255) UNIQUE , ROW_COUNT BIGINT ).

        Then instead of executing two queries per row inserted, we can execute one INSERT query, as we do currently.
        In the case when the integrity constraint is violated, via the unique index, which can be caught by an exception, we perform a single UPDATE query.
        The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table:

        UPDATE PARTITION_STAT_TBL SET ROW_COUNT = new_value
        WHERE ID = "rowID" AND
        (0) new_value >
        (1) (SELECT TEMP.ROW_COUNT FROM
        (2) (SELECT ROW_COUNT FROM PARTITION_STAT_TBL WHERE ID = "rowID") TEMP )

        --(0) is a condition that checks if the newly inserted value is greater that the one we already have.
        --(1) and (2) is a work-around for MySQL, which does not allow to refer to the table that occurs in the update statement. Here, we basically materialize the value that we need for comparison.
        --(1) should theoretically have (LIMIT 1) to choose exactly one tuple, however Derby does not support it, and by the unique constraint, and the fact that the insert failed, there exists exactly one tuple matching the ID predicate.

        To summarize, for non existing rows, only one insert query will be executed, instead of two.
        For existing rows, which seems to occur very infrequently, two queries instead of three will be executed.

        Show
        Tomasz Nykiel added a comment - Currently the schema of the stat table is the following: PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared. We can amend it to: PARTITION_STAT_TABLE ( ID VARCHAR(255) UNIQUE , ROW_COUNT BIGINT ). Then instead of executing two queries per row inserted, we can execute one INSERT query, as we do currently. In the case when the integrity constraint is violated, via the unique index, which can be caught by an exception, we perform a single UPDATE query. The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table: UPDATE PARTITION_STAT_TBL SET ROW_COUNT = new_value WHERE ID = "rowID" AND (0) new_value > (1) (SELECT TEMP.ROW_COUNT FROM (2) (SELECT ROW_COUNT FROM PARTITION_STAT_TBL WHERE ID = "rowID") TEMP ) --(0) is a condition that checks if the newly inserted value is greater that the one we already have. --(1) and (2) is a work-around for MySQL, which does not allow to refer to the table that occurs in the update statement. Here, we basically materialize the value that we need for comparison. --(1) should theoretically have (LIMIT 1) to choose exactly one tuple, however Derby does not support it, and by the unique constraint, and the fact that the insert failed, there exists exactly one tuple matching the ID predicate. To summarize, for non existing rows, only one insert query will be executed, instead of two. For existing rows, which seems to occur very infrequently, two queries instead of three will be executed.

          People

          • Assignee:
            Tomasz Nykiel
            Reporter:
            Ning Zhang
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development