Hive
  1. Hive
  2. HIVE-2430

Performance degradation in stats DB after JIRA HIVE-2144 (https://issues.apache.org/jira/browse/HIVE-2144)

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      After JIRA HIVE-2144 (https://issues.apache.org/jira/browse/HIVE-2144), the performance in stats DB degraded significantly due to MySQL's inefficient index maintenance.

      We should remove the primary index introduced in that JIRA and resolve duplicates in the stats aggregation
      phase.

        Activity

        Kevin Wilfong created issue -
        Kevin Wilfong made changes -
        Field Original Value New Value
        Summary Performance degradation after After JIRA HIVE-2144 (https://issues.apache.org/jira/browse/HIVE-2144) Performance degradation in stats DB after JIRA HIVE-2144 (https://issues.apache.org/jira/browse/HIVE-2144)
        Kevin Wilfong made changes -
        Attachment HIVE-2430.1.patch.txt [ 12493355 ]
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1737/
        -----------------------------------------------------------

        Review request for hive and Ning Zhang.

        Summary
        -------

        I considered two different strategies for handling duplicates after removing the primary key from the table.

        1) Go back to performing a select, and then updating if a row for the file exists, or inserting a new record otherwise.

        2) Always insert records and then during aggregation get the max value for each statistic with a group by on the file name, and then aggregate those statistics.

        This diff contains the code for option 2. I determined this to be the better option by adding a couple stress tests to TestStatsPublisherEnhanced, and then comparing the run times for the two implementations using derby and MySQL. The two tests checked the performance when inserting a couple hundred rows for each of two files, and inserting several hundred rows, each for a different file. In each case, when i ran the tests on my machine there wasn't much difference for derby, but for MySQL I was seeing both tests run about 100 ms faster for MySQL. I ran both tests several times, to confirm what I was seeing.

        Note that previously, if statistics were added for a file, and then statistics were added again for that same file, but missing some number of values, those missing values were erased from the row. With this new implementation the old values for those missing statistics will be used. This case will probably never happen in the field.

        This addresses bug HIVE-2430.
        https://issues.apache.org/jira/browse/HIVE-2430

        Diffs


        trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1165899
        trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1165899
        trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java 1165899
        trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java 1165899

        Diff: https://reviews.apache.org/r/1737/diff

        Testing
        -------

        I added two new stress tests to TestStatsPublisherEnhanced. I also modified one of the tests to reflect the modified behavior described in the Description.

        I ran the unit test queries as well.

        Thanks,

        Kevin

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1737/ ----------------------------------------------------------- Review request for hive and Ning Zhang. Summary ------- I considered two different strategies for handling duplicates after removing the primary key from the table. 1) Go back to performing a select, and then updating if a row for the file exists, or inserting a new record otherwise. 2) Always insert records and then during aggregation get the max value for each statistic with a group by on the file name, and then aggregate those statistics. This diff contains the code for option 2. I determined this to be the better option by adding a couple stress tests to TestStatsPublisherEnhanced, and then comparing the run times for the two implementations using derby and MySQL. The two tests checked the performance when inserting a couple hundred rows for each of two files, and inserting several hundred rows, each for a different file. In each case, when i ran the tests on my machine there wasn't much difference for derby, but for MySQL I was seeing both tests run about 100 ms faster for MySQL. I ran both tests several times, to confirm what I was seeing. Note that previously, if statistics were added for a file, and then statistics were added again for that same file, but missing some number of values, those missing values were erased from the row. With this new implementation the old values for those missing statistics will be used. This case will probably never happen in the field. This addresses bug HIVE-2430 . https://issues.apache.org/jira/browse/HIVE-2430 Diffs trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1165899 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1165899 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java 1165899 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java 1165899 Diff: https://reviews.apache.org/r/1737/diff Testing ------- I added two new stress tests to TestStatsPublisherEnhanced. I also modified one of the tests to reflect the modified behavior described in the Description. I ran the unit test queries as well. Thanks, Kevin

          People

          • Assignee:
            Kevin Wilfong
            Reporter:
            Kevin Wilfong
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development