Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-9436

RetryingMetaStoreClient does not retry JDOExceptions

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.14.0, 0.13.1
    • Fix Version/s: 1.0.0
    • Component/s: None
    • Labels:
      None

      Description

      RetryingMetaStoreClient has a bug in the following bit of code:

              } else if ((e.getCause() instanceof MetaException) &&
                  e.getCause().getMessage().matches("JDO[a-zA-Z]*Exception")) {
                caughtException = (MetaException) e.getCause();
              } else {
                throw e.getCause();
              }
      

      The bug here is that java String.matches matches the entire string to the regex, and thus, that match will fail if the message contains anything before or after JDO[a-zA-Z]*Exception. The solution, however, is very simple, we should match (?s).*JDO[a-zA-Z]*Exception.*

      1. HIVE-9436.2.patch
        0.9 kB
        Sushanth Sowmyan
      2. HIVE-9436.3.patch
        0.9 kB
        Sushanth Sowmyan
      3. HIVE-9436.patch
        0.9 kB
        Sushanth Sowmyan

        Issue Links

          Activity

          Hide
          sushanth Sushanth Sowmyan added a comment -

          Attaching patch. Thejas M Nair, could you please review?

          Show
          sushanth Sushanth Sowmyan added a comment - Attaching patch. Thejas M Nair , could you please review?
          Hide
          thejas Thejas M Nair added a comment -

          +1

          Show
          thejas Thejas M Nair added a comment - +1
          Hide
          hiveqa Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12693708/HIVE-9436.patch

          ERROR: -1 due to 1 failed/errored test(s), 7347 tests executed
          Failed tests:

          org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_schemeAuthority
          

          Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2470/testReport
          Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2470/console
          Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2470/

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests exited with: TestsFailedException: 1 tests failed
          

          This message is automatically generated.

          ATTACHMENT ID: 12693708 - PreCommit-HIVE-TRUNK-Build

          Show
          hiveqa Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12693708/HIVE-9436.patch ERROR: -1 due to 1 failed/errored test(s), 7347 tests executed Failed tests: org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_schemeAuthority Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2470/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2470/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2470/ Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed This message is automatically generated. ATTACHMENT ID: 12693708 - PreCommit-HIVE-TRUNK-Build
          Hide
          sushanth Sushanth Sowmyan added a comment -

          Canceling patch to update one more thing - this needs to be a multiline match, from the looks of it.

          Show
          sushanth Sushanth Sowmyan added a comment - Canceling patch to update one more thing - this needs to be a multiline match, from the looks of it.
          Hide
          hsubramaniyan Hari Sankar Sivarama Subramaniyan added a comment -

          A minor comment since I remember looking at RetryingHMSHandler code for catching Nucleus Exceptions a while back. I think RetryingHMSHandler::invoke and RetryingMetaStoreClient::invoke should have similar exceptions caught(except might be TException or something that might be client specific or HMS handler specific). Is it possible to factorize the catch part and have them called from RetryingMetaStoreClient::invoke and RetryingHMSHandler::invoke to avoid any redundancy/ missing cases. If this is not possible, manually comparing these two functions for the exception type handled might help in resolving any similar missing exception handling cases. Another point is comparing the exception messages for JDOException regex looks a bit hairy to me. Please correct me if this understanding is wrong.

          Thanks
          Hari

          Show
          hsubramaniyan Hari Sankar Sivarama Subramaniyan added a comment - A minor comment since I remember looking at RetryingHMSHandler code for catching Nucleus Exceptions a while back. I think RetryingHMSHandler::invoke and RetryingMetaStoreClient::invoke should have similar exceptions caught(except might be TException or something that might be client specific or HMS handler specific). Is it possible to factorize the catch part and have them called from RetryingMetaStoreClient::invoke and RetryingHMSHandler::invoke to avoid any redundancy/ missing cases. If this is not possible, manually comparing these two functions for the exception type handled might help in resolving any similar missing exception handling cases. Another point is comparing the exception messages for JDOException regex looks a bit hairy to me. Please correct me if this understanding is wrong. Thanks Hari
          Hide
          sushanth Sushanth Sowmyan added a comment -

          Updated patch to multi-line match

          Show
          sushanth Sushanth Sowmyan added a comment - Updated patch to multi-line match
          Hide
          sushanth Sushanth Sowmyan added a comment -

          Hari, good point - I had a look at RetryingHMSHandler, and the code is rather similar. There seems to be two main differences though: First, on the HMSHandler side, the exact cause-chain of exceptions are still available, and we can compare using "instanceof", whereas on the client-side, we have only the first level exception object, and all internal objects are serialized, so we have to look at the messages instead. Secondly, the client has thrift-related exceptions which are acceptable as well.

          Still, possibly, we should undertake to refactor this into MetastoreUtils with a parameter which asks whether to look strictly at types, or look inside the messages as well, and call that from both sides, and also pass in a list of acceptable exception types. In that scenario, this would be usable all over the place where we need to retry. I would, however, like to tackle the refactor as an improvement rather than in a bug-fix jira if that's okay.

          Show
          sushanth Sushanth Sowmyan added a comment - Hari, good point - I had a look at RetryingHMSHandler, and the code is rather similar. There seems to be two main differences though: First, on the HMSHandler side, the exact cause-chain of exceptions are still available, and we can compare using "instanceof", whereas on the client-side, we have only the first level exception object, and all internal objects are serialized, so we have to look at the messages instead. Secondly, the client has thrift-related exceptions which are acceptable as well. Still, possibly, we should undertake to refactor this into MetastoreUtils with a parameter which asks whether to look strictly at types, or look inside the messages as well, and call that from both sides, and also pass in a list of acceptable exception types. In that scenario, this would be usable all over the place where we need to retry. I would, however, like to tackle the refactor as an improvement rather than in a bug-fix jira if that's okay.
          Hide
          thejas Thejas M Nair added a comment -

          Hari Sankar Sivarama Subramaniyan that is a good point regarding the two places where exception checks and retries are happening. IMO, the metastore client should ideally only retry any issue related to communication with metastore server. HMSHandler is the one that should retry on any temporary errors such as JDOException. As Sushanth Sowmyan said we have the real cause available there and we can do an instanceof check over there. It seems like a simple and cleaner change to do the check there. Sushanth Sowmyan What do you think ?

          Show
          thejas Thejas M Nair added a comment - Hari Sankar Sivarama Subramaniyan that is a good point regarding the two places where exception checks and retries are happening. IMO, the metastore client should ideally only retry any issue related to communication with metastore server. HMSHandler is the one that should retry on any temporary errors such as JDOException. As Sushanth Sowmyan said we have the real cause available there and we can do an instanceof check over there. It seems like a simple and cleaner change to do the check there. Sushanth Sowmyan What do you think ?
          Hide
          hiveqa Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12694069/HIVE-9436.2.patch

          ERROR: -1 due to 67 failed/errored test(s), 7346 tests executed
          Failed tests:

          TestCustomAuthentication - did not produce a TEST-*.xml file
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_archive_excludeHadoop20
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_archive_multi
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join11
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join12
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join13
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join27
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join4
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join5
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join8
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_simple_select
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_subq_in
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_column_access_stats
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_constantPropagateForSubQuery
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer1
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer6
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer8
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_explain_logical
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_filter_join_breaktask
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_filter_join_breaktask2
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_1_23
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_skew_1_23
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_metadataOnlyOptimizer
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_metadataonly1
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_gby2
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_join_filter
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_outer_join5
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_union_view
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_vc
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_rcfile_union
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_smb_mapjoin_25
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_in
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_in_having
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_notin
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_notin_having
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_views
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_table_access_keys_stats
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_histogram_numeric
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union24
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union28
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union30
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_null
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_10
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_2
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_5
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_6_subq
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_8
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_9
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vector_decimal_mapjoin
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vector_mapjoin_reduce
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_cbo_simple_select
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_cbo_subq_in
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_dynamic_partition_pruning
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_dynamic_partition_pruning_2
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_filter_join_breaktask
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_filter_join_breaktask2
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_mrr
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_subquery_in
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_decimal_mapjoin
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_mapjoin_reduce
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vectorized_dynamic_partition_pruning
          org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_constprog_partitioner
          org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_ql_rewrite_gbtoidx_cbo_1
          org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_ql_rewrite_gbtoidx_cbo_2
          org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.testMetastoreProxyUser
          org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.testSaslWithHiveMetaStore
          org.apache.hive.hcatalog.streaming.TestStreaming.testEndpointConnection
          

          Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2495/testReport
          Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2495/console
          Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2495/

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests exited with: TestsFailedException: 67 tests failed
          

          This message is automatically generated.

          ATTACHMENT ID: 12694069 - PreCommit-HIVE-TRUNK-Build

          Show
          hiveqa Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12694069/HIVE-9436.2.patch ERROR: -1 due to 67 failed/errored test(s), 7346 tests executed Failed tests: TestCustomAuthentication - did not produce a TEST-*.xml file org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_archive_excludeHadoop20 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_archive_multi org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join11 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join12 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join13 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join27 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join5 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_simple_select org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_subq_in org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_column_access_stats org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_constantPropagateForSubQuery org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer6 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_explain_logical org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_filter_join_breaktask org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_filter_join_breaktask2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_1_23 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_skew_1_23 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_metadataOnlyOptimizer org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_metadataonly1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_gby2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_join_filter org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_outer_join5 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_union_view org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_vc org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_rcfile_union org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_smb_mapjoin_25 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_in org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_in_having org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_notin org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_notin_having org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_views org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_table_access_keys_stats org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_histogram_numeric org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union24 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union28 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union30 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_null org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_10 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_5 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_6_subq org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_9 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vector_decimal_mapjoin org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vector_mapjoin_reduce org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_cbo_simple_select org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_cbo_subq_in org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_dynamic_partition_pruning org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_dynamic_partition_pruning_2 org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_filter_join_breaktask org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_filter_join_breaktask2 org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_mrr org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_subquery_in org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_decimal_mapjoin org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_mapjoin_reduce org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vectorized_dynamic_partition_pruning org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_constprog_partitioner org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_ql_rewrite_gbtoidx_cbo_1 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_ql_rewrite_gbtoidx_cbo_2 org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.testMetastoreProxyUser org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.testSaslWithHiveMetaStore org.apache.hive.hcatalog.streaming.TestStreaming.testEndpointConnection Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2495/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2495/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2495/ Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 67 tests failed This message is automatically generated. ATTACHMENT ID: 12694069 - PreCommit-HIVE-TRUNK-Build
          Hide
          sushanth Sushanth Sowmyan added a comment -

          Randomly sampling some of the test failures:

          source/itests/qtest/../../itests/qtest/target/qfile-results/clientpositive/udaf_histogram_numeric.q.out /home/hiveptest/54.205.215.38-hiveptest-1/apache-svn-trunk-source/itests/qtest/../../ql/src/test/results/clientpositive/udaf_histogram_numeric.q.out
          9c9
          < [{"x":139.16078431372554,"y":255.0},{"x":386.1428571428572,"y":245.0}]
          ---
          > [{"x":135.0284552845532,"y":246.0},{"x":381.39370078740143,"y":254.0}]
          
          Running: diff -a /home/hiveptest/54.159.206.72-hiveptest-1/apache-svn-trunk-source/itests/qtest/../../itests/qtest/target/qfile-results/clientpositive/auto_join12.q.out /home/hiveptest/54.159.206.72-hiveptest-1/apache-svn-trunk-source/itests/qtest/../../ql/src/test/results/clientpositive/auto_join12.q.out
          32c32
          <         $hdt$_0:$hdt$_0:$hdt$_0:$hdt$_0:src 
          ---
          >         $hdt$_0:$hdt$_0:$hdt$_0:$hdt$_0:$hdt$_0:src 
          35c35
          <         $hdt$_0:$hdt$_0:$hdt$_1:$hdt$_1:$hdt$_1:src 
          ---
          >         $hdt$_0:$hdt$_0:$hdt$_1:$hdt$_1:$hdt$_1:$hdt$_1:src 
          39c39
          <         $hdt$_0:$hdt$_0:$hdt$_0:$hdt$_0:src 
          ---
          >         $hdt$_0:$hdt$_0:$hdt$_0:$hdt$_0:$hdt$_0:src 
          54c54
          <         $hdt$_0:$hdt$_0:$hdt$_1:$hdt$_1:$hdt$_1:src 
          ---
          >         $hdt$_0:$hdt$_0:$hdt$_1:$hdt$_1:$hdt$_1:$hdt$_1:src 
          

          Neither of those have anything whatsoever to do with retries.

          Also, I see some other errors reporting "junit.framework.AssertionFailedError: Client Execution failed with error code = 40000 running", which is similar to what's reported in HIVE-8997. Is there a test flakiness issue right now in trunk?

          Show
          sushanth Sushanth Sowmyan added a comment - Randomly sampling some of the test failures: source/itests/qtest/../../itests/qtest/target/qfile-results/clientpositive/udaf_histogram_numeric.q.out /home/hiveptest/54.205.215.38-hiveptest-1/apache-svn-trunk-source/itests/qtest/../../ql/src/test/results/clientpositive/udaf_histogram_numeric.q.out 9c9 < [{"x":139.16078431372554,"y":255.0},{"x":386.1428571428572,"y":245.0}] --- > [{"x":135.0284552845532,"y":246.0},{"x":381.39370078740143,"y":254.0}] Running: diff -a /home/hiveptest/54.159.206.72-hiveptest-1/apache-svn-trunk-source/itests/qtest/../../itests/qtest/target/qfile-results/clientpositive/auto_join12.q.out /home/hiveptest/54.159.206.72-hiveptest-1/apache-svn-trunk-source/itests/qtest/../../ql/src/test/results/clientpositive/auto_join12.q.out 32c32 < $hdt$_0:$hdt$_0:$hdt$_0:$hdt$_0:src --- > $hdt$_0:$hdt$_0:$hdt$_0:$hdt$_0:$hdt$_0:src 35c35 < $hdt$_0:$hdt$_0:$hdt$_1:$hdt$_1:$hdt$_1:src --- > $hdt$_0:$hdt$_0:$hdt$_1:$hdt$_1:$hdt$_1:$hdt$_1:src 39c39 < $hdt$_0:$hdt$_0:$hdt$_0:$hdt$_0:src --- > $hdt$_0:$hdt$_0:$hdt$_0:$hdt$_0:$hdt$_0:src 54c54 < $hdt$_0:$hdt$_0:$hdt$_1:$hdt$_1:$hdt$_1:src --- > $hdt$_0:$hdt$_0:$hdt$_1:$hdt$_1:$hdt$_1:$hdt$_1:src Neither of those have anything whatsoever to do with retries. Also, I see some other errors reporting "junit.framework.AssertionFailedError: Client Execution failed with error code = 40000 running", which is similar to what's reported in HIVE-8997 . Is there a test flakiness issue right now in trunk?
          Hide
          sushanth Sushanth Sowmyan added a comment -

          Thejas M Nair/Hari Sankar Sivarama Subramaniyan : I have a couple of thoughts about moving JDOException retries solely to the metastore:

          a) Firstly, we have had cases so far where a JDOException invalidates the connection on the metastore side, and retrying from the metastore has not helped. Retrying from the client-side, though, causes a fresh openTransaction() that clears the connection and all history, sometimes by hitting a different HMSHandler, and this causes the retry from client to be more successful than a retry from server. Admittedly, this is more likely because we need to clean up our metastore code to make sure that the retry from the metastore-side handles this properly, and thus, is something we should attempt to improve.
          b) Second, from a perspective of a loaded metastore, having a metastore thread do retries, thus using up valuable metastore resources/time is more wasteful than having the client do retries. We thus tend to keep our metastore-side retries to a low amount, but the fact that we have client-side retries as well gives us an ability to be fail-fast on the metastore, but retry a large number of times in particular clients if we find the need to do so. Particularly, in HA configurations, I've seen a large number of retries and longer retry-intervals on the client side that allow a connection to go through despite metastore HUPs.
          c) Thirdly, speaking of HA, retrying on the client-side allows us to hit alternate metastores as well, if configured, if we have scenarios where one metastore is getting bogged down. As you mention, client should ideally only be retrying connection exceptions, but JDOExceptions are frequently the result of connection exceptions raised by the connection pool from the metastore to the db.

          There is definitely scope for refactoring and improvement in all this, I will look into it further, but for now, this is a simpler bugfix to enable the already-existing regex to work correctly.

          Show
          sushanth Sushanth Sowmyan added a comment - Thejas M Nair / Hari Sankar Sivarama Subramaniyan : I have a couple of thoughts about moving JDOException retries solely to the metastore: a) Firstly, we have had cases so far where a JDOException invalidates the connection on the metastore side, and retrying from the metastore has not helped. Retrying from the client-side, though, causes a fresh openTransaction() that clears the connection and all history, sometimes by hitting a different HMSHandler, and this causes the retry from client to be more successful than a retry from server. Admittedly, this is more likely because we need to clean up our metastore code to make sure that the retry from the metastore-side handles this properly, and thus, is something we should attempt to improve. b) Second, from a perspective of a loaded metastore, having a metastore thread do retries, thus using up valuable metastore resources/time is more wasteful than having the client do retries. We thus tend to keep our metastore-side retries to a low amount, but the fact that we have client-side retries as well gives us an ability to be fail-fast on the metastore, but retry a large number of times in particular clients if we find the need to do so. Particularly, in HA configurations, I've seen a large number of retries and longer retry-intervals on the client side that allow a connection to go through despite metastore HUPs. c) Thirdly, speaking of HA, retrying on the client-side allows us to hit alternate metastores as well, if configured, if we have scenarios where one metastore is getting bogged down. As you mention, client should ideally only be retrying connection exceptions, but JDOExceptions are frequently the result of connection exceptions raised by the connection pool from the metastore to the db. There is definitely scope for refactoring and improvement in all this, I will look into it further, but for now, this is a simpler bugfix to enable the already-existing regex to work correctly.
          Hide
          sushanth Sushanth Sowmyan added a comment -

          As to the precommit tests, it looks like the majority of those tests were already failing before this build:

          Test Result (66 failures / +3)
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_histogram_numeric
          org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.testSaslWithHiveMetaStore
          org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.testMetastoreProxyUser
          org.apache.hive.hcatalog.streaming.TestStreaming.testEndpointConnection
          ...
          

          This indicates that tests were already flaky (probably due to an earlier commit?) and there were a total of 3 reported regressions. Diffing between this build(2495) and the previous one where tests ran (2493) to find changes, I get:

          < org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucket5
          ---
          > org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_histogram_numeric
          > org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.testSaslWithHiveMetaStore
          > org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.testMetastoreProxyUser
          > org.apache.hive.hcatalog.streaming.TestStreaming.testEndpointConnection
          

          For org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_histogram_numeric:

          Running: diff -a /home/hiveptest/54.205.215.38-hiveptest-1/apache-svn-trunk-source/itests/qtest/../../itests/qtest/target/qfile-results/clientpositive/udaf_histogram_numeric.q.out /home/hiveptest/54.205.215.38-hiveptest-1/apache-svn-trunk-source/itests/qtest/../../ql/src/test/results/clientpositive/udaf_histogram_numeric.q.out
          9c9
          < [{"x":139.16078431372554,"y":255.0},{"x":386.1428571428572,"y":245.0}]
          ---
          > [{"x":135.0284552845532,"y":246.0},{"x":381.39370078740143,"y":254.0}]
          

          This does not look connected to this issue.

          As for org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.testSaslWithHiveMetaStore and

          java.lang.NullPointerException: null
          	at org.apache.hadoop.hive.metastore.HiveMetaStore.getDelegationToken(HiveMetaStore.java:5596)
          	at org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.getDelegationTokenStr(TestHadoop20SAuthBridge.java:318)
          	at org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.obtainTokenAndAddIntoUGI(TestHadoop20SAuthBridge.java:339)
          	at org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.testSaslWithHiveMetaStore(TestHadoop20SAuthBridge.java:231)
          
          java.lang.NullPointerException: null
          	at org.apache.hadoop.hive.metastore.HiveMetaStore.getDelegationToken(HiveMetaStore.java:5596)
          	at org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.getDelegationTokenStr(TestHadoop20SAuthBridge.java:318)
          	at org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.access$100(TestHadoop20SAuthBridge.java:62)
          

          Both of these seem to be errors getting a delegation token, again something not connected with this connection retry issue.

          As to the last one, org.apache.hive.hcatalog.streaming.TestStreaming.testEndpointConnection, that looks like an issue with derby setup? I'm not certain. It does look like a valid issue in and of itself, but again, not related to this patch.

          java.sql.SQLException: Table/View 'TXNS' already exists in Schema 'APP'.
          	at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
          	at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
          	at org.apache.derby.impl.sql.catalog.DataDictionaryImpl.duplicateDescriptorException(Unknown Source)
          	at org.apache.derby.impl.sql.catalog.DataDictionaryImpl.addDescriptor(Unknown Source)
          	at org.apache.derby.impl.sql.execute.CreateTableConstantAction.executeConstantAction(Unknown Source)
          	at org.apache.derby.impl.sql.execute.MiscResultSet.open(Unknown Source)
          	at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(Unknown Source)
          	at org.apache.derby.impl.sql.GenericPreparedStatement.execute(Unknown Source)
          	at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown Source)
          	at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
          	at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
          	at org.apache.hadoop.hive.metastore.txn.TxnDbUtil.prepDb(TxnDbUtil.java:72)
          	at org.apache.hadoop.hive.metastore.txn.TxnDbUtil.prepDb(TxnDbUtil.java:131)
          	at org.apache.hive.hcatalog.streaming.TestStreaming.<init>(TestStreaming.java:157)
          
          Show
          sushanth Sushanth Sowmyan added a comment - As to the precommit tests, it looks like the majority of those tests were already failing before this build: Test Result (66 failures / +3) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_histogram_numeric org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.testSaslWithHiveMetaStore org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.testMetastoreProxyUser org.apache.hive.hcatalog.streaming.TestStreaming.testEndpointConnection ... This indicates that tests were already flaky (probably due to an earlier commit?) and there were a total of 3 reported regressions. Diffing between this build(2495) and the previous one where tests ran (2493) to find changes, I get: < org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucket5 --- > org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_histogram_numeric > org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.testSaslWithHiveMetaStore > org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.testMetastoreProxyUser > org.apache.hive.hcatalog.streaming.TestStreaming.testEndpointConnection For org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_histogram_numeric: Running: diff -a /home/hiveptest/54.205.215.38-hiveptest-1/apache-svn-trunk-source/itests/qtest/../../itests/qtest/target/qfile-results/clientpositive/udaf_histogram_numeric.q.out /home/hiveptest/54.205.215.38-hiveptest-1/apache-svn-trunk-source/itests/qtest/../../ql/src/test/results/clientpositive/udaf_histogram_numeric.q.out 9c9 < [{"x":139.16078431372554,"y":255.0},{"x":386.1428571428572,"y":245.0}] --- > [{"x":135.0284552845532,"y":246.0},{"x":381.39370078740143,"y":254.0}] This does not look connected to this issue. As for org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.testSaslWithHiveMetaStore and java.lang.NullPointerException: null at org.apache.hadoop.hive.metastore.HiveMetaStore.getDelegationToken(HiveMetaStore.java:5596) at org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.getDelegationTokenStr(TestHadoop20SAuthBridge.java:318) at org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.obtainTokenAndAddIntoUGI(TestHadoop20SAuthBridge.java:339) at org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.testSaslWithHiveMetaStore(TestHadoop20SAuthBridge.java:231) java.lang.NullPointerException: null at org.apache.hadoop.hive.metastore.HiveMetaStore.getDelegationToken(HiveMetaStore.java:5596) at org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.getDelegationTokenStr(TestHadoop20SAuthBridge.java:318) at org.apache.hadoop.hive.thrift.TestHadoop20SAuthBridge.access$100(TestHadoop20SAuthBridge.java:62) Both of these seem to be errors getting a delegation token, again something not connected with this connection retry issue. As to the last one, org.apache.hive.hcatalog.streaming.TestStreaming.testEndpointConnection, that looks like an issue with derby setup? I'm not certain. It does look like a valid issue in and of itself, but again, not related to this patch. java.sql.SQLException: Table/View 'TXNS' already exists in Schema 'APP'. at org.apache.derby.iapi.error.StandardException.newException(Unknown Source) at org.apache.derby.iapi.error.StandardException.newException(Unknown Source) at org.apache.derby.impl.sql.catalog.DataDictionaryImpl.duplicateDescriptorException(Unknown Source) at org.apache.derby.impl.sql.catalog.DataDictionaryImpl.addDescriptor(Unknown Source) at org.apache.derby.impl.sql.execute.CreateTableConstantAction.executeConstantAction(Unknown Source) at org.apache.derby.impl.sql.execute.MiscResultSet.open(Unknown Source) at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(Unknown Source) at org.apache.derby.impl.sql.GenericPreparedStatement.execute(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) at org.apache.hadoop.hive.metastore.txn.TxnDbUtil.prepDb(TxnDbUtil.java:72) at org.apache.hadoop.hive.metastore.txn.TxnDbUtil.prepDb(TxnDbUtil.java:131) at org.apache.hive.hcatalog.streaming.TestStreaming.<init>(TestStreaming.java:157)
          Hide
          thejas Thejas M Nair added a comment -

          +1
          I agree in context of metastore HA setup, it does makes sense to retry on JDOException as well.

          Show
          thejas Thejas M Nair added a comment - +1 I agree in context of metastore HA setup, it does makes sense to retry on JDOException as well.
          Hide
          sushanth Sushanth Sowmyan added a comment -

          Cancelling patch to resubmit an identical .3.patch so that the precommit tests don't skip it after the test issues we've had in the past couple of days.

          Show
          sushanth Sushanth Sowmyan added a comment - Cancelling patch to resubmit an identical .3.patch so that the precommit tests don't skip it after the test issues we've had in the past couple of days.
          Hide
          hiveqa Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12694681/HIVE-9436.3.patch

          ERROR: -1 due to 3 failed/errored test(s), 7397 tests executed
          Failed tests:

          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_histogram_numeric
          org.apache.hadoop.hive.ql.TestMTQueries.testMTQueries1
          org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler.org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler
          

          Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2528/testReport
          Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2528/console
          Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2528/

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests exited with: TestsFailedException: 3 tests failed
          

          This message is automatically generated.

          ATTACHMENT ID: 12694681 - PreCommit-HIVE-TRUNK-Build

          Show
          hiveqa Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12694681/HIVE-9436.3.patch ERROR: -1 due to 3 failed/errored test(s), 7397 tests executed Failed tests: org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_histogram_numeric org.apache.hadoop.hive.ql.TestMTQueries.testMTQueries1 org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler.org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2528/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2528/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2528/ Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed This message is automatically generated. ATTACHMENT ID: 12694681 - PreCommit-HIVE-TRUNK-Build
          Hide
          sushanth Sushanth Sowmyan added a comment -

          The test failure now reported is down to 3, and they're unconnected to the issue being fixed here, so I will go ahead and commit this.

          Show
          sushanth Sushanth Sowmyan added a comment - The test failure now reported is down to 3, and they're unconnected to the issue being fixed here, so I will go ahead and commit this.
          Hide
          sushanth Sushanth Sowmyan added a comment -

          Committed to trunk

          Show
          sushanth Sushanth Sowmyan added a comment - Committed to trunk
          Hide
          sushanth Sushanth Sowmyan added a comment -

          Vikram Dixit K - if you do respin an rc for 1.0, it would be useful to have this fix in as well - it's a simple fix which fixes retries from the client, and is a robustness fix. It isn't, however a breaking bug as I see it, since it is used only in the case of connection issues, where we give it a chance to retry instead of failing directly.

          Show
          sushanth Sushanth Sowmyan added a comment - Vikram Dixit K - if you do respin an rc for 1.0, it would be useful to have this fix in as well - it's a simple fix which fixes retries from the client, and is a robustness fix. It isn't, however a breaking bug as I see it, since it is used only in the case of connection issues, where we give it a chance to retry instead of failing directly.
          Hide
          vikram.dixit Vikram Dixit K added a comment -

          Committed to RC for 1.0.

          Show
          vikram.dixit Vikram Dixit K added a comment - Committed to RC for 1.0.
          Hide
          thejas Thejas M Nair added a comment -

          Updating release version for jiras resolved in 1.0.0 .

          Show
          thejas Thejas M Nair added a comment - Updating release version for jiras resolved in 1.0.0 .
          Hide
          thejas Thejas M Nair added a comment -

          This issue has been fixed in Apache Hive 1.0.0. If there is any issue with the fix, please open a new jira to address it.

          Show
          thejas Thejas M Nair added a comment - This issue has been fixed in Apache Hive 1.0.0. If there is any issue with the fix, please open a new jira to address it.

            People

            • Assignee:
              sushanth Sushanth Sowmyan
              Reporter:
              sushanth Sushanth Sowmyan
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development