Hive
  1. Hive
  2. HIVE-6893

out of sequence error in HiveMetastore server

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.12.0
    • Fix Version/s: None
    • Component/s: HiveServer2
    • Labels:
      None

      Description

      Calls listing databases or tables fail. It seems to be a concurrency problem.

      014-03-06 05:34:00,785 ERROR hive.log: org.apache.thrift.TApplicationException: get_databases failed: out of sequence response
              at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:76)
              at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_databases(ThriftHiveMetastore.java:472)
              at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_databases(ThriftHiveMetastore.java:459)
              at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabases(HiveMetaStoreClient.java:648)
              at org.apache.hive.service.cli.operation.GetSchemasOperation.run(GetSchemasOperation.java:66)
              at org.apache.hive.service.cli.session.HiveSessionImpl.getSchemas(HiveSessionImpl.java:278)
              at sun.reflect.GeneratedMethodAccessor323.invoke(Unknown Source)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:606)
              at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:62)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:415)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
              at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:582)
              at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:57)
              at com.sun.proxy.$Proxy9.getSchemas(Unknown Source)
              at org.apache.hive.service.cli.CLIService.getSchemas(CLIService.java:192)
              at org.apache.hive.service.cli.thrift.ThriftCLIService.GetSchemas(ThriftCLIService.java:263)
              at org.apache.hive.service.cli.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1433)
              at org.apache.hive.service.cli.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1418)
              at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
              at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
              at org.apache.hive.service.cli.thrift.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:38)
              at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:244)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
              at java.lang.Thread.run(Thread.java:724)
      
      1. HIVE-6893.1.patch
        5 kB
        Naveen Gangam

        Activity

        Gunther Hagleitner made changes -
        Fix Version/s 0.14.0 [ 12326450 ]
        Hide
        Gilad Wolff added a comment -

        I encountered the same issue, we get a socket read timeout and then out-of-sequence error. In one case we got an OOM in our client and I suspect it's the same underlying issue. Here is the metastore sequence of events. Our client tried to drop a table starting at 14:02:25. Note that we use a 20 seconds timeout for our client:

        2014-06-23 14:02:25,181 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: 11: source:/10.20.93.47 drop_table : db=cloudera_manager_metastore_canary_test_db tbl=CM_TEST_TABLE
        2014-06-23 14:02:25,181 INFO org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=hue	ip=/10.20.93.47	cmd=source:/10.20.93.47 drop_table : db=cloudera_manager_metastore_canary_test_db tbl=CM_TEST_TABLE	
        2014-06-23 14:02:25,182 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: 11: source:/10.20.93.47 get_table : db=cloudera_manager_metastore_canary_test_db tbl=CM_TEST_TABLE
        2014-06-23 14:02:25,182 INFO org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=hue	ip=/10.20.93.47	cmd=source:/10.20.93.47 get_table : db=cloudera_manager_metastore_canary_test_db tbl=CM_TEST_TABLE	
        2014-06-23 14:02:46,596 INFO hive.metastore.hivemetastoressimpl: deleting  hdfs://jenkins-debian60-17.ent.cloudera.com:8020/user/hue/.cloudera_manager_hive_metastore_canary/HIVE_1_HIVEMETASTORE_627a77825bb851bf2db30317a698dded/2014_06_23_14_02_11/cm_test_table
        2014-06-23 14:02:46,694 INFO hive.metastore.hivemetastoressimpl: Moved to trash: hdfs://jenkins-debian60-17.ent.cloudera.com:8020/user/hue/.cloudera_manager_hive_metastore_canary/HIVE_1_HIVEMETASTORE_627a77825bb851bf2db30317a698dded/2014_06_23_14_02_11/cm_test_table
        

        On our client we get a socket timeout for the drop table call at 14:02:45:

        2:02:45.209 PM 	WARN 	com.cloudera.cmon.firehose.polling.hive.HiveMetastoreCanary Metastore HIVE-1-HIVEMETASTORE-627a77825bb851bf2db30317a698dded: Failed to drop table 
        com.cloudera.cmf.cdhclient.common.hive.MetaException: java.net.SocketTimeoutException: Read timed out
        

        we then try to drop the database immediately afterwards and the next message in our logs is:

        2:02:46.697 PM 	WARN 	com.cloudera.cmf.cdh4client.hive.MetastoreClientImpl 	Could not drop hive database: cloudera_manager_metastore_canary_test_db
        com.cloudera.cdh4client.hive.shaded.org.apache.thrift.TApplicationException: get_database failed: out of sequence response
        	at com.cloudera.cdh4client.hive.shaded.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:76)
        	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_database(ThriftHiveMetastore.java:412)
        	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_database(ThriftHiveMetastore.java:399)
        	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:736)
        	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropDatabase(HiveMetaStoreClient.java:479)
        	at com.cloudera.cmf.cdh4client.hive.MetastoreClientImpl.dropDatabase(MetastoreClientImpl.java:160)
        

        Note that the moved-to-trash message in the hive metastore is from 14:02:46,694 and the out-of-order exception is from 2:02:46.697. I know that order-in-time does not imply causation but is it possible that we are getting the drop-table acknowledgment message instead of the get_database message?

        Show
        Gilad Wolff added a comment - I encountered the same issue, we get a socket read timeout and then out-of-sequence error. In one case we got an OOM in our client and I suspect it's the same underlying issue. Here is the metastore sequence of events. Our client tried to drop a table starting at 14:02:25. Note that we use a 20 seconds timeout for our client: 2014-06-23 14:02:25,181 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: 11: source:/10.20.93.47 drop_table : db=cloudera_manager_metastore_canary_test_db tbl=CM_TEST_TABLE 2014-06-23 14:02:25,181 INFO org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=hue ip=/10.20.93.47 cmd=source:/10.20.93.47 drop_table : db=cloudera_manager_metastore_canary_test_db tbl=CM_TEST_TABLE 2014-06-23 14:02:25,182 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: 11: source:/10.20.93.47 get_table : db=cloudera_manager_metastore_canary_test_db tbl=CM_TEST_TABLE 2014-06-23 14:02:25,182 INFO org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=hue ip=/10.20.93.47 cmd=source:/10.20.93.47 get_table : db=cloudera_manager_metastore_canary_test_db tbl=CM_TEST_TABLE 2014-06-23 14:02:46,596 INFO hive.metastore.hivemetastoressimpl: deleting hdfs: //jenkins-debian60-17.ent.cloudera.com:8020/user/hue/.cloudera_manager_hive_metastore_canary/HIVE_1_HIVEMETASTORE_627a77825bb851bf2db30317a698dded/2014_06_23_14_02_11/cm_test_table 2014-06-23 14:02:46,694 INFO hive.metastore.hivemetastoressimpl: Moved to trash: hdfs: //jenkins-debian60-17.ent.cloudera.com:8020/user/hue/.cloudera_manager_hive_metastore_canary/HIVE_1_HIVEMETASTORE_627a77825bb851bf2db30317a698dded/2014_06_23_14_02_11/cm_test_table On our client we get a socket timeout for the drop table call at 14:02:45: 2:02:45.209 PM WARN com.cloudera.cmon.firehose.polling.hive.HiveMetastoreCanary Metastore HIVE-1-HIVEMETASTORE-627a77825bb851bf2db30317a698dded: Failed to drop table com.cloudera.cmf.cdhclient.common.hive.MetaException: java.net.SocketTimeoutException: Read timed out we then try to drop the database immediately afterwards and the next message in our logs is: 2:02:46.697 PM WARN com.cloudera.cmf.cdh4client.hive.MetastoreClientImpl Could not drop hive database: cloudera_manager_metastore_canary_test_db com.cloudera.cdh4client.hive.shaded.org.apache.thrift.TApplicationException: get_database failed: out of sequence response at com.cloudera.cdh4client.hive.shaded.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:76) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_database(ThriftHiveMetastore.java:412) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_database(ThriftHiveMetastore.java:399) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:736) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropDatabase(HiveMetaStoreClient.java:479) at com.cloudera.cmf.cdh4client.hive.MetastoreClientImpl.dropDatabase(MetastoreClientImpl.java:160) Note that the moved-to-trash message in the hive metastore is from 14:02:46,694 and the out-of-order exception is from 2:02:46.697. I know that order-in-time does not imply causation but is it possible that we are getting the drop-table acknowledgment message instead of the get_database message?
        Hide
        Ashutosh Chauhan added a comment -

        Patch in current form eschews underlying problem. I think its worth spending time to figure out whats going on. FWIW, I have seen TestRetryingHMSHandler.testRetryingHMSHandler test fail on trunk with HiveQA runs with following trace:

        org.apache.thrift.TApplicationException: create_database failed: out of sequence response
        	at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:76)
        	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_database(ThriftHiveMetastore.java:511)
        	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_database(ThriftHiveMetastore.java:498)
        	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:534)
        	at org.apache.hadoop.hive.metastore.TestRetryingHMSHandler.testRetryingHMSHandler(TestRetryingHMSHandler.java:76)
        

        Above test failure which is independent of HS2 provides an independent repro test case. Though, in Hive QA runs it only fails rarely, hinting at some race condition.

        Show
        Ashutosh Chauhan added a comment - Patch in current form eschews underlying problem. I think its worth spending time to figure out whats going on. FWIW, I have seen TestRetryingHMSHandler.testRetryingHMSHandler test fail on trunk with HiveQA runs with following trace: org.apache.thrift.TApplicationException: create_database failed: out of sequence response at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:76) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_database(ThriftHiveMetastore.java:511) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_database(ThriftHiveMetastore.java:498) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:534) at org.apache.hadoop.hive.metastore.TestRetryingHMSHandler.testRetryingHMSHandler(TestRetryingHMSHandler.java:76) Above test failure which is independent of HS2 provides an independent repro test case. Though, in Hive QA runs it only fails rarely, hinting at some race condition.
        Navis made changes -
        Fix Version/s 0.14.0 [ 12326450 ]
        Fix Version/s 0.13.0 [ 12324986 ]
        Hide
        Szehon Ho added a comment -

        As FYI, these are known test errors due to recent switch to hadoop-2. Ref for a clean hadoop-2 build run two days back, see bottom of: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/precommit-hive-15/execution.txt

        Show
        Szehon Ho added a comment - As FYI, these are known test errors due to recent switch to hadoop-2. Ref for a clean hadoop-2 build run two days back, see bottom of: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/precommit-hive-15/execution.txt
        Hide
        Hive QA added a comment -

        Overall: -1 at least one tests failed

        Here are the results of testing the latest attachment:
        https://issues.apache.org/jira/secure/attachment/12641303/HIVE-6893.1.patch

        ERROR: -1 due to 41 failed/errored test(s), 5417 tests executed
        Failed tests:

        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join32
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_filter_numeric
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby2_map_skew
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_1
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_skew_1
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_infer_bucket_sort_list_bucket
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_list_bucket_dml_6
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_list_bucket_dml_7
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_list_bucket_dml_8
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_mapjoin_test_outer
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_nullgroup3
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_createas1
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_join4
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_select_dummy_source
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats_list_bucket
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats_partscan_1_23
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_symlink_text_input_format
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_truncate_column_list_bucket
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udf_current_database
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_1
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_10
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_12
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_13
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_14
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_17
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_19
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_2
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_20
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_21
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_22
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_23
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_24
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_4
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_5
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_7
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_8
        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_9
        org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucketizedhiveinputformat
        org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table
        org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_dynamic_partitions_with_whitelist
        org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_stats_partialscan_autogether
        

        Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/9/testReport
        Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/9/console

        Messages:

        Executing org.apache.hive.ptest.execution.PrepPhase
        Executing org.apache.hive.ptest.execution.ExecutionPhase
        Executing org.apache.hive.ptest.execution.ReportingPhase
        Tests exited with: TestsFailedException: 41 tests failed
        

        This message is automatically generated.

        ATTACHMENT ID: 12641303

        Show
        Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12641303/HIVE-6893.1.patch ERROR: -1 due to 41 failed/errored test(s), 5417 tests executed Failed tests: org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join32 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_filter_numeric org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby2_map_skew org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_skew_1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_infer_bucket_sort_list_bucket org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_list_bucket_dml_6 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_list_bucket_dml_7 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_list_bucket_dml_8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_mapjoin_test_outer org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_nullgroup3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_createas1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_join4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_select_dummy_source org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats_list_bucket org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats_partscan_1_23 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_symlink_text_input_format org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_truncate_column_list_bucket org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udf_current_database org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_10 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_12 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_13 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_14 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_17 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_19 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_20 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_21 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_22 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_23 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_24 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_5 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_7 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_9 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucketizedhiveinputformat org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_dynamic_partitions_with_whitelist org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_stats_partialscan_autogether Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/9/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/9/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 41 tests failed This message is automatically generated. ATTACHMENT ID: 12641303
        Hide
        Naveen Gangam added a comment -

        Review posted to the review board.
        https://reviews.apache.org/r/20569/

        Show
        Naveen Gangam added a comment - Review posted to the review board. https://reviews.apache.org/r/20569/
        Hide
        Vaibhav Gumashta added a comment -

        Naveen Gangam Can you also upload the patch to review board?

        Show
        Vaibhav Gumashta added a comment - Naveen Gangam Can you also upload the patch to review board?
        Naveen Gangam made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Fix Version/s 0.13.0 [ 12324986 ]
        Naveen Gangam made changes -
        Attachment HIVE-6893.1.patch [ 12641303 ]
        Hide
        Naveen Gangam added a comment -

        This proposed fix makes HS2 use a synchronized wrapper for the metastore client when the HMS is running as a standalone process. In embedded mode, we would not be reading sequenced responses over a socket.

        This fix does NOT synchronize all the calls to the HMS. For example, Hive.getMSC() returns an metastore client instance that is not thread-safe. There is a bit of risk synchronizing these calls, as it might have a regressive performance impact.

        I will be measure performance impact of using a synchronized wrapper all over and see if the reward outweighs the risk.

        Show
        Naveen Gangam added a comment - This proposed fix makes HS2 use a synchronized wrapper for the metastore client when the HMS is running as a standalone process. In embedded mode, we would not be reading sequenced responses over a socket. This fix does NOT synchronize all the calls to the HMS. For example, Hive.getMSC() returns an metastore client instance that is not thread-safe. There is a bit of risk synchronizing these calls, as it might have a regressive performance impact. I will be measure performance impact of using a synchronized wrapper all over and see if the reward outweighs the risk.
        Naveen Gangam made changes -
        Assignee Naveen Gangam [ ngangam ]
        Hide
        Naveen Gangam added a comment -

        Mestastore is running as a standalone process/service outside the HS2 JVM. This issue was seen after the following occured (there is no indication of a problem prior)
        1) Upgrade the HMS DB (mysql)
        2) Running with slightly newer version of HUE.
        3) About 15 parallel beeswax sessions to HS2 each running a query.

        We haven't been able to reproduce this in-house using a JDBC Client that spins off threads that connect to HS2, run a query & exit.

        We suspected that it was one of the two causes
        1) Socket timeouts during read by parallel invocations between HS2 (HMS Client) & HMS.
        2) Concurrency issues in HMS on invocations to HMS.

        After noticing Socket timeout exceptions in the HS2 logs, we had the customer increase the socket read timeouts from 90s to 120s. Did not help, but also the hive logs indicate that the "out of sequence" errors occurred first then the socket timeouts, pretty far apart in time. So the socket read timeout is not likely the root cause.

        It must be concurrent access to the HMS by multiple beeswax sessions thats causing this issue.

        Hope this helps.

        Show
        Naveen Gangam added a comment - Mestastore is running as a standalone process/service outside the HS2 JVM. This issue was seen after the following occured (there is no indication of a problem prior) 1) Upgrade the HMS DB (mysql) 2) Running with slightly newer version of HUE. 3) About 15 parallel beeswax sessions to HS2 each running a query. We haven't been able to reproduce this in-house using a JDBC Client that spins off threads that connect to HS2, run a query & exit. We suspected that it was one of the two causes 1) Socket timeouts during read by parallel invocations between HS2 (HMS Client) & HMS. 2) Concurrency issues in HMS on invocations to HMS. After noticing Socket timeout exceptions in the HS2 logs, we had the customer increase the socket read timeouts from 90s to 120s. Did not help, but also the hive logs indicate that the "out of sequence" errors occurred first then the socket timeouts, pretty far apart in time. So the socket read timeout is not likely the root cause. It must be concurrent access to the HMS by multiple beeswax sessions thats causing this issue. Hope this helps.
        sukhendu chakraborty made changes -
        Summary Out of sequence response out of sequence error in HiveMetastore server
        Hide
        Vaibhav Gumashta added a comment -

        Romain Rigaux As a followup, can you also try starting HiveServer2 with -hiveconf hive.metastore.uris=" " and see if you get the same error?

        Show
        Vaibhav Gumashta added a comment - Romain Rigaux As a followup, can you also try starting HiveServer2 with -hiveconf hive.metastore.uris=" " and see if you get the same error?
        Hide
        Vaibhav Gumashta added a comment -

        Romain Rigaux This seems like a metastore concurrency issue. It seems that the metastore client socket is reading the RPC response from a different call, hence the out of sequence exception. I think we'll have a better idea once we learn more about Ashutosh Chauhan's queries.

        Show
        Vaibhav Gumashta added a comment - Romain Rigaux This seems like a metastore concurrency issue. It seems that the metastore client socket is reading the RPC response from a different call, hence the out of sequence exception. I think we'll have a better idea once we learn more about Ashutosh Chauhan 's queries.
        Hide
        Ashutosh Chauhan added a comment -

        Also, did you see any other error messages in your log like TException, TableAlreadyExistsException etc. or only the one you posted ?

        Show
        Ashutosh Chauhan added a comment - Also, did you see any other error messages in your log like TException, TableAlreadyExistsException etc. or only the one you posted ?
        Hide
        Ashutosh Chauhan added a comment -

        Romain Rigaux Can you provide more details about your setup?
        Was metastore running as separate process or embedded within HS2 ?
        Were HS2 clients odbc or jdbc ? How many concurrent clients were there and how many queries they were firing?

        Show
        Ashutosh Chauhan added a comment - Romain Rigaux Can you provide more details about your setup? Was metastore running as separate process or embedded within HS2 ? Were HS2 clients odbc or jdbc ? How many concurrent clients were there and how many queries they were firing?
        Romain Rigaux made changes -
        Summary ut of sequence response Out of sequence response
        Romain Rigaux made changes -
        Field Original Value New Value
        Summary out of sequence response ut of sequence response
        Romain Rigaux created issue -

          People

          • Assignee:
            Naveen Gangam
            Reporter:
            Romain Rigaux
          • Votes:
            2 Vote for this issue
            Watchers:
            17 Start watching this issue

            Dates

            • Created:
              Updated:

              Development