Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.5.0
-
None
Description
Test steps:
1. Configure Knox service discovery; Knox is in HA
1. Updated hive on tez-cm config hive.server2.transport.mode from binary to http
2. Restarted Hive on Tez service
Expected:
After 2 mins, check if the HIVE service have the updated endpoint URL in the relevant topology
Issue:
HIVE endpoint URL is updated on only one of the Knox instances.
Let's see the relevant gateway logs and the related commands from CM of my new findings:
gateway.log - knox-1-host ----------------------------- 2020-12-03 00:32:23,751 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @ https://my-cm-host:7183 for configuration changes... 2020-12-03 00:32:23,769 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying configuration activation events from Cluster 1 @ https://my-cm-host:7183 since 2020-12-03T08:31:23.681Z 2020-12-03 00:32:23,861 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event relevance: e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2 = false 2020-12-03 00:32:23,861 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event relevance: d3176932-8c50-4c42-879c-7e25ae9e2cbd = false 2020-12-03 00:32:32,042 DEBUG knox.gateway (ClouderaManagerDescriptorMonitor.java:monitorClouderaManagerDescriptor(96)) - Skipping CM descriptor monitoring of /var/lib/knox/gateway/conf/descriptors/cdp-resources.cm (force = false; lastReloadTime = 2020-12-03T08:20:24.486931Z; lastModified = 2020-12-03T08:20:24.486931Z) ... 2020-12-03 00:33:23,862 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @ https://my-cm-host:7183 for configuration changes... 2020-12-03 00:33:23,874 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying configuration activation events from Cluster 1 @ https://my-cm-host:7183 since 2020-12-03T08:32:23.769Z 2020-12-03 00:33:23,938 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:getRelevantEvents(402)) - There is no any activation event found within the given time period gateway.log - knox-host-2 ----------------------------- 2020-12-03 00:32:07,745 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @ https://my-cm-host:7183 for configuration changes... 2020-12-03 00:32:07,920 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying configuration activation events from Cluster 1 @ https://my-cm-host:7183 since 2020-12-03T08:31:07.670Z 2020-12-03 00:32:08,039 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event relevance: e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2 = false 2020-12-03 00:32:08,043 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event relevance: d3176932-8c50-4c42-879c-7e25ae9e2cbd = false ... 2020-12-03 00:33:08,045 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @ https://my-cm-host:7183 for configuration changes... 2020-12-03 00:33:08,083 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying configuration activation events from Cluster 1 @ https://my-cm-host:7183 since 2020-12-03T08:32:07.922Z 2020-12-03 00:33:08,184 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event relevance: 7537adcb-d7da-4591-aa66-1f857584c099 = true AUDIT events fetched from CM ---------------------------- https://my-cm-host:7183/api/v41/events?query=category==AUDIT_EVENT;timeOccurred=gt=2020-12-03T08:31:23.681Z;timeOccurred=lt=2020-12-03T08:32:23.769Z { totalResults: 3, items: [{ id: "7537adcb-d7da-4591-aa66-1f857584c099", content: "Command Start on subject ATLAS-1 succeeded.", timeOccurred: "2020-12-03T08:32:20.366Z", timeReceived: "2020-12-03T08:32:20.369Z", category: "AUDIT_EVENT", severity: "INFORMATIONAL", alert: false, attributes: [{ values: [ "SvcCmdArgs{targetRoles=[], args=[]}" ], name: "COMMAND_ARGS" }, { values: [ "Successfully started service." ], name: "COMMAND_RESULT" }, { values: [ "556" ], name: "COMMAND_ID" }, { values: [ "clouderaManager" ], name: "USER" }, { values: [ "Cluster 1" ], name: "CLUSTER_DISPLAY_NAME" }, { values: [ "https://my-cm-host5003/cmf/eventRedirect/7537adcb-d7da-4591-aa66-1f857584c099" ], name: "URL" }, { values: [ "ATLAS-1" ], name: "SERVICE_DISPLAY_NAME" }, { values: [ "SUCCEEDED" ], name: "COMMAND_STATUS" }, { values: [ "EV_SERVICE_STARTED" ], name: "EVENTCODE" }, { values: [ "false" ], name: "ALERT_SUPPRESSED" }, { values: [ "Start" ], name: "COMMAND" }, { values: [ "ATLAS-1" ], name: "SERVICE" }, { values: [ "ATLAS" ], name: "SERVICE_TYPE" }, { values: [ "1" ], name: "CLUSTER_ID" }, { values: [ "Cluster 1" ], name: "CLUSTER" }, { values: [ "COMMAND_SUCCEEDED_WITH_TARGET" ], name: "MESSAGE_CODES" } ] }, { id: "e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2", content: "User admin created a new revision. Message was: Modified Server HTTPS Port.", timeOccurred: "2020-12-03T08:32:01.769Z", timeReceived: "2020-12-03T08:32:01.772Z", category: "AUDIT_EVENT", severity: "INFORMATIONAL", alert: false, attributes: [{ values: [ "257" ], name: "REVISION" }, { values: [ "admin" ], name: "USER" }, { values: [ "Cluster 1" ], name: "CLUSTER_DISPLAY_NAME" }, { values: [ "https://knox-host-2:5003/cmf/eventRedirect/e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2" ], name: "URL" }, { values: [ "ATLAS-1-ATLAS_SERVER-353d9af9c42d35ad516e1b1ff11f3a63" ], name: "ROLE" }, { values: [ "ATLAS-1-ATLAS_SERVER-BASE" ], name: "ROLE_CONFIG_GROUP" }, { values: [ "ATLAS-1" ], name: "SERVICE_DISPLAY_NAME" }, { values: [ "EV_REVISION_CREATED" ], name: "EVENTCODE" }, { values: [ "false" ], name: "ALERT_SUPPRESSED" }, { values: [ "ATLAS-1" ], name: "SERVICE" }, { values: [ "ATLAS" ], name: "SERVICE_TYPE" }, { values: [ "1" ], name: "CLUSTER_ID" }, { values: [ "Cluster 1" ], name: "CLUSTER" }, { values: [ "REVISION_CREATED" ], name: "MESSAGE_CODES" } ] }, { id: "d3176932-8c50-4c42-879c-7e25ae9e2cbd", content: "User admin started service Start.", timeOccurred: "2020-12-03T08:31:58.360Z", timeReceived: "2020-12-03T08:31:58.361Z", category: "AUDIT_EVENT", severity: "INFORMATIONAL", alert: false, attributes: [{ values: [ "SvcCmdArgs{targetRoles=[DbRole{id=65, name=ATLAS-1-ATLAS_SERVER-353d9af9c42d35ad516e1b1ff11f3a63, hostName=knox-host-2}], args=[]}" ], name: "COMMAND_ARGS" }, { values: [ "556" ], name: "COMMAND_ID" }, { values: [ "admin" ], name: "USER" }, { values: [ "Cluster 1" ], name: "CLUSTER_DISPLAY_NAME" }, { values: [ "https://knox-host-2:5003/cmf/eventRedirect/d3176932-8c50-4c42-879c-7e25ae9e2cbd" ], name: "URL" }, { values: [ "ATLAS-1" ], name: "SERVICE_DISPLAY_NAME" }, { values: [ "STARTED" ], name: "COMMAND_STATUS" }, { values: [ "EV_SERVICE_STARTED" ], name: "EVENTCODE" }, { values: [ "false" ], name: "ALERT_SUPPRESSED" }, { values: [ "Start" ], name: "COMMAND" }, { values: [ "ATLAS-1" ], name: "SERVICE" }, { values: [ "ATLAS" ], name: "SERVICE_TYPE" }, { values: [ "1" ], name: "CLUSTER_ID" }, { values: [ "Cluster 1" ], name: "CLUSTER" }, { values: [ "SERVICE_STARTED" ], name: "MESSAGE_CODES" } ] } ] }
On knox-host-2 everything worked as expected. However, on knox-host-1, my ATLAS service configuration change got lost:
- The start command successfully finished on 2020-12-03T08:32:20.366Z
- Polling configuration analyzer fetched audit events from CM server at 2020-12-03 08:32:23,769Z since 2020-12-03T08:31:23.681Z. However, CM did not return the start command that finished 3 seconds ago, even it should have returned it
- A minute later, the polling configuration analyzer fetched audit events from CM server at 2020-12-03 08:33:23,874Z since 2020-12-03T08:32:23.769Z (the last query timestamp). This query returned nothing which is correct since there was no activation within this time frame
As a result, the command with ID of 7537adcb-d7da-4591-aa66-1f857584c099 has been lost on knox-host-1. It's important to note that the first timestamp, when Knox issued a CM query, was sort of close to the timestamp when the successful start event occurred. There seems to be a lag between the CM server/agent communication which may cause this issue. This means Knox should be smarter in this respect and try to mitigate the chance of missing relevant audit events.
I'm going to implement the following approach:
- Knox will look back a little bit more time than is does currently: instead of checking the [lastQueryTimestamp -> currentTimestamp] interval, we'll check [lastQueryTimestamp - (2 x pollingInterval) -> currentTimestamp]
- Knox will maintain a cache of all relevant events that it processed to avoid re-processing them
Attachments
Issue Links
- links to