Uploaded image for project: 'Apache Knox'
  1. Apache Knox
  2. KNOX-2549

Knox CM discovery may lose relevant audit events

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.0
    • Fix Version/s: 1.6.0
    • Component/s: cm-discovery
    • Labels:
      None

      Description

       Test steps:
      1. Configure Knox service discovery; Knox is in HA
      1. Updated hive on tez-cm config hive.server2.transport.mode from binary to http
      2. Restarted Hive on Tez service

      Expected:
      After 2 mins, check if the HIVE service have the updated endpoint URL in the relevant topology 

      Issue:
      HIVE endpoint URL is updated on only one of the Knox instances.

       
      Let's see the relevant gateway logs and the related commands from CM of my new findings:

      gateway.log - knox-1-host
      -----------------------------
      2020-12-03 00:32:23,751 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @ https://my-cm-host:7183 for configuration changes...
      2020-12-03 00:32:23,769 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying configuration activation events from Cluster 1 @ https://my-cm-host:7183 since 2020-12-03T08:31:23.681Z
      2020-12-03 00:32:23,861 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event relevance: e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2 = false
      2020-12-03 00:32:23,861 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event relevance: d3176932-8c50-4c42-879c-7e25ae9e2cbd = false
      2020-12-03 00:32:32,042 DEBUG knox.gateway (ClouderaManagerDescriptorMonitor.java:monitorClouderaManagerDescriptor(96)) - Skipping CM descriptor monitoring of /var/lib/knox/gateway/conf/descriptors/cdp-resources.cm (force = false; lastReloadTime = 2020-12-03T08:20:24.486931Z; lastModified = 2020-12-03T08:20:24.486931Z)
      ...
      2020-12-03 00:33:23,862 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @ https://my-cm-host:7183 for configuration changes...
      2020-12-03 00:33:23,874 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying configuration activation events from Cluster 1 @ https://my-cm-host:7183 since 2020-12-03T08:32:23.769Z
      2020-12-03 00:33:23,938 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:getRelevantEvents(402)) - There is no any activation event found within the given time period
      
      
      gateway.log - knox-host-2
      -----------------------------
      
      2020-12-03 00:32:07,745 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @ https://my-cm-host:7183 for configuration changes...
      2020-12-03 00:32:07,920 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying configuration activation events from Cluster 1 @ https://my-cm-host:7183 since 2020-12-03T08:31:07.670Z
      2020-12-03 00:32:08,039 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event relevance: e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2 = false
      2020-12-03 00:32:08,043 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event relevance: d3176932-8c50-4c42-879c-7e25ae9e2cbd = false
      ...
      2020-12-03 00:33:08,045 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @ https://my-cm-host:7183 for configuration changes...
      2020-12-03 00:33:08,083 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying configuration activation events from Cluster 1 @ https://my-cm-host:7183 since 2020-12-03T08:32:07.922Z
      2020-12-03 00:33:08,184 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event relevance: 7537adcb-d7da-4591-aa66-1f857584c099 = true
      
      
      AUDIT events fetched from CM
      ----------------------------
      
      https://my-cm-host:7183/api/v41/events?query=category==AUDIT_EVENT;timeOccurred=gt=2020-12-03T08:31:23.681Z;timeOccurred=lt=2020-12-03T08:32:23.769Z
      {
      	totalResults: 3,
      	items: [{
      			id: "7537adcb-d7da-4591-aa66-1f857584c099",
      			content: "Command Start on subject ATLAS-1 succeeded.",
      			timeOccurred: "2020-12-03T08:32:20.366Z",
      			timeReceived: "2020-12-03T08:32:20.369Z",
      			category: "AUDIT_EVENT",
      			severity: "INFORMATIONAL",
      			alert: false,
      			attributes: [{
      					values: [
      						"SvcCmdArgs{targetRoles=[], args=[]}"
      					],
      					name: "COMMAND_ARGS"
      				},
      				{
      					values: [
      						"Successfully started service."
      					],
      					name: "COMMAND_RESULT"
      				},
      				{
      					values: [
      						"556"
      					],
      					name: "COMMAND_ID"
      				},
      				{
      					values: [
      						"clouderaManager"
      					],
      					name: "USER"
      				},
      				{
      					values: [
      						"Cluster 1"
      					],
      					name: "CLUSTER_DISPLAY_NAME"
      				},
      				{
      					values: [
      						"https://my-cm-host5003/cmf/eventRedirect/7537adcb-d7da-4591-aa66-1f857584c099"
      					],
      					name: "URL"
      				},
      				{
      					values: [
      						"ATLAS-1"
      					],
      					name: "SERVICE_DISPLAY_NAME"
      				},
      				{
      					values: [
      						"SUCCEEDED"
      					],
      					name: "COMMAND_STATUS"
      				},
      				{
      					values: [
      						"EV_SERVICE_STARTED"
      					],
      					name: "EVENTCODE"
      				},
      				{
      					values: [
      						"false"
      					],
      					name: "ALERT_SUPPRESSED"
      				},
      				{
      					values: [
      						"Start"
      					],
      					name: "COMMAND"
      				},
      				{
      					values: [
      						"ATLAS-1"
      					],
      					name: "SERVICE"
      				},
      				{
      					values: [
      						"ATLAS"
      					],
      					name: "SERVICE_TYPE"
      				},
      				{
      					values: [
      						"1"
      					],
      					name: "CLUSTER_ID"
      				},
      				{
      					values: [
      						"Cluster 1"
      					],
      					name: "CLUSTER"
      				},
      				{
      					values: [
      						"COMMAND_SUCCEEDED_WITH_TARGET"
      					],
      					name: "MESSAGE_CODES"
      				}
      			]
      		},
      		{
      			id: "e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2",
      			content: "User admin created a new revision. Message was: Modified Server HTTPS Port.",
      			timeOccurred: "2020-12-03T08:32:01.769Z",
      			timeReceived: "2020-12-03T08:32:01.772Z",
      			category: "AUDIT_EVENT",
      			severity: "INFORMATIONAL",
      			alert: false,
      			attributes: [{
      					values: [
      						"257"
      					],
      					name: "REVISION"
      				},
      				{
      					values: [
      						"admin"
      					],
      					name: "USER"
      				},
      				{
      					values: [
      						"Cluster 1"
      					],
      					name: "CLUSTER_DISPLAY_NAME"
      				},
      				{
      					values: [
      						"https://knox-host-2:5003/cmf/eventRedirect/e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2"
      					],
      					name: "URL"
      				},
      				{
      					values: [
      						"ATLAS-1-ATLAS_SERVER-353d9af9c42d35ad516e1b1ff11f3a63"
      					],
      					name: "ROLE"
      				},
      				{
      					values: [
      						"ATLAS-1-ATLAS_SERVER-BASE"
      					],
      					name: "ROLE_CONFIG_GROUP"
      				},
      				{
      					values: [
      						"ATLAS-1"
      					],
      					name: "SERVICE_DISPLAY_NAME"
      				},
      				{
      					values: [
      						"EV_REVISION_CREATED"
      					],
      					name: "EVENTCODE"
      				},
      				{
      					values: [
      						"false"
      					],
      					name: "ALERT_SUPPRESSED"
      				},
      				{
      					values: [
      						"ATLAS-1"
      					],
      					name: "SERVICE"
      				},
      				{
      					values: [
      						"ATLAS"
      					],
      					name: "SERVICE_TYPE"
      				},
      				{
      					values: [
      						"1"
      					],
      					name: "CLUSTER_ID"
      				},
      				{
      					values: [
      						"Cluster 1"
      					],
      					name: "CLUSTER"
      				},
      				{
      					values: [
      						"REVISION_CREATED"
      					],
      					name: "MESSAGE_CODES"
      				}
      			]
      		},
      		{
      			id: "d3176932-8c50-4c42-879c-7e25ae9e2cbd",
      			content: "User admin started service Start.",
      			timeOccurred: "2020-12-03T08:31:58.360Z",
      			timeReceived: "2020-12-03T08:31:58.361Z",
      			category: "AUDIT_EVENT",
      			severity: "INFORMATIONAL",
      			alert: false,
      			attributes: [{
      					values: [
      						"SvcCmdArgs{targetRoles=[DbRole{id=65, name=ATLAS-1-ATLAS_SERVER-353d9af9c42d35ad516e1b1ff11f3a63, hostName=knox-host-2}], args=[]}"
      					],
      					name: "COMMAND_ARGS"
      				},
      				{
      					values: [
      						"556"
      					],
      					name: "COMMAND_ID"
      				},
      				{
      					values: [
      						"admin"
      					],
      					name: "USER"
      				},
      				{
      					values: [
      						"Cluster 1"
      					],
      					name: "CLUSTER_DISPLAY_NAME"
      				},
      				{
      					values: [
      						"https://knox-host-2:5003/cmf/eventRedirect/d3176932-8c50-4c42-879c-7e25ae9e2cbd"
      					],
      					name: "URL"
      				},
      				{
      					values: [
      						"ATLAS-1"
      					],
      					name: "SERVICE_DISPLAY_NAME"
      				},
      				{
      					values: [
      						"STARTED"
      					],
      					name: "COMMAND_STATUS"
      				},
      				{
      					values: [
      						"EV_SERVICE_STARTED"
      					],
      					name: "EVENTCODE"
      				},
      				{
      					values: [
      						"false"
      					],
      					name: "ALERT_SUPPRESSED"
      				},
      				{
      					values: [
      						"Start"
      					],
      					name: "COMMAND"
      				},
      				{
      					values: [
      						"ATLAS-1"
      					],
      					name: "SERVICE"
      				},
      				{
      					values: [
      						"ATLAS"
      					],
      					name: "SERVICE_TYPE"
      				},
      				{
      					values: [
      						"1"
      					],
      					name: "CLUSTER_ID"
      				},
      				{
      					values: [
      						"Cluster 1"
      					],
      					name: "CLUSTER"
      				},
      				{
      					values: [
      						"SERVICE_STARTED"
      					],
      					name: "MESSAGE_CODES"
      				}
      			]
      		}
      	]
      }
      

      On knox-host-2 everything worked as expected. However, on knox-host-1, my ATLAS service configuration change got lost:

      • The start command successfully finished on 2020-12-03T08:32:20.366Z
      • Polling configuration analyzer fetched audit events from CM server at 2020-12-03 08:32:23,769Z since 2020-12-03T08:31:23.681Z. However, CM did not return the start command that finished 3 seconds ago, even it should have returned it
      • A minute later, the polling configuration analyzer fetched audit events from CM server at 2020-12-03 08:33:23,874Z since 2020-12-03T08:32:23.769Z (the last query timestamp). This query returned nothing which is correct since there was no activation within this time frame

      As a result, the command with ID of 7537adcb-d7da-4591-aa66-1f857584c099 has been lost on knox-host-1. It's important to note that the first timestamp, when Knox issued a CM query, was sort of close to the timestamp when the successful start event occurred. There seems to be a lag between the CM server/agent communication which may cause this issue. This means Knox should be smarter in this respect and try to mitigate the chance of missing relevant audit events.

      I'm going to implement the following approach:

      • Knox will look back a little bit more time than is does currently: instead of checking the [lastQueryTimestamp -> currentTimestamp] interval, we'll check [lastQueryTimestamp - (2 x pollingInterval) -> currentTimestamp]
      • Knox will maintain a cache of all relevant events that it processed to avoid re-processing them

        Attachments

          Activity

            People

            • Assignee:
              smolnar Sandor Molnar
              Reporter:
              smolnar Sandor Molnar

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 20m
                20m

                  Issue deployment