This issue has been identified by analyzing the logs, code, etc., of the system tests. Many of the system tests indicate that after each test (or after a set of tests), the flow must be torn down. This will stop all processors/reporting tasks and disable all controller services. It will then wait for them to fully stop/disable, according to the REST API. It will then purge any queues and delete all components. Then it deletes all components.
However, occasionally we see a failure in the step that deletes the components. One node will indicate that the component cannot be deleted because it's still running, so the REST API will send back a 409. However, before making this request, we've already made a request to get all components and checked that their state is STOPPED/DISABLED and no active threads.
If we look at the code that is used to determine whether or not they are STOPPED/DISABLED, it is using the "status" field in the Entity objects ( reportingTaskEntity.getStatus().getRunStatus() for example).
However, the DTO also has a state field: ReportingTaskDTO.getState()
We have a similar situation with Processors, Reporting Tasks, and Controller Services.
In order to maintain backward compatibility, we need to leave both of these fields. However, the issue we have appears to be in the ReportingTaskEntityMerger, ProcessorEntityMerger, and ControllerServiceEntityMerger.
These mergers do not take into account / merge this status field in the Entity. They take into account only the fields in the DTO. As a result, we can have one node indicating that the status is STOPPED with 0 threads while another node indicates STOPPED with 1 thread. The merging logic may choose the STOPPED with 0 threads, confirming that the component is fully stopped. At this point, a delete or update will fail because the component is not in the desired state on all nodes.
We need to update the 3 Entity Mergers to ensure that they properly merge the state in the Entity objects as well.