Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
After stressing a cluster for several days, we found that there are a lot of CLOSED EC pipelines.
[ozoneadmin@TENCENT64 ~/ozone-1.3.0-SNAPSHOT]$ ./bin/ozone admin pipeline list --state=CLOSED | wc -l 997
It makes commands return slowly(e.g. ozone admin datanode list, ozone admin pipeline list), and potentially it will add unnecessary burden to SCM HA, so these CLOSED EC pipelines should be cleaned up properly.
Several ways to consider:
- We close pipelines in `WritableECContainerProvider` by calling `pipelineManager.closePipeline(pipeline, true);`, here the `true` means we don't remove the pipeline record until a timeout. But actually the remove only happens for Ratis Pipelines in `BackgrounePipelineCreator` when doing `pipelineManager.scrubPipeline(replicationConfig);`. We could make it to `false` then we'll get selected, CLOSED pipeline records removed, but leave the unselected CLOSED pipeline records there.
- We could try to close pipeline after container close event from DN is received. But container close follows a lifecyle like: OPEN -> CLOSING -> QUASI_CLOSED -> CLOSED. I think it would be tricky to hook a pipeline close action after an EC container is closed.
- We could have a dedicated background thread that runs periodically to cleanup the CLOSED pipelines in a batch. This also benefits SCM HA compared to solution 1 since we tends to do batch cleanups instead of one by one.
I think we could choose solution 3 to solve this problem.
Attachments
Issue Links
- links to