Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.2.0
-
None
-
None
Description
We need metrics about du running stats like this;
# HELP total count of du started per data directory du_started_count\{path="/ozone/data/storage1", node="node1.example.com"} 234 # HELP total count of du done per data directory du_finished_count\{path="/ozone/data/storage1", node="node1.example.com"} 233 # HELP du latency in total (milli)seconds du_latency_time \{path="/ozone/data/storage1", node="node1.example.com"} 123423e+10
Datanodes run du command to measure observe disk usage by block files. Besides, it could be fairly heavy load to disk device due to the recursive nature of du command, especially in case block files are relatively small (e.g. the small file problem in local file systems). du itself is not that heavy load alone, but in case when it overlaps with container scan tasks, it is relatively hard to observe du is an additional load to the disk. (The default interval of container metadata scan is 3h and du interval is 1h - I already changed them in our environment).
We can't observe du load easily, until we log in to the datanode and hit "top" or whatever, or the log level be in debug. The log level should be in INFO IMO.