This epic is all about simplifying the operational aspects of the TarMK. Broadly this can be broken down into the following three topics.
- We need to improve monitoring for system load and health. It should be easy for operators to figure out which parts of the TarMK are within safe bounds and and which are not.
- Failures should be easy to diagnose and pinpoint the root cause. It should be evident if and how a failures can be fixed by the operator.
- Management tasks should be easy to use, clear and safe. It should be evident how to achieve a certain task, what it means to execute it and what its parameters mean (discoverability). Executing a task should no cause harm to the system because the system is not in the right state (e.g. running restore concurrently to backup should be safe).
- We need better tooling for diagnosing systems. E.g. Analysis of file stores (what content, how much content, distribution over space and time, reachability, retention time, garbage, etc.) Both, online and offline (i.e. post mortem).
Below is a list of items to address in no specific order. Let's start extracting them into individual issues linked to this epic as we start tackling this.
- Throughput (e.g. time to commit, time to save, etc.)
- Thrashing (setting on thereof)
- SNFE (transient vs. catastrophic)
- FileStore (e.g. size on disk, #tar files, #segments, #nodes, #properties, etc.)
- Cold standby (progress, liveliness, latency, etc.)
- Revisit backup/restore (OAK-5103, OAK-4866)
- Coordination of management operations (ability to run conditionally, prevent them from running concurrently, etc.)
- Progress monitor oak-run compact
- Crash recovery for oak-run compact (e.g. run cleanup only to remove garbage left by prior crash)
- Bring oak-run check up to date. Address scalability and performance issues. Include more useful statistics (e.g. node types, child node lists, content distribution, etc.)
- Changes over time
- Consolidation of various (unversioned) scripts into oak-run like 'node count script', 'node remove script'.
- Allow connecting tools to a running instance.
- Snapshotting support: restartable stats collection (snapshot at certain revision, diff to collect extras)
- "Friendly" output formats that can be easily used by other tools (e.g. Unix tools, Kibana, etc.)
- Proper usage of stdin and stdout
- Proper exit codes
- Current gap in tooling is around the idea of healing a repository plagued with SNFEs, bridge the gap between oak-run check and 'oak console node count script', provide options to plug in the holes to restore the repository to a consistent state. One idea would be to complement rolling back the segment store to the last good revision with rolling it forward to a new and fixed good revisions. The simplest way of fixing is to just replace unreadable items with empty ones (i.e. "plugging the holes"). From there one could diff this new fixed revision against the last good revision to asses the damage and see what else needs fixing (e.g. to regain consistency wrt. to JCR).
- Classification of tools between development / research/ experimental and production (customer facing). The latter need a different level of support, maintenance, QE, documentation etc. Possibly mark via documentation which is which.
- Group commands from oak-run in namespaces. Assign a different namespace to each persistence implementation in Oak. Let every implementation parse its own commands. Move commands closer to their implementation and relieve oak-run from code bloat. See
OAK-5437for further details.