This epic is all about simplifying the operational aspects of the TarMK. Broadly this can be broken down into the following three topics.
- We need to improve monitoring for system load and health. It should be easy for operators to figure out which parts of the TarMK are within safe bounds and and which are not.
- Failures should be easy to diagnose and pinpoint the root cause. It should be evident if and how a failures can be fixed by the operator.
- Management tasks should be easy to use, clear and safe. It should be evident how to achieve a certain task, what it means to execute it and what its parameters mean (discoverability). Executing a task should no cause harm to the system because the system is not in the right state (e.g. running restore concurrently to backup should be safe).
- We need better tooling for diagnosing systems. E.g. Analysis of file stores (what content, how much content, distribution over space and time, reachability, retention time, garbage, etc.) Both, online and offline (i.e. post mortem).
Below is a list of items to address in no specific order. Let's start extracting them into individual issues linked to this epic as we start tackling this.
- Throughput (e.g. time to commit, time to save, etc.)
- Thrashing (setting on thereof)
- SNFE (transient vs. catastrophic)
- FileStore (e.g. size on disk, #tar files, #segments, #nodes, #properties, etc.)
- Cold standby (progress, liveliness, latency, etc.)
- Revisit backup/restore (OAK-5103, OAK-4866)
- Coordination of management operations (ability to run conditionally, prevent them from running concurrently, etc.)
- Progress monitor oak-run compact
- Crash recovery for oak-run compact (e.g. run cleanup only to remove garbage left by prior crash)
- Bring oak-run check up to date. Address scalability and performance issues. Include more useful statistics (e.g. node types, child node lists, content distribution, etc.)
- Changes over time
- Consolidation of various (unversioned) scripts into oak-run like 'node count script', 'node remove script'.
- Allow connecting tools to a running instance.
- Snapshotting support: restartable stats collection (snapshot at certain revision, diff to collect extras)
- "Friendly" output formats that can be easily used by other tools (e.g. Unix tools, Kibana, etc.)
- Proper usage of stdin and stdout
- Proper exit codes
- Current gap in tooling is around the idea of healing a repository plagued with SNFEs, bridge the gap between oak-run check and 'oak console node count script', provide options to plug in the holes to restore the repository to a consistent state. One idea would be to complement rolling back the segment store to the last good revision with rolling it forward to a new and fixed good revisions. The simplest way of fixing is to just replace unreadable items with empty ones (i.e. "plugging the holes"). From there one could diff this new fixed revision against the last good revision to asses the damage and see what else needs fixing (e.g. to regain consistency wrt. to JCR).
- Classification of tools between development / research/ experimental and production (customer facing). The latter need a different level of support, maintenance, QE, documentation etc. Possibly mark via documentation which is which.
- Group commands from oak-run in namespaces. Assign a different namespace to each persistence implementation in Oak. Let every implementation parse its own commands. Move commands closer to their implementation and relieve oak-run from code bloat. See
OAK-5437for further details.
Issues in Epic
||OAK-7174||The check command returns a zero exit code on error||Closed||Francesco Mari|
||OAK-7058||oak-run compact reports success even when it was cancelled||Closed||Michael Dürig|
|OAK-6941||Compatibility matrix for oak-run compact||Open||Unassigned|
||OAK-6584||Add tooling API||Closed||Michael Dürig|
|OAK-7207||Define porcelain and plumbing tools for the Segment Store||Open||Unassigned|
|OAK-7043||Collect SegmentStore stats as part of status zip||Open||Unassigned|
|OAK-5792||TarMK: Implement tooling to repair broken nodes||Open||Andrei Dulceanu|
|OAK-7224||oak-run check should have an option to check the segments checksums||Open||Andrei Dulceanu|
|OAK-5360||Cancellation of gc should be reflected by RevisionGC.getRevisionGCStatus()||Open||Unassigned|
||OAK-6373||oak-run check should also check checkpoints||Closed||Andrei Dulceanu|
||OAK-7075||Document oak-run compact arguments and system properties||Closed||Michael Dürig|
||OAK-6784||Exceptions are inhibited in oak-run compact||Closed||Francesco Mari|
||OAK-5635||Revisit FileStoreStats mbean stats format||Resolved||Michael Dürig|
||OAK-6626||Replace standby blob chunk size configuration with feature flag||Closed||Andrei Dulceanu|
||OAK-5956||Improve cache statistics of the segment cache||Closed||Michael Dürig|
||OAK-5634||Expose IOMonitor stats via JMX||Closed||Michael Dürig|
||OAK-4619||Unify RecordCacheStats and CacheStats||Closed||Michael Dürig|
||OAK-5352||Enable RevisionGC task for non primary SegmentNodeStore||Closed||Tomek Rękawek|
||OAK-3679||Rollback to timestamp||Resolved||Unassigned|
||OAK-6371||Implement better tools for reparing a corrupt repository||Resolved||Unassigned|
||OAK-6553||Progress indicator for compaction||Closed||Michael Dürig|
||OAK-7168||The debug command returns a zero exit code on error||Closed||Francesco Mari|
||OAK-7169||The datastorecheck returns a zero exit code on error||Closed||Francesco Mari|
||OAK-7171||The history command returns a zero exit code on error||Closed||Francesco Mari|
|OAK-7234||Check for outdated journal at startup||Open||Unassigned|
||OAK-5885||segment-tar should have a tarmkrecovery command||Resolved||Michael Dürig|
|OAK-4689||Add information about amount of data vs. waste to oak-run||Open||Unassigned|
|OAK-4994||Implement additional record types||Reopened||Unassigned|
|OAK-7504||Include dynamic commit information in the persisted repository data||Open||Unassigned|
|OAK-7634||Repository migration docs should include info on TAR <-> Azure sidegrade||Open||Andrei Dulceanu|
|OAK-7635||oak-run check should support Azure Segment Store||Open||Andrei Dulceanu|
||OAK-7728||Oak run check command fails with SegmentNotFound exception||Closed||Andrei Dulceanu|