[OAK-5468] Ease TarMK Operations - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Epic
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: segment-tar
Labels:

Epic Name:
TarMK Operations

Description

Ease of TarMK Operations

This epic is all about simplifying the operational aspects of the TarMK. Broadly this can be broken down into the following three topics.

Monitoring

We need to improve monitoring for system load and health. It should be easy for operators to figure out which parts of the TarMK are within safe bounds and and which are not.
Failures should be easy to diagnose and pinpoint the root cause. It should be evident if and how a failures can be fixed by the operator.

Management

Management tasks should be easy to use, clear and safe. It should be evident how to achieve a certain task, what it means to execute it and what its parameters mean (discoverability). Executing a task should no cause harm to the system because the system is not in the right state (e.g. running restore concurrently to backup should be safe).

Tooling

We need better tooling for diagnosing systems. E.g. Analysis of file stores (what content, how much content, distribution over space and time, reachability, retention time, garbage, etc.) Both, online and offline (i.e. post mortem).

Individual improvements

Below is a list of items to address in no specific order. Let's start extracting them into individual issues linked to this epic as we start tackling this.

Monitoring

Throughput (e.g. time to commit, time to save, etc.)
Thrashing (setting on thereof)
SNFE (transient vs. catastrophic)
DSGC
FileStore (e.g. size on disk, #tar files, #segments, #nodes, #properties, etc.)
Cold standby (progress, liveliness, latency, etc.)
...

Management

Revisit backup/restore (OAK-5103, OAK-4866)
Coordination of management operations (ability to run conditionally, prevent them from running concurrently, etc.)

Tooling

Progress monitor oak-run compact
Crash recovery for oak-run compact (e.g. run cleanup only to remove garbage left by prior crash)
Bring oak-run check up to date. Address scalability and performance issues. Include more useful statistics (e.g. node types, child node lists, content distribution, etc.)
Changes over time
Consolidation of various (unversioned) scripts into oak-run like 'node count script', 'node remove script'.
Allow connecting tools to a running instance.
Snapshotting support: restartable stats collection (snapshot at certain revision, diff to collect extras)
"Friendly" output formats that can be easily used by other tools (e.g. Unix tools, Kibana, etc.)
Proper usage of stdin and stdout
Proper exit codes
Current gap in tooling is around the idea of healing a repository plagued with SNFEs, bridge the gap between oak-run check and 'oak console node count script', provide options to plug in the holes to restore the repository to a consistent state. One idea would be to complement rolling back the segment store to the last good revision with rolling it forward to a new and fixed good revisions. The simplest way of fixing is to just replace unreadable items with empty ones (i.e. "plugging the holes"). From there one could diff this new fixed revision against the last good revision to asses the damage and see what else needs fixing (e.g. to regain consistency wrt. to JCR).
Classification of tools between development / research/ experimental and production (customer facing). The latter need a different level of support, maintenance, QE, documentation etc. Possibly mark via documentation which is which.
Group commands from oak-run in namespaces. Assign a different namespace to each persistence implementation in Oak. Let every implementation parse its own commands. Move commands closer to their implementation and relieve oak-run from code bloat. See ~~OAK-5437~~ for further details.

Attachments

Issue Links

incorporates

OAK-1576 SegmentMK: Implement refined conflict resolution for addExistingNode conflicts

Open

is blocked by

OAK-5973 Metrics integration should expose endpoints in an non flat name space

Open

Activity

People

Assignee:: Unassigned

Reporter:: Michael Dürig

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Jan/17 14:19

Updated:: 04/Oct/19 17:51