Production experience has shown that big repositories are prone to thrashing:
Monitoring showed as massive level of major page faults, load averages several times the number of cores, system cpu levels well above 50% and extreme levels of IO. As more IOPS was provisioned the instance consumed all available IOPS. The TechOps team reported many TB of read IO per hour and hardly any write IO.
Investigation revealed that the repository size was just larger than the available RAM on the machine. The instance was running in MMAPED mode and the IO was due to major page faults mapping in and out pages of memory. This was made worse by transparent huge page settings causing huge pages to be mapped proactively on major page faults. Compaction reduced the repository size to less than RAM. The TechOps team now monitor the total tar file size and dont let it exceed the RAM on the machine, scheduling compactions to keep within limits. Since the default to TarMK was to run memory mapped rather than on heap, the JVM had no visibility of the mayhem being caused at OS level.
This epic is all about improving scalability of the TarMK wrt. the content. Below are some initial points to consider. Let's create issues and link them to this epic as we go.
- What kind of internal / external monitoring do we need to understand and optimally predict thrashing? Can we monitor the working set (active pages)? The number of segments in the segment cache might be a good starting point.
- (How) can we reproduce the thrashing (easily enough)? Can we scale it down (i.e. to an instance with littler RAM)?
- What is the impact of transparent huge pages (and switching it off)? How much do we suffer from read amplification? What would be the impact of not memory mapping but instead increasing the size of the segment buffer accordingly? Both approaches aim at having finer grained control over the data actually being loaded into RAM.
- What other OS level tweaks should / can we look at?
- Can we reduce the working set by keeping it more compact? E.g. running GC/compaction, reducing read amplification (see above), improving de-duplication of values, storing values more efficiently (e.g. dates, and boolean), can we on the fly compress buffers (e.g. segments)?
- How do we testing with big repositories?
- What is a big repository? (Potential target: 100 GB segment store - 500M nodes, TBC)
- What to measure (indicators of size): size on disk (after compaction), number of JCR nodes, number of node records (reachable vs. waste)
- How to measure?
- oak-run debug (needs improvements for better scalability)
- one-line tool to provide all the info?
- How to obtain big repositories (generate or re-use existing)?
- What to analyze / monitor / debug?
- Possible limits: number of nodes (relative to RAM) for which trashing starts to occur, max. number of direct children, max. concurrent requests during online garbage collection.
- Platform monitoring:
- basic: disc size, IO, CPU, memory
- Asses impact of hardware upgrades on performance. E.g. what impact does doubling RAM/IO/CPU have on our test scenarios.
- in depth: page faults, writes / reads per process, working set of nodes, commit statistics, incoming requests vs Oak operations, other hiccups
- Tools: Ganglia, jHiccups, AppDynamics