Currently the DocumentNodeState has two revisions:
- getRevision() returns the read revision of this node state. This revision was used to read the node state from the underlying NodeDocument.
- getLastRevision() returns the revision when this node state was last modified. This revision also reflects changes done further below the tree when the node state was not directly affected by a change.
The lastRevision of a state is then used as the read revision of the child node states. This avoids reading the entire tree again with a different revision after the head revision changed because of a commit.
This approach has at least two problems related to comparing node states:
- It does not work well with the current DiffCache implementation and affects the hit rate of this cache. The DiffCache is pro-actively populated after a commit. The key for a diff is a combination of previous and current commit revision and the path. The value then tells what child nodes were added/removed/changed. As the comparison of node states proceeds and traverses the tree, the revision of a state may go back in time because the lastRevision is used as the read revision of the child nodes. This will cause misses in the diff cache, because the revisions do not match the previous and current commit revisions as used to create the cache entries.
OAK-2562tried to address this by keeping the read revision for child nodes at the read revision of the parent in calls of compareAgainstBaseState() when there is a diff cache hit. However, it turns out node state comparison does not always start at the root state. The EventQueue implementation in oak-jcr will start at the paths as indicated by the filter of the listener. This means, OAK-2562is not effective in this case and the diff needs to be calculated again based on a set of revisions, which is different from the original commit.
- When a diff is calculated for a parent with many child nodes, the DocumentNodeStore will perform a query on the underlying DocumentStore to get child nodes modified after a given timestamp. This timestamp is derived from the lower revision of the two lastRevisions of the parent node states to compare. The query gets problematic for the DocumentStore if the timestamp is too far in the past. This will happen when the parent node (and sub-tree) was not modified for some time. E.g. the MongoDocumentStore has an index on the _id and the _modified field. But if there are many child nodes the _id index will not be that helpful and if the timestamp is too far in the past, the _modified index is not selective either. This problem was already reported in
OAK-1970and linked issues.
Both of the above problems could be addressed by keeping track of the read revision of the root node state in each of the node states as the tree is traversed. The revision of the root state would then be used e.g. to derive the timestamp for the _modified constraint in the query. Because the revision of the root state is rather recent, the _modified constraint is very selective and the index on it would be the preferred choice.