Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.5.1
    • Component/s: None
    • Labels:
      None

      Description

      Currently the consistency checker loads in a batch of node ids and for each node id fetches the corresponding bundle, its child bundles, and parent bundle separately. This makes the consistency checker perform less than optimal and may take hours (days?) to complete for large repositories.

      I've been able to make the checker execute about 20 times faster on my local machine by loading in batches of node prop bundles at once. For 17000 nodes in the workspace the current implementation ran for about 23 seconds whereas with the enhancements I made it finished in 1.2 seconds.

      Now the problem lies in the fact that loading in node prop bundles in batches may require a lot of memory. And it is not very predictable how much per batch size because the sizes of the individual bundles are unpredictable.

      Also the node prop bundle contains much more information than is needed for a consistency check.

      What would be ideal in this situation is to introduce a new type - call it NodeInfo - that contains only the structural information the checker needs to do its work. Meaning the node id, the parent id and the child ids. In order to allow for a possible future referential integrity check perhaps also its reference type propeties.

      The IterablePersistenceManager interface would then get an additional method:
      Map<NodeId, NodeInfo> getAllNodeInfos();

      If this is an acceptable proposal I would like to work on this and contribute a patch.

        Activity

        Unico Hommes created issue -
        Hide
        Thomas Mueller added a comment -

        > perform less than optimal and may take hours (days?) to complete for large repositories.

        It sounds like the problem is caused by the fact that node ids are currently randomly distributed. The getAllNodeInfos() loads the bundles sorted by node id (so performance isn't affected by the fact that node ids are randomly distributed), and the regular traversal loads the nodes by traversal (that means the database has to do a lot of random seeks).

        A solution for this problem would be JCR-2857 (Support sequential (non-random) node ids). This would also improve performance of other traversal operations.

        Show
        Thomas Mueller added a comment - > perform less than optimal and may take hours (days?) to complete for large repositories. It sounds like the problem is caused by the fact that node ids are currently randomly distributed. The getAllNodeInfos() loads the bundles sorted by node id (so performance isn't affected by the fact that node ids are randomly distributed), and the regular traversal loads the nodes by traversal (that means the database has to do a lot of random seeks). A solution for this problem would be JCR-2857 (Support sequential (non-random) node ids). This would also improve performance of other traversal operations.
        Hide
        Bart van der Schans added a comment -

        Hi Thomas,

        The problem is not the sequential node ids. As long as you get the node ids in the same order as the index the performance is ok (and as long as you don't try to count them in InnoDB). The big performance gain is from fetching a bunch of bundles in one query at once instead of fetching a bunch of node ids and then fetch each bundle separately with a query. With the former (in a slightly different consistency checker we created at Hippo) I can check about a 1M nodes in three and a half minute with a batch site of 1k.

        Bart

        Show
        Bart van der Schans added a comment - Hi Thomas, The problem is not the sequential node ids. As long as you get the node ids in the same order as the index the performance is ok (and as long as you don't try to count them in InnoDB). The big performance gain is from fetching a bunch of bundles in one query at once instead of fetching a bunch of node ids and then fetch each bundle separately with a query. With the former (in a slightly different consistency checker we created at Hippo) I can check about a 1M nodes in three and a half minute with a batch site of 1k. Bart
        Hide
        Unico Hommes added a comment -

        >It sounds like the problem is caused by the fact that node ids are currently randomly distributed. The getAllNodeInfos() loads the bundles sorted by node id (so performance isn't affected by the fact that node ids are randomly distributed), and the regular traversal loads the nodes by traversal (that means the database has to do a lot of random seeks).

        No, that isn't the problem. It is actually the amount of database calls that need to be done by the consistency checker.

        Show
        Unico Hommes added a comment - >It sounds like the problem is caused by the fact that node ids are currently randomly distributed. The getAllNodeInfos() loads the bundles sorted by node id (so performance isn't affected by the fact that node ids are randomly distributed), and the regular traversal loads the nodes by traversal (that means the database has to do a lot of random seeks). No, that isn't the problem. It is actually the amount of database calls that need to be done by the consistency checker.
        Hide
        Thomas Mueller added a comment - - edited

        > No, that isn't the problem.

        Did you run a benchmark using sequential node ids (patch for JCR-2857)?

        > batches of node prop bundles at once

        It sounds like in this case you loaded the node bundles sorted by node id, that means the randomly distributed nodes don't hurt performance here.

        To reduce the number of network round trips (if this is really the problem), another solution would be a new method

        Map<NodeId, NodeInfo> getNodeInfos(List<NodeId> ids).

        This method could be used to load any number of nodes, and it could be used to traverse a part of the repository (where getAllNodeInfos could only be used to read the repository completely).

        Show
        Thomas Mueller added a comment - - edited > No, that isn't the problem. Did you run a benchmark using sequential node ids (patch for JCR-2857 )? > batches of node prop bundles at once It sounds like in this case you loaded the node bundles sorted by node id, that means the randomly distributed nodes don't hurt performance here. To reduce the number of network round trips (if this is really the problem), another solution would be a new method Map<NodeId, NodeInfo> getNodeInfos(List<NodeId> ids). This method could be used to load any number of nodes, and it could be used to traverse a part of the repository (where getAllNodeInfos could only be used to read the repository completely).
        Unico Hommes made changes -
        Field Original Value New Value
        Attachment checkerperformance.patch [ 12519946 ]
        Hide
        Unico Hommes added a comment -

        Patch against 2.6 trunk. Includes patches JCR-3267, JCR-3265, JCR-3269 and JCR-3277

        Show
        Unico Hommes added a comment - Patch against 2.6 trunk. Includes patches JCR-3267 , JCR-3265 , JCR-3269 and JCR-3277
        Hide
        Unico Hommes added a comment -

        Hi Thomas, I didn't see your last comment until now. Your suggestion for the following method:

        Map<NodeId, NodeInfo> getNodeInfos(List<NodeId> ids)

        seems problematic to me. You would end up with a huge statement with possibly thousands of WHERE id=? OR id=? OR id=? clauses.

        Or did you mean something else?

        Show
        Unico Hommes added a comment - Hi Thomas, I didn't see your last comment until now. Your suggestion for the following method: Map<NodeId, NodeInfo> getNodeInfos(List<NodeId> ids) seems problematic to me. You would end up with a huge statement with possibly thousands of WHERE id=? OR id=? OR id=? clauses. Or did you mean something else?
        Hide
        Thomas Mueller added a comment -

        Hi,

        > You would end up with a huge statement with possibly thousands of WHERE id=? OR id=? OR id=? clauses.

        Slightly shorter is WHERE ID IN(?,?,?,?,...). If the size of the statement is a problem, you could create multiple statements (and fetch 1000 nodes each). Or you could create a temporary table with the node ids (fill this table using a batch update), and then use a join. Or use another database specific optimization for WHERE ID IN(?,?,...) (for example use ID = ANY(ARRAY) in PostgreSQL).

        Show
        Thomas Mueller added a comment - Hi, > You would end up with a huge statement with possibly thousands of WHERE id=? OR id=? OR id=? clauses. Slightly shorter is WHERE ID IN(?,?,?,?,...). If the size of the statement is a problem, you could create multiple statements (and fetch 1000 nodes each). Or you could create a temporary table with the node ids (fill this table using a batch update), and then use a join. Or use another database specific optimization for WHERE ID IN(?,?,...) (for example use ID = ANY(ARRAY) in PostgreSQL).
        Hide
        Unico Hommes added a comment -

        OK. But I think that until we have a concrete use-case for such a method, we should keep it as simple as possible. The current patch simply expands upon the tried and tested getAllNodeIds method of querying the database. WDYT?

        Show
        Unico Hommes added a comment - OK. But I think that until we have a concrete use-case for such a method, we should keep it as simple as possible. The current patch simply expands upon the tried and tested getAllNodeIds method of querying the database. WDYT?
        Hide
        Thomas Mueller added a comment -

        > until we have a concrete use-case for such a method

        Well, this isn't really about not having a concrete use case... You already seem to have a use case. It's just that you proposed to add a new method:

        Map<NodeId, NodeInfo> getAllNodeInfos(NodeId after, int maxCount)

        and I have suggested to use a different method instead:

        Map<NodeId, NodeInfo> getNodeInfos(List<NodeId> ids)

        But other than that, I can't really comment on the patch as I'm not familiar with the consistency checker implementation. With 69 KB, the patch you provided is quite large.

        Show
        Thomas Mueller added a comment - > until we have a concrete use-case for such a method Well, this isn't really about not having a concrete use case... You already seem to have a use case. It's just that you proposed to add a new method: Map<NodeId, NodeInfo> getAllNodeInfos(NodeId after, int maxCount) and I have suggested to use a different method instead: Map<NodeId, NodeInfo> getNodeInfos(List<NodeId> ids) But other than that, I can't really comment on the patch as I'm not familiar with the consistency checker implementation. With 69 KB, the patch you provided is quite large.
        Hide
        Unico Hommes added a comment -

        For the consistency checker the method you propose would not make a lot of sense IMO. To use it you would have to do something like:

        Collection<NodeId> nodeIds = pm.getAllNodeIds(after, maxcount);
        Map<NodeId, NodeInfo> pm.getNodeInfos(nodeIds);

        Which underneath would result in two database calls to be made:
        SELECT NODE_ID FROM WS_BUNDLE WHERE NODE_ID > $

        {after}

        etc.
        and
        SELECT NODE_ID, BUNDLE_DATA FROM WS_BUNDLE WHERE NODE_ID IN (?,?,? etc.

        At least in the case of the consistency checker I don't think it makes a lot of sense to do it this way.

        The reason the patch is so large is that it includes patches for 4 other issues. It will be some work to create a patch for this issue alone but if you need it I would gladly provide it.

        Show
        Unico Hommes added a comment - For the consistency checker the method you propose would not make a lot of sense IMO. To use it you would have to do something like: Collection<NodeId> nodeIds = pm.getAllNodeIds(after, maxcount); Map<NodeId, NodeInfo> pm.getNodeInfos(nodeIds); Which underneath would result in two database calls to be made: SELECT NODE_ID FROM WS_BUNDLE WHERE NODE_ID > $ {after} etc. and SELECT NODE_ID, BUNDLE_DATA FROM WS_BUNDLE WHERE NODE_ID IN (?,?,? etc. At least in the case of the consistency checker I don't think it makes a lot of sense to do it this way. The reason the patch is so large is that it includes patches for 4 other issues. It will be some work to create a patch for this issue alone but if you need it I would gladly provide it.
        Hide
        Thomas Mueller added a comment -

        > I don't think it makes a lot of sense to do it this way

        Why not? Is it a performance problem? Does it look complicated?

        One reason for getNodeInfos(List<NodeId> ids) is to avoid duplicate code (copy code from getAllNodeIds, which is already quite complex in some implementations).

        Show
        Thomas Mueller added a comment - > I don't think it makes a lot of sense to do it this way Why not? Is it a performance problem? Does it look complicated? One reason for getNodeInfos(List<NodeId> ids) is to avoid duplicate code (copy code from getAllNodeIds, which is already quite complex in some implementations).
        Hide
        Thomas Mueller added a comment -

        It doesn't matter to me actually that much whether you want to implement getNodeInfos(List<NodeId> ids) or getAllNodeInfos(NodeId after, int maxCount), as I'm not the right person to review this patch.

        Show
        Thomas Mueller added a comment - It doesn't matter to me actually that much whether you want to implement getNodeInfos(List<NodeId> ids) or getAllNodeInfos(NodeId after, int maxCount), as I'm not the right person to review this patch.
        Unico Hommes made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 2.6 [ 12319480 ]
        Resolution Fixed [ 1 ]
        Hide
        Jukka Zitting added a comment -

        I adjusted the code slightly in revision 1352849 by adding a basic getAllNodeInfos() implementation in AbstractBundlePersistenceManager so that only those PMs that can implement the logic more efficiently need to override the method.

        Show
        Jukka Zitting added a comment - I adjusted the code slightly in revision 1352849 by adding a basic getAllNodeInfos() implementation in AbstractBundlePersistenceManager so that only those PMs that can implement the logic more efficiently need to override the method.
        Hide
        Unico Hommes added a comment -

        Ah yes, good thinking. Thanks Jukka.

        Show
        Unico Hommes added a comment - Ah yes, good thinking. Thanks Jukka.
        Jukka Zitting made changes -
        Fix Version/s 2.5.1 [ 12321651 ]
        Fix Version/s 2.6 [ 12319480 ]
        Alex Parvulescu made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Unico Hommes
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development