Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: jackrabbit-core
    • Labels:
      None

      Description

      The current best practice with Jackrabbit is to avoid flat content structures due to performance concerns.

      These concerns are caused by the fact that the NodeState implementation requires the list of child node names and identifiers to be available at all times. In fact many (all?) current persistence managers implement this requirement by storing and loading this list as a part of the serialized node state. When this list grows, the performance and memory overhead of managing the list grows as well. As a side note, this also creates potential consistency issues since the parent/child links are stored both within the child list of the parent node and as the parent link of the child node.

      To solve this issue, I believe we need to break the tight bonding between the node state and the list of child nodes. This will likely require major refactoring of the Jackrabbit core, including breaking the NodeState and PersistenceManager interfaces, so I don't expect a solution in near future. However, we should start thinking about how to best do this, and at least be concerned about building in any more assumptions about the list of child nodes always being readily available.

        Activity

        Hide
        Stefan Guggisberg added a comment -

        > Is this issue still under consideration?

        yes

        > It's a big issue for our organization, and we are considering whether we need to build a workaround for it or wait for resolution in Jackrabbit itself.

        support for flat hierarchies is IMO one of the design goals of a complete redesign/rewrite of the jackrabbit core (i.e. version 3.0).
        so i wouldn't hold my breath for jackrabbit supporting it within the next couple of months...

        Show
        Stefan Guggisberg added a comment - > Is this issue still under consideration? yes > It's a big issue for our organization, and we are considering whether we need to build a workaround for it or wait for resolution in Jackrabbit itself. support for flat hierarchies is IMO one of the design goals of a complete redesign/rewrite of the jackrabbit core (i.e. version 3.0). so i wouldn't hold my breath for jackrabbit supporting it within the next couple of months...
        Hide
        Jeff Yemin added a comment -

        Is this issue still under consideration? It's a big issue for our organization, and we are considering whether we need to build a workaround for it or wait for resolution in Jackrabbit itself.

        Show
        Jeff Yemin added a comment - Is this issue still under consideration? It's a big issue for our organization, and we are considering whether we need to build a workaround for it or wait for resolution in Jackrabbit itself.
        Hide
        Stefan Guggisberg added a comment -

        @lars:

        i am less enthousiastic about thomas' proposal. please keep in mind that a lot of code all over jackrabbit's core would probably need to be touched, with a high risk of introducing new bugs. personally i doubt that the 'feature' is worth taking this risk...

        if we decide that very flat hierarchies is a use case we want to support, then this should be taken into account when redesigning jackrabbit's core from scratch.

        Show
        Stefan Guggisberg added a comment - @lars: i am less enthousiastic about thomas' proposal. please keep in mind that a lot of code all over jackrabbit's core would probably need to be touched, with a high risk of introducing new bugs. personally i doubt that the 'feature' is worth taking this risk... if we decide that very flat hierarchies is a use case we want to support, then this should be taken into account when redesigning jackrabbit's core from scratch.
        Hide
        Lars Trieloff added a comment -

        > There would be no change to the persistence manager!
        > ...unfortunately, quite a lot of effort elsewhere in the core.

        I think that's the best news I have heard on this issue in a long time. We do not have to

        • break APIs
        • break existing third-party code
        • break existing persisted repositories

        Once we fix it, it is only an incremental change for existing users with long lists - their B-Tree that they did not know about will be re-balanced with the next write and
        that's it.

        Show
        Lars Trieloff added a comment - > There would be no change to the persistence manager! > ...unfortunately, quite a lot of effort elsewhere in the core. I think that's the best news I have heard on this issue in a long time. We do not have to break APIs break existing third-party code break existing persisted repositories Once we fix it, it is only an incremental change for existing users with long lists - their B-Tree that they did not know about will be re-balanced with the next write and that's it.
        Hide
        Thomas Mueller added a comment -

        > "hidden" nodes would mean only small effort is required to change existing persistence managers?

        There would be no change to the persistence manager!
        ...unfortunately, quite a lot of effort elsewhere in the core.

        > no performance impact for deep and thin trees, right?

        The performance impact would be minimal (probably not measurable).

        Show
        Thomas Mueller added a comment - > "hidden" nodes would mean only small effort is required to change existing persistence managers? There would be no change to the persistence manager! ...unfortunately, quite a lot of effort elsewhere in the core. > no performance impact for deep and thin trees, right? The performance impact would be minimal (probably not measurable).
        Hide
        Lars Trieloff added a comment -

        Thomas, do I see it right that introducing "hidden" nodes would mean only small effort is required to change existing persistence managers? And if we only introduce
        hidden nodes once we have more child nodes than we would put into a node in the B-Tree, there would be no performance impact for deep and thin trees, right?

        Show
        Lars Trieloff added a comment - Thomas, do I see it right that introducing "hidden" nodes would mean only small effort is required to change existing persistence managers? And if we only introduce hidden nodes once we have more child nodes than we would put into a node in the B-Tree, there would be no performance impact for deep and thin trees, right?
        Hide
        Bart van der Schans added a comment -

        @Thomas, Stefan,

        Thanks for your answerers! Obviously I need to give it a lot more thought...

        Bart.

        Show
        Bart van der Schans added a comment - @Thomas, Stefan, Thanks for your answerers! Obviously I need to give it a lot more thought... Bart.
        Hide
        Stefan Guggisberg added a comment -

        @bart;

        > Would it be an idea to store the hierarchical information in a separate table with only (node_id, parent_id)?

        yes, that would be the obvious approach for db-based pm's.

        some random thoughts:

        • you'd need a 3rd column ('name')
        • you'd need a 4rd column (e.g. 'pos', in order to support ordered child nodes)
        • read performance would suffer, especially for remote db's (2+ instead of 1 db access)
        • write performance would suffer (updating/reorganizing indices of very large tables
          can be very expensive)
        • same-name sibling support would be a challenge (both implementation and performance-wise)
        • Node.hasNode() e.g. would be significantly slower
        Show
        Stefan Guggisberg added a comment - @bart; > Would it be an idea to store the hierarchical information in a separate table with only (node_id, parent_id)? yes, that would be the obvious approach for db-based pm's. some random thoughts: you'd need a 3rd column ('name') you'd need a 4rd column (e.g. 'pos', in order to support ordered child nodes) read performance would suffer, especially for remote db's (2+ instead of 1 db access) write performance would suffer (updating/reorganizing indices of very large tables can be very expensive) same-name sibling support would be a challenge (both implementation and performance-wise) Node.hasNode() e.g. would be significantly slower
        Hide
        Thomas Mueller added a comment -

        > hierarchical information in a separate table
        That's possible, but it would be quite a big change. Also, most likely it would be slower and would need more space.

        Show
        Thomas Mueller added a comment - > hierarchical information in a separate table That's possible, but it would be quite a big change. Also, most likely it would be slower and would need more space.
        Hide
        Bart van der Schans added a comment -

        Would it be an idea to store the hierarchical information in a separate table with only (node_id, parent_id)? That way if you would add or remove a node to/from a large child node list no serialization of the complete child node list would be needed at the cost of doing one simple extra query. I would also make it easier to lazy initialize the child node list, because it wouldn't be part of the bundle de-serialization when loading the bundles from the database.

        Show
        Bart van der Schans added a comment - Would it be an idea to store the hierarchical information in a separate table with only (node_id, parent_id)? That way if you would add or remove a node to/from a large child node list no serialization of the complete child node list would be needed at the cost of doing one simple extra query. I would also make it easier to lazy initialize the child node list, because it wouldn't be part of the bundle de-serialization when loading the bundles from the database.
        Hide
        Thomas Mueller added a comment -

        I would (try to) use the b-tree data structure: http://en.wikipedia.org/wiki/B-tree

        I would keep the internal implementation (PersistenceManagerAPI, NodeState and so on),
        but add a property 'hidden' to node (maybe using an empty name). Hidden nodes
        would have no properties, only child nodes. When adding a child to a hidden
        node X, the child would be added to a child of X, and the b-tree would be
        re-balanced. The same when removing a child from a hidden node. Other than that,
        user-facing algorithms would need to be changed, specially getPath, and
        NodeIterator. Basically hidden nodes would never be returned to the application and
        not be directly accessible using the JCR API.

        Show
        Thomas Mueller added a comment - I would (try to) use the b-tree data structure: http://en.wikipedia.org/wiki/B-tree I would keep the internal implementation (PersistenceManagerAPI, NodeState and so on), but add a property 'hidden' to node (maybe using an empty name). Hidden nodes would have no properties, only child nodes. When adding a child to a hidden node X, the child would be added to a child of X, and the b-tree would be re-balanced. The same when removing a child from a hidden node. Other than that, user-facing algorithms would need to be changed, specially getPath, and NodeIterator. Basically hidden nodes would never be returned to the application and not be directly accessible using the JCR API.
        Hide
        Jukka Zitting added a comment -

        > Do you think it is necessary to break the NodeState or PersistenceManager interfaces?

        Yes. NodeState is currently a class that's pretty difficult to subclass, and the PersistenceManager interface refers to it directly. If we change that (make NodeState an interface), we can just as well adjust the interfaces to better support the needs of this issue.

        Show
        Jukka Zitting added a comment - > Do you think it is necessary to break the NodeState or PersistenceManager interfaces? Yes. NodeState is currently a class that's pretty difficult to subclass, and the PersistenceManager interface refers to it directly. If we change that (make NodeState an interface), we can just as well adjust the interfaces to better support the needs of this issue.
        Hide
        Stefan Guggisberg added a comment -

        @lars:

        i am rather sceptic that your proposal can be implemented with the current core architecture
        and without sacrificing existing functionality and without seriously impacting performance.
        please keep in mind that child node entries e.g. need to support same-name sibling semantics.

        however, feel free to provide a patch and proof me wrong! i'd be happy to admit having been too pessimistic

        Show
        Stefan Guggisberg added a comment - @lars: i am rather sceptic that your proposal can be implemented with the current core architecture and without sacrificing existing functionality and without seriously impacting performance. please keep in mind that child node entries e.g. need to support same-name sibling semantics. however, feel free to provide a patch and proof me wrong! i'd be happy to admit having been too pessimistic
        Hide
        Lars Trieloff added a comment -

        >> I think that limiting children to 100k nodes is artificial.

        I did not say that - because I know that this limitation is organic.

        > The good thing about that limitation is that it forces you to think about a good content model, ie. one that is also browsable by a human.
        > Nevertheless in some cases it might be good if Jackrabbit would scale better with flat hierarchies.

        Yes, especially if there are aspects about your content model that are outside your control and introducing a fake hierarchy only makes things more complicated at an application level.

        As far as I can see, the other source of NodeStates is in the ItemStateManager, which is being created by the RepositoryImpl with a reference to the PersistenceManager, so that there are no API breaks necessary.

        Making a lazy list writable is certainly hard, but not impossible. In the end following things can happen:

        • a node is being removed - this node has to be fetched before it can be removed
        • a node is being added at the end of the list - easy
        • a node is being added relative to another node - this other node has to be fetched beforehand
        Show
        Lars Trieloff added a comment - >> I think that limiting children to 100k nodes is artificial. I did not say that - because I know that this limitation is organic. > The good thing about that limitation is that it forces you to think about a good content model, ie. one that is also browsable by a human. > Nevertheless in some cases it might be good if Jackrabbit would scale better with flat hierarchies. Yes, especially if there are aspects about your content model that are outside your control and introducing a fake hierarchy only makes things more complicated at an application level. As far as I can see, the other source of NodeStates is in the ItemStateManager, which is being created by the RepositoryImpl with a reference to the PersistenceManager, so that there are no API breaks necessary. Making a lazy list writable is certainly hard, but not impossible. In the end following things can happen: a node is being removed - this node has to be fetched before it can be removed a node is being added at the end of the list - easy a node is being added relative to another node - this other node has to be fetched beforehand
        Hide
        Alexander Klimetschek added a comment -

        > I think that limiting children to 100k nodes is artificial.

        The good thing about that limitation is that it forces you to think about a good content model, ie. one that is also browsable by a human. Nevertheless in some cases it might be good if Jackrabbit would scale better with flat hierarchies.

        > If a PersistenceManager would return a LazyNodeState that would itself contain a LazyChildNodeEntries

        This has to work in both directions: if the long list of child node entries is created in the first place, this happens in the transient space before the PM gets that node in order to store it. Already at that early point things can get slow and memory-consuming (although I have no measurements to present). Also, the implementation of the child node entries mus therefore be working in both directions, ie. a LazyChildNodeEntries that fetches the entries step by step is probably hard to make writeable. Last but not least it also depends on how the list is actually used inside Jackrabbit.

        Show
        Alexander Klimetschek added a comment - > I think that limiting children to 100k nodes is artificial. The good thing about that limitation is that it forces you to think about a good content model, ie. one that is also browsable by a human. Nevertheless in some cases it might be good if Jackrabbit would scale better with flat hierarchies. > If a PersistenceManager would return a LazyNodeState that would itself contain a LazyChildNodeEntries This has to work in both directions: if the long list of child node entries is created in the first place, this happens in the transient space before the PM gets that node in order to store it. Already at that early point things can get slow and memory-consuming (although I have no measurements to present). Also, the implementation of the child node entries mus therefore be working in both directions, ie. a LazyChildNodeEntries that fetches the entries step by step is probably hard to make writeable. Last but not least it also depends on how the list is actually used inside Jackrabbit.
        Hide
        Lars Trieloff added a comment -

        Do you think it is necessary to break the NodeState or PersistenceManager interfaces? If a PersistenceManager would return a LazyNodeState that would itself contain a LazyChildNodeEntries, which contains a reference to the PersistenceManager, which would allow the LazyChildNodeEntries to retrieve the list of child nodes in batches of
        thousands, then the memory consumption of the ChildNodeEntries object can be controlled.

        This would not provide a generic solution for the problem, but a solution that works with certain persistence managers.

        Show
        Lars Trieloff added a comment - Do you think it is necessary to break the NodeState or PersistenceManager interfaces? If a PersistenceManager would return a LazyNodeState that would itself contain a LazyChildNodeEntries, which contains a reference to the PersistenceManager, which would allow the LazyChildNodeEntries to retrieve the list of child nodes in batches of thousands, then the memory consumption of the ChildNodeEntries object can be controlled. This would not provide a generic solution for the problem, but a solution that works with certain persistence managers.
        Hide
        Alan Cabrera added a comment -

        I think that limiting children to 100k nodes is artificial.

        Show
        Alan Cabrera added a comment - I think that limiting children to 100k nodes is artificial.

          People

          • Assignee:
            Unassigned
            Reporter:
            Jukka Zitting
          • Votes:
            7 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:

              Development