Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-1236

Query: optimize for sling's i18n support

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.13
    • query
    • None

    Description

      There are some performance issues with sling's internationalization support query [0].

      The query for a specific locale looks like the following

      //element(*,mix:language)[@jcr:language='en']//element(*,sling:Message)[@sling:message]/(@sling:key|@sling:message)
      

      This turns into a join and it looks like it cannot properly leverage the index on the left side to filter out content on the right side of the join.

      I'm going to use a standard CQ setup for the following analysis.

      The left side of the join is quite efficient with a property index

      //element(*,mix:language)[@jcr:language='en']
      /libs/foundation/components/search/i18n/en
      /libs/foundation/components/mobilefooter/i18n/en
      /libs/commerce/components/search/i18n/en
      /libs/cq/searchpromote/components/pagination/i18n/en
      

      fast query, so far so good.

      Now the trouble begins running the right side

      //element(*,sling:Message)[@sling:message]/(@sling:key|@sling:message)
      

      As far as I see the biggest issue here is that the second query doesn't leverage the left side join info. This affects the overall query time twice

      • first it doesn't know that we're only looking for 'en' so the query will traverse all the existing translations in all the languages (goes up to 91k rows). So it will fetch 91k rows each time, filtering out for english at a later phase
      • second it appears to run the query for each of the left side hit, in our case 4 times making the first issue 4 times worse.

      [0] http://sling.apache.org/site/internationalization-support.html

      Attachments

        Issue Links

          Activity

            thomasm Thomas Mueller added a comment -

            Do you know how many messages there are with language = 'en'?

            The full query is converted to the SQL-2 statement:

            //element(*,mix:language)[@jcr:language='en']
              //element(*,sling:Message)[@sling:message]/(@sling:key|@sling:message)
              
            select b.[jcr:path] as [jcr:path], b.[jcr:score] as [jcr:score], 
              b.[sling:key] as [sling:key], b.[sling:message] as [sling:message] 
            from [mix:language] as a 
            inner join [sling:Message] as b 
            on isdescendantnode(b, a) 
            where a.[jcr:language] = 'en' 
            and b.[sling:message] is not null 
            

            As far as I know, the left hand side (selector a) is using an index, and the right hand side (selector b) is evaluated by traversing all child nodes of the result of a, and then checking if sling:message is not null. The alternative I see would be to use an index on selector b (the index on sling:message), and then traversing all those nodes, and check for each node whether one of the parent nodes has jcr:language = 'en'. But I don't currently see an easy way to somehow use both indexes at the same time.

            thomasm Thomas Mueller added a comment - Do you know how many messages there are with language = 'en'? The full query is converted to the SQL-2 statement: //element(*,mix:language)[@jcr:language= 'en' ] //element(*,sling:Message)[@sling:message]/(@sling:key|@sling:message) select b.[jcr:path] as [jcr:path], b.[jcr:score] as [jcr:score], b.[sling:key] as [sling:key], b.[sling:message] as [sling:message] from [mix:language] as a inner join [sling:Message] as b on isdescendantnode(b, a) where a.[jcr:language] = 'en' and b.[sling:message] is not null As far as I know, the left hand side (selector a) is using an index, and the right hand side (selector b) is evaluated by traversing all child nodes of the result of a, and then checking if sling:message is not null. The alternative I see would be to use an index on selector b (the index on sling:message), and then traversing all those nodes, and check for each node whether one of the parent nodes has jcr:language = 'en'. But I don't currently see an easy way to somehow use both indexes at the same time.
            jukkaz Jukka Zitting added a comment -

            The join engine in Jackrabbit 2.x would handle the query by first executing the left side of the join:

            SELECT a.[jcr:path] FROM [mix:language] AS a  WHERE a.[jcr:language] = 'en'
            

            So far it's equivalent to what Oak does. But the right side is then handled more efficiently, by using the left-side results to rewrite it to:

            SELECT b.[jcr:path] FROM [sling:Message] AS b 
            WHERE b.[sling:message] IS NOT NULL AND
                (ISDESCENDANTNODE(b, '/libs/foundation/components/search/i18n/en') OR
                 ISDESCENDANTNODE(b, '/libs/foundation/components/mobilefooter/i18n/en') OR
                 ISDESCENDANTNODE(b, '/libs/commerce/components/search/i18n/en') OR
                 ISDESCENDANTNODE(b, '/libs/cq/searchpromote/components/pagination/i18n/en'))
            

            Finally the results of the two sides are merged back together. I would suggest that we do something similar also in Oak.

            jukkaz Jukka Zitting added a comment - The join engine in Jackrabbit 2.x would handle the query by first executing the left side of the join: SELECT a .[jcr: path ] FROM [mix: language ] AS a WHERE a .[jcr: language ] = 'en' So far it's equivalent to what Oak does. But the right side is then handled more efficiently, by using the left-side results to rewrite it to: SELECT b.[jcr: path ] FROM [sling:Message] AS b WHERE b.[sling:message] IS NOT NULL AND (ISDESCENDANTNODE(b, '/libs/foundation/components/ search /i18n/en' ) OR ISDESCENDANTNODE(b, '/libs/foundation/components/mobilefooter/i18n/en' ) OR ISDESCENDANTNODE(b, '/libs/commerce/components/ search /i18n/en' ) OR ISDESCENDANTNODE(b, '/libs/cq/searchpromote/components/pagination/i18n/en' )) Finally the results of the two sides are merged back together. I would suggest that we do something similar also in Oak.

            would it be faster, to just search for all language roots and then traverse the subtree instead of querying it?

            tripod Tobias Bocanegra added a comment - would it be faster, to just search for all language roots and then traverse the subtree instead of querying it?
            thomasm Thomas Mueller added a comment -

            I wonder what would happen if there is no index on the mixin type sling:Message? Wouldn't that make the query fast?

            thomasm Thomas Mueller added a comment - I wonder what would happen if there is no index on the mixin type sling:Message? Wouldn't that make the query fast?
            stillalex Alex Deparvu added a comment -

            Funny enough, I think the 2 following statements have the same effect:

            would it be faster, to just search for all language roots and then traverse the subtree instead of querying it?

            and

            I wonder what would happen if there is no index on the mixin type sling:Message? Wouldn't that make the query fast?

            I've tested this (and fixed OAK-1269 in the process) and it looks like it would solve this issue: removing the node type index for the sling:Message causes a traversal which has minimal impact compared to the original issue.

            On a more broader scope, I agree with Jukka that we should look into applying a similar optimization like the jackrabbit case: buffer the left side results and push the intermediate values on the right side of the join as a filter, but this could be tracked in a dedicated issue.

            This issue is now a matter of index config which is outside the indexing code, so I will mark is as resolved soon if nobody objects.

            stillalex Alex Deparvu added a comment - Funny enough, I think the 2 following statements have the same effect: would it be faster, to just search for all language roots and then traverse the subtree instead of querying it? and I wonder what would happen if there is no index on the mixin type sling:Message? Wouldn't that make the query fast? I've tested this (and fixed OAK-1269 in the process) and it looks like it would solve this issue: removing the node type index for the sling:Message causes a traversal which has minimal impact compared to the original issue. On a more broader scope, I agree with Jukka that we should look into applying a similar optimization like the jackrabbit case: buffer the left side results and push the intermediate values on the right side of the join as a filter, but this could be tracked in a dedicated issue. This issue is now a matter of index config which is outside the indexing code, so I will mark is as resolved soon if nobody objects.
            stillalex Alex Deparvu added a comment -

            bulk close for the 0.13 release

            stillalex Alex Deparvu added a comment - bulk close for the 0.13 release

            People

              stillalex Alex Deparvu
              stillalex Alex Deparvu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: