Jackrabbit Oak
  1. Jackrabbit Oak
  2. OAK-890

Query: advanced fulltext search conditions

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.15
    • Component/s: query
    • Labels:
      None

      Description

      Currently, the query engine does not use a fulltext index if there are multiple fulltext conditions combined with "or". Also, the QueryIndex interface does not support boosts, and does not support fulltext conditions on properties (just on nodes) - Filter.getFulltextConditions is a collection of strings, combined with "and", but does not contain the information whether a condition is on a property or on all properties. Also, the popular sorting by score (specially descending) is not currently supported.

      Marcel Reutegger and me discussed how we could support those features (including boost) in a way that is backward compatible with Jackrabbit 2.x, but without adding a lot of complexity. Example Jackrabbit 2.x query:

      /jcr:root/content//*[(@jcr:primaryType='page' 
        and (jcr:contains(jcr:content/@tags, 'it:blue') 
        or jcr:contains(jcr:content/@tags, '/tags/it/blue')))]
      
      /jcr:root/content//element(*, nt:hierarchyNode)[
        (jcr:contains(jcr:content, 'SomeTextToSearch') 
        or jcr:contains(jcr:content/@jcr:title, 'SomeTextToSearch') 
        or jcr:contains(jcr:content/@jcr:description, 'SomeTextToSearch'))]
        /rep:excerpt(.) order by @jcr:score descending 
      

      A possible solution is to extend the internal fulltext syntax to support those features. The internal fulltext syntax is the one used by
      Filter.getFulltextCondition (not the one used within the original XPath, SQL, or SQL-2 query). The proposed syntax (work in progress, just a rough draft so far) is:

      FullTextSearch ::= Or
        ['order by score' [' desc']]
      Or ::= And {' OR ' And}* 
      And ::= Term {' ' Term}*
      Term ::= '(' Or ')' | ['-'] SimpleTerm
      SimpleTerm ::= [Property ':'] '"' Word {' ' Word}* '"' ['^' Boost]
      Property ::= <property name>
      Boost ::= <number>
      

      The idea is that the syntax matches the syntax used by Lucene (except for the 'order by' part), so that the Lucene and Solr index implementations should get simpler (only need minimal parsing, possibly just the 'order by' part). Search terms (phrases, words) are always within double quotes. That means, the above queries would result in the following condition:

      jcr:content/tags:"it:blue" 
      OR jcr:content/tags:"/tags/it/blue"
      
      jcr:content/*:"SomeTextToSearch" 
      OR jcr:content/jcr:title:"SomeTextToSearch"
      OR jcr:content/jcr:description:"SomeTextToSearch"
      order by score desc
      

      It would also allow to switch back from

      Collection<String> getFulltextConditions()
      

      to

      String getFulltextCondition()
      

        Issue Links

          Activity

          Hide
          Thomas Mueller added a comment -

          An open question is, how can we best support relative properties in Lucene such as in the example above. How does Jackrabbit 2.x do that?

          Show
          Thomas Mueller added a comment - An open question is, how can we best support relative properties in Lucene such as in the example above. How does Jackrabbit 2.x do that?
          Hide
          Marcel Reutegger added a comment -

          > How does Jackrabbit 2.x do that?

          It evaluates each contains separately for the given property and then resolves the relative path backwards.

          Show
          Marcel Reutegger added a comment - > How does Jackrabbit 2.x do that? It evaluates each contains separately for the given property and then resolves the relative path backwards.
          Hide
          Thomas Mueller added a comment -

          OK, so there is some mechanism in Jackrabbit 2.x that plugs into Lucene to support relative properties... I guess this is something we want to avoid for Oak, as it wouldn't be easy to support that for Solr (or other fulltext search engines). For the example above, the SQL-2 query could be re-written as an outer join, from:

          select [jcr:path], [jcr:score], [rep:excerpt] 
          from [nt:hierarchyNode] as a 
          where (contains([jcr:content/*], 'SomeTextToSearch') 
          or contains([jcr:content/jcr:title], 'SomeTextToSearch') 
          or contains([jcr:content/jcr:description], 'SomeTextToSearch')) 
          and isdescendantnode(a, '/content') 
          order by [jcr:score] desc
          

          to:

          select a.[jcr:path], a.[jcr:score], a.[rep:excerpt] 
          from [nt:hierarchyNode] as a 
          left outer join [nt:base] as b on ischildnode(b, a)
          where (contains(b.*, 'SomeTextToSearch')
          or contains(b.[jcr:title], 'SomeTextToSearch')
          or contains(b.[jcr:description], 'SomeTextToSearch'))
          and isdescendantnode(a, '/content') 
          and name(b) = 'jcr:content'
          order by a.[jcr:score] desc
          

          But using outer joins is problematic; it wouldn't work any longer if there is another "or" constraint (for example "or a.x = 1". The problem is the "and name(b) = 'jcr:content'" that should be moved to the "on" condition but can't currently due to the limited SQL-2 syntax (I believe the condition "or name(b) is null" is not supported either). So the query would potentially return the wrong data.

          It's probably easier if a wrapper index of the Lucene/Solr index (similar to the NodeTypeIndex) could support relative properties. That way each component itself (the query engine, and the fulltext index) could stay relatively simple.

          Show
          Thomas Mueller added a comment - OK, so there is some mechanism in Jackrabbit 2.x that plugs into Lucene to support relative properties... I guess this is something we want to avoid for Oak, as it wouldn't be easy to support that for Solr (or other fulltext search engines). For the example above, the SQL-2 query could be re-written as an outer join, from: select [jcr:path], [jcr:score], [rep:excerpt] from [nt:hierarchyNode] as a where (contains([jcr:content/*], 'SomeTextToSearch') or contains([jcr:content/jcr:title], 'SomeTextToSearch') or contains([jcr:content/jcr:description], 'SomeTextToSearch')) and isdescendantnode(a, '/content') order by [jcr:score] desc to: select a.[jcr:path], a.[jcr:score], a.[rep:excerpt] from [nt:hierarchyNode] as a left outer join [nt:base] as b on ischildnode(b, a) where (contains(b.*, 'SomeTextToSearch') or contains(b.[jcr:title], 'SomeTextToSearch') or contains(b.[jcr:description], 'SomeTextToSearch')) and isdescendantnode(a, '/content') and name(b) = 'jcr:content' order by a.[jcr:score] desc But using outer joins is problematic; it wouldn't work any longer if there is another "or" constraint (for example "or a.x = 1". The problem is the "and name(b) = 'jcr:content'" that should be moved to the "on" condition but can't currently due to the limited SQL-2 syntax (I believe the condition "or name(b) is null" is not supported either). So the query would potentially return the wrong data. It's probably easier if a wrapper index of the Lucene/Solr index (similar to the NodeTypeIndex) could support relative properties. That way each component itself (the query engine, and the fulltext index) could stay relatively simple.
          Hide
          Jukka Zitting added a comment -

          Ideally it should IMO be possible to execute the first example query against a custom index configured as follows:

          • include all nodes under /content
          • include the jcr:primaryType property
          • include the tags property of a jcr:content child, analyzed for full text searching

          (or any superset of such a configuration). Similarly for the second example query. Such custom indices should be able to massively outperform Jackrabbit 2.x at least in some cases.

          As far as I see it, solutions that rely on mapping such constraints to an extended full text syntax or on rewriting the query to a join would end up making it harder for an index like the one described above to figure out whether it can be used to evaluate the query. Thus instead of such preprocessed data, I'd just pass the full abstract query tree to the index implementations for evaluation. The implementations can still opt to apply such transformations if they're helpful (or necessary to make the query evaluable), but it should also be possible for them to use the original query.

          Features like boosts, etc. could be implemented by extending the query syntax and associated abstract syntax tree.

          Show
          Jukka Zitting added a comment - Ideally it should IMO be possible to execute the first example query against a custom index configured as follows: include all nodes under /content include the jcr:primaryType property include the tags property of a jcr:content child, analyzed for full text searching (or any superset of such a configuration). Similarly for the second example query. Such custom indices should be able to massively outperform Jackrabbit 2.x at least in some cases. As far as I see it, solutions that rely on mapping such constraints to an extended full text syntax or on rewriting the query to a join would end up making it harder for an index like the one described above to figure out whether it can be used to evaluate the query. Thus instead of such preprocessed data, I'd just pass the full abstract query tree to the index implementations for evaluation. The implementations can still opt to apply such transformations if they're helpful (or necessary to make the query evaluable), but it should also be possible for them to use the original query. Features like boosts, etc. could be implemented by extending the query syntax and associated abstract syntax tree.
          Hide
          Thomas Mueller added a comment -

          Hi Jukka,

          Yes, a specialized index would help a lot. I'm not against doing that (in the contrary) but my issue is a bit different, it is about features we had in Jackrabbit 2.x that are not supported right now in Oak (and are hard to support), such as "contains(..) or contains(..)".

          > rewriting the query to a join

          I think automatic rewriting within the query engine (for this case) would be hard to achieve, so I wouldn't try to do that. (Manual rewriting within the application might make sense in some cases; but on the other hand we want to be compatible with Jackrabbit 2.x as much as possible, so it's also not my preferred option).

          > extended full text syntax

          Well it's not so much about extending the fulltext syntax as it's about provide a way to let the index know the whole fulltext condition. Currently, for some cases, such as "contains(..) or contains(..)" as above, currently the query index filter (FilterImpl) doesn't contain either condition - it can't really, due to the way the Filter interface is specified. That's the main problem.

          > I'd just pass the full abstract query tree to the index implementations for evaluation

          That's an option. We could do it in a way such that the index implementation doesn't have to understand it, but if it does understand the AST, then it could use it. Specially for the fulltext condition it's an option; that way we might avoid having to parse the condition multiple times (avoid needing multiple parsers and avoid the small runtime overhead). Specially as soon as there is "fulltext index wrapper" similar to the NodeTypeIndex.

          I'm actually working on such an addition, right now just for fulltext conditions, to the Filter interface.

          > Features like boosts, etc. could be implemented by extending the query syntax and associated abstract syntax tree.

          Yes, I'm working on that.

          Show
          Thomas Mueller added a comment - Hi Jukka, Yes, a specialized index would help a lot. I'm not against doing that (in the contrary) but my issue is a bit different, it is about features we had in Jackrabbit 2.x that are not supported right now in Oak (and are hard to support), such as "contains(..) or contains(..)". > rewriting the query to a join I think automatic rewriting within the query engine (for this case) would be hard to achieve, so I wouldn't try to do that. (Manual rewriting within the application might make sense in some cases; but on the other hand we want to be compatible with Jackrabbit 2.x as much as possible, so it's also not my preferred option). > extended full text syntax Well it's not so much about extending the fulltext syntax as it's about provide a way to let the index know the whole fulltext condition. Currently, for some cases, such as "contains(..) or contains(..)" as above, currently the query index filter (FilterImpl) doesn't contain either condition - it can't really, due to the way the Filter interface is specified. That's the main problem. > I'd just pass the full abstract query tree to the index implementations for evaluation That's an option. We could do it in a way such that the index implementation doesn't have to understand it, but if it does understand the AST, then it could use it. Specially for the fulltext condition it's an option; that way we might avoid having to parse the condition multiple times (avoid needing multiple parsers and avoid the small runtime overhead). Specially as soon as there is "fulltext index wrapper" similar to the NodeTypeIndex. I'm actually working on such an addition, right now just for fulltext conditions, to the Filter interface. > Features like boosts, etc. could be implemented by extending the query syntax and associated abstract syntax tree. Yes, I'm working on that.
          Hide
          Jukka Zitting added a comment -

          specialized index

          Yep, I'm not suggesting that we implement something like that specifically for this issue, just suggesting that whatever solution we come up with here should be compatible with such an index (i.e. at least not make it harder to implement something like that).

          a way to let the index know the whole fulltext condition

          Right. In general I'd opt to pass as much information to the index as possible, ideally the full abstract syntax tree of the query. An index can always opt to ignore query parts it doesn't support or understand (assuming such parts only restrict the set of potentially matching results), but going in the other direction (trying to use information that's not provided) is impossible.

          Show
          Jukka Zitting added a comment - specialized index Yep, I'm not suggesting that we implement something like that specifically for this issue, just suggesting that whatever solution we come up with here should be compatible with such an index (i.e. at least not make it harder to implement something like that). a way to let the index know the whole fulltext condition Right. In general I'd opt to pass as much information to the index as possible, ideally the full abstract syntax tree of the query. An index can always opt to ignore query parts it doesn't support or understand (assuming such parts only restrict the set of potentially matching results), but going in the other direction (trying to use information that's not provided) is impossible.
          Hide
          Thomas Mueller added a comment -

          > ideally the full abstract syntax tree of the query

          There are a few problems, for example the syntax tree can contain entries for other selectors, and an index is used for each selector. More importantly, the full AST API is quite complex and not stable; it changes whenever we want to add a new feature or even just new syntax. If we have multiple index implementations relying on the AST, then the whole system would be quite brittle and not modular. That's why I would try to avoid it if possible. But it might not always be possible to avoid it.

          But fulltext index conditions are slightly different. The syntax should be relatively stable, and providing the AST (for the fulltext conditions) would have the benefit of only requiring one parser. For example, I'm not sure if the parser in the LuceneIndex really works correctly around "OR" conditions and escaping. I guess nobody would be unhappy if we don't need it at all

          Show
          Thomas Mueller added a comment - > ideally the full abstract syntax tree of the query There are a few problems, for example the syntax tree can contain entries for other selectors, and an index is used for each selector. More importantly, the full AST API is quite complex and not stable; it changes whenever we want to add a new feature or even just new syntax. If we have multiple index implementations relying on the AST, then the whole system would be quite brittle and not modular. That's why I would try to avoid it if possible. But it might not always be possible to avoid it. But fulltext index conditions are slightly different. The syntax should be relatively stable, and providing the AST (for the fulltext conditions) would have the benefit of only requiring one parser. For example, I'm not sure if the parser in the LuceneIndex really works correctly around "OR" conditions and escaping. I guess nobody would be unhappy if we don't need it at all
          Hide
          Thomas Mueller added a comment -

          Revision 1504114: As a temporary workaround for "contains(x, 'y') or contains(, 'y')", now such conditions are converted to just "contains(, 'y')". This means two Jackrabbit core tests are now failing, but it's not really a regression as they failed before when using the Lucene index (the Lucene index isn't used for those tests yet by default).

          Show
          Thomas Mueller added a comment - Revision 1504114: As a temporary workaround for "contains(x, 'y') or contains( , 'y')", now such conditions are converted to just "contains( , 'y')". This means two Jackrabbit core tests are now failing, but it's not really a regression as they failed before when using the Lucene index (the Lucene index isn't used for those tests yet by default).
          Hide
          Thomas Mueller added a comment -

          Revision 1505940, 1505952: the feature is implemented, but not yet enabled. A few parts are still missing, for example LuceneIndex support for multiple distinct relative properties ("conains(a/x, 'hello') or contains(b/x, 'hello').

          Revision 1505953: the feature is now enabled for testing.

          Show
          Thomas Mueller added a comment - Revision 1505940, 1505952: the feature is implemented, but not yet enabled. A few parts are still missing, for example LuceneIndex support for multiple distinct relative properties ("conains(a/x, 'hello') or contains(b/x, 'hello'). Revision 1505953: the feature is now enabled for testing.
          Hide
          Thomas Mueller added a comment -

          The Travis build with the feature enabled was successful, so if nobody objects I will remove the old code soon.

          Show
          Thomas Mueller added a comment - The Travis build with the feature enabled was successful, so if nobody objects I will remove the old code soon.
          Hide
          Alex Parvulescu added a comment -

          I currently get an OOME whenever there is a fulltext query that has a lot of results [0]

          I think we should revisit the need for the FullTextSearchImpl to recheck the returned results of a fulltext index and revert to the previous way things worked [1].

          [0]

          java.lang.OutOfMemoryError: Java heap space
          at java.util.Arrays.copyOf(Arrays.java:2367)
          at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
          at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
          at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:587)
          at java.lang.StringBuilder.append(StringBuilder.java:214)
          at org.apache.jackrabbit.oak.query.ast.FullTextSearchImpl.appendString(FullTextSearchImpl.java:231)
          at org.apache.jackrabbit.oak.query.ast.FullTextSearchImpl.evaluate(FullTextSearchImpl.java:218)
          at org.apache.jackrabbit.oak.query.ast.SelectorImpl.next(SelectorImpl.java:244)
          at org.apache.jackrabbit.oak.query.QueryImpl$RowIterator.fetchNext(QueryImpl.java:447)
          at org.apache.jackrabbit.oak.query.QueryImpl$RowIterator.hasNext(QueryImpl.java:467)
          

          [1] http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/query/ast/FullTextSearchImpl.java?view=markup#l179

          Show
          Alex Parvulescu added a comment - I currently get an OOME whenever there is a fulltext query that has a lot of results [0] I think we should revisit the need for the FullTextSearchImpl to recheck the returned results of a fulltext index and revert to the previous way things worked [1] . [0] java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2367) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:587) at java.lang.StringBuilder.append(StringBuilder.java:214) at org.apache.jackrabbit.oak.query.ast.FullTextSearchImpl.appendString(FullTextSearchImpl.java:231) at org.apache.jackrabbit.oak.query.ast.FullTextSearchImpl.evaluate(FullTextSearchImpl.java:218) at org.apache.jackrabbit.oak.query.ast.SelectorImpl.next(SelectorImpl.java:244) at org.apache.jackrabbit.oak.query.QueryImpl$RowIterator.fetchNext(QueryImpl.java:447) at org.apache.jackrabbit.oak.query.QueryImpl$RowIterator.hasNext(QueryImpl.java:467) [1] http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/query/ast/FullTextSearchImpl.java?view=markup#l179
          Hide
          Alex Parvulescu added a comment -

          I've created OAK-955 for what I think it's still a small piece on info missing from the filter: fulltext constraints on joins.

          Show
          Alex Parvulescu added a comment - I've created OAK-955 for what I think it's still a small piece on info missing from the filter: fulltext constraints on joins.
          Hide
          Thomas Mueller added a comment -

          About the OOME: this is now fixed: FullTextSearchImpl.evaluate (again) doesn't check the condition if a FulltextQueryIndex is used.

          Show
          Thomas Mueller added a comment - About the OOME: this is now fixed: FullTextSearchImpl.evaluate (again) doesn't check the condition if a FulltextQueryIndex is used.
          Hide
          Thomas Mueller added a comment -

          There are some remaining issues with aggregates, tracked with issue OAK-828.

          Show
          Thomas Mueller added a comment - There are some remaining issues with aggregates, tracked with issue OAK-828 .
          Hide
          Alex Parvulescu added a comment -

          bulk close for the 0.15 release

          Show
          Alex Parvulescu added a comment - bulk close for the 0.15 release

            People

            • Assignee:
              Thomas Mueller
              Reporter:
              Thomas Mueller
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development