Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.3, 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      followup to LUCENE-2048:

      Adds factory methods getPhraseQuery/getMultiPhraseQuery to QP, this way you can subclass it and customize behavior, particularly

      • by default, Solr throws exception here if the fieldtype omits positions: rather than 3.x's silent failure of no results, and even for trunk its nicer to fail during query parsing rather than waiting for lucene's failure during execution.
      • adds phraseAsBoolean, which allows you to downgrade these phrase/multiphrase queries to boolean queries: this is a nice option in conjunction with our word n-gram filters (shingle/commongrams/etc)for a fast "approximation", if your application can tolerate some false positives, e.g. "foo bar" -> termQuery(foo_bar), "foo bar baz" -> BQ(foo_bar AND bar_baz)
      1. SOLR-2660.patch
        15 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        I think this could be a good option (in combination with shingles as mentioned), to accelerate
        the phrase queries that solr query parsers generate in order to boost closer matches.

        Again the idea is to omit positions entirely, and instead use shinglefilter (unigrams and bigrams), approximating phrase
        queries with n-gram conjunctions. I think for the sloppy case, we should use an n-gram disjunction, perhaps interpreting
        slop factor as minNrShouldmatch?

        This basically means you are substituting levenshtein distance for an n-gram approximation in both cases.

        In general its a classic indexing/search tradeoff, in my tests on wikipedia indexing takes ~ twice as long with the shingles,
        but the tradeoff is that for a lot of these use cases you don't need to consult the positions file at all.

        As a parameter to the fieldtype its easily pluggable without messing with any queryparsers, and ordinary queries (term, boolean, etc)
        are totally 'pass-thru', however the thing I guess I don't like about this patch is the fact that this is really a different
        'query intent', in other words, I think its a perfect approach when you just want to boost scores of close matches
        (e.g. when generated by dismax queryparser), but when your 'intent' is to actually limit matches to a phrase
        (e.g. when keyed in by a user directly), then this approximation isn't as good of a fit.

        Either way I'm open to other opinions before doing anything (if we decide to do it, next step I think is to update the patch with
        the SloppyPhraseQuery approximation).

        Show
        Robert Muir added a comment - I think this could be a good option (in combination with shingles as mentioned), to accelerate the phrase queries that solr query parsers generate in order to boost closer matches. Again the idea is to omit positions entirely, and instead use shinglefilter (unigrams and bigrams), approximating phrase queries with n-gram conjunctions. I think for the sloppy case, we should use an n-gram disjunction, perhaps interpreting slop factor as minNrShouldmatch? This basically means you are substituting levenshtein distance for an n-gram approximation in both cases. In general its a classic indexing/search tradeoff, in my tests on wikipedia indexing takes ~ twice as long with the shingles, but the tradeoff is that for a lot of these use cases you don't need to consult the positions file at all. As a parameter to the fieldtype its easily pluggable without messing with any queryparsers, and ordinary queries (term, boolean, etc) are totally 'pass-thru', however the thing I guess I don't like about this patch is the fact that this is really a different 'query intent', in other words, I think its a perfect approach when you just want to boost scores of close matches (e.g. when generated by dismax queryparser), but when your 'intent' is to actually limit matches to a phrase (e.g. when keyed in by a user directly), then this approximation isn't as good of a fit. Either way I'm open to other opinions before doing anything (if we decide to do it, next step I think is to update the patch with the SloppyPhraseQuery approximation).
        Hide
        Jan Høydahl added a comment -

        Can we consider commit a first part of this to lay the foundation for fixing the exception, as discussed in this thread http://search-lucene.com/m/t168517UJ5l1

        Show
        Jan Høydahl added a comment - Can we consider commit a first part of this to lay the foundation for fixing the exception, as discussed in this thread http://search-lucene.com/m/t168517UJ5l1
        Hide
        Robert Muir added a comment -

        There is no exception to fix. I think people discussing in that thread have a misunderstanding of what this issue is about.
        If you ask to omit positions, and then you ask for a phrase query, or configure a stupid query parser that generates them automatically, then you deserve an exception.

        Show
        Robert Muir added a comment - There is no exception to fix. I think people discussing in that thread have a misunderstanding of what this issue is about. If you ask to omit positions, and then you ask for a phrase query, or configure a stupid query parser that generates them automatically, then you deserve an exception.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            3 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development