Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      According to the documentation (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory), PositionFilter is mainly useful to make query parsers generate boolean queries instead of phrase queries although this problem can be solved at query parsing level instead of analysis level (eg. using QueryParser.setAutoGeneratePhraseQueries).

      So given that PositionFilter corrupts token graphs (see TestRandomChains), I propose to deprecate it.

      1. LUCENE-4981.patch
        7 kB
        Adrien Grand
      2. LUCENE-4981.patch
        6 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        Here is the patch for 4.x. The patch for trunk is simpler as PositionFilter and PositionFilterFactory would simply be removed.

        Show
        Adrien Grand added a comment - Here is the patch for 4.x. The patch for trunk is simpler as PositionFilter and PositionFilterFactory would simply be removed.
        Hide
        Steve Rowe added a comment -

        Adrien, can you hold off committing for a little bit? I'm not sure if QueryParser.setAutoGeneratePhraseQueries is sufficient for all cases that the PositionFilter hack addresses - I want to do some investigation.

        Show
        Steve Rowe added a comment - Adrien, can you hold off committing for a little bit? I'm not sure if QueryParser.setAutoGeneratePhraseQueries is sufficient for all cases that the PositionFilter hack addresses - I want to do some investigation.
        Hide
        Adrien Grand added a comment -

        Sure I can wait. (Even when committed, the old behavior will still be available by using luceneMatchVersion=4.3).

        I would like to start marking all our broken components (the offenders in TestRandomChains) as deprecated so that people start thinking about ways to solve their problems without them, stop getting highlighting bugs and can eventually smoothly upgrade to 5.0 when we release it. I already started deprecating/fixing some tokenizers / token filters for 4.4 (LUCENE-4955 and LUCENE-4963) and would like to get as many of them fixed as possible for the next release.

        Show
        Adrien Grand added a comment - Sure I can wait. (Even when committed, the old behavior will still be available by using luceneMatchVersion=4.3). I would like to start marking all our broken components (the offenders in TestRandomChains) as deprecated so that people start thinking about ways to solve their problems without them, stop getting highlighting bugs and can eventually smoothly upgrade to 5.0 when we release it. I already started deprecating/fixing some tokenizers / token filters for 4.4 ( LUCENE-4955 and LUCENE-4963 ) and would like to get as many of them fixed as possible for the next release.
        Hide
        Steve Rowe added a comment -

        Thanks for working on fixing the broken stuff.

        In addition to use cases, I want to investigate the exact nature of the brokenness PositionFilter introduces - maybe it's fixable? I'll re-enable it in TestRandomChains and iterate until it breaks.

        Show
        Steve Rowe added a comment - Thanks for working on fixing the broken stuff. In addition to use cases, I want to investigate the exact nature of the brokenness PositionFilter introduces - maybe it's fixable? I'll re-enable it in TestRandomChains and iterate until it breaks.
        Hide
        Robert Muir added a comment -

        I'm not sure its fixable: by definition it corrupts the structure because you lose all posincs. so synonyms no longer become synonyms, holes disappear, or whatever. and this doesnt even factor in posLength...

        Show
        Robert Muir added a comment - I'm not sure its fixable: by definition it corrupts the structure because you lose all posincs. so synonyms no longer become synonyms, holes disappear, or whatever. and this doesnt even factor in posLength...
        Hide
        Steve Rowe added a comment -

        The comment in TestRandomChains says:

        // TODO: corrumpts graphs (offset consistency check):
        PositionFilter.class,
        

        which is what made me wonder what about the nature of brokenness: why are offsets a problem?

        I agree, Robert, PositionFilter corrupts by design. And if we do end up keeping it, position length should be addressed (it's not now), maybe by always setting it to 1.

        Show
        Steve Rowe added a comment - The comment in TestRandomChains says: // TODO: corrumpts graphs (offset consistency check): PositionFilter.class, which is what made me wonder what about the nature of brokenness: why are offsets a problem? I agree, Robert, PositionFilter corrupts by design. And if we do end up keeping it, position length should be addressed (it's not now), maybe by always setting it to 1.
        Hide
        Adrien Grand added a comment -

        why are offsets a problem?

        There are invariants that need to be maintained by token filters: all tokens that start at the same position must have the same start offset and all tokens that end at the same position (start position + position length) must have the same end offset (see ValidatingFilter). By arbitrarily changing position increments, PositionFilter breaks these invariants.

        Show
        Adrien Grand added a comment - why are offsets a problem? There are invariants that need to be maintained by token filters: all tokens that start at the same position must have the same start offset and all tokens that end at the same position (start position + position length) must have the same end offset (see ValidatingFilter). By arbitrarily changing position increments, PositionFilter breaks these invariants.
        Hide
        Robert Muir added a comment -

        which is what made me wonder what about the nature of brokenness: why are offsets a problem?

        I think Adrien describes it correctly: afaik it doesn't do anything super-evil like make start offsets go backwards or anything, but it breaks those invariants Adrien describes which can cause a follow-on-filter (e.g. shingle) to cause further craziness, e.g. things going backwards or endOffset < startOffset or other problems.

        Show
        Robert Muir added a comment - which is what made me wonder what about the nature of brokenness: why are offsets a problem? I think Adrien describes it correctly: afaik it doesn't do anything super-evil like make start offsets go backwards or anything, but it breaks those invariants Adrien describes which can cause a follow-on-filter (e.g. shingle) to cause further craziness, e.g. things going backwards or endOffset < startOffset or other problems.
        Hide
        Steve Rowe added a comment -

        Thanks for the pointer Adrien, I'll take a look at ValidatingFilter.

        It might be possible, by creating new positions, to enable offset consistency in PositionFilter. Not sure it's worth the effort though.

        Show
        Steve Rowe added a comment - Thanks for the pointer Adrien, I'll take a look at ValidatingFilter. It might be possible, by creating new positions, to enable offset consistency in PositionFilter. Not sure it's worth the effort though.
        Hide
        Robert Muir added a comment -

        The Validatingfilter should be the same logic in BaseTokenStreamTestCase:196

        I think its in a separate filter because then its applied at each "stage" of the analysis in TestRandomChains so if there is a bug in a complex analysis chain we know the culprit.

        Show
        Robert Muir added a comment - The Validatingfilter should be the same logic in BaseTokenStreamTestCase:196 I think its in a separate filter because then its applied at each "stage" of the analysis in TestRandomChains so if there is a bug in a complex analysis chain we know the culprit.
        Hide
        Adrien Grand added a comment -

        Steve, may I commit this patch?

        Show
        Adrien Grand added a comment - Steve, may I commit this patch?
        Hide
        Steve Rowe added a comment -

        Adrien, 24 more hours, please?

        Show
        Steve Rowe added a comment - Adrien, 24 more hours, please?
        Hide
        Adrien Grand added a comment -

        Oh sure, I didn't want to rush you. Take the time you need.

        Show
        Adrien Grand added a comment - Oh sure, I didn't want to rush you. Take the time you need.
        Hide
        Steve Rowe added a comment -

        Adrien,

        I looked http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory, and where it originally came from (http://markmail.org/message/g4habmbyeuckmix6 and LUCENE-1380), and I don't think existing query parser functionality, including QueryParser.setAutoGeneratePhraseQueries, will cover the use case it was created to handle.

        That use case is roughly: given an indexed non-tokenized string field (e.g. "a b"), and a multi-word query against that field, create disjunction query of all possible word n-grams, where 0<n<N and N is as large as the expected longest query. E.g. a query "a b c" would result in "'a' OR 'a b' OR 'a b c' OR 'b' OR 'b c' OR 'c'", and would match a doc with field value "a b".

        mck, the guy who started the thread and created the issue, was able to handle this use case by stringing together:

        1. Quoting the query, to allow the configured analyzer to see all of the terms instead of one-at-a-time
        2. ShingleFilter, to create the n-grams
        3. The new PositionFilter, to place all terms at the same position
        4. QueryParser's synonym handling functionality, which produces a MultiPhraseQuery, which when given multiple terms at the same single position, creates a BooleanQuery with one SHOULD TermQuery for each term.

        Without PositionFilter, is there some way to achieve the same goal?

        I don't think we should get rid of PositionFilter unless we have an alternate way to handle the (IMHO legitimate) use case it was originally designed to cover.

        Show
        Steve Rowe added a comment - Adrien, I looked http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory , and where it originally came from ( http://markmail.org/message/g4habmbyeuckmix6 and LUCENE-1380 ), and I don't think existing query parser functionality, including QueryParser.setAutoGeneratePhraseQueries , will cover the use case it was created to handle. That use case is roughly: given an indexed non-tokenized string field (e.g. "a b"), and a multi-word query against that field, create disjunction query of all possible word n-grams, where 0<n<N and N is as large as the expected longest query. E.g. a query "a b c" would result in "'a' OR 'a b' OR 'a b c' OR 'b' OR 'b c' OR 'c'", and would match a doc with field value "a b". mck , the guy who started the thread and created the issue, was able to handle this use case by stringing together: Quoting the query, to allow the configured analyzer to see all of the terms instead of one-at-a-time ShingleFilter, to create the n-grams The new PositionFilter, to place all terms at the same position QueryParser's synonym handling functionality, which produces a MultiPhraseQuery, which when given multiple terms at the same single position, creates a BooleanQuery with one SHOULD TermQuery for each term. Without PositionFilter, is there some way to achieve the same goal? I don't think we should get rid of PositionFilter unless we have an alternate way to handle the (IMHO legitimate) use case it was originally designed to cover.
        Hide
        Adrien Grand added a comment -

        Without PositionFilter, is there some way to achieve the same goal?

        My understanding is that this use-case would like the query parser to interpret phrase queries as sub-phrase queries. But instead of creating a specific query parser in order to process phrase queries differently (by overriding newFieldQuery for example), it tries to hack the token stream so that the default query parser generates the expected query. So I don't really think this is a use-case for PositionFilter?

        Show
        Adrien Grand added a comment - Without PositionFilter, is there some way to achieve the same goal? My understanding is that this use-case would like the query parser to interpret phrase queries as sub-phrase queries. But instead of creating a specific query parser in order to process phrase queries differently (by overriding newFieldQuery for example), it tries to hack the token stream so that the default query parser generates the expected query. So I don't really think this is a use-case for PositionFilter?
        Hide
        Steve Rowe added a comment -

        So I don't really think this is a use-case for PositionFilter?

        I agree, subclassing the QP and overriding newFieldQuery and getFieldQuery should be sufficient to handle this use case. Current PositionFilter users will have to maintain their own code outside of Lucene and Solr's codebase, rather than having a configuration-only solution.

        I think the @deprecated annotation on PositionFilter in branch_4x should be augmented to help people find this alternative. Similarly, in the backcompat section of trunk CHANGES.txt, and/or MIGRATE.txt, this issue should be mentioned.

        Show
        Steve Rowe added a comment - So I don't really think this is a use-case for PositionFilter? I agree, subclassing the QP and overriding newFieldQuery and getFieldQuery should be sufficient to handle this use case. Current PositionFilter users will have to maintain their own code outside of Lucene and Solr's codebase, rather than having a configuration-only solution. I think the @deprecated annotation on PositionFilter in branch_4x should be augmented to help people find this alternative. Similarly, in the backcompat section of trunk CHANGES.txt, and/or MIGRATE.txt, this issue should be mentioned.
        Hide
        Adrien Grand added a comment -

        Thanks for your feedback Steve. I updated the patch to say that the problems that were solved by PositionFilter should be solved at query parsing level, does it look better?

        Show
        Adrien Grand added a comment - Thanks for your feedback Steve. I updated the patch to say that the problems that were solved by PositionFilter should be solved at query parsing level, does it look better?
        Hide
        Steve Rowe added a comment - - edited

        +1, thanks Adrien!

        Show
        Steve Rowe added a comment - - edited +1, thanks Adrien!

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development