Lucene - Core
  1. Lucene - Core
  2. LUCENE-4828

BooleanQuery.extractTerms should not recurse into MUST_NOT clauses

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3, 4.2.1, 6.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New
    1. LUCENE-4828.patch
      3 kB
      Michael McCandless

      Activity

      Hide
      Commit Tag Bot added a comment -

      [trunk commit] Michael McCandless
      http://svn.apache.org/viewvc?view=revision&revision=1456074

      LUCENE-4828: BooleanQuery.extractTerms no longer includes terms from MUST_NOT clauses

      Show
      Commit Tag Bot added a comment - [trunk commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1456074 LUCENE-4828 : BooleanQuery.extractTerms no longer includes terms from MUST_NOT clauses
      Hide
      Commit Tag Bot added a comment -

      [branch_4x commit] Michael McCandless
      http://svn.apache.org/viewvc?view=revision&revision=1456076

      LUCENE-4828: BooleanQuery.extractTerms no longer includes terms from MUST_NOT clauses

      Show
      Commit Tag Bot added a comment - [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1456076 LUCENE-4828 : BooleanQuery.extractTerms no longer includes terms from MUST_NOT clauses
      Hide
      Yonik Seeley added a comment -

      This seems highly dependent on what the extractTerms is being used for?

      Show
      Yonik Seeley added a comment - This seems highly dependent on what the extractTerms is being used for?
      Hide
      Michael McCandless added a comment -

      This seems highly dependent on what the extractTerms is being used for?

      True, but as far as I know the dominant use case is for highlighting, where it's confusing to see your MUST_NOT terms highlighted.

      And SpanNotQuery doesn't include terms from its excluded query ...

      It also causes problems for users, e.g. http://stackoverflow.com/questions/13633409/skipping-terms-of-must-not-clauses-during-term-extraction

      Show
      Michael McCandless added a comment - This seems highly dependent on what the extractTerms is being used for? True, but as far as I know the dominant use case is for highlighting, where it's confusing to see your MUST_NOT terms highlighted. And SpanNotQuery doesn't include terms from its excluded query ... It also causes problems for users, e.g. http://stackoverflow.com/questions/13633409/skipping-terms-of-must-not-clauses-during-term-extraction
      Hide
      Yonik Seeley added a comment -

      For highlighting, I've heard it argued both ways (i.e. prohibited terms can be important too). I wasn't thinking about highlighting as much as something like distributed IDF or other global term statistics. Depending on what type of query is in the prohibited clause, using global statistics could be important.

      Show
      Yonik Seeley added a comment - For highlighting, I've heard it argued both ways (i.e. prohibited terms can be important too). I wasn't thinking about highlighting as much as something like distributed IDF or other global term statistics. Depending on what type of query is in the prohibited clause, using global statistics could be important.
      Hide
      Michael McCandless added a comment -

      For highlighting, I've heard it argued both ways (i.e. prohibited terms can be important too).

      Hmm can you give an example where it's useful to highlight the prohibited terms?

      I wasn't thinking about highlighting as much as something like distributed IDF or other global term statistics. Depending on what type of query is in the prohibited clause, using global statistics could be important.

      But, normally, prohibited clauses don't contribute to scoring so the stats of terms inside them don't need to be distributed?

      Show
      Michael McCandless added a comment - For highlighting, I've heard it argued both ways (i.e. prohibited terms can be important too). Hmm can you give an example where it's useful to highlight the prohibited terms? I wasn't thinking about highlighting as much as something like distributed IDF or other global term statistics. Depending on what type of query is in the prohibited clause, using global statistics could be important. But, normally, prohibited clauses don't contribute to scoring so the stats of terms inside them don't need to be distributed?
      Hide
      Yonik Seeley added a comment -

      > For highlighting, I've heard it argued both ways (i.e. prohibited terms can be important too).

      Hmm can you give an example where it's useful to highlight the prohibited terms?

      It wasn't my argument, but I guess it was along the lines that there can be info/relevance in the fact that the user did not want documents with a specific term, and thus it can make sense to highlight them (maybe with a diff color...)

      > I wasn't thinking about highlighting as much as something like distributed IDF or other global term statistics.

      But, normally, prohibited clauses don't contribute to scoring so the stats of terms inside them don't need to be distributed?

      The key word there is "normally". As I said, it depends on the type of query in the prohibited clause, and the boolean query does not have the knowledge to know if it will matter or not. Something other than extractTerms could be used for distributed term stats though.

      Show
      Yonik Seeley added a comment - > For highlighting, I've heard it argued both ways (i.e. prohibited terms can be important too). Hmm can you give an example where it's useful to highlight the prohibited terms? It wasn't my argument, but I guess it was along the lines that there can be info/relevance in the fact that the user did not want documents with a specific term, and thus it can make sense to highlight them (maybe with a diff color...) > I wasn't thinking about highlighting as much as something like distributed IDF or other global term statistics. But, normally, prohibited clauses don't contribute to scoring so the stats of terms inside them don't need to be distributed? The key word there is "normally". As I said, it depends on the type of query in the prohibited clause, and the boolean query does not have the knowledge to know if it will matter or not. Something other than extractTerms could be used for distributed term stats though.
      Hide
      Robert Muir added a comment -

      in my opinion SpanNot should be consistent with BQ here (whichever way we go, its no big deal to me).

      As far as distributed scoring, i think ideally we would not weight or score MUST_NOT or constant-scored clauses at all. I know this isnt the case today, but I just think its dumb.

      Show
      Robert Muir added a comment - in my opinion SpanNot should be consistent with BQ here (whichever way we go, its no big deal to me). As far as distributed scoring, i think ideally we would not weight or score MUST_NOT or constant-scored clauses at all. I know this isnt the case today, but I just think its dumb.
      Hide
      Otis Gospodnetic added a comment -

      But how do you highlight a term that is not there?

      Show
      Otis Gospodnetic added a comment - But how do you highlight a term that is not there?
      Hide
      Walter Underwood added a comment - - edited

      I was about to make the same comment about highlighting forbidden terms, but then realized that you might search on one field and highlight another.

      I think that goes against the point of highlighting, which is to make more clear why the engine chose that document, but some people have odd requirements.

      Show
      Walter Underwood added a comment - - edited I was about to make the same comment about highlighting forbidden terms, but then realized that you might search on one field and highlight another. I think that goes against the point of highlighting, which is to make more clear why the engine chose that document, but some people have odd requirements.
      Hide
      Yonik Seeley added a comment -

      I think ideally we would not weight or score MUST_NOT or constant-scored clauses at all. I know this isnt the case today, but I just think its dumb.

      Not weighting prohibited clauses would needlessly break certain types of queries.

      Show
      Yonik Seeley added a comment - I think ideally we would not weight or score MUST_NOT or constant-scored clauses at all. I know this isnt the case today, but I just think its dumb. Not weighting prohibited clauses would needlessly break certain types of queries.
      Hide
      Robert Muir added a comment -

      What kind of queries would this break?

      Just to be clear, when I say "weight". I mean, similarity. we'd still createWeight, it just wouldnt fetch any term statistics.

      Show
      Robert Muir added a comment - What kind of queries would this break? Just to be clear, when I say "weight". I mean, similarity. we'd still createWeight, it just wouldnt fetch any term statistics.
      Hide
      Commit Tag Bot added a comment -

      [trunk commit] Michael McCandless
      http://svn.apache.org/viewvc?view=revision&revision=1458472

      LUCENE-4828: move CHANGES entry to 4.2.1

      Show
      Commit Tag Bot added a comment - [trunk commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1458472 LUCENE-4828 : move CHANGES entry to 4.2.1
      Hide
      Uwe Schindler added a comment -

      Closed after release.

      Show
      Uwe Schindler added a comment - Closed after release.

        People

        • Assignee:
          Michael McCandless
          Reporter:
          Michael McCandless
        • Votes:
          0 Vote for this issue
          Watchers:
          5 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development