Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6603

consider restrictions on what Similarity.coord() can return

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • None
    • None
    • None
    • None
    • New

    Description

      Today, Similarity.coord() can really return anything, though all of our similarities are well-behaved, issues are all about custom ones:

         * @param overlap the number of query terms matched in the document
         * @param maxOverlap the total number of terms in the query
         * @return a score factor based on term overlap with the query
         */
        public float coord(int overlap, int maxOverlap)
      

      But problems arise when a custom similarity implements coord(1, 1) to return something crazy (say 2). In this case their coord impl is ignored (see LUCENE-4300). coord(1,1) is always treated as 1, or things make no sense, for example A NOT B would score differently from A (which would be a simple termquery and have no BQ around it). Same goes with filters, which should not change scoring but would, if we didn't mandate this.

      Now we see the same problem again, with LUCENE-6585. For this optimization to work in the current world, it will have to check if coord(N,N) == 1 in order to do it safely. Otherwise it cannot safely collapse conjunctions for such custom similarities.

      I would like to enforce that coord(N,N) is always treated as 1, to prevent all these crazy codepaths for wierd, not-so-well-tested cases. So we could change the current hack in BooleanWeight.coord():

      --   } else if (maxOverlap == 1) {
      ++   } else if (overlap == maxOverlap) {
         return 1f;
      

      Alternatively though, we could change javadocs of Similarity.coord() and add a check to BooleanWeight to throw an exception.

      In either case, doing this would be a break, because it would break some custom sims out there, at least this one (https://mail-archives.apache.org/mod_mbox/lucene-java-user/201208.mbox/%3CF00509B7-8C6B-4496-951B-E89B168D91A2@local.ch%3E). That one returns 1/overlap which is kinda like averaging the scores, and caused us to look into this and open bugs like LUCENE-4297 and LUCENE-4300 and probably others. But perhaps coord() is not the way such things should be done and there is another way instead, to allow BQ to be simpler and more efficient.

      Attachments

        Activity

          People

            Unassigned Unassigned
            rcmuir Robert Muir
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: