Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 5.4
    • 6.1
    • modules/other
    • New

    Description

      Currently the query is generated :
      org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
      1) we extract the terms from the interesting fields, adding them to a map :
      Map<String, Int> termFreqMap = new HashMap<>();
      ( we lose the relation field-> term, we don't know anymore where the term was coming ! )

      org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
      2) we build the queue that will contain the query terms, at this point we connect again there terms to some field, but :
      ...
      // go through all the fields and find the largest document frequency
      String topField = fieldNames[0];
      int docFreq = 0;
      for (String fieldName : fieldNames) {
      int freq = ir.docFreq(new Term(fieldName, word));
      topField = (freq > docFreq) ? fieldName : topField;
      docFreq = (freq > docFreq) ? freq : docFreq;
      }
      ...

      We identify the topField as the field with the highest document frequency for the term t .
      Then we build the termQuery :

      queue.add(new ScoreTerm(word, topField, score, idf, docFreq, tf));

      In this way we lose a lot of precision.
      Not sure why we do that.
      I would prefer to keep the relation between terms and fields.
      The MLT query can improve a lot the quality.
      If i run the MLT on 2 fields : weSell and weDontSell for example.
      It is likely I want to find documents with similar terms in the weSell and similar terms in the weDontSell, without mixing up the things and loosing the semantic of the terms.

      Attachments

        1. LUCENE-6954.patch
          14 kB
          Alessandro Benedetti

        Activity

          People

            teofili Tommaso Teofili
            abenedetti Alessandro Benedetti
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment