Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6954

More Like This Query: keep fields separated

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.4
    • Fix Version/s: 6.1
    • Component/s: modules/other
    • Labels:
    • Lucene Fields:
      New

      Description

      Currently the query is generated :
      org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
      1) we extract the terms from the interesting fields, adding them to a map :
      Map<String, Int> termFreqMap = new HashMap<>();
      ( we lose the relation field-> term, we don't know anymore where the term was coming ! )

      org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
      2) we build the queue that will contain the query terms, at this point we connect again there terms to some field, but :
      ...
      // go through all the fields and find the largest document frequency
      String topField = fieldNames[0];
      int docFreq = 0;
      for (String fieldName : fieldNames) {
      int freq = ir.docFreq(new Term(fieldName, word));
      topField = (freq > docFreq) ? fieldName : topField;
      docFreq = (freq > docFreq) ? freq : docFreq;
      }
      ...

      We identify the topField as the field with the highest document frequency for the term t .
      Then we build the termQuery :

      queue.add(new ScoreTerm(word, topField, score, idf, docFreq, tf));

      In this way we lose a lot of precision.
      Not sure why we do that.
      I would prefer to keep the relation between terms and fields.
      The MLT query can improve a lot the quality.
      If i run the MLT on 2 fields : weSell and weDontSell for example.
      It is likely I want to find documents with similar terms in the weSell and similar terms in the weDontSell, without mixing up the things and loosing the semantic of the terms.

        Attachments

        1. LUCENE-6954.patch
          14 kB
          Alessandro Benedetti

          Activity

            People

            • Assignee:
              teofili Tommaso Teofili
              Reporter:
              alessandro.benedetti Alessandro Benedetti
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: