Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6954

More Like This Query: keep fields separated


    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 5.4
    • 6.1
    • modules/other
    • New


      Currently the query is generated :
      1) we extract the terms from the interesting fields, adding them to a map :
      Map<String, Int> termFreqMap = new HashMap<>();
      ( we lose the relation field-> term, we don't know anymore where the term was coming ! )

      2) we build the queue that will contain the query terms, at this point we connect again there terms to some field, but :
      // go through all the fields and find the largest document frequency
      String topField = fieldNames[0];
      int docFreq = 0;
      for (String fieldName : fieldNames) {
      int freq = ir.docFreq(new Term(fieldName, word));
      topField = (freq > docFreq) ? fieldName : topField;
      docFreq = (freq > docFreq) ? freq : docFreq;

      We identify the topField as the field with the highest document frequency for the term t .
      Then we build the termQuery :

      queue.add(new ScoreTerm(word, topField, score, idf, docFreq, tf));

      In this way we lose a lot of precision.
      Not sure why we do that.
      I would prefer to keep the relation between terms and fields.
      The MLT query can improve a lot the quality.
      If i run the MLT on 2 fields : weSell and weDontSell for example.
      It is likely I want to find documents with similar terms in the weSell and similar terms in the weDontSell, without mixing up the things and loosing the semantic of the terms.


        1. LUCENE-6954.patch
          14 kB
          Alessandro Benedetti



            teofili Tommaso Teofili
            abenedetti Alessandro Benedetti
            0 Vote for this issue
            6 Start watching this issue