Lucene - Core
  1. Lucene - Core
  2. LUCENE-1908

Similarity javadocs for scoring function to relate more tightly to scoring models in effect

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      See discussion in the related issue.

      1. LUCENE-1908.patch
        15 kB
        Doron Cohen
      2. LUCENE-1908.patch
        15 kB
        Doron Cohen
      3. LUCENE-1908.patch
        15 kB
        Doron Cohen
      4. LUCENE-1908.patch
        19 kB
        Doron Cohen
      5. LUCENE-1908.patch
        19 kB
        Doron Cohen

        Issue Links

          Activity

          Hide
          Doron Cohen added a comment -

          Committed,
          Thanks Mark, Shai, Ted, Jiri, and Marvin!

          Show
          Doron Cohen added a comment - Committed, Thanks Mark, Shai, Ted, Jiri, and Marvin!
          Hide
          Doron Cohen added a comment -

          Updated patch according to comments for the norms section

          To review without applying the patch see http://people.apache.org/~doronc/sim-docs/org/apache/lucene/search/Similarity.html

          Show
          Doron Cohen added a comment - Updated patch according to comments for the norms section To review without applying the patch see http://people.apache.org/~doronc/sim-docs/org/apache/lucene/search/Similarity.html
          Hide
          Doron Cohen added a comment -

          The rationale behind the coarseness of the norms is that since the accuracy of
          search engines in retrieving the documents that the user really wants is so
          poor, only big differences matter. (It's not just poor "recall" against a
          given query, but the difficulty that the user experiences in formulating a
          proper query to express what they're really looking for in the first place.)

          Doug wrote at least once about this some years back, but I haven't been
          able to track down the post.

          Thanks! I too failed to find that post.

          I like the part about users difficulty to express their information need in the query.

          So I am updating like this:

          However the resulted norm value is encoded as a single byte before being 
          stored. At search time, the norm byte value is read from the index directory 
          and decoded back to a float norm value. This encoding/decoding, while reducing 
          index size, comes with the price of precision loss - it is not guaranteed that 
          decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75. 
           
          Compression of norm values to a single byte saves memory at search time, 
          because once a field is referenced at search time, its norms - for all 
          documents - are maintained in memory. 
           
          The rationale supporting such lossy compression of norm values is that 
          given the difficulty (and inaccuracy) of users to express their true information 
          need by a query, only big differences matter. 
           
          Last, note that search time is too late to modify this norm part of scoring, 
          e.g. by using a different Similarity for search. 
          
          Show
          Doron Cohen added a comment - The rationale behind the coarseness of the norms is that since the accuracy of search engines in retrieving the documents that the user really wants is so poor, only big differences matter. (It's not just poor "recall" against a given query, but the difficulty that the user experiences in formulating a proper query to express what they're really looking for in the first place.) Doug wrote at least once about this some years back, but I haven't been able to track down the post. Thanks! I too failed to find that post. I like the part about users difficulty to express their information need in the query. So I am updating like this: However the resulted norm value is encoded as a single byte before being stored. At search time, the norm byte value is read from the index directory and decoded back to a float norm value. This encoding/decoding, while reducing index size, comes with the price of precision loss - it is not guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75. Compression of norm values to a single byte saves memory at search time, because once a field is referenced at search time, its norms - for all documents - are maintained in memory. The rationale supporting such lossy compression of norm values is that given the difficulty (and inaccuracy) of users to express their true information need by a query, only big differences matter. Last, note that search time is too late to modify this norm part of scoring, e.g. by using a different Similarity for search.
          Hide
          Marvin Humphrey added a comment -

          The rationale behind the coarseness of the norms is that since the accuracy of
          search engines in retrieving the documents that the user really wants is so
          poor, only big differences matter. (It's not just poor "recall" against a
          given query, but the difficulty that the user experiences in formulating a
          proper query to express what they're really looking for in the first place.)

          Doug wrote at least once about this some years back, but I haven't been
          able to track down the post.

          Show
          Marvin Humphrey added a comment - The rationale behind the coarseness of the norms is that since the accuracy of search engines in retrieving the documents that the user really wants is so poor, only big differences matter. (It's not just poor "recall" against a given query, but the difficulty that the user experiences in formulating a proper query to express what they're really looking for in the first place.) Doug wrote at least once about this some years back, but I haven't been able to track down the post.
          Hide
          Jiri Kuhn added a comment -

          Yes, sounds great.

          The consequence of encoding norms as bytes is that norms can't be used to fine tune computed document score as I tried to (and failed). I was forced to use function query to achieve my goals.

          Show
          Jiri Kuhn added a comment - Yes, sounds great. The consequence of encoding norms as bytes is that norms can't be used to fine tune computed document score as I tried to (and failed). I was forced to use function query to achieve my goals.
          Hide
          Doron Cohen added a comment -

          However the resulted norm value is encoded as a single byte before being stored.

          Maybe this is an ideal time to document reasons why norms are encoded as bytes.

          Perhaps something like: "Once a field is referenced at search time, its norms - for all documents - are maintained in memory, hence it is important to keep norms footprint low." ...?

          Show
          Doron Cohen added a comment - However the resulted norm value is encoded as a single byte before being stored. Maybe this is an ideal time to document reasons why norms are encoded as bytes. Perhaps something like: "Once a field is referenced at search time, its norms - for all documents - are maintained in memory, hence it is important to keep norms footprint low." ...?
          Hide
          Jiri Kuhn added a comment -

          The last paragraph starts with:

          However the resulted norm value is encoded as a single byte before being stored.

          Maybe this is an ideal time to document reasons why norms are encoded as bytes.

          Show
          Jiri Kuhn added a comment - The last paragraph starts with: However the resulted norm value is encoded as a single byte before being stored. Maybe this is an ideal time to document reasons why norms are encoded as bytes.
          Hide
          Doron Cohen added a comment -

          Updated patch according to comments by Ted.

          To review without applying the patch see http://people.apache.org/~doronc/sim-docs/org/apache/lucene/search/Similarity.html

          Show
          Doron Cohen added a comment - Updated patch according to comments by Ted. To review without applying the patch see http://people.apache.org/~doronc/sim-docs/org/apache/lucene/search/Similarity.html
          Hide
          Doron Cohen added a comment -

          Thanks Ted (for both your comments and kindness).

          I modified to "varies" as you suggested, and added the note about efficient computation of maxDoc(), but omitted the part of "pretty much correct except" because this might not be correct for certain applications.

          Patch to follow...

          Show
          Doron Cohen added a comment - Thanks Ted (for both your comments and kindness). I modified to "varies" as you suggested, and added the note about efficient computation of maxDoc(), but omitted the part of "pretty much correct except" because this might not be correct for certain applications. Patch to follow...
          Hide
          Ted Dunning added a comment -

          Thanks for reviewing this Ted. ... Might have helped to have better English than mine for this

          No problem reviewing this. Thanks to you for doing the hard part of actually writing it. Being a critic is easy in comparison.

          And you English is doing fine. If you get the ideas down, a hundred people can chime in with grammatical and spelling fixes. And frankly, your English is so much better than any of my other languages that I would never be brave enough to complain.

          I think all 3 formulas are required, just the gluing text should improve. ... I think I know how to write it better in this sense.

          Good point. This is the essence of my grumpiness.

          Note that Searcher.maxDoc() is used instead of org.apache.lucene.index.IndexReader.numDocs() because it is related to Searcher.docFreq(Term), i.e., when one is inaccurate, so is the other, and in the same direction.

          Actually, in this case, I think that the motivation is that maxDoc is very commonly exactly correct and typically vastly more efficient. As you say, when it is wrong, docFreq can also be wrong the same way. My suggestion would be this:

          Note that Searcher.maxDoc() is used instead of org.apache.lucene.index.IndexReader.numDocs() because it is more efficient to compute and is pretty much correct except for when many documents have been deleted. In any case Searcher.docFreq(Term) is likely to have a similar problem.

          Regarding the proportional/related issue, I think that your language is fine. At most, I would suggest "varies with" instead of "related" because it is slightly stronger, but you make the relationship abundantly clear in your text.

          Show
          Ted Dunning added a comment - Thanks for reviewing this Ted. ... Might have helped to have better English than mine for this No problem reviewing this. Thanks to you for doing the hard part of actually writing it. Being a critic is easy in comparison. And you English is doing fine. If you get the ideas down, a hundred people can chime in with grammatical and spelling fixes. And frankly, your English is so much better than any of my other languages that I would never be brave enough to complain. I think all 3 formulas are required, just the gluing text should improve. ... I think I know how to write it better in this sense. Good point. This is the essence of my grumpiness. Note that Searcher.maxDoc() is used instead of org.apache.lucene.index.IndexReader.numDocs() because it is related to Searcher.docFreq(Term), i.e., when one is inaccurate, so is the other, and in the same direction. Actually, in this case, I think that the motivation is that maxDoc is very commonly exactly correct and typically vastly more efficient. As you say, when it is wrong, docFreq can also be wrong the same way. My suggestion would be this: Note that Searcher.maxDoc() is used instead of org.apache.lucene.index.IndexReader.numDocs() because it is more efficient to compute and is pretty much correct except for when many documents have been deleted. In any case Searcher.docFreq(Term) is likely to have a similar problem. Regarding the proportional/related issue, I think that your language is fine. At most, I would suggest "varies with" instead of "related" because it is slightly stronger, but you make the relationship abundantly clear in your text.
          Hide
          Doron Cohen added a comment -

          Thanks for reviewing this Ted.

          the new text seems to say things like "the scoring function is like this (formula) except that it isn't because it is really like this (other-formula) but that isn't really right either because it is like this (still-another-formula) which actually isn't right because of fields and <mumble>".

          I see what you mean.

          I tried to take the reader of this from VSM to the actual elements computed and aggregated in Lucene scoring code. This would also answer questions several times asked in the lists: "but what is the scoring model of Lucene" - it is not that straightforward to tell why a certain method is called during scoring.

          But I think you have a good point - the reader is told "this is the scoring formula" just to discover 20 lines ahead that in fact "that is the formula" and yet again the same thing in another paragraph.

          I think all 3 formulas are required, just the gluing text should improve. Might have helped to have better English than mine for this, but I'll give it a try, I think I know how to write it better in this sense.

          There are also many small errors such as claiming that tf is proportional to term frequency and idf is proportional to inverse of document frequency. Proportional means that there is a linear relationship which is definitely not the case here. It would be better to say tf usually increases with increasing term frequency, although occasionally a constant might be used. IDF, on the other hand, decreases with increasing document frequency.

          I agree. "Proportional" is wrong. Thanks for catching this! In fact it appears wrongly in two other places in Similarity - idf() and in idfExplain(). In these two other places I think replacing it to "related" would be correct, i.e. like this:

          Note that Searcher.maxDoc() is used instead of
          org.apache.lucene.index.IndexReader.numDocs() 
          because it is related to Searcher.docFreq(Term) , 
          i.e., when one is inaccurate, so is the other, and 
          in the same direction.
          

          For tf and idf I think this will do:

          Tf and Idf are described in more detail below, 
          but for now, for completion, let's just say that 
          for given term t and document (or query) x, 
          Tf(t,x) is related to the number of occurrences of 
          term t in x - when one increases so does 
          the other - and idf(t) is similarly related to the 
          inverse of the number of index documents 
          containing term t. 
          
          Show
          Doron Cohen added a comment - Thanks for reviewing this Ted. the new text seems to say things like "the scoring function is like this (formula) except that it isn't because it is really like this (other-formula) but that isn't really right either because it is like this (still-another-formula) which actually isn't right because of fields and <mumble>". I see what you mean. I tried to take the reader of this from VSM to the actual elements computed and aggregated in Lucene scoring code. This would also answer questions several times asked in the lists: "but what is the scoring model of Lucene" - it is not that straightforward to tell why a certain method is called during scoring. But I think you have a good point - the reader is told "this is the scoring formula" just to discover 20 lines ahead that in fact "that is the formula" and yet again the same thing in another paragraph. I think all 3 formulas are required, just the gluing text should improve. Might have helped to have better English than mine for this, but I'll give it a try, I think I know how to write it better in this sense. There are also many small errors such as claiming that tf is proportional to term frequency and idf is proportional to inverse of document frequency. Proportional means that there is a linear relationship which is definitely not the case here. It would be better to say tf usually increases with increasing term frequency, although occasionally a constant might be used. IDF, on the other hand, decreases with increasing document frequency. I agree. "Proportional" is wrong. Thanks for catching this! In fact it appears wrongly in two other places in Similarity - idf() and in idfExplain(). In these two other places I think replacing it to "related" would be correct, i.e. like this: Note that Searcher.maxDoc() is used instead of org.apache.lucene.index.IndexReader.numDocs() because it is related to Searcher.docFreq(Term) , i.e., when one is inaccurate, so is the other, and in the same direction. For tf and idf I think this will do: Tf and Idf are described in more detail below, but for now, for completion, let's just say that for given term t and document (or query) x, Tf(t,x) is related to the number of occurrences of term t in x - when one increases so does the other - and idf(t) is similarly related to the inverse of the number of index documents containing term t.
          Hide
          Ted Dunning added a comment -

          I apologize for complaining here without actually editing this text to illustrate what I would rather see, but the new text seems to say things like "the scoring function is like this (formula) except that it isn't because it is really like this (other-formula) but that isn't really right either because it is like this (still-another-formula) which actually isn't right because of fields and <mumble>".

          There are also many small errors such as claiming that tf is proportional to term frequency and idf is proportional to inverse of document frequency. Proportional means that there is a linear relationship which is definitely not the case here. It would be better to say tf usually increases with increasing term frequency, although occasionally a constant might be used. IDF, on the other hand, decreases with increasing document frequency.

          Show
          Ted Dunning added a comment - I apologize for complaining here without actually editing this text to illustrate what I would rather see, but the new text seems to say things like "the scoring function is like this (formula) except that it isn't because it is really like this (other-formula) but that isn't really right either because it is like this (still-another-formula) which actually isn't right because of fields and <mumble>". There are also many small errors such as claiming that tf is proportional to term frequency and idf is proportional to inverse of document frequency. Proportional means that there is a linear relationship which is definitely not the case here. It would be better to say tf usually increases with increasing term frequency, although occasionally a constant might be used. IDF, on the other hand, decreases with increasing document frequency.
          Hide
          Doron Cohen added a comment -

          New patch is more accurate about doc-length-normalization choices.

          For easy reviewing I updated http://people.apache.org/~doronc/sim-docs/org/apache/lucene/search/Similarity.html accordingly.

          I plan to commit this in a day or two if there are no objections.

          Show
          Doron Cohen added a comment - New patch is more accurate about doc-length-normalization choices. For easy reviewing I updated http://people.apache.org/~doronc/sim-docs/org/apache/lucene/search/Similarity.html accordingly. I plan to commit this in a day or two if there are no objections.
          Hide
          Doron Cohen added a comment -

          The intro to ir book appears to break it down so that you can explain it with the math (why going into the unit vector space favors longer docs) - but other work I am seeing says the math tells you no such thing, and its just comparing it to the computed relevancy curve that tells you its not great.

          To my (current) understanding it goes like this: normalizing all V(d)'s to unit vector is losing all information about lengths of documents. For a large document made by duplicating a smaller one this is probably ok. For a large document which just contains lots of "unique" text this is probably wrong. To solve this, a different normalization is sometimes preferred, one that would not normalize V(d) to the unit vector. (very much in line with what you (Mark) wrote above, finally I understand this...).

          The pivoted length normalization which you mentioned is one different such normalization. Juru in fact is using this document length normalization. In our TREC experiments with Lucene we tried this approach (we modified Lucene indexing such that all require components were indexed as stored/cached fields and at search time we could try various scoring algorithms). It is interesting that pivoted length normalization did not work well - by our experiments - with Lucene for TREC.

          The document length normalization of Lucene's DefaultSimilarity (DS) now seems to me - intuitively - not so good - at least for the previously mentioned two edge cases, where doc1 is made of N distinct terms, and doc2 is made of same N distinct terms, but its length is 2N because each term appears twice. For doc1 DS will normalize to the unit vector same as EN, and for doc2 DS will normalize to a vector larger than the unit vector. However I think the desired behavior is the other way around - for doc2 to be the same as EN, and for doc1 to be normalized to a vector larger than the unit vector.

          Back to the documentation patch, again it seems wrong presenting as if both EU and some additional doc length normalization are required - fixed patch to follow...

          Show
          Doron Cohen added a comment - The intro to ir book appears to break it down so that you can explain it with the math (why going into the unit vector space favors longer docs) - but other work I am seeing says the math tells you no such thing, and its just comparing it to the computed relevancy curve that tells you its not great. To my (current) understanding it goes like this: normalizing all V(d)'s to unit vector is losing all information about lengths of documents. For a large document made by duplicating a smaller one this is probably ok. For a large document which just contains lots of "unique" text this is probably wrong. To solve this, a different normalization is sometimes preferred, one that would not normalize V(d) to the unit vector. (very much in line with what you (Mark) wrote above, finally I understand this...). The pivoted length normalization which you mentioned is one different such normalization. Juru in fact is using this document length normalization. In our TREC experiments with Lucene we tried this approach (we modified Lucene indexing such that all require components were indexed as stored/cached fields and at search time we could try various scoring algorithms). It is interesting that pivoted length normalization did not work well - by our experiments - with Lucene for TREC. The document length normalization of Lucene's DefaultSimilarity (DS) now seems to me - intuitively - not so good - at least for the previously mentioned two edge cases, where doc1 is made of N distinct terms, and doc2 is made of same N distinct terms, but its length is 2N because each term appears twice. For doc1 DS will normalize to the unit vector same as EN, and for doc2 DS will normalize to a vector larger than the unit vector. However I think the desired behavior is the other way around - for doc2 to be the same as EN, and for doc1 to be normalized to a vector larger than the unit vector. Back to the documentation patch, again it seems wrong presenting as if both EU and some additional doc length normalization are required - fixed patch to follow...
          Hide
          Mark Miller added a comment -

          In that work we got best results from Lucene

          Thats funny - I didn't notice you were an author one that one! Small world.

          The original idea of why the cosine normalization for the doc vector is bad, I got from the free intro to info retrieval book thats out there -
          but what it says doesn't fully jive with the info I am finding elsewhere, or my own common sense. Thats what has me most confused
          at the moment - the intro to ir book appears to break it down so that you can explain it with the math (why going into the unit vector space favors longer docs) - but other work I am seeing says
          the math tells you no such thing, and its just comparing it to the computed relevancy curve that tells you its not great.

          I dunno - though at least I know a lot more than I did a few days ago - it never even occurred to me how the scoring we did equated to any kind
          of dot product before this - I used to read Lucene's scoring algorithm and then look at the code and it was like .... okay - sure ... - so I've come a long way.

          Show
          Mark Miller added a comment - In that work we got best results from Lucene Thats funny - I didn't notice you were an author one that one! Small world. The original idea of why the cosine normalization for the doc vector is bad, I got from the free intro to info retrieval book thats out there - but what it says doesn't fully jive with the info I am finding elsewhere, or my own common sense. Thats what has me most confused at the moment - the intro to ir book appears to break it down so that you can explain it with the math (why going into the unit vector space favors longer docs) - but other work I am seeing says the math tells you no such thing, and its just comparing it to the computed relevancy curve that tells you its not great. I dunno - though at least I know a lot more than I did a few days ago - it never even occurred to me how the scoring we did equated to any kind of dot product before this - I used to read Lucene's scoring algorithm and then look at the code and it was like .... okay - sure ... - so I've come a long way.
          Hide
          Doron Cohen added a comment -

          I'm still a little confused I guess

          That makes too of us...

          The longer docs will have larger weights naturally is what I meant, but larger weights actually hurts in the cosine normalization - so it actually over punishes I guess? I dunno - all of this over punish/ under punish is in comparison to a relevancy curve they figure out ( a probability of relevance as a function of document length), and how the pivoted cosine curves compare against it. I'm just reading across random interweb pdfs from google. Apparently our pivot also over punishes large docs and over favors small, the same as this one, but perhaps not as bad ? I'm seeing that in a Lucene/Juru research pdf. This stuff is hard to grok on first pass.

          In that work we got best results from Lucene (for TREC) with SweetSpot similarity and normalizing tf by average term-freq in doc.

          For me it was mainly experimental and intuitive, but I was not able to convince Hoss (or even convince myslf once I read Hoss comments) that this was justified theoretically.

          At that time I was not aware of the V(d) normalization delicacy we are discussing now. I think I understand things better now, and still I am not sure. Need to read http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html and sleep on it...

          Show
          Doron Cohen added a comment - I'm still a little confused I guess That makes too of us... The longer docs will have larger weights naturally is what I meant, but larger weights actually hurts in the cosine normalization - so it actually over punishes I guess? I dunno - all of this over punish/ under punish is in comparison to a relevancy curve they figure out ( a probability of relevance as a function of document length), and how the pivoted cosine curves compare against it. I'm just reading across random interweb pdfs from google. Apparently our pivot also over punishes large docs and over favors small, the same as this one, but perhaps not as bad ? I'm seeing that in a Lucene/Juru research pdf. This stuff is hard to grok on first pass. In that work we got best results from Lucene (for TREC) with SweetSpot similarity and normalizing tf by average term-freq in doc. For me it was mainly experimental and intuitive, but I was not able to convince Hoss (or even convince myslf once I read Hoss comments) that this was justified theoretically. At that time I was not aware of the V(d) normalization delicacy we are discussing now. I think I understand things better now, and still I am not sure. Need to read http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html and sleep on it...
          Hide
          Doron Cohen added a comment -

          Attached fixes according to comments by Shai and Mark:

          • spell/typos
          • slanting definitions and in-text formulas.
          • normalization of doc-vector/doc-length is description was fixed.
          • more "eye friendly" colors in the two large formulas.
          • added link to Euclidean-norm in the last paragraph on query-norm computation.

          For convenient review of the patch see http://people.apache.org/~doronc/sim-docs/org/apache/lucene/search/Similarity.html

          Show
          Doron Cohen added a comment - Attached fixes according to comments by Shai and Mark: spell/typos slanting definitions and in-text formulas. normalization of doc-vector/doc-length is description was fixed. more "eye friendly" colors in the two large formulas. added link to Euclidean-norm in the last paragraph on query-norm computation. For convenient review of the patch see http://people.apache.org/~doronc/sim-docs/org/apache/lucene/search/Similarity.html
          Hide
          Mark Miller added a comment - - edited

          Is it really better? It seems to "punish" the same for length due to distinct terms, and to punish less for length due to duplicate terms. Is this really a desired behavior? My intuition says no, but I am not sure.

          Its only desired behavior if you have a corpus that favors it, but most do. I think you can think of the |V(d)| as taking out information about the document length - you start with an m space vector, with each term given a weight depending on how many times it occurs - in other words, there is information about the length of that document there, and when you normalize by |V(d)|, you will take out that information - but it will appear more similar the more unique terms it started with and the higher the tf's. So that method favors long docs, witch will naturally have more of both, though of course not always be more similar.

          All of the normalizations I have seen appear in the vein of |V(d)| -eg 1/root(something). All of the others also try and make up for this doc length problem - by messing with the curve so that ultra long docs are not favored too highly. Our default method is a known not very good one - buts its also very fast and efficient compared to the better ones. Sweetspot is much better and I think still efficient - but you need to tune the right params I believe.

          • edit *

          I'm still a little confused I guess The longer docs will have larger weights naturally is what I meant, but larger weights actually hurts in the cosine normalization - so it actually over punishes I guess? I dunno - all of this over punish/ under punish is in comparison to a relevancy curve they figure out ( a probability of relevance as a function of document length), and how the pivoted cosine curves compare against it. I'm just reading across random interweb pdfs from google. Apparently our pivot also over punishes large docs and over favors small, the same as this one, but perhaps not as bad ? I'm seeing that in a Lucene/Juru research pdf. This stuff is hard to grok on first pass.

          Show
          Mark Miller added a comment - - edited Is it really better? It seems to "punish" the same for length due to distinct terms, and to punish less for length due to duplicate terms. Is this really a desired behavior? My intuition says no, but I am not sure. Its only desired behavior if you have a corpus that favors it, but most do. I think you can think of the |V(d)| as taking out information about the document length - you start with an m space vector, with each term given a weight depending on how many times it occurs - in other words, there is information about the length of that document there, and when you normalize by |V(d)|, you will take out that information - but it will appear more similar the more unique terms it started with and the higher the tf's. So that method favors long docs, witch will naturally have more of both, though of course not always be more similar. All of the normalizations I have seen appear in the vein of |V(d)| -eg 1/root(something). All of the others also try and make up for this doc length problem - by messing with the curve so that ultra long docs are not favored too highly. Our default method is a known not very good one - buts its also very fast and efficient compared to the better ones. Sweetspot is much better and I think still efficient - but you need to tune the right params I believe. edit * I'm still a little confused I guess The longer docs will have larger weights naturally is what I meant, but larger weights actually hurts in the cosine normalization - so it actually over punishes I guess? I dunno - all of this over punish/ under punish is in comparison to a relevancy curve they figure out ( a probability of relevance as a function of document length), and how the pivoted cosine curves compare against it. I'm just reading across random interweb pdfs from google. Apparently our pivot also over punishes large docs and over favors small, the same as this one, but perhaps not as bad ? I'm seeing that in a Lucene/Juru research pdf. This stuff is hard to grok on first pass.
          Hide
          Doron Cohen added a comment -

          Mark and Shai Thanks for reviewing!

          Mark, I think you have a point here (and I am definitely no more an IR guy than you are ).

          Truth is I was surprised to find out (through your comments in LUCENE-1896) that this component of the score is "missing", and I indeed thought that the "right thing to do" (if there is such thing as "right") really is to do both: normalize to the unit vector, and then normalize by length to compensate for "unfair" advantage of long documents.

          But you're right, and the way I presented V(d) normalization and doc-length normalization is incorrect, as if it is a the right thing to do both, and the way it is presented is not doing justice to Lucene. I will change the writing.

          Interestingly, for a document containing N distinct terms, the 1/Euclidean-norm and Lucene's default similarity's length norm are the same: 1/sqrt(N). But if you double that doc to have two occurrences of each of the N distinct terms, its length would be 2N, 1/Euclidean-norm would be 1/sqrt(4N) but Lucene's default similarity's length norm would be 1/sqrt(2N). So we will punish documents with duplicate terms less than would the Euclidean norm...

          I am not aware of an evaluation or discussion of this - I mean - why was this approach selected, and so I assumed (under question) that it was merely for performance considerations. You said in Lucene-1896:

          not just similar properties - but many times better properties - the standard normalization would not factor in document length at all - it essentially removes it.

          Is it really better? It seems to "punish" the same for length due to distinct terms, and to punish less for length due to duplicate terms. Is this really a desired behavior? My intuition says no, but I am not sure.

          Anyhow this issue more about describing what Lucene is doing today than on what should Lucene do, so think I have the correct picture now (except for historical justification which is interesting but not a show stopper).

          Shai thanks for the fixes.

          (updated patch to follow).

          Show
          Doron Cohen added a comment - Mark and Shai Thanks for reviewing! Mark, I think you have a point here (and I am definitely no more an IR guy than you are ). Truth is I was surprised to find out (through your comments in LUCENE-1896 ) that this component of the score is "missing", and I indeed thought that the "right thing to do" (if there is such thing as "right") really is to do both: normalize to the unit vector, and then normalize by length to compensate for "unfair" advantage of long documents. But you're right, and the way I presented V(d) normalization and doc-length normalization is incorrect, as if it is a the right thing to do both, and the way it is presented is not doing justice to Lucene. I will change the writing. Interestingly, for a document containing N distinct terms, the 1/Euclidean-norm and Lucene's default similarity's length norm are the same: 1/sqrt(N). But if you double that doc to have two occurrences of each of the N distinct terms, its length would be 2N, 1/Euclidean-norm would be 1/sqrt(4N) but Lucene's default similarity's length norm would be 1/sqrt(2N). So we will punish documents with duplicate terms less than would the Euclidean norm... I am not aware of an evaluation or discussion of this - I mean - why was this approach selected, and so I assumed (under question) that it was merely for performance considerations. You said in Lucene-1896: not just similar properties - but many times better properties - the standard normalization would not factor in document length at all - it essentially removes it. Is it really better? It seems to "punish" the same for length due to distinct terms, and to punish less for length due to duplicate terms. Is this really a desired behavior? My intuition says no, but I am not sure. Anyhow this issue more about describing what Lucene is doing today than on what should Lucene do, so think I have the correct picture now (except for historical justification which is interesting but not a show stopper). Shai thanks for the fixes. (updated patch to follow).
          Hide
          Shai Erera added a comment -

          This looks great ! Really helpful. Few comments:

          • typo: "perscoective" should be "perspective"
          • Where you describe the different components, some bullets are missing their "component" name, like query-boost (you write it for doc-boost for example).
          • I think if you make those "component" slanted or something, they'll stand out better? This is in general for this page, not just this section.
          Show
          Shai Erera added a comment - This looks great ! Really helpful. Few comments: typo: "perscoective" should be "perspective" Where you describe the different components, some bullets are missing their "component" name, like query-boost (you write it for doc-boost for example). I think if you make those "component" slanted or something, they'll stand out better? This is in general for this page, not just this section.
          Hide
          Mark Miller added a comment - - edited

          Looks great!

          Document Euclidean norm |V(d)| is excluded by Lucene, for indexing performance considerstions .

          Hmm - I'm not sure if that is right either. Are we not replacing the |V(d)| normalization factor with our document length factor?

          That's how it appears to me anyway -

          for |V(d)| you have many options right?

          the cosine normalization - your standard euclidean length - |V(d)|
          or none (eg 1)
          or pivoted normalized doc length
          or SweetSpotSimilarity's formula
          or the quick,dirty,easy, not great default doc length normalization that Lucene uses by default
          or Okapi's formula,
          or ...

          So we are replacing (which everyone generally does) not dropping right?

          And I don't think we are replacing for performance reasons (though it is complicated to calculate) - we are replacing because its not a great normalization factor.
          Using |V(d)| eliminates info on the length of the orig document - but longer documents will still have higher tf's and more distinct terms - so it unnaturally gives them
          an advantage (most long docs will be repeated pieces or cover multiple topics). So its not generally a good normalization factor, and we have a chosen another?
          (the one we have chosen isnt great either - long docs are punished too much and short preferred too much)

          Again, I'm not an IR guy, but thats what my modest take is.

          • edit: I suppose you could argue that you could do cosine normalization and then a further normalization approach on that, and in that sense we are dropping the cosine normalization because its too expensive. But from what I can see, it appears more the case that you try and use a normalization factor that can just replace cosine normalization - like the pivoted normalization which rotates the cosine normalization curve. I think pivoted is something like 1/root(stuff, ie unique terms), so our norm of 1/root(L) is of a similar, simpler, vein. So thats why I think we are not really dropping it - we are choosing one of the variety of replacements.
          Show
          Mark Miller added a comment - - edited Looks great! Document Euclidean norm |V(d)| is excluded by Lucene, for indexing performance considerstions . Hmm - I'm not sure if that is right either. Are we not replacing the |V(d)| normalization factor with our document length factor? That's how it appears to me anyway - for |V(d)| you have many options right? the cosine normalization - your standard euclidean length - |V(d)| or none (eg 1) or pivoted normalized doc length or SweetSpotSimilarity's formula or the quick,dirty,easy, not great default doc length normalization that Lucene uses by default or Okapi's formula, or ... So we are replacing (which everyone generally does) not dropping right? And I don't think we are replacing for performance reasons (though it is complicated to calculate) - we are replacing because its not a great normalization factor. Using |V(d)| eliminates info on the length of the orig document - but longer documents will still have higher tf's and more distinct terms - so it unnaturally gives them an advantage (most long docs will be repeated pieces or cover multiple topics). So its not generally a good normalization factor, and we have a chosen another? (the one we have chosen isnt great either - long docs are punished too much and short preferred too much) Again, I'm not an IR guy, but thats what my modest take is. edit: I suppose you could argue that you could do cosine normalization and then a further normalization approach on that, and in that sense we are dropping the cosine normalization because its too expensive. But from what I can see, it appears more the case that you try and use a normalization factor that can just replace cosine normalization - like the pivoted normalization which rotates the cosine normalization curve. I think pivoted is something like 1/root(stuff, ie unique terms), so our norm of 1/root(L) is of a similar, simpler, vein. So thats why I think we are not really dropping it - we are choosing one of the variety of replacements.
          Hide
          Doron Cohen added a comment -
          Show
          Doron Cohen added a comment - Suggested Similarity javadoc can be reiewed here: http://people.apache.org/~doronc/sim-docs/org/apache/lucene/search/Similarity.html
          Hide
          Doron Cohen added a comment -

          just relates to, not blocking...

          Show
          Doron Cohen added a comment - just relates to, not blocking...
          Hide
          Doron Cohen added a comment -

          Attached patch adds explanations on how current scoring function is derived from VSM.
          I hope it is not to "colorful", and that it indeed clarifies rather than confuses.
          I marked it for 2.9, because if others like it I think it would be good to include it, but if there is no concenzus on this modification let's move it to a future release.

          Show
          Doron Cohen added a comment - Attached patch adds explanations on how current scoring function is derived from VSM. I hope it is not to "colorful", and that it indeed clarifies rather than confuses. I marked it for 2.9, because if others like it I think it would be good to include it, but if there is no concenzus on this modification let's move it to a future release.

            People

            • Assignee:
              Doron Cohen
              Reporter:
              Doron Cohen
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development