Mark and Shai Thanks for reviewing!
Mark, I think you have a point here (and I am definitely no more an IR guy than you are ).
Truth is I was surprised to find out (through your comments in
LUCENE-1896) that this component of the score is "missing", and I indeed thought that the "right thing to do" (if there is such thing as "right") really is to do both: normalize to the unit vector, and then normalize by length to compensate for "unfair" advantage of long documents.
But you're right, and the way I presented V(d) normalization and doc-length normalization is incorrect, as if it is a the right thing to do both, and the way it is presented is not doing justice to Lucene. I will change the writing.
Interestingly, for a document containing N distinct terms, the 1/Euclidean-norm and Lucene's default similarity's length norm are the same: 1/sqrt(N). But if you double that doc to have two occurrences of each of the N distinct terms, its length would be 2N, 1/Euclidean-norm would be 1/sqrt(4N) but Lucene's default similarity's length norm would be 1/sqrt(2N). So we will punish documents with duplicate terms less than would the Euclidean norm...
I am not aware of an evaluation or discussion of this - I mean - why was this approach selected, and so I assumed (under question) that it was merely for performance considerations. You said in Lucene-1896:
not just similar properties - but many times better properties - the standard normalization would not factor in document length at all - it essentially removes it.
Is it really better? It seems to "punish" the same for length due to distinct terms, and to punish less for length due to duplicate terms. Is this really a desired behavior? My intuition says no, but I am not sure.
Anyhow this issue more about describing what Lucene is doing today than on what should Lucene do, so think I have the correct picture now (except for historical justification which is interesting but not a show stopper).
Shai thanks for the fixes.
(updated patch to follow).