Details

Type: Improvement

Status: Closed

Priority: Minor

Resolution: Fixed

Affects Version/s: None

Fix Version/s: 2.9

Component/s: general/javadocs

Labels:None

Lucene Fields:New, Patch Available
Description
See http://markmail.org/message/arai6silfiktwcer
The javadoc confuses me as well.

 LUCENE1896.patch
 1 kB
 Mark Miller

 queryNormAlternate.patch
 11 kB
 Mark Miller
Issue Links
 is related to

LUCENE1907 sumOfSquared weights should be calculated as part of queryNorm
 Open

LUCENE1908 Similarity javadocs for scoring function to relate more tightly to scoring models in effect
 Closed
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
Hmm .. so perhaps the idea that the query scores are more comparable comes from:
if we take it out, that means the score will be the cosine times the euclidean distance  the distance will be rather large for a small query vector and a large doc vector (in magnitude). So a query that matched many large docs would could scale a lot harder than a query that hit small docs?
I guess that makes sense  scores not being very comparable anyway though, it hardly seems worth the extra compute cost ...
heh  the javadoc actually makes sense to me now. It is correct, and its even not confusing  perhaps cryptic, but not confusing
It makes scores from different queries more comparable
it hardly seems worth the extra compute cost
Marvin mentioned the compute cost is essentially 0, so that statement is likely bs ...
At essentially no cost, I can see why the IR dudes leave it in.
FWIW, after all that fuss,
I would up leaving it in.
From the standpoint of ordinary users, queryNorm() is harmless or mildly
beneficial. Scores are never going to be comparable across multiple queries
without what I normally think of as "normalization" (given my background
in audio): setting the top score to 1.0, and multiplying all other scores by the
same factor. Nevertheless, it's better for them to be closer together than
farther apart.
From the standpoint of users trying to write Query subclasses, it's a wash.
On the one hand, it's not the most important method, since it doesn't affect
ranking within a single query – and zapping it would mean one less thing to
think about. On the other hand, it's nice to have it in there for the sake of
completeness in the implementation of cosine similarity.
I eventually would up messing with how the query norm gets applied to
achieve my devoodoofication goals. Essentially, I hid away queryNorm() so
that you don't need to think about it unless you really need it.
I know the queryNorm was based on some long ago research on it, but I don't know that it has proven to be all that useful. There are certainly others who have done work on this, too. In the end, I don't think the current implementation is all that useful and we could likely save a few ops by removing it, or giving people more powerful ways of changing it.
The bottom line, I think, is you shouldn't compare scores across queries. Often times, you can't even compare scores for the same query if the underlying index changed. I also don't understand Marvin's comment about "completeness in the implementation of cosine similarity" nor the comment about scores being "closer together than farther apart".
bq . The bottom line, I think, is you shouldn't compare scores across queries. Often times, you can't even compare scores for the same query if the underlying index changed. I also don't understand Marvin's comment about "completeness in the implementation of cosine similarity" nor the comment about scores being "closer together than farther apart".
Its not about being able to compare scores across queries per say.
The cosine similarity is calc'd using the definition
cos(a) = V(q) dot V(d) / V(q)V(d)
queryNorm corresponds to the denominator of the right side of that equation. Suppose we take it out  that like multiplying the cos(a) by the product of the magnitude of the vectors:
V(q)V(d) * cos(a) = V(q) dot V(d)
So rather than getting the cos as a factor, we get cos * a number that depends on the euclidean length of the query/doc vectors.
Thats what he means by completeness of the imp of the cosine sim. Rather than getting a cosine score, you get the cosine scaled by a somewhat arbitrary amount (that depends on the doc vector mostly).
To get rid of that skew at no essentially no cost makes a lot of sense to me  which jives with why IR lit and Doug keep it around. Its not there to make scores between queries comparable  but it makes them way more comparable than they would be, and it adds to the "completeness in the implementation of cosine similarity"  at what I am trusting is essentially no cost. Its a keeper from my point of view. Its not based on research  its just the math of the formula  and if it had any real expense, it would likely have been tossed long ago (in the IR world).
In other words  we get a measure that relates to the angle between vectors, not a measure that relates to the angle as well as the magnitude of the vectors (which doesn't help as a sim measurement and skews the scores across queries). Another way to think of it is that we are converting the vectors to unit vectors, and just considering the angles between them  not entirely necessary, but nice if its free.
> at what I am trusting is essentially no cost.
Here's the snippet from TermQuery.score() where queryNorm() actually
gets applied to each document's score:
float raw = // compute tf(f)*weight f < SCORE_CACHE_SIZE // check cache ? scoreCache[f] // cache hit : getSimilarity().tf(f)*weightValue; // cache miss
At this point, queryNorm() has already been factored into weightValue (and
scoreCache). It happens during setup. You can either scale weightValue by
queryNorm() during setup or not – the perdocument computational cost is
unaffected.
as i mentioned in the linked thread, the biggest advantage i know of for queryNorm is that it is a reduction factor applied to the constituent parts of a complex score prior to multiplication – so it helps prevent loss of information due to floating point accuracy that could arrise otherwise.
but then again: that'swhat the default queryNorm does ... an alternate queryNorm could do something (like be a noop)
Since the target audience of the Similarity javadocs are mainly people who are interested in customizing the scoring, perhaps it should be something like...
The queryNorm is a uniform normalization factor computed from the sumOfSquareWeights for the query which is then applied to each of the clauses of the query. In some cases this can be useful for attempting to keep scores from simple queries semicomparable. The Default implementation is...
We can't be that worried about decimal accuracy  we use float which is good for 6 digits  if we were, wouldn't we use double (good for over double that) 
something about this argument hasn't clicked in my head yet.
If that was helpful (to scale by a some what arbitrary dampening factor) we would just start by multiplying every score by .01 or whatever. It doesn't make sense to me.
you could have a queryNorm implementation that always returned 0.01f ... but if you were dealing with weights that were all really large, it might not be enough ... and ify ou were dealing with weights that were extremely small low, it might actually be counter productive. that's why the default isn't arbitrary – it's a function of the weight.
i never said it was a reason why queryNorm was there ... i just said it was an advatnge i've observed in having it.
I also didn't argue in favor of adding anything about that to hte javadocs – i mentioned it only to explain one type of benefit that can arise from have "a uniform normalization factor computed from the sumOfSquareWeights for the query which is then applied to each of the clauses of the query"
I'm not arguing about it, I'm just trying to understand it, and whether its actually helpful or not. Since it doesn't click in my head yet, I wouldn't have a leg to stand on in terms of an argument.
argue was a poor choice of word, it implies someone is arguing back. what i should have said was that i don't advocate adding a mention of it to the docs.
Okay  think I was a tad off base 
Here is the cosine def used:
cos(a) = V(q) dot V(d) / V(q)V(d)
So the cosine is the query vector dot the document vector divided by the magnitude of the vectors. Classically, V(q)V(d) is a normalization factor that takes the vectors to unit vectors (so you get the real cosine)
cos(a) = v(q) dot v(d)
This is because the magnitude of a unit vector is 1 be definition.
But we don't care about absolute numbers, just relative numbers (as has been often pointed out)  so the IR guys already fudge this stuff.
While I thought that the queryNorm correlates to V(q)V(d) before, I was off  its just V(q). V(d) is replaced with the document length normalization, a much faster calculation with similar properties  a longer doc would have a larger magnitude most likely. edit not just similar properties  but many times better properties  the standard normalization would not factor in document length at all  it essentially removes it.
So one strategy is just to not normalize query  though the lit i see doing this is very inefficiently calculating the query norm in the inner loop  we are not doing that, and so its not much of an optimization for us.
cos(a) = V(q) dot V(d) / V(d) == cos(a) * V(q) = v(q) dot v(d)
And it does make queries more comparable (an odd goal I know, but for free?)
Sorry I was a little off earlier  just tried to learn all this myself  and linear alg was years ago  and open book tests lured my younger, more irresponsible self to not go to the classes ...
Anyhow, thats my current understanding  please point out if you know I have something wrong.
The real problem with this, is that while it wants to pluggable, because the sumOfSquared scores is calculated in the Weight, its really not so pluggable.
The alg is not properly contained  this queryNorm is one divided by the square root of the sumOfSquared weights  but thats the whole norm. If you wanted to
replace it with one, you shouldnt still be doing sumOfSquaredWeights.
Perhaps its a compromise on a difficult piece (have not considered)  but technically its kind of off..
Finally, after all that hot air, an attempt at a patch to address the issue
edit Ill fix the atrocious spelling before committing.
I've committed for RC4  I'll leave open in case someone has an issue, and we can modify for final if we need to.
This is a tricky change I think, but shoudnt it be more like this?
queryNorm takes the Weight rather than sumSquared  and then you can call sumSquared on the weight or skip it, saving the few calcs, and isolating the work to the queryNorm impl.
The patch is unfinished though 
its an issue that queryNorm throws IOException  I don't think it should and i removed it here, but thats not back compat. Pain in the butt.
also, I havn't given much thought to how we would migrate without breaking back compat  I guess the introspection again ...
And perhaps we just leave it.
I kinda like the modification of queryNorm to get Weight as parameter.
And I don't mind about it throwing an exception  besides being perhaps redundant, is it a real problem?
But in any case, perhaps this part (modifying the API) belongs to a separate issue?
Back to the original queryNorm discussion, thanks Mark and Hoss for the explanations  I just realized I did not fully understand the use of query norm. (Now I think I do, but I thought so too before this discussion.... )
I am working on a patch to the large javadoc part in the top of Similarity  to show more tightly how the scoring formula described there today (which I think is accurate) evolves from VSM  something in the lines of Mark's description above. I hope to complete it over the weekend, and I hope the added text would also clear the confusion about the purpose and practice of the query norm.
hi Mark, I moved the work on the Similarity javadoc for scoring to LUCENE1908, so I think this one can be closed?
Suggestions? I am no IR guy.
It would appear to me that this component is simply the part where you convert the vectors to unit vectors. Or are you are just dividing by the product of the euclidean lengths  it appears to be the same in either case to me.
So it looks like we kind of factor that out as a constant (I barley understand myself too  don't worry) 
In which case, why do we need it? Is it just there as a reminent of the math? Yes, it will give us the sim measure as the cosine  by why do we care if its a scaled cosine (eg not the cosine, but the same relative scores) for less work? What do we need 1/(Sum(W^2))^1/2 for? Seems like wasted calculations. Who is the academic that kept this in! ...
Actually, I don't have a clue if it makes sense to keep it or not. Or even whether or not I am talking out my ...
Wherefore art thou Doug Cutting.