Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: general/javadocs
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      See http://markmail.org/message/arai6silfiktwcer

      The javadoc confuses me as well.

      1. LUCENE-1896.patch
        1 kB
        Mark Miller
      2. queryNormAlternate.patch
        11 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          Suggestions? I am no IR guy.

          It would appear to me that this component is simply the part where you convert the vectors to unit vectors. Or are you are just dividing by the product of the euclidean lengths - it appears to be the same in either case to me.

          So it looks like we kind of factor that out as a constant (I barley understand myself too - don't worry) -

          In which case, why do we need it? Is it just there as a reminent of the math? Yes, it will give us the sim measure as the cosine - by why do we care if its a scaled cosine (eg not the cosine, but the same relative scores) for less work? What do we need 1/(Sum(W^2))^1/2 for? Seems like wasted calculations. Who is the academic that kept this in! ...

          Actually, I don't have a clue if it makes sense to keep it or not. Or even whether or not I am talking out my ...

          Wherefore art thou Doug Cutting.

          Show
          Mark Miller added a comment - Suggestions? I am no IR guy. It would appear to me that this component is simply the part where you convert the vectors to unit vectors. Or are you are just dividing by the product of the euclidean lengths - it appears to be the same in either case to me. So it looks like we kind of factor that out as a constant (I barley understand myself too - don't worry) - In which case, why do we need it? Is it just there as a reminent of the math? Yes, it will give us the sim measure as the cosine - by why do we care if its a scaled cosine (eg not the cosine, but the same relative scores) for less work? What do we need 1/(Sum(W^2))^1/2 for? Seems like wasted calculations. Who is the academic that kept this in! ... Actually, I don't have a clue if it makes sense to keep it or not. Or even whether or not I am talking out my ... Wherefore art thou Doug Cutting.
          Hide
          Mark Miller added a comment - - edited

          Hmm .. so perhaps the idea that the query scores are more comparable comes from:

          if we take it out, that means the score will be the cosine times the euclidean distance - the distance will be rather large for a small query vector and a large doc vector (in magnitude). So a query that matched many large docs would could scale a lot harder than a query that hit small docs?

          I guess that makes sense - scores not being very comparable anyway though, it hardly seems worth the extra compute cost ...

          Show
          Mark Miller added a comment - - edited Hmm .. so perhaps the idea that the query scores are more comparable comes from: if we take it out, that means the score will be the cosine times the euclidean distance - the distance will be rather large for a small query vector and a large doc vector (in magnitude). So a query that matched many large docs would could scale a lot harder than a query that hit small docs? I guess that makes sense - scores not being very comparable anyway though, it hardly seems worth the extra compute cost ...
          Hide
          Mark Miller added a comment -

          heh - the javadoc actually makes sense to me now. It is correct, and its even not confusing - perhaps cryptic, but not confusing

          It makes scores from different queries more comparable

          Show
          Mark Miller added a comment - heh - the javadoc actually makes sense to me now. It is correct, and its even not confusing - perhaps cryptic, but not confusing It makes scores from different queries more comparable
          Hide
          Mark Miller added a comment -

          it hardly seems worth the extra compute cost

          Marvin mentioned the compute cost is essentially 0, so that statement is likely bs ...

          At essentially no cost, I can see why the IR dudes leave it in.

          Show
          Mark Miller added a comment - it hardly seems worth the extra compute cost Marvin mentioned the compute cost is essentially 0, so that statement is likely bs ... At essentially no cost, I can see why the IR dudes leave it in.
          Hide
          Marvin Humphrey added a comment -

          FWIW, after all that fuss,
          I would up leaving it in.

          From the standpoint of ordinary users, queryNorm() is harmless or mildly
          beneficial. Scores are never going to be comparable across multiple queries
          without what I normally think of as "normalization" (given my background
          in audio): setting the top score to 1.0, and multiplying all other scores by the
          same factor. Nevertheless, it's better for them to be closer together than
          farther apart.

          From the standpoint of users trying to write Query subclasses, it's a wash.
          On the one hand, it's not the most important method, since it doesn't affect
          ranking within a single query – and zapping it would mean one less thing to
          think about. On the other hand, it's nice to have it in there for the sake of
          completeness in the implementation of cosine similarity.

          I eventually would up messing with how the query norm gets applied to
          achieve my de-voodoo-fication goals. Essentially, I hid away queryNorm() so
          that you don't need to think about it unless you really need it.

          Show
          Marvin Humphrey added a comment - FWIW, after all that fuss , I would up leaving it in. From the standpoint of ordinary users, queryNorm() is harmless or mildly beneficial. Scores are never going to be comparable across multiple queries without what I normally think of as "normalization" (given my background in audio): setting the top score to 1.0, and multiplying all other scores by the same factor. Nevertheless, it's better for them to be closer together than farther apart. From the standpoint of users trying to write Query subclasses, it's a wash. On the one hand, it's not the most important method, since it doesn't affect ranking within a single query – and zapping it would mean one less thing to think about. On the other hand, it's nice to have it in there for the sake of completeness in the implementation of cosine similarity. I eventually would up messing with how the query norm gets applied to achieve my de-voodoo-fication goals. Essentially, I hid away queryNorm() so that you don't need to think about it unless you really need it.
          Hide
          Grant Ingersoll added a comment -

          I know the queryNorm was based on some long ago research on it, but I don't know that it has proven to be all that useful. There are certainly others who have done work on this, too. In the end, I don't think the current implementation is all that useful and we could likely save a few ops by removing it, or giving people more powerful ways of changing it.

          The bottom line, I think, is you shouldn't compare scores across queries. Often times, you can't even compare scores for the same query if the underlying index changed. I also don't understand Marvin's comment about "completeness in the implementation of cosine similarity" nor the comment about scores being "closer together than farther apart".

          Show
          Grant Ingersoll added a comment - I know the queryNorm was based on some long ago research on it, but I don't know that it has proven to be all that useful. There are certainly others who have done work on this, too. In the end, I don't think the current implementation is all that useful and we could likely save a few ops by removing it, or giving people more powerful ways of changing it. The bottom line, I think, is you shouldn't compare scores across queries. Often times, you can't even compare scores for the same query if the underlying index changed. I also don't understand Marvin's comment about "completeness in the implementation of cosine similarity" nor the comment about scores being "closer together than farther apart".
          Hide
          Mark Miller added a comment - - edited

          bq . The bottom line, I think, is you shouldn't compare scores across queries. Often times, you can't even compare scores for the same query if the underlying index changed. I also don't understand Marvin's comment about "completeness in the implementation of cosine similarity" nor the comment about scores being "closer together than farther apart".

          Its not about being able to compare scores across queries per say.

          The cosine similarity is calc'd using the definition

           cos(a) = V(q) dot V(d) /  |V(q)||V(d)|
          

          queryNorm corresponds to the denominator of the right side of that equation. Suppose we take it out - that like multiplying the cos(a) by the product of the magnitude of the vectors:

           |V(q)||V(d)| * cos(a) = V(q) dot V(d)
          

          So rather than getting the cos as a factor, we get cos * a number that depends on the euclidean length of the query/doc vectors.

          Thats what he means by completeness of the imp of the cosine sim. Rather than getting a cosine score, you get the cosine scaled by a somewhat arbitrary amount (that depends on the doc vector mostly).

          To get rid of that skew at no essentially no cost makes a lot of sense to me - which jives with why IR lit and Doug keep it around. Its not there to make scores between queries comparable - but it makes them way more comparable than they would be, and it adds to the "completeness in the implementation of cosine similarity" - at what I am trusting is essentially no cost. Its a keeper from my point of view. Its not based on research - its just the math of the formula - and if it had any real expense, it would likely have been tossed long ago (in the IR world).

          Show
          Mark Miller added a comment - - edited bq . The bottom line, I think, is you shouldn't compare scores across queries. Often times, you can't even compare scores for the same query if the underlying index changed. I also don't understand Marvin's comment about "completeness in the implementation of cosine similarity" nor the comment about scores being "closer together than farther apart". Its not about being able to compare scores across queries per say. The cosine similarity is calc'd using the definition cos(a) = V(q) dot V(d) / |V(q)||V(d)| queryNorm corresponds to the denominator of the right side of that equation. Suppose we take it out - that like multiplying the cos(a) by the product of the magnitude of the vectors: |V(q)||V(d)| * cos(a) = V(q) dot V(d) So rather than getting the cos as a factor, we get cos * a number that depends on the euclidean length of the query/doc vectors. Thats what he means by completeness of the imp of the cosine sim. Rather than getting a cosine score, you get the cosine scaled by a somewhat arbitrary amount (that depends on the doc vector mostly). To get rid of that skew at no essentially no cost makes a lot of sense to me - which jives with why IR lit and Doug keep it around. Its not there to make scores between queries comparable - but it makes them way more comparable than they would be, and it adds to the "completeness in the implementation of cosine similarity" - at what I am trusting is essentially no cost. Its a keeper from my point of view. Its not based on research - its just the math of the formula - and if it had any real expense, it would likely have been tossed long ago (in the IR world).
          Hide
          Mark Miller added a comment -

          In other words - we get a measure that relates to the angle between vectors, not a measure that relates to the angle as well as the magnitude of the vectors (which doesn't help as a sim measurement and skews the scores across queries). Another way to think of it is that we are converting the vectors to unit vectors, and just considering the angles between them - not entirely necessary, but nice if its free.

          Show
          Mark Miller added a comment - In other words - we get a measure that relates to the angle between vectors, not a measure that relates to the angle as well as the magnitude of the vectors (which doesn't help as a sim measurement and skews the scores across queries). Another way to think of it is that we are converting the vectors to unit vectors, and just considering the angles between them - not entirely necessary, but nice if its free.
          Hide
          Marvin Humphrey added a comment -

          > at what I am trusting is essentially no cost.

          Here's the snippet from TermQuery.score() where queryNorm() actually
          gets applied to each document's score:

          float raw =                                   // compute tf(f)*weight
            f < SCORE_CACHE_SIZE                        // check cache
            ? scoreCache[f]                             // cache hit
            : getSimilarity().tf(f)*weightValue;        // cache miss
          

          At this point, queryNorm() has already been factored into weightValue (and
          scoreCache). It happens during setup. You can either scale weightValue by
          queryNorm() during setup or not – the per-document computational cost is
          unaffected.

          Show
          Marvin Humphrey added a comment - > at what I am trusting is essentially no cost. Here's the snippet from TermQuery.score() where queryNorm() actually gets applied to each document's score: float raw = // compute tf(f)*weight f < SCORE_CACHE_SIZE // check cache ? scoreCache[f] // cache hit : getSimilarity().tf(f)*weightValue; // cache miss At this point, queryNorm() has already been factored into weightValue (and scoreCache). It happens during setup. You can either scale weightValue by queryNorm() during setup or not – the per-document computational cost is unaffected.
          Hide
          Hoss Man added a comment -

          as i mentioned in the linked thread, the biggest advantage i know of for queryNorm is that it is a reduction factor applied to the constituent parts of a complex score prior to multiplication – so it helps prevent loss of information due to floating point accuracy that could arrise otherwise.

          but then again: that'swhat the default queryNorm does ... an alternate queryNorm could do something (like be a no-op)

          Since the target audience of the Similarity javadocs are mainly people who are interested in customizing the scoring, perhaps it should be something like...

          The queryNorm is a uniform normalization factor computed from the sumOfSquareWeights for the query which is then applied to each of the clauses of the query. In some cases this can be useful for attempting to keep scores from simple queries semi-comparable. The Default implementation is...

          Show
          Hoss Man added a comment - as i mentioned in the linked thread, the biggest advantage i know of for queryNorm is that it is a reduction factor applied to the constituent parts of a complex score prior to multiplication – so it helps prevent loss of information due to floating point accuracy that could arrise otherwise. but then again: that'swhat the default queryNorm does ... an alternate queryNorm could do something (like be a no-op) Since the target audience of the Similarity javadocs are mainly people who are interested in customizing the scoring, perhaps it should be something like... The queryNorm is a uniform normalization factor computed from the sumOfSquareWeights for the query which is then applied to each of the clauses of the query. In some cases this can be useful for attempting to keep scores from simple queries semi-comparable. The Default implementation is...
          Hide
          Mark Miller added a comment -

          We can't be that worried about decimal accuracy - we use float which is good for 6 digits - if we were, wouldn't we use double (good for over double that) -

          something about this argument hasn't clicked in my head yet.

          If that was helpful (to scale by a some what arbitrary dampening factor) we would just start by multiplying every score by .01 or whatever. It doesn't make sense to me.

          Show
          Mark Miller added a comment - We can't be that worried about decimal accuracy - we use float which is good for 6 digits - if we were, wouldn't we use double (good for over double that) - something about this argument hasn't clicked in my head yet. If that was helpful (to scale by a some what arbitrary dampening factor) we would just start by multiplying every score by .01 or whatever. It doesn't make sense to me.
          Hide
          Hoss Man added a comment -

          you could have a queryNorm implementation that always returned 0.01f ... but if you were dealing with weights that were all really large, it might not be enough ... and ify ou were dealing with weights that were extremely small low, it might actually be counter productive. that's why the default isn't arbitrary – it's a function of the weight.

          i never said it was a reason why queryNorm was there ... i just said it was an advatnge i've observed in having it.

          I also didn't argue in favor of adding anything about that to hte javadocs – i mentioned it only to explain one type of benefit that can arise from have "a uniform normalization factor computed from the sumOfSquareWeights for the query which is then applied to each of the clauses of the query"

          Show
          Hoss Man added a comment - you could have a queryNorm implementation that always returned 0.01f ... but if you were dealing with weights that were all really large, it might not be enough ... and ify ou were dealing with weights that were extremely small low, it might actually be counter productive. that's why the default isn't arbitrary – it's a function of the weight. i never said it was a reason why queryNorm was there ... i just said it was an advatnge i've observed in having it. I also didn't argue in favor of adding anything about that to hte javadocs – i mentioned it only to explain one type of benefit that can arise from have "a uniform normalization factor computed from the sumOfSquareWeights for the query which is then applied to each of the clauses of the query"
          Hide
          Mark Miller added a comment -

          I'm not arguing about it, I'm just trying to understand it, and whether its actually helpful or not. Since it doesn't click in my head yet, I wouldn't have a leg to stand on in terms of an argument.

          Show
          Mark Miller added a comment - I'm not arguing about it, I'm just trying to understand it, and whether its actually helpful or not. Since it doesn't click in my head yet, I wouldn't have a leg to stand on in terms of an argument.
          Hide
          Hoss Man added a comment -

          argue was a poor choice of word, it implies someone is arguing back. what i should have said was that i don't advocate adding a mention of it to the docs.

          Show
          Hoss Man added a comment - argue was a poor choice of word, it implies someone is arguing back. what i should have said was that i don't advocate adding a mention of it to the docs.
          Hide
          Mark Miller added a comment - - edited

          Okay - think I was a tad off base -

          Here is the cosine def used:

          cos(a) = V(q) dot V(d) /  |V(q)||V(d)|
          

          So the cosine is the query vector dot the document vector divided by the magnitude of the vectors. Classically, |V(q)||V(d)| is a normalization factor that takes the vectors to unit vectors (so you get the real cosine)

          cos(a) = v(q) dot v(d) 
          

          This is because the magnitude of a unit vector is 1 be definition.

          But we don't care about absolute numbers, just relative numbers (as has been often pointed out) - so the IR guys already fudge this stuff.

          While I thought that the queryNorm correlates to |V(q)||V(d)| before, I was off - its just |V(q)|. |V(d)| is replaced with the document length normalization, a much faster calculation with similar properties - a longer doc would have a larger magnitude most likely. edit not just similar properties - but many times better properties - the standard normalization would not factor in document length at all - it essentially removes it.

          So one strategy is just to not normalize query - though the lit i see doing this is very inefficiently calculating the query norm in the inner loop - we are not doing that, and so its not much of an optimization for us.

          cos(a) = V(q) dot V(d) /  |V(d)| == cos(a) * |V(q)| = v(q) dot v(d)
          

          And it does make queries more comparable (an odd goal I know, but for free?)

          Sorry I was a little off earlier - just tried to learn all this myself - and linear alg was years ago - and open book tests lured my younger, more irresponsible self to not go to the classes ...

          Anyhow, thats my current understanding - please point out if you know I have something wrong.

          Show
          Mark Miller added a comment - - edited Okay - think I was a tad off base - Here is the cosine def used: cos(a) = V(q) dot V(d) / |V(q)||V(d)| So the cosine is the query vector dot the document vector divided by the magnitude of the vectors. Classically, |V(q)||V(d)| is a normalization factor that takes the vectors to unit vectors (so you get the real cosine) cos(a) = v(q) dot v(d) This is because the magnitude of a unit vector is 1 be definition. But we don't care about absolute numbers, just relative numbers (as has been often pointed out) - so the IR guys already fudge this stuff. While I thought that the queryNorm correlates to |V(q)||V(d)| before, I was off - its just |V(q)|. |V(d)| is replaced with the document length normalization, a much faster calculation with similar properties - a longer doc would have a larger magnitude most likely. edit not just similar properties - but many times better properties - the standard normalization would not factor in document length at all - it essentially removes it. So one strategy is just to not normalize query - though the lit i see doing this is very inefficiently calculating the query norm in the inner loop - we are not doing that, and so its not much of an optimization for us. cos(a) = V(q) dot V(d) / |V(d)| == cos(a) * |V(q)| = v(q) dot v(d) And it does make queries more comparable (an odd goal I know, but for free?) Sorry I was a little off earlier - just tried to learn all this myself - and linear alg was years ago - and open book tests lured my younger, more irresponsible self to not go to the classes ... Anyhow, thats my current understanding - please point out if you know I have something wrong.
          Hide
          Mark Miller added a comment -

          The real problem with this, is that while it wants to pluggable, because the sumOfSquared scores is calculated in the Weight, its really not so pluggable.

          The alg is not properly contained - this queryNorm is one divided by the square root of the sumOfSquared weights - but thats the whole norm. If you wanted to
          replace it with one, you shouldnt still be doing sumOfSquaredWeights.

          Perhaps its a compromise on a difficult piece (have not considered) - but technically its kind of off..

          Show
          Mark Miller added a comment - The real problem with this, is that while it wants to pluggable, because the sumOfSquared scores is calculated in the Weight, its really not so pluggable. The alg is not properly contained - this queryNorm is one divided by the square root of the sumOfSquared weights - but thats the whole norm. If you wanted to replace it with one, you shouldnt still be doing sumOfSquaredWeights. Perhaps its a compromise on a difficult piece (have not considered) - but technically its kind of off..
          Hide
          Mark Miller added a comment - - edited

          Finally, after all that hot air, an attempt at a patch to address the issue

          edit Ill fix the atrocious spelling before committing.

          Show
          Mark Miller added a comment - - edited Finally, after all that hot air, an attempt at a patch to address the issue edit Ill fix the atrocious spelling before committing.
          Hide
          Mark Miller added a comment -

          I've committed for RC4 - I'll leave open in case someone has an issue, and we can modify for final if we need to.

          Show
          Mark Miller added a comment - I've committed for RC4 - I'll leave open in case someone has an issue, and we can modify for final if we need to.
          Hide
          Mark Miller added a comment -

          This is a tricky change I think, but shoudnt it be more like this?

          queryNorm takes the Weight rather than sumSquared - and then you can call sumSquared on the weight or skip it, saving the few calcs, and isolating the work to the queryNorm impl.

          The patch is unfinished though -

          its an issue that queryNorm throws IOException - I don't think it should and i removed it here, but thats not back compat. Pain in the butt.

          also, I havn't given much thought to how we would migrate without breaking back compat - I guess the introspection again ...

          And perhaps we just leave it.

          Show
          Mark Miller added a comment - This is a tricky change I think, but shoudnt it be more like this? queryNorm takes the Weight rather than sumSquared - and then you can call sumSquared on the weight or skip it, saving the few calcs, and isolating the work to the queryNorm impl. The patch is unfinished though - its an issue that queryNorm throws IOException - I don't think it should and i removed it here, but thats not back compat. Pain in the butt. also, I havn't given much thought to how we would migrate without breaking back compat - I guess the introspection again ... And perhaps we just leave it.
          Hide
          Doron Cohen added a comment -

          I kinda like the modification of queryNorm to get Weight as parameter.
          And I don't mind about it throwing an exception - besides being perhaps redundant, is it a real problem?
          But in any case, perhaps this part (modifying the API) belongs to a separate issue?

          Back to the original queryNorm discussion, thanks Mark and Hoss for the explanations - I just realized I did not fully understand the use of query norm. (Now I think I do, but I thought so too before this discussion.... )

          I am working on a patch to the large javadoc part in the top of Similarity - to show more tightly how the scoring formula described there today (which I think is accurate) evolves from VSM - something in the lines of Mark's description above. I hope to complete it over the weekend, and I hope the added text would also clear the confusion about the purpose and practice of the query norm.

          Show
          Doron Cohen added a comment - I kinda like the modification of queryNorm to get Weight as parameter. And I don't mind about it throwing an exception - besides being perhaps redundant, is it a real problem? But in any case, perhaps this part (modifying the API) belongs to a separate issue? Back to the original queryNorm discussion, thanks Mark and Hoss for the explanations - I just realized I did not fully understand the use of query norm. (Now I think I do, but I thought so too before this discussion.... ) I am working on a patch to the large javadoc part in the top of Similarity - to show more tightly how the scoring formula described there today (which I think is accurate) evolves from VSM - something in the lines of Mark's description above. I hope to complete it over the weekend, and I hope the added text would also clear the confusion about the purpose and practice of the query norm.
          Hide
          Doron Cohen added a comment -

          hi Mark, I moved the work on the Similarity javadoc for scoring to LUCENE-1908, so I think this one can be closed?

          Show
          Doron Cohen added a comment - hi Mark, I moved the work on the Similarity javadoc for scoring to LUCENE-1908 , so I think this one can be closed?

            People

            • Assignee:
              Mark Miller
              Reporter:
              Jiri Kuhn
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development