Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-664

[PATCH] small fixes to the new scoring.html doc

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1
    • Fix Version/s: None
    • Component/s: general/website
    • Labels:
      None

      Description

      This is an awesome initiative. We need more docs that cleanly explain the inner workings of Lucene in general... thanks Grant & Steve & others!

      I have a few small initial proposed fixes, largely just adding some more description around the components of the formula. But also a couple typos, another link out to Wikipedia, a missing closing ), etc. I've only made it through the "Understanding the Scoring Formula" section so far.

      1. ASF.LICENSE.NOT.GRANTED--scoring-small-fixes.patch
        8 kB
        Michael McCandless
      2. boosts_plus_scoring_formula.patch
        62 kB
        Doron Cohen
      3. lucene.uxf
        11 kB
        Karl Wettin
      4. scoring_formula_2.patch
        13 kB
        Doron Cohen
      5. scoring-small-fixes2.patch
        31 kB
        Michael McCandless
      6. scoring-small-fixes3.patch
        24 kB
        Michael McCandless

        Activity

        Hide
        gsingers Grant Ingersoll added a comment -

        Thanks, Michael. I have applied the patch in scoring-small-fixes.patch and will update the docs.

        I will leave this Issue open so that you and others can add patches w/o opening a new Issue for every patch, at least until this document becomes "official".

        Show
        gsingers Grant Ingersoll added a comment - Thanks, Michael. I have applied the patch in scoring-small-fixes.patch and will update the docs. I will leave this Issue open so that you and others can add patches w/o opening a new Issue for every patch, at least until this document becomes "official".
        Hide
        karl.wettin Karl Wettin added a comment -

        This is the class diagram (in UMLet format) linked to in the document. It does not explain scoring that much - more of an abstract overview of Lucene than anything else. I might find some time to sequence up the comments.

        Show
        karl.wettin Karl Wettin added a comment - This is the class diagram (in UMLet format) linked to in the document. It does not explain scoring that much - more of an abstract overview of Lucene than anything else. I might find some time to sequence up the comments.
        Hide
        mikemccand Michael McCandless added a comment -

        OK I did some more small fixes to the sections "Query Classes" and "Changing Similarity". Just added some missing links, fixed a broken link, tightened up wording, etc. I'll attach a patch.

        Also a couple random questions:

        • Should this go into the Wiki or into the Web site? Is there a general criteria / guideline that we've converged on for Wiki vs Web site?
        • Is there a reason why we are linking to the "ViewVC" page in the Big Picture section? Can't we just link to the javadoc?
        • Should we change the title / purpose to be an overview of the Query classes and also details of scoring? I think the Query Classes section is a great intro to the different Query classes.
        Show
        mikemccand Michael McCandless added a comment - OK I did some more small fixes to the sections "Query Classes" and "Changing Similarity". Just added some missing links, fixed a broken link, tightened up wording, etc. I'll attach a patch. Also a couple random questions: Should this go into the Wiki or into the Web site? Is there a general criteria / guideline that we've converged on for Wiki vs Web site? Is there a reason why we are linking to the "ViewVC" page in the Big Picture section? Can't we just link to the javadoc? Should we change the title / purpose to be an overview of the Query classes and also details of scoring? I think the Query Classes section is a great intro to the different Query classes.
        Hide
        gsingers Grant Ingersoll added a comment -

        Hi Michael,

        Correct me if I am wrong, but does the second patch include the first patch, too? I have already committed the first, so if you could update and add a new version of the patch with your latest edits, that would be great.

        Thanks,
        Grant

        Show
        gsingers Grant Ingersoll added a comment - Hi Michael, Correct me if I am wrong, but does the second patch include the first patch, too? I have already committed the first, so if you could update and add a new version of the patch with your latest edits, that would be great. Thanks, Grant
        Hide
        mikemccand Michael McCandless added a comment -

        Doh! You are correct. OK I re-based the diffs off the current version. Sorry about that.

        Show
        mikemccand Michael McCandless added a comment - Doh! You are correct. OK I re-based the diffs off the current version. Sorry about that.
        Hide
        gsingers Grant Ingersoll added a comment -

        Committed patch 3. Thanks Michael!

        Show
        gsingers Grant Ingersoll added a comment - Committed patch 3. Thanks Michael!
        Hide
        cutting Doug Cutting added a comment -

        This is good stuff. Might it be better to include this in the javadoc? That way, as the code evolves, the documentation is more naturally maintained, since search-and-replace operations etc. will find it. I worry that as soon as this documentation is complete, we'll change the implementation and invalidate it.

        Show
        cutting Doug Cutting added a comment - This is good stuff. Might it be better to include this in the javadoc? That way, as the code evolves, the documentation is more naturally maintained, since search-and-replace operations etc. will find it. I worry that as soon as this documentation is complete, we'll change the implementation and invalidate it.
        Hide
        gsingers Grant Ingersoll added a comment -

        Yeah, I thought about putting it in Javadoc and I share your concern about invalidation, but I think that is always the case with all documentation, including javadocs, and is just something we have to live with and work on. I also think it is bigger picture than javadocs and spans multiple packages, so it doesn't really fit as nicely into the Javadoc model of package level documentation. My hope is that by making it a part of the main site and something people can easily find it will be better maintained.

        Show
        gsingers Grant Ingersoll added a comment - Yeah, I thought about putting it in Javadoc and I share your concern about invalidation, but I think that is always the case with all documentation, including javadocs, and is just something we have to live with and work on. I also think it is bigger picture than javadocs and spans multiple packages, so it doesn't really fit as nicely into the Javadoc model of package level documentation. My hope is that by making it a part of the main site and something people can easily find it will be better maintained.
        Hide
        steven_parkes Steven Parkes added a comment -

        I went through basically the same reasoning that Grant describes, particularly with loading the javadocs w/too much stuff. But I do like the wiki ability to give you linking akin to that in javadocs.

        As to renaming the wiki url (mentioned on dev though not in jira), I'm happy to approach infrastructure. My concern would be external links into the wiki that reference the old jakarata url. I guess we could find out if they can handle a redirect from jakarata-lucene/* to lucene/*?

        Show
        steven_parkes Steven Parkes added a comment - I went through basically the same reasoning that Grant describes, particularly with loading the javadocs w/too much stuff. But I do like the wiki ability to give you linking akin to that in javadocs. As to renaming the wiki url (mentioned on dev though not in jira), I'm happy to approach infrastructure. My concern would be external links into the wiki that reference the old jakarata url. I guess we could find out if they can handle a redirect from jakarata-lucene/* to lucene/*?
        Hide
        cutting Doug Cutting added a comment -

        Javadoc is easiest to keep in sync with changes to the code, since it is in the same files. The wiki is the hardest to keep in sync with the code, since it is not versioned with the code. We website is somewhere between: it is in subversion, but in separate files in a separate tree.

        The javadoc is thus the preferred location for documentation that is specific to the code. The website and wiki are better for stuff that's specific to the project: policys, procedures, etc. The wiki is great for user-generated stuff like benchmarks, porting tricks, use cases, etc.

        This stuff seems pretty closely tied to the code, so I'd put it in the javadoc. It's nearly all stuff that's in the search package, so much of this could go in search/package.html with pointers to the javadoc for Query, Weight, Scorer, etc.

        Show
        cutting Doug Cutting added a comment - Javadoc is easiest to keep in sync with changes to the code, since it is in the same files. The wiki is the hardest to keep in sync with the code, since it is not versioned with the code. We website is somewhere between: it is in subversion, but in separate files in a separate tree. The javadoc is thus the preferred location for documentation that is specific to the code. The website and wiki are better for stuff that's specific to the project: policys, procedures, etc. The wiki is great for user-generated stuff like benchmarks, porting tricks, use cases, etc. This stuff seems pretty closely tied to the code, so I'd put it in the javadoc. It's nearly all stuff that's in the search package, so much of this could go in search/package.html with pointers to the javadoc for Query, Weight, Scorer, etc.
        Hide
        gsingers Grant Ingersoll added a comment -

        OK, I have split this up into the javadocs and the scoring page. All the Query stuff and Similarity stuff is moved to o.a.l.s package documentation.

        Show
        gsingers Grant Ingersoll added a comment - OK, I have split this up into the javadocs and the scoring page. All the Query stuff and Similarity stuff is moved to o.a.l.s package documentation.
        Hide
        doronc Doron Cohen added a comment -

        (1) added a section in Scoring.xml for "Search Results Boosts", on ways to boost in Lucene, at search time and at indexing time.

        (2) updated the presentation of the scoring formula in Similarity.java, to:

        • closely reflect the scoring code/process.
        • distinguish between indexing time factors and search time factors, and
        • point to differences between a scoring notion (e.g.tf, idf) to the way it is computed.

        As result the scoring formula is presented differently in Similarity.java and in Scoring.html. I can update this if there are no objections to the updated formula presentation.

        Show
        doronc Doron Cohen added a comment - (1) added a section in Scoring.xml for "Search Results Boosts", on ways to boost in Lucene, at search time and at indexing time. (2) updated the presentation of the scoring formula in Similarity.java, to: closely reflect the scoring code/process. distinguish between indexing time factors and search time factors, and point to differences between a scoring notion (e.g.tf, idf) to the way it is computed. As result the scoring formula is presented differently in Similarity.java and in Scoring.html. I can update this if there are no objections to the updated formula presentation.
        Hide
        gsingers Grant Ingersoll added a comment -

        Hi Doron,

        Thanks for the updates. I like the content in scoring.xml, although I am inclined to place it after boosts have been introduced in the Scoring Formula section (what do you think? No need to send a new patch if you agree, I can take care of it)

        On the Similarity changes, I think I like them, but would like to hear from other people. One of the problems I had with the old writing of the formula is it is a bit hard to tell whether the coord factor is inside the sum or not at first glance and I always have to think about the theory a bit before being sure. With your formula, it is much clearer. You also have some nice explanations, but I also feel it takes a bit longer to get an understanding of the formula b/c you introduce some new factors (normalizer, searchBoost, indexBoost) that require an extra layer of indirection (but maybe it is just b/c I am late and I'm tired and need to take a fresh look this weekend).

        I think the scoring.xml and Similarity.java formulas should be kept in sync. Ideally we could do some type of include and only have one file with the formula, but I am not sure how to do this given the current documentation frameworks.

        Also, I think you need a link on the indexBoost factor to the indexBoost anchor (but again, I can take care of that if others are in favor of this part of the patch).

        So, in summary, I am +1 on part 1 of this patch and am all for applying that, but after the Scoring Formula section. Part 1 can be applied separately from part 2 and I will do so this weekend unless I hear objections otherwise.

        The 2nd part, I am +0.5, if there is such a thing. I really like the added explanation and the organization of the formula, but am not sure on the searchBoost, indexBoost and normalizer factors and would like to hear from others what they favor.

        -Grant

        Show
        gsingers Grant Ingersoll added a comment - Hi Doron, Thanks for the updates. I like the content in scoring.xml, although I am inclined to place it after boosts have been introduced in the Scoring Formula section (what do you think? No need to send a new patch if you agree, I can take care of it) On the Similarity changes, I think I like them, but would like to hear from other people. One of the problems I had with the old writing of the formula is it is a bit hard to tell whether the coord factor is inside the sum or not at first glance and I always have to think about the theory a bit before being sure. With your formula, it is much clearer. You also have some nice explanations, but I also feel it takes a bit longer to get an understanding of the formula b/c you introduce some new factors (normalizer, searchBoost, indexBoost) that require an extra layer of indirection (but maybe it is just b/c I am late and I'm tired and need to take a fresh look this weekend). I think the scoring.xml and Similarity.java formulas should be kept in sync. Ideally we could do some type of include and only have one file with the formula, but I am not sure how to do this given the current documentation frameworks. Also, I think you need a link on the indexBoost factor to the indexBoost anchor (but again, I can take care of that if others are in favor of this part of the patch). So, in summary, I am +1 on part 1 of this patch and am all for applying that, but after the Scoring Formula section. Part 1 can be applied separately from part 2 and I will do so this weekend unless I hear objections otherwise. The 2nd part, I am +0.5, if there is such a thing. I really like the added explanation and the organization of the formula, but am not sure on the searchBoost, indexBoost and normalizer factors and would like to hear from others what they favor. -Grant
        Hide
        doronc Doron Cohen added a comment -

        Hi Grant,

        For part 1, I am ok with having it after the scoring formula.

        For part 2, my motivation was to make it more clear in:

        • - what's inside the sum and what's outside (as you said).
        • - what's decided at indexing time and what's still controllable at search time.
        • - how boosts and encoding/decoding play in.
        • - what's fixed and what can be modified, by subclassing, say, DefaultSimilarity.

        So

        {indexBoost, searchBoost, normalizer}

        were the tools to clear this up, and also to make the formula shorter and easier to read in a glance.

        Naturally, after delving so deep into it is now clear to me, but you are right, it would be good to hear from others how they like this part.

        Thanks,
        Doron

        Show
        doronc Doron Cohen added a comment - Hi Grant, For part 1, I am ok with having it after the scoring formula. For part 2, my motivation was to make it more clear in: - what's inside the sum and what's outside (as you said). - what's decided at indexing time and what's still controllable at search time. - how boosts and encoding/decoding play in. - what's fixed and what can be modified, by subclassing, say, DefaultSimilarity. So {indexBoost, searchBoost, normalizer} were the tools to clear this up, and also to make the formula shorter and easier to read in a glance. Naturally, after delving so deep into it is now clear to me, but you are right, it would be good to hear from others how they like this part. Thanks, Doron
        Hide
        cutting Doug Cutting added a comment -

        I like the idea of refactoring this formula, but think we should be cautious about introducing new terms that look like java methods but are not.

        I don't see the distinction between queryNorm and normalizer. I am not convinced we need a new term here.

        I think 'norm' is a good term for the product of lengthNorm(d) and field boost. That's what it is called consistently in the code and API. This quantity is represented in two places, but seems like a logical candidate for the sort of factoring done here. This could be placed to the left of the sigma, since it does not depend on t.

        Finally, I think it might be clearer to call "searchBoost" instead t.getBoost().

        Show
        cutting Doug Cutting added a comment - I like the idea of refactoring this formula, but think we should be cautious about introducing new terms that look like java methods but are not. I don't see the distinction between queryNorm and normalizer. I am not convinced we need a new term here. I think 'norm' is a good term for the product of lengthNorm(d) and field boost. That's what it is called consistently in the code and API. This quantity is represented in two places, but seems like a logical candidate for the sort of factoring done here. This could be placed to the left of the sigma, since it does not depend on t. Finally, I think it might be clearer to call "searchBoost" instead t.getBoost().
        Hide
        doronc Doron Cohen added a comment -

        Two quick questions:

        > I think 'norm' is a good term for the product of lengthNorm(d) and
        > field boost. That's what it is called consistently in the code and API.
        > This quantity is represented in two places, but seems like a logical
        > candidate for the sort of factoring done here.

        Norm would also Include the doc boost, right?
        So this means replacing indexBoost with norm ?

        > This could be placed to the left of the sigma, since it does not depend on t.

        I think that norm depends on the field name and there may be terms of more than one field in the query.?

        Show
        doronc Doron Cohen added a comment - Two quick questions: > I think 'norm' is a good term for the product of lengthNorm(d) and > field boost. That's what it is called consistently in the code and API. > This quantity is represented in two places, but seems like a logical > candidate for the sort of factoring done here. Norm would also Include the doc boost, right? So this means replacing indexBoost with norm ? > This could be placed to the left of the sigma, since it does not depend on t. I think that norm depends on the field name and there may be terms of more than one field in the query.?
        Hide
        cutting Doug Cutting added a comment -

        > So this means replacing indexBoost with norm ?

        Yes, that was my suggestion.

        > I think that norm depends on the field name [ ...]

        You're right. Both components are indeed field-specific (lengthNorm and field.getBoost()). So maybe it can't move to the left of the sigma.

        Show
        cutting Doug Cutting added a comment - > So this means replacing indexBoost with norm ? Yes, that was my suggestion. > I think that norm depends on the field name [ ...] You're right. Both components are indeed field-specific (lengthNorm and field.getBoost()). So maybe it can't move to the left of the sigma.
        Hide
        gsingers Grant Ingersoll added a comment -

        Doron, if you can submit a new patch, that would be great, otherwise I can try to do it sometime soon.

        I also think I want to drop the formula (and explanation) in scoring.html and just put a link to similarity in, so that we don't have to maintain two copies. Ideally, there would be a way to do a server side include so that we could have one file that contains the formula and then have it included in both the javadocs and this page. Does anyone know if this possible using our current setup?

        Show
        gsingers Grant Ingersoll added a comment - Doron, if you can submit a new patch, that would be great, otherwise I can try to do it sometime soon. I also think I want to drop the formula (and explanation) in scoring.html and just put a link to similarity in, so that we don't have to maintain two copies. Ideally, there would be a way to do a server side include so that we could have one file that contains the formula and then have it included in both the javadocs and this page. Does anyone know if this possible using our current setup?
        Hide
        doronc Doron Cohen added a comment -

        Going to work on this now, according to comments by Doug and Grant.

        Will give a try to the include idea - client side iframe as Chris suggested, see how it works. Iframe don't rely on Javascript (might be turned off for some users). There are downsides to iframe too - possible scrollbars etc. - so need to see how it looks, and need to check if it is possible to somehow also include it in Scoring.html, otherwise guess we just link to it from there.

        http://www.boutell.com/newfaq/creating/include.html

        Show
        doronc Doron Cohen added a comment - Going to work on this now, according to comments by Doug and Grant. Will give a try to the include idea - client side iframe as Chris suggested, see how it works. Iframe don't rely on Javascript (might be turned off for some users). There are downsides to iframe too - possible scrollbars etc. - so need to see how it looks, and need to check if it is possible to somehow also include it in Scoring.html, otherwise guess we just link to it from there. http://www.boutell.com/newfaq/creating/include.html
        Hide
        doronc Doron Cohen added a comment -

        I played with including the formula from a separate file, Client Side Include.

        === Summary ===
        I think the "include" is not going to work well enough and hence not worth to invest in it.
        So, bottom line, I give up for now on "include", and so I will make the changes in Similarity.java.

        === Details ===
        I know of 3 ways to do this: Javascript, Iframe, Object-Embed.

        Iframe can work for both the javadocs and the xdocs - I think that Embed also would work though I did not try it.

        Both Iframe and Embed have a problem of appearance: you have to decide on the size of the frame showm, in pixels. If you set too large an area, blank space will remain. If you set a too small area, scroll bars would show up. If the user changes the text side, the required area size is changing, but not the allocated area, so scrollbars or blank space are showing up and disapearing. Very ugly.

        Iframe also has an issue with inner links navigation: once you navigate to an anchor text in the iframe part (this works), the back action from some reason does not work (both Firefox and IE).

        The javascript approach should not have these issues, because the imported text becomes part of the embedding page (the imported text is "dynamically generated"). I saw that Javadocs themselves use javascript (at least in 1.5) so I feel better with using this. However to use Javascript you have to put some javascript code in the HTML Header, as well as an onload event in the BODY tag. I didn't find a way to do this with Javadocs.

        (Another tricky issue with Javascript is that outgoing links from the imported text have the base address of the embedding page. So references going out from the embedded text should be different in Similarity.html and Scoring.html (which are in separate directories). But I think this can be resolved with passing a 'base' param to the include() function.)

        Bottom line, I give up for now on "include", and so I will make the changes in Similarity.java.

        Show
        doronc Doron Cohen added a comment - I played with including the formula from a separate file, Client Side Include. === Summary === I think the "include" is not going to work well enough and hence not worth to invest in it. So, bottom line, I give up for now on "include", and so I will make the changes in Similarity.java. === Details === I know of 3 ways to do this: Javascript, Iframe, Object-Embed. Iframe can work for both the javadocs and the xdocs - I think that Embed also would work though I did not try it. Both Iframe and Embed have a problem of appearance: you have to decide on the size of the frame showm, in pixels. If you set too large an area, blank space will remain. If you set a too small area, scroll bars would show up. If the user changes the text side, the required area size is changing, but not the allocated area, so scrollbars or blank space are showing up and disapearing. Very ugly. Iframe also has an issue with inner links navigation: once you navigate to an anchor text in the iframe part (this works), the back action from some reason does not work (both Firefox and IE). The javascript approach should not have these issues, because the imported text becomes part of the embedding page (the imported text is "dynamically generated"). I saw that Javadocs themselves use javascript (at least in 1.5) so I feel better with using this. However to use Javascript you have to put some javascript code in the HTML Header, as well as an onload event in the BODY tag. I didn't find a way to do this with Javadocs. (Another tricky issue with Javascript is that outgoing links from the imported text have the base address of the embedding page. So references going out from the embedded text should be different in Similarity.html and Scoring.html (which are in separate directories). But I think this can be resolved with passing a 'base' param to the include() function.) Bottom line, I give up for now on "include", and so I will make the changes in Similarity.java.
        Hide
        gsingers Grant Ingersoll added a comment -

        That sounds good to me. I will remove the formula from scoring and just make it link for now until we can figure something else out.

        Show
        gsingers Grant Ingersoll added a comment - That sounds good to me. I will remove the formula from scoring and just make it link for now until we can figure something else out.
        Hide
        doronc Doron Cohen added a comment -

        I am attaching scoring_formula_2.patch - modifed scoring formula as suggested.

        Additional changes here:

        • order of the explanation parts: detailed norm part moved to end; tf and idf moved to start, so most of the stuff is visble at first glance.
        • links in the formula go to the appropriate explanation bullet.
        • formula itself is "framed" (border=1) for easier orientation within all the other text.
        Show
        doronc Doron Cohen added a comment - I am attaching scoring_formula_2.patch - modifed scoring formula as suggested. Additional changes here: order of the explanation parts: detailed norm part moved to end; tf and idf moved to start, so most of the stuff is visble at first glance. links in the formula go to the appropriate explanation bullet. formula itself is "framed" (border=1) for easier orientation within all the other text.
        Hide
        gsingers Grant Ingersoll added a comment -

        Doron,

        I really like this. It is so much cleaner and easier to read, IMO. I will commit this tomorrow or Tuesday unless I hear any objections.

        -Grant

        Show
        gsingers Grant Ingersoll added a comment - Doron, I really like this. It is so much cleaner and easier to read, IMO. I will commit this tomorrow or Tuesday unless I hear any objections. -Grant
        Hide
        gsingers Grant Ingersoll added a comment -

        OK, I have committed the changes and updated the site javadocs and scoring. Changes always take 15-30 mins. to propagate.

        Show
        gsingers Grant Ingersoll added a comment - OK, I have committed the changes and updated the site javadocs and scoring. Changes always take 15-30 mins. to propagate.
        Hide
        doronc Doron Cohen added a comment -

        One comment for Scoring.html:

        Tthe last sentence in the "Score Boosting" paragraph says:

        "At scoring (search) time, this norm is brought into the score
        of document as indexBoost, as shown by the formula in Similarity."

        To fix this, we should replace "indexBoost" by "norm(t,d)"

        Show
        doronc Doron Cohen added a comment - One comment for Scoring.html: Tthe last sentence in the "Score Boosting" paragraph says: "At scoring (search) time, this norm is brought into the score of document as indexBoost, as shown by the formula in Similarity." To fix this, we should replace "indexBoost" by "norm(t,d)"
        Hide
        doronc Doron Cohen added a comment -

        I just noticed that the link to "TermScorer" in "Understanding the Scoring Formula" is broken b/c TermScorer has package visibility.

        Can be fixed by saying instead "..., especially the scorer for TermQuery" and linking to TermQuery.

        Show
        doronc Doron Cohen added a comment - I just noticed that the link to "TermScorer" in "Understanding the Scoring Formula" is broken b/c TermScorer has package visibility. Can be fixed by saying instead "..., especially the scorer for TermQuery" and linking to TermQuery.
        Hide
        gsingers Grant Ingersoll added a comment -

        Going to close this one out for now, as I think we have gotten past the initial flurry of changes for the new docs.

        Show
        gsingers Grant Ingersoll added a comment - Going to close this one out for now, as I think we have gotten past the initial flurry of changes for the new docs.

          People

          • Assignee:
            Unassigned
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development