Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3
    • Component/s: search
    • Labels:
      None

      Description

      Here's a patch that implements simple support of Lucene's MoreLikeThis class.

      The MoreLikeThisHelper code is heavily based on (hmm..."lifted from" might be more appropriate Erik Hatcher's example mentioned in http://www.mail-archive.com/solr-user@lucene.apache.org/msg00878.html

      To use it, add at least the following parameters to a standard or dismax query:

      mlt=true
      mlt.fl=list,of,fields,which,define,similarity

      See the MoreLikeThisHelper source code for more parameters.

      Here are two URLs that work with the example config, after loading all documents found in exampledocs in the index (just to show that it seems to work - of course you need a larger corpus to make it interesting):

      http://localhost:8983/solr/select/?stylesheet=&q=apache&qt=standard&mlt=true&mlt.fl=manu,cat&mlt.mindf=1&mlt.mindf=1&fl=id,score

      http://localhost:8983/solr/select/?stylesheet=&q=apache&qt=dismax&mlt=true&mlt.fl=manu,cat&mlt.mindf=1&mlt.mindf=1&fl=id,score

      Results are added to the output like this:
      <response>
      ...
      <lst name="moreLikeThis">
      <result name="UTF8TEST" numFound="1" start="0" maxScore="1.5293242">
      <doc>
      <float name="score">1.5293242</float>
      <str name="id">SOLR1000</str>
      </doc>
      </result>
      <result name="SOLR1000" numFound="1" start="0" maxScore="1.5293242">
      <doc>
      <float name="score">1.5293242</float>
      <str name="id">UTF8TEST</str>
      </doc>
      </result>
      </lst>

      I haven't tested this extensively yet, will do in the next few days. But comments are welcome of course.

      1. lucene-queries-2.0.0.jar
        23 kB
        Bertrand Delacretaz
      2. lucene-queries-2.1.1-dev.jar
        23 kB
        Ryan McKinley
      3. SOLR-69.patch
        10 kB
        Ryan McKinley
      4. SOLR-69.patch
        9 kB
        Bertrand Delacretaz
      5. SOLR-69.patch
        9 kB
        Bertrand Delacretaz
      6. SOLR-69.patch
        9 kB
        Bertrand Delacretaz
      7. SOLR-69-MoreLikeThisRequestHandler.patch
        22 kB
        Ryan McKinley
      8. SOLR-69-MoreLikeThisRequestHandler.patch
        22 kB
        Ryan McKinley
      9. SOLR-69-MoreLikeThisRequestHandler.patch
        15 kB
        Ryan McKinley
      10. SOLR-69-MoreLikeThisRequestHandler.patch
        12 kB
        Ryan McKinley
      11. SOLR-69-MoreLikeThisRequestHandler.patch
        8 kB
        Ryan McKinley

        Activity

        Hide
        Bertrand Delacretaz added a comment -

        The MoreLikeThis class comes from the lucene-queries jar, I enclose the version used for my tests

        Show
        Bertrand Delacretaz added a comment - The MoreLikeThis class comes from the lucene-queries jar, I enclose the version used for my tests
        Hide
        Erik Hatcher added a comment -

        I love it when features get implemented by others! Thanks Bertrand!

        Show
        Erik Hatcher added a comment - I love it when features get implemented by others! Thanks Bertrand!
        Hide
        Yonik Seeley added a comment -

        I finally got around to checking this out... looks cool!
        In your example URL, it looks like mindf=1 is repeated... is that right, or should one of them have been mintf=1?

        Show
        Yonik Seeley added a comment - I finally got around to checking this out... looks cool! In your example URL, it looks like mindf=1 is repeated... is that right, or should one of them have been mintf=1?
        Hide
        Ryan McKinley added a comment -

        Thanks. it works great. The only problem i ran into is a null pointer if you do not specify the fields to return (by default all of them without the score).

        just add a not null check to line 102 of MoreLikeThisHelper.java

        <code>

        protected boolean usesScoreField(SolrQueryRequest req) {
        String fl = req.getParams().get(SolrParams.FL);
        if( fl != null ) {
        for(String field : splitList.split(fl))

        { if("score".equals(field)) return true; }

        }
        return false;
        }

        </code>

        Show
        Ryan McKinley added a comment - Thanks. it works great. The only problem i ran into is a null pointer if you do not specify the fields to return (by default all of them without the score). just add a not null check to line 102 of MoreLikeThisHelper.java <code> protected boolean usesScoreField(SolrQueryRequest req) { String fl = req.getParams().get(SolrParams.FL); if( fl != null ) { for(String field : splitList.split(fl)) { if("score".equals(field)) return true; } } return false; } </code>
        Hide
        Bertrand Delacretaz added a comment -

        Yonik, you're right about the mindf parameter duplication, here's the correct example URL

        http://localhost:8983/solr/select/?stylesheet=&q=apache&qt=standard&mlt=true&mlt.fl=manu,cat&mlt.mindf=1&mlt.mintf=1&fl=id,score

        Show
        Bertrand Delacretaz added a comment - Yonik, you're right about the mindf parameter duplication, here's the correct example URL http://localhost:8983/solr/select/?stylesheet=&q=apache&qt=standard&mlt=true&mlt.fl=manu,cat&mlt.mindf=1&mlt.mintf=1&fl=id,score
        Hide
        Bertrand Delacretaz added a comment -

        Thanks Ryan for spotting the fl param problem, I'll attach a revised patch which fixes it.

        Before that, the following request caused an NPE, it works now:

        http://localhost:8983/solr/select/?stylesheet=&q=apache&qt=standard&mlt=true&mlt.fl=manu,cat&mlt.mindf=1&mlt.mintf=1

        Show
        Bertrand Delacretaz added a comment - Thanks Ryan for spotting the fl param problem, I'll attach a revised patch which fixes it. Before that, the following request caused an NPE, it works now: http://localhost:8983/solr/select/?stylesheet=&q=apache&qt=standard&mlt=true&mlt.fl=manu,cat&mlt.mindf=1&mlt.mintf=1
        Hide
        Bertrand Delacretaz added a comment -

        SOLR-69.patch updated

        Show
        Bertrand Delacretaz added a comment - SOLR-69 .patch updated
        Show
        Bertrand Delacretaz added a comment - The method used to compute includeScore in MoreLikeThisHelper was inconsistent with what the XmlWriter does. I have changed it to take this info from SolrQueryResponse.getReturnFields(). The md5 sum of the current SOLR-69 patch is b6178d11d33f19b296b741a67df00d45 With this change, all the following requests should work (standard and dismax handlers, with no fl param, id only and id + score as return fields): http://localhost:8983/solr/select/?stylesheet=&q=apache&qt=standard&mlt=true&mlt.fl=manu,cat&mlt.mindf=1&mlt.mintf=1 http://localhost:8983/solr/select/?stylesheet=&q=apache&qt=standard&mlt=true&mlt.fl=manu,cat&mlt.mindf=1&mlt.mintf=1&fl=id http://localhost:8983/solr/select/?stylesheet=&q=apache&qt=standard&mlt=true&mlt.fl=manu,cat&mlt.mindf=1&mlt.mintf=1&fl=id,score http://localhost:8983/solr/select/?stylesheet=&q=apache&qt=dismax&mlt=true&mlt.fl=manu,cat&mlt.mindf=1&mlt.mintf=1 http://localhost:8983/solr/select/?stylesheet=&q=apache&qt=dismax&mlt=true&mlt.fl=manu,cat&mlt.mindf=1&mlt.mintf=1&fl=id http://localhost:8983/solr/select/?stylesheet=&q=apache&qt=dismax&mlt=true&mlt.fl=manu,cat&mlt.mindf=1&mlt.mintf=1&fl=id,score
        Hide
        mrball added a comment -

        Thanks for writing this!

        Just a shot in the dark: Would it be possible to use this on fields that are not stored? maybe the client has to supply the content of the field?

        Reason being I'd rather not store the field as that basically duplicates the data already in my (normal non-lucene) database.

        Show
        mrball added a comment - Thanks for writing this! Just a shot in the dark: Would it be possible to use this on fields that are not stored? maybe the client has to supply the content of the field? Reason being I'd rather not store the field as that basically duplicates the data already in my (normal non-lucene) database.
        Hide
        Bertrand Delacretaz added a comment -

        Intuitively, without having checked exactly how it's implemented, I think MoreLikeThis queries should work irrelevant of whether fields are stored or not, as it's based on what's indexed. Maybe someone who knows Lucene's internals better than I do can comment.

        Did you find a case where non-stored fields cause problems?

        Show
        Bertrand Delacretaz added a comment - Intuitively, without having checked exactly how it's implemented, I think MoreLikeThis queries should work irrelevant of whether fields are stored or not, as it's based on what's indexed. Maybe someone who knows Lucene's internals better than I do can comment. Did you find a case where non-stored fields cause problems?
        Hide
        mrball added a comment -

        Yep, doesn't seem to work with non-stored fields. (if you only use non stored fields in mlt.fl)

        I believe the stored field values are used to build the query

        Show
        mrball added a comment - Yep, doesn't seem to work with non-stored fields. (if you only use non stored fields in mlt.fl) I believe the stored field values are used to build the query
        Hide
        Yonik Seeley added a comment -

        > MoreLikeThis queries should work irrelevant of whether fields are stored or not, as it's based on what's indexed

        I haven't looked at the lucene-code for more-like-this, but it's just like highlighting... to get the tokens for a specific document, you need to either get it's stored field and re-analyze or store term vectors and use them.
        Looking up those terms in other documents is then fast (that's where the inverted index comes in)

        Show
        Yonik Seeley added a comment - > MoreLikeThis queries should work irrelevant of whether fields are stored or not, as it's based on what's indexed I haven't looked at the lucene-code for more-like-this, but it's just like highlighting... to get the tokens for a specific document, you need to either get it's stored field and re-analyze or store term vectors and use them. Looking up those terms in other documents is then fast (that's where the inverted index comes in)
        Hide
        Brian Whitman added a comment -

        There's a typo in the latest uploaded patch –

        • map.put(MIN_DOC_FREQ, String.valueOf(MoreLikeThis.DEFALT_MIN_DOC_FREQ));
          + map.put(MIN_DOC_FREQ, String.valueOf(MoreLikeThis.DEFAULT_MIN_DOC_FREQ));
        Show
        Brian Whitman added a comment - There's a typo in the latest uploaded patch – map.put(MIN_DOC_FREQ, String.valueOf(MoreLikeThis.DEFALT_MIN_DOC_FREQ)); + map.put(MIN_DOC_FREQ, String.valueOf(MoreLikeThis.DEFAULT_MIN_DOC_FREQ));
        Hide
        Yonik Seeley added a comment -

        Should this be an integrated part of the standard/dismax handlers, or should it be a separate request handler?
        I guess the answer would depend on how it's used mos of the time:

        Case 1)
        The GUI queries the standard request handler and displays a list of documents with a little "more-like-this" button next to each document. When pressed, the GUI queries the more-like-this handler with the specific document id, and then displays the results to the user.

        Case 2)
        The GUI queries the standard request handler to display a list of documents, with a sub-list of similar "mlt" documents automatically displayed under each. Or, those lists could be collapsed by default, but instantly displayed since the GUI already has the info.

        If case (2) were rare, then perhaps mlt should be a separate handler. Case (2) can still be done, it would just require more requests from the GUI to do it.

        In either case, will highlighting be desired on any of the mlt docs? Other thoughts?

        Show
        Yonik Seeley added a comment - Should this be an integrated part of the standard/dismax handlers, or should it be a separate request handler? I guess the answer would depend on how it's used mos of the time: Case 1) The GUI queries the standard request handler and displays a list of documents with a little "more-like-this" button next to each document. When pressed, the GUI queries the more-like-this handler with the specific document id, and then displays the results to the user. Case 2) The GUI queries the standard request handler to display a list of documents, with a sub-list of similar "mlt" documents automatically displayed under each. Or, those lists could be collapsed by default, but instantly displayed since the GUI already has the info. If case (2) were rare, then perhaps mlt should be a separate handler. Case (2) can still be done, it would just require more requests from the GUI to do it. In either case, will highlighting be desired on any of the mlt docs? Other thoughts?
        Hide
        Bertrand Delacretaz added a comment -

        Making this a separate handler would probably make the code easier to understand, but the current code makes case 2) easier, while making case 1) easy as well (just query on the document's unique ID, with MoreLikeThis enabled).

        I'm for keeping it as is, integrated in the handlers as an option. If someone needs it as a separate handler, it wouldn't be hard to factor our the common parts.

        I have no strong feelings, however, as I built this patch to experiment with this feature but I'm not using it yet.

        Show
        Bertrand Delacretaz added a comment - Making this a separate handler would probably make the code easier to understand, but the current code makes case 2) easier, while making case 1) easy as well (just query on the document's unique ID, with MoreLikeThis enabled). I'm for keeping it as is, integrated in the handlers as an option. If someone needs it as a separate handler, it wouldn't be hard to factor our the common parts. I have no strong feelings, however, as I built this patch to experiment with this feature but I'm not using it yet.
        Hide
        Erik Hatcher added a comment -

        In Collex, we do more-like-this on a single object not within search results. A separate handler would be sufficient for our current needs and avoid the other handlers from becoming overloaded with options.

        Highlighting is not needed on MLT documents in our case.

        Show
        Erik Hatcher added a comment - In Collex, we do more-like-this on a single object not within search results. A separate handler would be sufficient for our current needs and avoid the other handlers from becoming overloaded with options. Highlighting is not needed on MLT documents in our case.
        Hide
        Bertrand Delacretaz added a comment -
        Show
        Bertrand Delacretaz added a comment - See Ken Krugler's comments about term vectors at http://www.nabble.com/MoreLikeThis-and-term-vectors---documentation-suggestion-tf3295459.html
        Hide
        Brian Whitman added a comment -

        Is there a way to get this patch to listen to start & rows on the moreLikeThis result section?

        Show
        Brian Whitman added a comment - Is there a way to get this patch to listen to start & rows on the moreLikeThis result section?
        Hide
        Bertrand Delacretaz added a comment -

        > Is there a way to get this patch to listen to start & rows on the moreLikeThis result section?

        IIUC you want to use the start & rows request parameters to limit the number of results in the moreLikeThis section?

        This is not implemented currently, and if we did it we'd have to use different parameter names (mlt.start and mlt.rows maybe) so that they don't interfere with the "main" part of the result set.

        Show
        Bertrand Delacretaz added a comment - > Is there a way to get this patch to listen to start & rows on the moreLikeThis result section? IIUC you want to use the start & rows request parameters to limit the number of results in the moreLikeThis section? This is not implemented currently, and if we did it we'd have to use different parameter names (mlt.start and mlt.rows maybe) so that they don't interfere with the "main" part of the result set.
        Hide
        Brian Whitman added a comment -

        Yes, paging and size would be helpful in the MLT section. mlt.start and mlt.rows would be great.

        Show
        Brian Whitman added a comment - Yes, paging and size would be helpful in the MLT section. mlt.start and mlt.rows would be great.
        Hide
        Ryan McKinley added a comment -

        trivial changes so it applies to trunk without conflicts...

        Show
        Ryan McKinley added a comment - trivial changes so it applies to trunk without conflicts...
        Hide
        Ryan McKinley added a comment -

        Changed the MoreLikeThis implementation to be a standalone request handler rather then tacked on to standard/dismax request handlers

        How are other people using this patch? I found that i am always looking for things that are similar to a single document.

        This is still in progress, but posting for feedback.

        An example command would be:
        http://localhost:8983/solr/mlt?q=apache&qt=standard&mlt=true&mlt.fl=manu,cat&mlt.mindf=1&mlt.mindf=1&fl=id,score

        Show
        Ryan McKinley added a comment - Changed the MoreLikeThis implementation to be a standalone request handler rather then tacked on to standard/dismax request handlers How are other people using this patch? I found that i am always looking for things that are similar to a single document. This is still in progress, but posting for feedback. An example command would be: http://localhost:8983/solr/mlt?q=apache&qt=standard&mlt=true&mlt.fl=manu,cat&mlt.mindf=1&mlt.mindf=1&fl=id,score
        Hide
        Brian Whitman added a comment -

        I've personally never understood the "more documents that don't match this query but are like the documents in this query" usage of SOLR-69. MLT results (to me) should be like any other result, except by querying by text you are querying by document ID. I'm confused as to how querying by query would work – if a query for 'apache' returned 10 docs, would MLT work on each one and generate n more docs per doc? And would the original query results get returned? What's the ordering?

        But I do know that paging and faceting should definitely work on MLT results. (Ryan's patch seems to implement this but I haven't tested it.) MLT results should look and operate like any other results.

        Show
        Brian Whitman added a comment - I've personally never understood the "more documents that don't match this query but are like the documents in this query" usage of SOLR-69 . MLT results (to me) should be like any other result, except by querying by text you are querying by document ID. I'm confused as to how querying by query would work – if a query for 'apache' returned 10 docs, would MLT work on each one and generate n more docs per doc? And would the original query results get returned? What's the ordering? But I do know that paging and faceting should definitely work on MLT results. (Ryan's patch seems to implement this but I haven't tested it.) MLT results should look and operate like any other results.
        Hide
        Ken Krugler added a comment -

        Ryan & Brian's comments above are (I think) indicative of how most people want to use MLT - you've got a single document, and you want to show other documents that are similar.

        The way we deal with this is to do a query on the <uniqueKey> field (as defined in the schema).

        If this was the only use case, then the syntax could be something like:

        http://localhost:8983/solr/mlt?uid=xxx&mlt.fl=manu,cat&mindf=1&rows=10

        The uid parameter would implicitly be applied against the <uniqueKey> field as specified in the schema.

        But that's just for my use case - others may want the ability to have mlt results returned for the first hit result of an arbitrary query.

        Show
        Ken Krugler added a comment - Ryan & Brian's comments above are (I think) indicative of how most people want to use MLT - you've got a single document, and you want to show other documents that are similar. The way we deal with this is to do a query on the <uniqueKey> field (as defined in the schema). If this was the only use case, then the syntax could be something like: http://localhost:8983/solr/mlt?uid=xxx&mlt.fl=manu,cat&mindf=1&rows=10 The uid parameter would implicitly be applied against the <uniqueKey> field as specified in the schema. But that's just for my use case - others may want the ability to have mlt results returned for the first hit result of an arbitrary query.
        Hide
        Hoss Man added a comment -

        looking back at the two main use cases Yonik described in his comment from 06/Feb/07...

        At the most basic level, A request for MLT results for a single doc by uniqueKey (case#1) is just a simplistic example of asking for MLT results for an arbitrary query (case#2) ... that arbitrary query just happens to be on a uniqueKey field, and only returns one result.

        Where things get more complicated is when you start returning other "tier 2" type information about the request – which begs the question "what is tier 1 data"? If the MLT results are added as "tier 2" data to StandardRequestHandler response, then all of the other "tier 2" data blocks (highlighting, faceting, debugQuery score explanation, etc..) still refer to the main result from the original query ... this may be what you want in use case #2, but doesn't really make sense for use case #1, where the "tier 1" main result only contains the single document you asked for by id ... the score explanation and facet count numbers aren't very interesting in that case.

        for case #1, what you really want is for the MLT data to be treated as the primary ("tier 1") result set, and all of hte "tier 2" data is about those results ... highlighting is done on the MLT docs, facet counts are for the MLT docs, debugQuery score explanation tells you why the MLT docs are like your original docs, etc..

        Case #1 and case #2 are both useful, to address Brian's 02/May/07 comment..

        > I've personally never understood the "more documents
        > that don't match this query but are like the documents
        > in this query" ... I'm confused as to how querying by
        > query would work – if a query for 'apache' returned 10
        > docs, would MLT work on each one and generate n more
        > docs per doc? And would the original query results get
        > returned? What's the ordering?

        in your example, yes ... the users main search on "apache" would return 10 results sorted by whatever sort they specified. for each of those 10 results, N similar results might me listed to the side (in a smaller font, or as a pop up widget) sorted most likely by how similar they are. even if you don't want to surface those similar docs right there on the main result page, you still need to execute the MLT logic as part of hte initial request to know if there there are any similar docs (so you can surface the link/button for displaying them to the user.

        I would even argue there is actually a third use case ...


        Case 3)
        The GUI queries the standard request handler to display a list of documents, with a single subsequent list of similar "mlt" documents that have things in common with all of the docs in the current page of results displayed elsewhere on the page.

        ...where case #2 is about having separate MLT lists for each of hte matching reuslts, this case is about having a single "if you are interested in all of these items, you might also be interested in these other items" list.

        case#1 and case#3 can both easily be satisfied with a single "MoreLikeThisHandler" which takes as it's input a generic query (ie: "q=id:12345" for case#1, and "q=apache" for case#3) and then generates a single "tier 1" result block of MLT results that relate to all of the docs matching that query (simpel case of 1 doc for case#1) ... all other "tier 2" data would be in regards to this main MLT result set.

        case#2 would still easily be handled by having some new "tier 2" MLT data added to the StandardRequestHandler.

        Show
        Hoss Man added a comment - looking back at the two main use cases Yonik described in his comment from 06/Feb/07... At the most basic level, A request for MLT results for a single doc by uniqueKey (case#1) is just a simplistic example of asking for MLT results for an arbitrary query (case#2) ... that arbitrary query just happens to be on a uniqueKey field, and only returns one result. Where things get more complicated is when you start returning other "tier 2" type information about the request – which begs the question "what is tier 1 data"? If the MLT results are added as "tier 2" data to StandardRequestHandler response, then all of the other "tier 2" data blocks (highlighting, faceting, debugQuery score explanation, etc..) still refer to the main result from the original query ... this may be what you want in use case #2, but doesn't really make sense for use case #1, where the "tier 1" main result only contains the single document you asked for by id ... the score explanation and facet count numbers aren't very interesting in that case. for case #1, what you really want is for the MLT data to be treated as the primary ("tier 1") result set, and all of hte "tier 2" data is about those results ... highlighting is done on the MLT docs, facet counts are for the MLT docs, debugQuery score explanation tells you why the MLT docs are like your original docs, etc.. Case #1 and case #2 are both useful, to address Brian's 02/May/07 comment.. > I've personally never understood the "more documents > that don't match this query but are like the documents > in this query" ... I'm confused as to how querying by > query would work – if a query for 'apache' returned 10 > docs, would MLT work on each one and generate n more > docs per doc? And would the original query results get > returned? What's the ordering? in your example, yes ... the users main search on "apache" would return 10 results sorted by whatever sort they specified. for each of those 10 results, N similar results might me listed to the side (in a smaller font, or as a pop up widget) sorted most likely by how similar they are. even if you don't want to surface those similar docs right there on the main result page, you still need to execute the MLT logic as part of hte initial request to know if there there are any similar docs (so you can surface the link/button for displaying them to the user. I would even argue there is actually a third use case ... – Case 3) The GUI queries the standard request handler to display a list of documents, with a single subsequent list of similar "mlt" documents that have things in common with all of the docs in the current page of results displayed elsewhere on the page. – ...where case #2 is about having separate MLT lists for each of hte matching reuslts, this case is about having a single "if you are interested in all of these items, you might also be interested in these other items" list. case#1 and case#3 can both easily be satisfied with a single "MoreLikeThisHandler" which takes as it's input a generic query (ie: "q=id:12345" for case#1, and "q=apache" for case#3) and then generates a single "tier 1" result block of MLT results that relate to all of the docs matching that query (simpel case of 1 doc for case#1) ... all other "tier 2" data would be in regards to this main MLT result set. case#2 would still easily be handled by having some new "tier 2" MLT data added to the StandardRequestHandler.
        Hide
        Ryan McKinley added a comment -

        Refactored the MoreLikeThisRequestHandler so that it can support case #1, #2, #3

        • added faceting to the MoreLikeThisHandler
        • made it possible to remove the original match from the response. This makes the response look the same as ones that come from /select
        • Added documentation to: http://wiki.apache.org/solr/MoreLikeThis
        Show
        Ryan McKinley added a comment - Refactored the MoreLikeThisRequestHandler so that it can support case #1, #2, #3 added faceting to the MoreLikeThisHandler made it possible to remove the original match from the response. This makes the response look the same as ones that come from /select Added documentation to: http://wiki.apache.org/solr/MoreLikeThis
        Hide
        Brian Whitman added a comment -

        Ryan, it seems the handler doesn't listen to the fl parameter either in the result section or the morelikethis section. It always returns everything.

        Show
        Brian Whitman added a comment - Ryan, it seems the handler doesn't listen to the fl parameter either in the result section or the morelikethis section. It always returns everything.
        Hide
        Brian Whitman added a comment -

        Oof ryan, my apologies, I was running an older version of this patch. fl is listened to. This is an excellent job, btw, I love that you can hide the original response.

        Show
        Brian Whitman added a comment - Oof ryan, my apologies, I was running an older version of this patch. fl is listened to. This is an excellent job, btw, I love that you can hide the original response.
        Hide
        Brian Whitman added a comment -

        R, one useful feature would be mlt.fq=query, where query is a filter query, like type:book. Or since we're moving to a solo handler for mlt, just supporting fq would be good.

        like

        /mlt?q=id:BOOK01&mlt.fl=contents&fq=type:BOOK

        (Because in a single solr instance you've got information about books & authors, and you only want the mlt results to be books.)

        Show
        Brian Whitman added a comment - R, one useful feature would be mlt.fq=query, where query is a filter query, like type:book. Or since we're moving to a solo handler for mlt, just supporting fq would be good. like /mlt?q=id:BOOK01&mlt.fl=contents&fq=type:BOOK (Because in a single solr instance you've got information about books & authors, and you only want the mlt results to be books.)
        Hide
        Brian Whitman added a comment -

        The mlt.exclude is similar to what I'm looking for but an mlt.fq is generally more useful.

        Also, mlt.exclude does not seem to support more than a single term query, e.g.

        mlt.exclude=+type:AUTHOR +type:PUBLISHER

        still lets type:PUBLISHER through.

        Show
        Brian Whitman added a comment - The mlt.exclude is similar to what I'm looking for but an mlt.fq is generally more useful. Also, mlt.exclude does not seem to support more than a single term query, e.g. mlt.exclude=+type:AUTHOR +type:PUBLISHER still lets type:PUBLISHER through.
        Hide
        Brian Whitman added a comment -

        Also (sorry to keep commenting on this!) asking for fl=score doesn't work, I get this:

        java.lang.NullPointerException
        at org.apache.solr.search.DocSlice$1.score(DocSlice.java:117)
        at org.apache.solr.request.XMLWriter.writeDocList(XMLWriter.java:369)
        at org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:408)
        at org.apache.solr.request.XMLWriter.writeResponse(XMLWriter.java:126)
        at org.apache.solr.request.XMLResponseWriter.write(XMLResponseWriter.java:35)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:169)
        at com.caucho.server.dispatch.FilterFilterChain.doFilter(FilterFilterChain.java:70)
        at com.caucho.server.webapp.WebAppFilterChain.doFilter(WebAppFilterChain.java:173)
        at com.caucho.server.dispatch.ServletInvocation.service(ServletInvocation.java:229)
        at com.caucho.server.http.HttpRequest.handleRequest(HttpRequest.java:274)
        at com.caucho.server.port.TcpConnection.run(TcpConnection.java:511)
        at com.caucho.util.ThreadPool.runTasks(ThreadPool.java:520)
        at com.caucho.util.ThreadPool.run(ThreadPool.java:442)
        at java.lang.Thread.run(Thread.java:619)

        if I do a query like

        /mlt?q=id:100&mlt.fl=content&fl=content,score

        If I take out the score from the fl it doesn't NPE.

        Show
        Brian Whitman added a comment - Also (sorry to keep commenting on this!) asking for fl=score doesn't work, I get this: java.lang.NullPointerException at org.apache.solr.search.DocSlice$1.score(DocSlice.java:117) at org.apache.solr.request.XMLWriter.writeDocList(XMLWriter.java:369) at org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:408) at org.apache.solr.request.XMLWriter.writeResponse(XMLWriter.java:126) at org.apache.solr.request.XMLResponseWriter.write(XMLResponseWriter.java:35) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:169) at com.caucho.server.dispatch.FilterFilterChain.doFilter(FilterFilterChain.java:70) at com.caucho.server.webapp.WebAppFilterChain.doFilter(WebAppFilterChain.java:173) at com.caucho.server.dispatch.ServletInvocation.service(ServletInvocation.java:229) at com.caucho.server.http.HttpRequest.handleRequest(HttpRequest.java:274) at com.caucho.server.port.TcpConnection.run(TcpConnection.java:511) at com.caucho.util.ThreadPool.runTasks(ThreadPool.java:520) at com.caucho.util.ThreadPool.run(ThreadPool.java:442) at java.lang.Thread.run(Thread.java:619) if I do a query like /mlt?q=id:100&mlt.fl=content&fl=content,score If I take out the score from the fl it doesn't NPE.
        Hide
        Ryan McKinley added a comment -

        Updating with a bunch of minor changes...

        1. Got rid of the "exclude" parameter and it is now using standard fq filters

        2. If only one field is specified, it uses the fields analizyer as Ken suggested in:
        http://www.nabble.com/MoreLikeThis-and-term-vectors---documentation-suggestion-tf3295459.html#a9188723

        3. set termVectors="true" for 'cat' in the example solrconfig.xml and added a comment describing 'termVectors'

        4. Added standard debug info

        5. Fixed 'score' issue – it was squaking because the original match did not have a score field...

        Show
        Ryan McKinley added a comment - Updating with a bunch of minor changes... 1. Got rid of the "exclude" parameter and it is now using standard fq filters 2. If only one field is specified, it uses the fields analizyer as Ken suggested in: http://www.nabble.com/MoreLikeThis-and-term-vectors---documentation-suggestion-tf3295459.html#a9188723 3. set termVectors="true" for 'cat' in the example solrconfig.xml and added a comment describing 'termVectors' 4. Added standard debug info 5. Fixed 'score' issue – it was squaking because the original match did not have a score field...
        Hide
        Andrew Nagy added a comment -

        A really nice feature would be to allow for boosting for fields, for example:

        ?q=id:1&mlt=true&mlt.fl=title^5,author^3,topic

        This would find items that are more similar to the title over the author, etc.

        Show
        Andrew Nagy added a comment - A really nice feature would be to allow for boosting for fields, for example: ?q=id:1&mlt=true&mlt.fl=title^5,author^3,topic This would find items that are more similar to the title over the author, etc.
        Hide
        Ryan McKinley added a comment -

        Updated patch to:

        • use searcher.getSchema().getAnalyzer()
        • be able to find similar documents from posted text
        • be able to return the "interesting terms" used for the MLT query

        Andrew: about field boosting... This handler uses the lucene contrib MoreLikeThis implementation – that does not have a way to boost one field above another, If it did, we could easily add it

        Show
        Ryan McKinley added a comment - Updated patch to: use searcher.getSchema().getAnalyzer() be able to find similar documents from posted text be able to return the "interesting terms" used for the MLT query Andrew: about field boosting... This handler uses the lucene contrib MoreLikeThis implementation – that does not have a way to boost one field above another, If it did, we could easily add it
        Hide
        Ryan McKinley added a comment -

        added param: mlt.boost that calls mlt.setBoost() to boost the interesting terms (or not) this field is required if you want a real number returned with mlt.interestingTerms=details

        Show
        Ryan McKinley added a comment - added param: mlt.boost that calls mlt.setBoost() to boost the interesting terms (or not) this field is required if you want a real number returned with mlt.interestingTerms=details
        Hide
        Hoss Man added a comment -

        This bug was modified as part of a bulk update using the criteria...

        • Marked "Resolved" and "Fixed"
        • Had no "Fix Version" versions
        • Was listed in the CHANGES.txt for 1.3 as of today 2008-03-15

        The Fix Version for all 29 issues found was set to 1.3, email notification was suppressed to prevent excessive email.

        For a list of all the issues modified, search jira comments for this (hopefully) unique string: batch20070315hossman1

        Show
        Hoss Man added a comment - This bug was modified as part of a bulk update using the criteria... Marked "Resolved" and "Fixed" Had no "Fix Version" versions Was listed in the CHANGES.txt for 1.3 as of today 2008-03-15 The Fix Version for all 29 issues found was set to 1.3, email notification was suppressed to prevent excessive email. For a list of all the issues modified, search jira comments for this (hopefully) unique string: batch20070315hossman1

          People

          • Assignee:
            Ryan McKinley
            Reporter:
            Bertrand Delacretaz
          • Votes:
            3 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development