Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: query parsers
    • Labels:
      None

      Description

      MLT Component doesn't let people highlight/paginate and the handler comes with an cost of maintaining another piece in the config. Also, any changes to the default (number of results to be fetched etc.) /select handler need to be copied/synced with this handler too.

      Having an MLT QParser would let users get back docs based on a query for them to paginate, highlight etc. It would also give them the flexibility to use this anywhere i.e. q,fq,bq etc.

      A bit of history about MLT (thanks to Hoss)

      MLT Handler pre-dates the existence of QParsers and was meant to take an arbitrary query as input, find docs that match that
      query, club them together to find interesting terms, and then use those
      terms as if they were my main query to generate a main result set.

      This result would then be used as the set to facet, highlight etc.

      The flow: Query -> DocList(m) -> Bag (terms) -> Query -> DocList(y)

      The MLT component on the other hand solved a very different purpose of augmenting the main result set. It is used to get similar docs for each of the doc in the main result set.

      DocSet(n) -> n * Bag (terms) -> n * (Query) -> n * DocList(m)

      The new approach:

      All of this can be done better and cleaner (and makes more sense too) using an MLT QParser.

      An important thing to handle here is the case where the user doesn't have TermVectors, in which case, it does what happens right now i.e. parsing stored fields.

      Also, in case the user doesn't have a field (to be used for MLT) indexed, the field would need to be a TextField with an index analyzer defined. This analyzer will then be used to extract terms for MLT.

      In case of SolrCloud mode, '/get-termvectors' can be used after looking at the schema (if TermVectors are enabled for the field). If not, a /get call can be used to fetch the field and parse it.

      1. SOLR-6248.patch
        5 kB
        Anshum Gupta
      2. SOLR-6248.patch
        27 kB
        Anshum Gupta
      3. SOLR-6248.patch
        27 kB
        Anshum Gupta
      4. SOLR-6248.patch
        25 kB
        Anshum Gupta
      5. SOLR-6248.patch
        47 kB
        Anshum Gupta
      6. SOLR-6248.patch
        71 kB
        Vitaliy Zhovtyuk
      7. SOLR-6248-4x.patch
        32 kB
        Markus Jelsma
      8. SOLR-6248-4x.patch
        31 kB
        Markus Jelsma

        Issue Links

          Activity

          Hide
          Steve Molloy added a comment -

          Would that approach also support sending in text that isn't in the index? This is the main reason we're using the MLT handler, which we need to be distributed (thus SOLR-5480). but if we can have a single approach for both, I agree that not maintaining 2 configurations (and 2 handlers in the code) would be much better. Let me know if I can help out.

          Show
          Steve Molloy added a comment - Would that approach also support sending in text that isn't in the index? This is the main reason we're using the MLT handler, which we need to be distributed (thus SOLR-5480 ). but if we can have a single approach for both, I agree that not maintaining 2 configurations (and 2 handlers in the code) would be much better. Let me know if I can help out.
          Hide
          Anshum Gupta added a comment -

          What do you mean by "text that isn't in the index"? If you mean pseudo-random text to find documents similar to that? Yes, it would handle that.

          Show
          Anshum Gupta added a comment - What do you mean by "text that isn't in the index"? If you mean pseudo-random text to find documents similar to that? Yes, it would handle that.
          Hide
          Steve Molloy added a comment -

          I meant passing in text as parameter as opposed to finding it in the index. With current MLT handler (not component), you can pass it in as body or stream.body to get documents similar to the text you pass in. In our case, we use it to find documents in one collection similar to a document found in another, or to some text directly provided by user. So, I know that at some point the SearchHandler started rejecting search requests with stream body, which would prevent this unless it could be achieved in another way. That's why I'm asking.

          Show
          Steve Molloy added a comment - I meant passing in text as parameter as opposed to finding it in the index. With current MLT handler (not component), you can pass it in as body or stream.body to get documents similar to the text you pass in. In our case, we use it to find documents in one collection similar to a document found in another, or to some text directly provided by user. So, I know that at some point the SearchHandler started rejecting search requests with stream body, which would prevent this unless it could be achieved in another way. That's why I'm asking.
          Hide
          Anshum Gupta added a comment -

          I don't think this would really work across 2 collections straight out of the box, but yes, as long as you have 'text' to pass, that is exactly what this parser would take. In other words, for now, it would more or less maintain the same mechanism of the handler (but in a manner that makes it work under SolrCloud mode).

          Show
          Anshum Gupta added a comment - I don't think this would really work across 2 collections straight out of the box, but yes, as long as you have 'text' to pass, that is exactly what this parser would take. In other words, for now, it would more or less maintain the same mechanism of the handler (but in a manner that makes it work under SolrCloud mode).
          Hide
          Vitaliy Zhovtyuk added a comment -

          Added mlt qparser, works in single and cloud mode. Added support for numeric id.
          Result of mlt written as query result - not in MoreLikeThis. Added tests to call in single and cloud modes.

          Show
          Vitaliy Zhovtyuk added a comment - Added mlt qparser, works in single and cloud mode. Added support for numeric id. Result of mlt written as query result - not in MoreLikeThis. Added tests to call in single and cloud modes.
          Hide
          Steve Molloy added a comment -

          I'd like to give this a spin, but looking at the attached patch, it's unclear how to pass in text. The parsers seem to be looking at "id" parameter, I haven't seen any reference to stream.body. What parameter would be used to pass in text to be analyzed and for which to return similar documents?

          Show
          Steve Molloy added a comment - I'd like to give this a spin, but looking at the attached patch, it's unclear how to pass in text. The parsers seem to be looking at "id" parameter, I haven't seen any reference to stream.body. What parameter would be used to pass in text to be analyzed and for which to return similar documents?
          Hide
          Vitaliy Zhovtyuk added a comment -

          With current implementation in patch mlt qparser can match document by unique field configured in schema and find similar document out of it. Parser syntax now look like

          {!mlt id=17 qf=lowerfilt}lowerfilt:*

          where id is value of unique field configure (not "id" column in schema), qf is matched fields to search.

          About passing text this parser can be extended with text parameter, search document by this term and look for similar document using existing implementation.

          Show
          Vitaliy Zhovtyuk added a comment - With current implementation in patch mlt qparser can match document by unique field configured in schema and find similar document out of it. Parser syntax now look like {!mlt id=17 qf=lowerfilt}lowerfilt:* where id is value of unique field configure (not "id" column in schema), qf is matched fields to search. About passing text this parser can be extended with text parameter, search document by this term and look for similar document using existing implementation.
          Hide
          Steve Molloy added a comment -

          In this case it cannot replace the current MoreLikeThisHandler implementation which can analyze incoming text (as opposed to searching for a matching document in the index) in order to find similar documents in the index. Being able to query by unique field and returning similar documents is already covered by the MoreLikeThisComponent if you use rows=1 to get a single document and its set of similar ones. The use case that forces the MoreLikeThisHandler currently (at least that I know of) is really this on-the-fly analysis of text that is nowhere in the index.

          Show
          Steve Molloy added a comment - In this case it cannot replace the current MoreLikeThisHandler implementation which can analyze incoming text (as opposed to searching for a matching document in the index) in order to find similar documents in the index. Being able to query by unique field and returning similar documents is already covered by the MoreLikeThisComponent if you use rows=1 to get a single document and its set of similar ones. The use case that forces the MoreLikeThisHandler currently (at least that I know of) is really this on-the-fly analysis of text that is nowhere in the index.
          Hide
          Anshum Gupta added a comment -

          My bad, this was my mistake. The last time I'd looked at this patch was about 10 months ago.

          This works like a component but also lets you paginate and do other stuff with it.
          Let me check out if accepting text would make sense here (or if we could have something on similar lines).

          Show
          Anshum Gupta added a comment - My bad, this was my mistake. The last time I'd looked at this patch was about 10 months ago. This works like a component but also lets you paginate and do other stuff with it. Let me check out if accepting text would make sense here (or if we could have something on similar lines).
          Hide
          Upayavira added a comment -

          I also, after a conversation with Hoss, have knocked up a MLTQuery parser.

          I very much doubt your query parser will allow you to pass in a stream.body, because the first few lines of o.a.s.handler.component.SearchHandler.handleRequestBody() say:

          if (req.getContentStreams() != null && req.getContentStreams().iterator().hasNext()) {
              throw new SolrException(ErrorCode.BAD_REQUEST, "Search requests cannot accept content streams");
          }
          

          This needs to be removed for stream.body to be available to the query parser.

          I can post my patch later if anyone is interested. It doesn't have any tests yet. My next task is to work out how to make it work across cores (recommend docs in one core based upon docs in another).

          Regarding the patch in this ticket, I'm curious why you needed a SolrCloud specific query parser? Is it because the doc you are using might be in a different shard?

          Also, it appears from a cursory look that LWMoreLikeThis is a fork of Lucene's MoreLikeThis class. Is there a reason that is needed, and if so, why isn't it still in Lucene?

          I expect to be working on my own version this week, and if what I produce can be useful to others (via this ticket or otherwise), I'd be happy to contribute it.
          Thx!

          Show
          Upayavira added a comment - I also, after a conversation with Hoss, have knocked up a MLTQuery parser. I very much doubt your query parser will allow you to pass in a stream.body, because the first few lines of o.a.s.handler.component.SearchHandler.handleRequestBody() say: if (req.getContentStreams() != null && req.getContentStreams().iterator().hasNext()) { throw new SolrException(ErrorCode.BAD_REQUEST, "Search requests cannot accept content streams" ); } This needs to be removed for stream.body to be available to the query parser. I can post my patch later if anyone is interested. It doesn't have any tests yet. My next task is to work out how to make it work across cores (recommend docs in one core based upon docs in another). Regarding the patch in this ticket, I'm curious why you needed a SolrCloud specific query parser? Is it because the doc you are using might be in a different shard? Also, it appears from a cursory look that LWMoreLikeThis is a fork of Lucene's MoreLikeThis class. Is there a reason that is needed, and if so, why isn't it still in Lucene? I expect to be working on my own version this week, and if what I produce can be useful to others (via this ticket or otherwise), I'd be happy to contribute it. Thx!
          Hide
          Anshum Gupta added a comment - - edited

          Updated patch. I'm keeping it simple for now (and we can always get more advanced stuff come in later).

          Here's the format:

          {!mlt id=docId qf=fieldNames}

          The users can also pass the min/max word length, mintf, mindf as a local params.
          I'll just update another patch which asserts against the parsedQuery from the query response (debug) to validate that the final query is actually generated using the passed documentId.

          Also, the changes MoreLikeThis changes are now in the exiting Lucene MoreLikeThis class.

          Show
          Anshum Gupta added a comment - - edited Updated patch. I'm keeping it simple for now (and we can always get more advanced stuff come in later). Here's the format: {!mlt id=docId qf=fieldNames} The users can also pass the min/max word length, mintf, mindf as a local params. I'll just update another patch which asserts against the parsedQuery from the query response (debug) to validate that the final query is actually generated using the passed documentId. Also, the changes MoreLikeThis changes are now in the exiting Lucene MoreLikeThis class.
          Hide
          Anshum Gupta added a comment -

          For now, I think this should be good to go. It'd be nice if someone takes a look though.

          I've refactored stuff from the last patch and fixed a few things too. Added a test for cloud mode to assert for parse queries from the debug response.

          Show
          Anshum Gupta added a comment - For now, I think this should be good to go. It'd be nice if someone takes a look though. I've refactored stuff from the last patch and fixed a few things too. Added a test for cloud mode to assert for parse queries from the debug response.
          Hide
          Erik Hatcher added a comment -

          Anshum Gupta - your latest patch does not have the new files (need to svn add them). This new qparser, IMO, should be registered automatically in QParserPlugin, so it doesn't need to be registered in solrconfig.xml manually.

          Overall looks great (looking back and previous patches to see the new files)! +1

          Show
          Erik Hatcher added a comment - Anshum Gupta - your latest patch does not have the new files (need to svn add them). This new qparser, IMO, should be registered automatically in QParserPlugin, so it doesn't need to be registered in solrconfig.xml manually. Overall looks great (looking back and previous patches to see the new files)! +1
          Hide
          Anshum Gupta added a comment -

          Erik Hatcher Thanks for looking at it. I merged the changes into MoreLikeThis.java instead of duplicating code (so the files are actually gone). The patch has everything that's required but yes I'll have this automatically registered in QParserPlugin.

          Show
          Anshum Gupta added a comment - Erik Hatcher Thanks for looking at it. I merged the changes into MoreLikeThis.java instead of duplicating code (so the files are actually gone). The patch has everything that's required but yes I'll have this automatically registered in QParserPlugin.
          Hide
          Anshum Gupta added a comment -

          Patch that auto registers the mlt QParser in the QParserPlugin. Also, Query equality test for the same. I plan to commit it tomorrow morning.

          Show
          Anshum Gupta added a comment - Patch that auto registers the mlt QParser in the QParserPlugin. Also, Query equality test for the same. I plan to commit it tomorrow morning.
          Hide
          Anshum Gupta added a comment -

          Another updated patch that fixes precommit issues.
          The precommit still fails due to another (unrelated) commit. Shall fix that one tomorrow morning before I commit this.

          Show
          Anshum Gupta added a comment - Another updated patch that fixes precommit issues. The precommit still fails due to another (unrelated) commit. Shall fix that one tomorrow morning before I commit this.
          Hide
          Noble Paul added a comment -

          doesn't it make sense to put an example query in the description ?

          Show
          Noble Paul added a comment - doesn't it make sense to put an example query in the description ?
          Hide
          Noble Paul added a comment -

          I guess it is good to go

          Show
          Noble Paul added a comment - I guess it is good to go
          Hide
          ASF subversion and git services added a comment -

          Commit 1634937 from Anshum Gupta in branch 'dev/trunk'
          [ https://svn.apache.org/r1634937 ]

          SOLR-6248: MoreLikeThis QParser that works in standalone/cloud mode

          Show
          ASF subversion and git services added a comment - Commit 1634937 from Anshum Gupta in branch 'dev/trunk' [ https://svn.apache.org/r1634937 ] SOLR-6248 : MoreLikeThis QParser that works in standalone/cloud mode
          Hide
          Markus Jelsma added a comment -

          Anshun, very cool stuff here!

          {!mlt id=docId qf=fieldNames}

          I assume this is not the Lucene DocID but the document's UniqueKey field value? Also, must we query the correct shard for it to work?

          Show
          Markus Jelsma added a comment - Anshun, very cool stuff here! {!mlt id=docId qf=fieldNames} I assume this is not the Lucene DocID but the document's UniqueKey field value? Also, must we query the correct shard for it to work?
          Hide
          ASF subversion and git services added a comment -

          Commit 1634939 from Anshum Gupta in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1634939 ]

          SOLR-6248: MoreLikeThis QParser that works in standalone/cloud mode (merge from trunk)

          Show
          ASF subversion and git services added a comment - Commit 1634939 from Anshum Gupta in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1634939 ] SOLR-6248 : MoreLikeThis QParser that works in standalone/cloud mode (merge from trunk)
          Hide
          Anshum Gupta added a comment -

          Thanks Markus Jelsma. This is indeed the documents` unique key field value.
          Also, I don't think you'd need to target the correct shard as, in case of Cloud mode, it uses the /get handler.

          This has a lot of room for improvements/enhancements but I thought this was a good point to start with.

          Show
          Anshum Gupta added a comment - Thanks Markus Jelsma . This is indeed the documents` unique key field value. Also, I don't think you'd need to target the correct shard as, in case of Cloud mode, it uses the /get handler. This has a lot of room for improvements/enhancements but I thought this was a good point to start with.
          Hide
          Anshum Gupta added a comment -

          Noble Paul Thanks for looking at the patch.
          I've added a sample query in there. Also, there's basic description as a part of the package.html.

          I'll also be adding the usage in the ref guide.

          Show
          Anshum Gupta added a comment - Noble Paul Thanks for looking at the patch. I've added a sample query in there. Also, there's basic description as a part of the package.html. I'll also be adding the usage in the ref guide.
          Hide
          ASF subversion and git services added a comment -

          Commit 1634941 from Anshum Gupta in branch 'dev/trunk'
          [ https://svn.apache.org/r1634941 ]

          SOLR-6248: Removing svn:keywords that got auto-added with the commit hook.

          Show
          ASF subversion and git services added a comment - Commit 1634941 from Anshum Gupta in branch 'dev/trunk' [ https://svn.apache.org/r1634941 ] SOLR-6248 : Removing svn:keywords that got auto-added with the commit hook.
          Hide
          ASF subversion and git services added a comment -

          Commit 1634942 from Anshum Gupta in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1634942 ]

          SOLR-6248: Removing svn:keywords that got auto-added with the commit hook. (merge from trunk)

          Show
          ASF subversion and git services added a comment - Commit 1634942 from Anshum Gupta in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1634942 ] SOLR-6248 : Removing svn:keywords that got auto-added with the commit hook. (merge from trunk)
          Hide
          Anshum Gupta added a comment -

          After a discussion with Hoss, I'm changing the format of the query parser. It wouldn't have an 'id' key in the request i.e. the new request would look like:

          {!mlt qf=fieldname}docId

          This would eliminate the need to document/maintain and track a new parameter name.

          Show
          Anshum Gupta added a comment - After a discussion with Hoss, I'm changing the format of the query parser. It wouldn't have an 'id' key in the request i.e. the new request would look like: {!mlt qf=fieldname}docId This would eliminate the need to document/maintain and track a new parameter name.
          Hide
          ASF subversion and git services added a comment -

          Commit 1635329 from Anshum Gupta in branch 'dev/trunk'
          [ https://svn.apache.org/r1635329 ]

          SOLR-6248: Changing the format of mlt query parser

          Show
          ASF subversion and git services added a comment - Commit 1635329 from Anshum Gupta in branch 'dev/trunk' [ https://svn.apache.org/r1635329 ] SOLR-6248 : Changing the format of mlt query parser
          Hide
          ASF subversion and git services added a comment -

          Commit 1635336 from Anshum Gupta in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1635336 ]

          SOLR-6248: Changing request format for mlt queryparser (merge from trunk)

          Show
          ASF subversion and git services added a comment - Commit 1635336 from Anshum Gupta in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1635336 ] SOLR-6248 : Changing request format for mlt queryparser (merge from trunk)
          Hide
          Anshum Gupta added a comment -

          A few more improvements to the query parser. An additional test too for a fix that's a part of the patch.

          Show
          Anshum Gupta added a comment - A few more improvements to the query parser. An additional test too for a fix that's a part of the patch.
          Hide
          ASF subversion and git services added a comment -

          Commit 1636784 from Anshum Gupta in branch 'dev/trunk'
          [ https://svn.apache.org/r1636784 ]

          SOLR-6248: Fixing an exception in case of missing qf

          Show
          ASF subversion and git services added a comment - Commit 1636784 from Anshum Gupta in branch 'dev/trunk' [ https://svn.apache.org/r1636784 ] SOLR-6248 : Fixing an exception in case of missing qf
          Hide
          ASF subversion and git services added a comment -

          Commit 1636788 from Anshum Gupta in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1636788 ]

          SOLR-6248: Fixing an exception in case of missing qf (merge from trunk)

          Show
          ASF subversion and git services added a comment - Commit 1636788 from Anshum Gupta in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1636788 ] SOLR-6248 : Fixing an exception in case of missing qf (merge from trunk)
          Hide
          Markus Jelsma added a comment -

          Patch for 4.10!

          Show
          Markus Jelsma added a comment - Patch for 4.10!
          Hide
          Markus Jelsma added a comment -

          Hi Anshum - ive uploaded a new patch. This adds a NPE check in retrieveTerms(Map<String, Collection<Object>> fields) in MoreLikeThis.java.

          In CloudMLTQParser i added a check if the requested document exists and returns an 404, you get a NPE otherwise. I also added support for glob there. One assert in the unit test uses that glob feature.

          Tests pass

          I haven't added support for it in the single parser.

          Markus

          Show
          Markus Jelsma added a comment - Hi Anshum - ive uploaded a new patch. This adds a NPE check in retrieveTerms(Map<String, Collection<Object>> fields) in MoreLikeThis.java. In CloudMLTQParser i added a check if the requested document exists and returns an 404, you get a NPE otherwise. I also added support for glob there. One assert in the unit test uses that glob feature. Tests pass I haven't added support for it in the single parser. Markus
          Hide
          Anshum Gupta added a comment -

          Markus Jelsma Thanks for the patch for 4.10 but that can't go in. It's a new feature and will be released with 5.0 (sometime really soon).
          I haven't looked at the patch yet but users who are running 4.10 and want to use this patch are free to do so.

          We can work on getting the bug fixes/tests into 5x now though. Can you provide a patch for trunk/5x for the tests/fixes?

          Show
          Anshum Gupta added a comment - Markus Jelsma Thanks for the patch for 4.10 but that can't go in. It's a new feature and will be released with 5.0 (sometime really soon). I haven't looked at the patch yet but users who are running 4.10 and want to use this patch are free to do so. We can work on getting the bug fixes/tests into 5x now though. Can you provide a patch for trunk/5x for the tests/fixes?
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              Anshum Gupta
              Reporter:
              Anshum Gupta
            • Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development