Solr
  1. Solr
  2. SOLR-3229

TermVectorComponent does not return terms in distributed search

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0, 6.0
    • Labels:
    • Environment:

      Ubuntu 11.10, openjdk-6

      Description

      TermVectorComponent does not return terms in distributed search, the distributedProcess() incorrectly uses Solr Unique Key to do subrequests, while process() expects Lucene document ids. Also, parameters are transferred in different format thus making distributed search returns no result.

      1. SOLR-3229.patch
        24 kB
        Hoss Man
      2. TermVectorComponent.patch
        3 kB
        Hang Xie

        Issue Links

          Activity

          Hide
          Hang Xie added a comment -

          patch to TermVectorComponent.java

          Show
          Hang Xie added a comment - patch to TermVectorComponent.java
          Hide
          Hang Xie added a comment -

          Patch attached, tested in 4.0 environment (both distributed and non-distributed), it should work with 3.x but I didn't test.

          Everything is compatible with previous except name of lst, which used to be "doc-<doc id>", and I changed it to Solr Unique Key as former may not be unique in multi-shard environment.

          Show
          Hang Xie added a comment - Patch attached, tested in 4.0 environment (both distributed and non-distributed), it should work with 3.x but I didn't test. Everything is compatible with previous except name of lst, which used to be "doc-<doc id>", and I changed it to Solr Unique Key as former may not be unique in multi-shard environment.
          Hide
          Hang Xie added a comment -

          Revised patch, no longer fails unit test.

          Show
          Hang Xie added a comment - Revised patch, no longer fails unit test.
          Hide
          Hang Xie added a comment -

          Revised, it seems distributedProcess() is unnecessary and can be removed, the only major change is add finishStage() to merge responses for subrequests to shards.

          Show
          Hang Xie added a comment - Revised, it seems distributedProcess() is unnecessary and can be removed, the only major change is add finishStage() to merge responses for subrequests to shards.
          Hide
          Hoss Man added a comment -

          bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

          Show
          Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
          Hide
          Robert Muir added a comment -

          rmuir20120906-bulk-40-change

          Show
          Robert Muir added a comment - rmuir20120906-bulk-40-change
          Hide
          Hoss Man added a comment -

          Hang: Thank you for your patch.

          I agree, the "docid" as a key is dangerous and misleading in distributed mode, and we should switch to using the uniqueKey when available, but if we leave things as you had it in your patch, existing (single node) users who don't have a uniqueKey field would no longer be able to get term vectors at all.

          I updated your patch to leave the key alone if there is no uniqueKey, and eliminate the "doc-" prefix when there is one. I also added a new distributed test to prove that everything is working, and that turned up a few problems - some of which i fixed (dealing with warnings, and ensuring that TVC results are in the correct order for the result documents).

          One thing i discovered that i'm not sure about is what to do about the "df" and "tf-idf" values when requested. in the test they have to be ignored because the way the distributed test works is to create a single node instance and compare it with a multi-node instance that has identical documents, and in the distributed TVC code, these won't match up – but i'm not sure if that's a bug (because the df & tf-idf values aren't "merged" from all nodes) or a feature (because you get the real df & tf-idf values for that term for that doc from the shard it lives in) ... either way it shouldn't stop fixing the basic problem of TVC failing painfully in a distributed request, so i've opened SOLR-3720 to track this in the future.

          feedback on this revised patch/test would be appreciated

          Show
          Hoss Man added a comment - Hang: Thank you for your patch. I agree, the "docid" as a key is dangerous and misleading in distributed mode, and we should switch to using the uniqueKey when available, but if we leave things as you had it in your patch, existing (single node) users who don't have a uniqueKey field would no longer be able to get term vectors at all. I updated your patch to leave the key alone if there is no uniqueKey, and eliminate the "doc-" prefix when there is one. I also added a new distributed test to prove that everything is working, and that turned up a few problems - some of which i fixed (dealing with warnings, and ensuring that TVC results are in the correct order for the result documents). One thing i discovered that i'm not sure about is what to do about the "df" and "tf-idf" values when requested. in the test they have to be ignored because the way the distributed test works is to create a single node instance and compare it with a multi-node instance that has identical documents, and in the distributed TVC code, these won't match up – but i'm not sure if that's a bug (because the df & tf-idf values aren't "merged" from all nodes) or a feature (because you get the real df & tf-idf values for that term for that doc from the shard it lives in) ... either way it shouldn't stop fixing the basic problem of TVC failing painfully in a distributed request, so i've opened SOLR-3720 to track this in the future. feedback on this revised patch/test would be appreciated
          Hide
          Hang Xie added a comment -

          I use "doc-0" to make it compatible with single node mode so far as I can recall, as my client was expecting that for parser. It's all up to you to keep "doc-" or not - it seems to me if you keep it, you can reduce lots of changes in tests. Other than that I don't have any comment on test thinking of my little to no knowledge on solr's test framework.

          I remember I read something regarding df/tf-idf in distributed mode is a highly anticipated feature, I don't expect that can be done easily, I'm good to have a bug there.

          Show
          Hang Xie added a comment - I use "doc-0" to make it compatible with single node mode so far as I can recall, as my client was expecting that for parser. It's all up to you to keep "doc-" or not - it seems to me if you keep it, you can reduce lots of changes in tests. Other than that I don't have any comment on test thinking of my little to no knowledge on solr's test framework. I remember I read something regarding df/tf-idf in distributed mode is a highly anticipated feature, I don't expect that can be done easily, I'm good to have a bug there.
          Hide
          Hoss Man added a comment -

          I use "doc-0" to make it compatible with single node mode so far as I can recall ...

          well, the biggest problem as i mentioned was thta you were only including the vector information in the reqponse if there was a uniqueKey value

          the format might have looked consistent, and the resulting string values were consistent in the test – but that was only a fluke of the fact that the uniqueKey values for the test docs were monotomicly increasing integers starting at "0" – so they just happened to correspond with the internal lucene docids.

          i think changing the format to only use the "doc-" when there is not uniqueKeyField in the schema makes the most sense – both because it helps make it clear when the output key is coming from the uniqueKey instead of the docid, and because moving forward that's the most logical thing for most users (who use a uniqueKey field)

          Show
          Hoss Man added a comment - I use "doc-0" to make it compatible with single node mode so far as I can recall ... well, the biggest problem as i mentioned was thta you were only including the vector information in the reqponse if there was a uniqueKey value the format might have looked consistent, and the resulting string values were consistent in the test – but that was only a fluke of the fact that the uniqueKey values for the test docs were monotomicly increasing integers starting at "0" – so they just happened to correspond with the internal lucene docids. i think changing the format to only use the "doc-" when there is not uniqueKeyField in the schema makes the most sense – both because it helps make it clear when the output key is coming from the uniqueKey instead of the docid, and because moving forward that's the most logical thing for most users (who use a uniqueKey field)
          Hide
          Hoss Man added a comment -

          Committed revision 1370870. - trunk
          Committed revision 1370871. - 4x

          Show
          Hoss Man added a comment - Committed revision 1370870. - trunk Committed revision 1370871. - 4x
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.
          Hide
          David Smiley added a comment -

          Hoss Man HighlightComponent, DebugComponent, and TermVectorComponent have a very similar bit of code in their finishStage() method. HighlightComponent & DebugComponent's version was recently found to be buggy – SOLR-8060 and SOLR-8059. The Highlight side was recently fixed and I'm about to do the same for Debug side. But I'd like to refactor out some common lines of code between all 3 to ease maintenance. However the TV side has this odd bit where it if it can't lookup the shard doc by it's unique key, that it adds this to the response any way (~line 458). I would rather we remove this; I think it's not something we should support. UniqueKey should be required for distributed-search to get TV info back. The code that's here now incorrectly assumes that if it was unable to lookup the key in the resultIds that it's because the schema has no uniqueKey. But another reason is just that it's a distrib.singlePass distributed search (related to the 2 bugs I'm looking at in Highlight & Debug components). Do you support my recommendation?

          Show
          David Smiley added a comment - Hoss Man HighlightComponent, DebugComponent, and TermVectorComponent have a very similar bit of code in their finishStage() method. HighlightComponent & DebugComponent's version was recently found to be buggy – SOLR-8060 and SOLR-8059 . The Highlight side was recently fixed and I'm about to do the same for Debug side. But I'd like to refactor out some common lines of code between all 3 to ease maintenance. However the TV side has this odd bit where it if it can't lookup the shard doc by it's unique key, that it adds this to the response any way (~line 458). I would rather we remove this; I think it's not something we should support. UniqueKey should be required for distributed-search to get TV info back . The code that's here now incorrectly assumes that if it was unable to lookup the key in the resultIds that it's because the schema has no uniqueKey. But another reason is just that it's a distrib.singlePass distributed search (related to the 2 bugs I'm looking at in Highlight & Debug components). Do you support my recommendation?
          Hide
          Hoss Man added a comment -

          I don't remember this issue at all, and w/o digging into it's history or looking at teh commits, i'm going to reply simply to this sentence...

          UniqueKey should be required for distributed-search to get TV info back.

          I have no objection to this. if that's not how it works now, then i'm surprised and If i'm responsible for the code/decision in question then my suspicion is that it's simply because this issue predated SolrCloud and most of the other current "rules" regarding "distributed search" – back when it was a query time only concept and people manually partitioned their shards. it certainly pre-dates distrib.singlePass.

          open/link a new jira w/whatever changes you think make sense to the existing functionality.

          Show
          Hoss Man added a comment - I don't remember this issue at all, and w/o digging into it's history or looking at teh commits, i'm going to reply simply to this sentence... UniqueKey should be required for distributed-search to get TV info back. I have no objection to this. if that's not how it works now, then i'm surprised and If i'm responsible for the code/decision in question then my suspicion is that it's simply because this issue predated SolrCloud and most of the other current "rules" regarding "distributed search" – back when it was a query time only concept and people manually partitioned their shards. it certainly pre-dates distrib.singlePass. open/link a new jira w/whatever changes you think make sense to the existing functionality.

            People

            • Assignee:
              Hoss Man
              Reporter:
              Hang Xie
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development