Solr
  1. Solr
  2. SOLR-3229

TermVectorComponent does not return terms in distributed search

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0, 5.0
    • Labels:
    • Environment:

      Ubuntu 11.10, openjdk-6

      Description

      TermVectorComponent does not return terms in distributed search, the distributedProcess() incorrectly uses Solr Unique Key to do subrequests, while process() expects Lucene document ids. Also, parameters are transferred in different format thus making distributed search returns no result.

      1. SOLR-3229.patch
        24 kB
        Hoss Man
      2. TermVectorComponent.patch
        3 kB
        Hang Xie

        Issue Links

          Activity

          Hang Xie created issue -
          Hide
          Hang Xie added a comment -

          patch to TermVectorComponent.java

          Show
          Hang Xie added a comment - patch to TermVectorComponent.java
          Hang Xie made changes -
          Field Original Value New Value
          Attachment TermVectorComponent.patch [ 12517820 ]
          Hide
          Hang Xie added a comment -

          Patch attached, tested in 4.0 environment (both distributed and non-distributed), it should work with 3.x but I didn't test.

          Everything is compatible with previous except name of lst, which used to be "doc-<doc id>", and I changed it to Solr Unique Key as former may not be unique in multi-shard environment.

          Show
          Hang Xie added a comment - Patch attached, tested in 4.0 environment (both distributed and non-distributed), it should work with 3.x but I didn't test. Everything is compatible with previous except name of lst, which used to be "doc-<doc id>", and I changed it to Solr Unique Key as former may not be unique in multi-shard environment.
          Hide
          Hang Xie added a comment -

          Revised patch, no longer fails unit test.

          Show
          Hang Xie added a comment - Revised patch, no longer fails unit test.
          Hang Xie made changes -
          Attachment TermVectorComponent.patch [ 12518060 ]
          Hide
          Hang Xie added a comment -

          Revised, it seems distributedProcess() is unnecessary and can be removed, the only major change is add finishStage() to merge responses for subrequests to shards.

          Show
          Hang Xie added a comment - Revised, it seems distributedProcess() is unnecessary and can be removed, the only major change is add finishStage() to merge responses for subrequests to shards.
          Hang Xie made changes -
          Attachment TermVectorComponent.patch [ 12518132 ]
          Hang Xie made changes -
          Attachment TermVectorComponent.patch [ 12518060 ]
          Hang Xie made changes -
          Attachment TermVectorComponent.patch [ 12517820 ]
          Hide
          Hoss Man added a comment -

          bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

          Show
          Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
          Hoss Man made changes -
          Fix Version/s 4.0 [ 12322455 ]
          Fix Version/s 4.0-ALPHA [ 12314992 ]
          Hide
          Robert Muir added a comment -

          rmuir20120906-bulk-40-change

          Show
          Robert Muir added a comment - rmuir20120906-bulk-40-change
          Robert Muir made changes -
          Fix Version/s 4.0 [ 12322551 ]
          Fix Version/s 4.0-BETA [ 12322455 ]
          Hoss Man made changes -
          Assignee Hoss Man [ hossman ]
          Hoss Man made changes -
          Link This issue is duplicated by SOLR-3718 [ SOLR-3718 ]
          Hide
          Hoss Man added a comment -

          Hang: Thank you for your patch.

          I agree, the "docid" as a key is dangerous and misleading in distributed mode, and we should switch to using the uniqueKey when available, but if we leave things as you had it in your patch, existing (single node) users who don't have a uniqueKey field would no longer be able to get term vectors at all.

          I updated your patch to leave the key alone if there is no uniqueKey, and eliminate the "doc-" prefix when there is one. I also added a new distributed test to prove that everything is working, and that turned up a few problems - some of which i fixed (dealing with warnings, and ensuring that TVC results are in the correct order for the result documents).

          One thing i discovered that i'm not sure about is what to do about the "df" and "tf-idf" values when requested. in the test they have to be ignored because the way the distributed test works is to create a single node instance and compare it with a multi-node instance that has identical documents, and in the distributed TVC code, these won't match up – but i'm not sure if that's a bug (because the df & tf-idf values aren't "merged" from all nodes) or a feature (because you get the real df & tf-idf values for that term for that doc from the shard it lives in) ... either way it shouldn't stop fixing the basic problem of TVC failing painfully in a distributed request, so i've opened SOLR-3720 to track this in the future.

          feedback on this revised patch/test would be appreciated

          Show
          Hoss Man added a comment - Hang: Thank you for your patch. I agree, the "docid" as a key is dangerous and misleading in distributed mode, and we should switch to using the uniqueKey when available, but if we leave things as you had it in your patch, existing (single node) users who don't have a uniqueKey field would no longer be able to get term vectors at all. I updated your patch to leave the key alone if there is no uniqueKey, and eliminate the "doc-" prefix when there is one. I also added a new distributed test to prove that everything is working, and that turned up a few problems - some of which i fixed (dealing with warnings, and ensuring that TVC results are in the correct order for the result documents). One thing i discovered that i'm not sure about is what to do about the "df" and "tf-idf" values when requested. in the test they have to be ignored because the way the distributed test works is to create a single node instance and compare it with a multi-node instance that has identical documents, and in the distributed TVC code, these won't match up – but i'm not sure if that's a bug (because the df & tf-idf values aren't "merged" from all nodes) or a feature (because you get the real df & tf-idf values for that term for that doc from the shard it lives in) ... either way it shouldn't stop fixing the basic problem of TVC failing painfully in a distributed request, so i've opened SOLR-3720 to track this in the future. feedback on this revised patch/test would be appreciated
          Hoss Man made changes -
          Attachment SOLR-3229.patch [ 12539769 ]
          Hide
          Hang Xie added a comment -

          I use "doc-0" to make it compatible with single node mode so far as I can recall, as my client was expecting that for parser. It's all up to you to keep "doc-" or not - it seems to me if you keep it, you can reduce lots of changes in tests. Other than that I don't have any comment on test thinking of my little to no knowledge on solr's test framework.

          I remember I read something regarding df/tf-idf in distributed mode is a highly anticipated feature, I don't expect that can be done easily, I'm good to have a bug there.

          Show
          Hang Xie added a comment - I use "doc-0" to make it compatible with single node mode so far as I can recall, as my client was expecting that for parser. It's all up to you to keep "doc-" or not - it seems to me if you keep it, you can reduce lots of changes in tests. Other than that I don't have any comment on test thinking of my little to no knowledge on solr's test framework. I remember I read something regarding df/tf-idf in distributed mode is a highly anticipated feature, I don't expect that can be done easily, I'm good to have a bug there.
          Hide
          Hoss Man added a comment -

          I use "doc-0" to make it compatible with single node mode so far as I can recall ...

          well, the biggest problem as i mentioned was thta you were only including the vector information in the reqponse if there was a uniqueKey value

          the format might have looked consistent, and the resulting string values were consistent in the test – but that was only a fluke of the fact that the uniqueKey values for the test docs were monotomicly increasing integers starting at "0" – so they just happened to correspond with the internal lucene docids.

          i think changing the format to only use the "doc-" when there is not uniqueKeyField in the schema makes the most sense – both because it helps make it clear when the output key is coming from the uniqueKey instead of the docid, and because moving forward that's the most logical thing for most users (who use a uniqueKey field)

          Show
          Hoss Man added a comment - I use "doc-0" to make it compatible with single node mode so far as I can recall ... well, the biggest problem as i mentioned was thta you were only including the vector information in the reqponse if there was a uniqueKey value the format might have looked consistent, and the resulting string values were consistent in the test – but that was only a fluke of the fact that the uniqueKey values for the test docs were monotomicly increasing integers starting at "0" – so they just happened to correspond with the internal lucene docids. i think changing the format to only use the "doc-" when there is not uniqueKeyField in the schema makes the most sense – both because it helps make it clear when the output key is coming from the uniqueKey instead of the docid, and because moving forward that's the most logical thing for most users (who use a uniqueKey field)
          Hoss Man committed 1370871 (63 files)
          Reviews: none

          SOLR-3229: Fixed TermVectorComponent to work with distributed search (merge r1370870)

          Hide
          Hoss Man added a comment -

          Committed revision 1370870. - trunk
          Committed revision 1370871. - 4x

          Show
          Hoss Man added a comment - Committed revision 1370870. - trunk Committed revision 1370871. - 4x
          Hoss Man made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 5.0 [ 12321664 ]
          Resolution Fixed [ 1 ]
          Hoss Man committed 1370904 (1 file)
          Reviews: none

          SOLR-3229: forgot credit in CHANGES.txt

          Hoss Man committed 1370905 (60 files)
          Reviews: none

          SOLR-3229: forgot credit in CHANGES.txt (merge r1370904)

          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.
          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Hoss Man
              Reporter:
              Hang Xie
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development