Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7340

MemoryIndex.toString is broken if you enable payloads

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 5.4.1, 6.0.1, 7.0
    • Fix Version/s: 6.2
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Noticed this as we use Luwak which creates a MemoryIndex(true, true) storing both offsets and payloads (though in reality we never put any payloads in it).

      We used to use MemoryIndex.toString() for debugging and noticed it broke in Lucene 5.x and beyond. I think LUCENE-6155 broke it when it added support for payloads?

      Creating default memoryindex (as all the tests currently do) works fine, as does one with just offsets, it is just the payload version which is broken.

      1. LUCENE-7340.diff
        4 kB
        Daniel Collins
      2. LUCENE-7340.diff
        4 kB
        Daniel Collins
      3. LUCENE-7340.patch
        4 kB
        David Smiley
      4. LUCENE-7340.patch
        5 kB
        David Smiley

        Issue Links

          Activity

          Hide
          dancollins Daniel Collins added a comment -

          This patches toString() so that it works and adds some tests to verify that.
          What it doesn't do is actually add a field with payloads into the MI to validate the resulting format.... Still trying to work out how to do that.

          Show
          dancollins Daniel Collins added a comment - This patches toString() so that it works and adds some tests to verify that. What it doesn't do is actually add a field with payloads into the MI to validate the resulting format.... Still trying to work out how to do that.
          Hide
          romseygeek Alan Woodward added a comment -

          The payload query tests in the queries module should give you an idea of how to add payloads to a TokenStream for tests.

          Show
          romseygeek Alan Woodward added a comment - The payload query tests in the queries module should give you an idea of how to add payloads to a TokenStream for tests.
          Hide
          dancollins Daniel Collins added a comment -

          Updated patch, this extends an existing test instead of adding new ones

          Show
          dancollins Daniel Collins added a comment - Updated patch, this extends an existing test instead of adding new ones
          Hide
          dancollins Daniel Collins added a comment -

          Thanks Alan, there was an existing test that did actually check the payload API, so I've put the toString() checks in that existing test

          Show
          dancollins Daniel Collins added a comment - Thanks Alan, there was an existing test that did actually check the payload API, so I've put the toString() checks in that existing test
          Hide
          dancollins Daniel Collins added a comment -

          Alan Woodward and David Smiley, any more thoughts on this?

          Show
          dancollins Daniel Collins added a comment - Alan Woodward and David Smiley , any more thoughts on this?
          Hide
          dsmiley David Smiley added a comment -

          I tweaked your patch a bit, including modifying the output a little. I think it's no value to count the number of payloads so I removed that; do you disagree? I added a dedicated toString() test.

          What I really don't like about MemoryIndex.toString() is that it isn't generic when it so obviously could be. Why have MemoryIndex specific logic that needs to be maintained (it broke here causing this bug) when there might be a Terms.toString() or at least a utility method on Terms (or Fields?) that a better MemoryIndex might call? I'm filing a separate issue for that.

          Show
          dsmiley David Smiley added a comment - I tweaked your patch a bit, including modifying the output a little. I think it's no value to count the number of payloads so I removed that; do you disagree? I added a dedicated toString() test. What I really don't like about MemoryIndex.toString() is that it isn't generic when it so obviously could be. Why have MemoryIndex specific logic that needs to be maintained (it broke here causing this bug) when there might be a Terms.toString() or at least a utility method on Terms (or Fields?) that a better MemoryIndex might call? I'm filing a separate issue for that.
          Hide
          dancollins Daniel Collins added a comment -

          The only potential issue I see with the output as it stands with your patch, is that the bracketing suggests that the payload is part of the position information (at least that's how I would interpret it), when really its something separate? But payloads aren't an area I know well, we came upon this bug by accident, so I don't feel that strongly about it.

          Agreed, there is no real value in the number of payloads, I only added it as both terms and positions had counts, so it was purely for consistency with them.

          Show
          dancollins Daniel Collins added a comment - The only potential issue I see with the output as it stands with your patch, is that the bracketing suggests that the payload is part of the position information (at least that's how I would interpret it), when really its something separate? But payloads aren't an area I know well, we came upon this bug by accident, so I don't feel that strongly about it. Agreed, there is no real value in the number of payloads, I only added it as both terms and positions had counts, so it was purely for consistency with them.
          Hide
          dsmiley David Smiley added a comment -

          the bracketing suggests that the payload is part of the position information (at least that's how I would interpret it), when really its something separate?

          Payloads are associated with the position just as much as the offsets are, and the offsets are enclosed in parenthesis here too.

          I'll commit this when I next get a chance; could be a couple days (I'm on vacation).

          Show
          dsmiley David Smiley added a comment - the bracketing suggests that the payload is part of the position information (at least that's how I would interpret it), when really its something separate? Payloads are associated with the position just as much as the offsets are, and the offsets are enclosed in parenthesis here too. I'll commit this when I next get a chance; could be a couple days (I'm on vacation).
          Hide
          dsmiley David Smiley added a comment -

          I think it's dangerous that MemoryIndex.toString() can potentially generate a massive string. The functionality is useful but specifically overriding toString() is bad (IMO). If LUCENE-7361 results in a utility in misc/ then this code can simply be removed resulting in a default toString() since someone can simply use the proposed utility there to get this feature. If it results in nothing then this code can move to a new different method like toStringDebug that takes an Appendable (FYI StringBuilder & Writer implement that).

          Show
          dsmiley David Smiley added a comment - I think it's dangerous that MemoryIndex.toString() can potentially generate a massive string. The functionality is useful but specifically overriding toString() is bad (IMO). If LUCENE-7361 results in a utility in misc/ then this code can simply be removed resulting in a default toString() since someone can simply use the proposed utility there to get this feature. If it results in nothing then this code can move to a new different method like toStringDebug that takes an Appendable (FYI StringBuilder & Writer implement that).
          Hide
          dsmiley David Smiley added a comment -

          Updated patch renaming toString to toStringDebug and marking with lucene.experimental. I plan to commit this tomorrow.

          LUCENE-7340: MemoryIndex.toString() could throw NPE; fixed. Renamed to toStringDebug(). (Daniel Collins, David Smiley)

          Show
          dsmiley David Smiley added a comment - Updated patch renaming toString to toStringDebug and marking with lucene.experimental. I plan to commit this tomorrow. LUCENE-7340 : MemoryIndex.toString() could throw NPE; fixed. Renamed to toStringDebug(). (Daniel Collins, David Smiley)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 3ca4fea5786430130f25d180440f765e96ac9c74 in lucene-solr's branch refs/heads/master from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3ca4fea ]

          LUCENE-7340: MemoryIndex.toString renamed to toStringDebug; fix NPE

          Show
          jira-bot ASF subversion and git services added a comment - Commit 3ca4fea5786430130f25d180440f765e96ac9c74 in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3ca4fea ] LUCENE-7340 : MemoryIndex.toString renamed to toStringDebug; fix NPE
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4a1b78af715a33d1d17aadc09b82003a3d11d157 in lucene-solr's branch refs/heads/branch_6x from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4a1b78a ]

          LUCENE-7340: MemoryIndex.toString renamed to toStringDebug; fix NPE
          (cherry picked from commit 3ca4fea)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4a1b78af715a33d1d17aadc09b82003a3d11d157 in lucene-solr's branch refs/heads/branch_6x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4a1b78a ] LUCENE-7340 : MemoryIndex.toString renamed to toStringDebug; fix NPE (cherry picked from commit 3ca4fea)
          Hide
          dsmiley David Smiley added a comment -

          Committed. Thanks Dan!

          Show
          dsmiley David Smiley added a comment - Committed. Thanks Dan!
          Hide
          mikemccand Michael McCandless added a comment -

          Bulk close resolved issues after 6.2.0 release.

          Show
          mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.

            People

            • Assignee:
              dsmiley David Smiley
              Reporter:
              dancollins Daniel Collins
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development