Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1101

TokenStream.next(Token) reuse 'policy': calling Token.clear() should be responsibility of producer.

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3
    • Fix Version/s: 2.3
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Tokenizers which implement the reuse form of the next method:
      next(Token result)
      should reset the postionIncrement of the returned token to 1.

      1. lucene-1101.patch
        4 kB
        Doron Cohen
      2. lucene-1101.patch
        2 kB
        Doron Cohen
      3. lucene-1101.patch
        2 kB
        Doron Cohen

        Issue Links

          Activity

          Hide
          doronc Doron Cohen added a comment -

          Simple patch fixing this.
          Planing to commit this shortly if no objections.

          Show
          doronc Doron Cohen added a comment - Simple patch fixing this. Planing to commit this shortly if no objections.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Resetting tokens (including the position) is currently handled via Token.clear() before the consumer reuses a token.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Resetting tokens (including the position) is currently handled via Token.clear() before the consumer reuses a token.
          Hide
          doronc Doron Cohen added a comment -

          I think I checked that and found that Token.clear() is too strong to be invoked in those cases, but let me check again...

          Show
          doronc Doron Cohen added a comment - I think I checked that and found that Token.clear() is too strong to be invoked in those cases, but let me check again...
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          In what cases? The protocol we are currently using requires that the caller of next(token) provide a suitably initialized token (positionIncrement defaults to 1, payload removed, etc).

          Show
          yseeley@gmail.com Yonik Seeley added a comment - In what cases? The protocol we are currently using requires that the caller of next(token) provide a suitably initialized token (positionIncrement defaults to 1, payload removed, etc).
          Hide
          doronc Doron Cohen added a comment -

          Currently Token.clear() is used only for un-tokenized fields in DocmentsWriter - Tokenizer implementations of next(Token) do not call it.
          I think they can be modified to call it (instead of explicitly reseting just the pos-incr).
          But since these methods already set the value for start-offset, calling these method might eat the speed-up gained by reusing tokens.
          But then again, shouldn't tokenizers also reset the payload info? (seems wrong to assume there there's no payload in the input reusable token.)
          So I guess the right thing to do is to call clear() in all toknizers (3 actually) - will work that path.

          Show
          doronc Doron Cohen added a comment - Currently Token.clear() is used only for un-tokenized fields in DocmentsWriter - Tokenizer implementations of next(Token) do not call it. I think they can be modified to call it (instead of explicitly reseting just the pos-incr). But since these methods already set the value for start-offset, calling these method might eat the speed-up gained by reusing tokens. But then again, shouldn't tokenizers also reset the payload info? (seems wrong to assume there there's no payload in the input reusable token.) So I guess the right thing to do is to call clear() in all toknizers (3 actually) - will work that path.
          Hide
          doronc Doron Cohen added a comment -

          Updated patch - Tokenizers now calling clear() as suggested.
          This seems cleaner - Token manages its data members that should get cleared, so if more data members are added to Token in the future only Token.clear() needs to be updated - thanks Yonik!
          All core tests pass.

          Show
          doronc Doron Cohen added a comment - Updated patch - Tokenizers now calling clear() as suggested. This seems cleaner - Token manages its data members that should get cleared, so if more data members are added to Token in the future only Token.clear() needs to be updated - thanks Yonik! All core tests pass.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Currently Token.clear() is used only for un-tokenized fields in DocmentsWriter

          I think it's used for both tokenized and un-tokenized.... see line1319.
          It seems redundant to call clear() in both the consumer (DocumentsWriter) and producer (Tokenizer).

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Currently Token.clear() is used only for un-tokenized fields in DocmentsWriter I think it's used for both tokenized and un-tokenized.... see line1319. It seems redundant to call clear() in both the consumer (DocumentsWriter) and producer (Tokenizer).
          Hide
          doronc Doron Cohen added a comment -

          I think it's used for both tokenized and un-tokenized.... see line1319.
          It seems redundant to call clear() in both the consumer (DocumentsWriter) and producer (Tokenizer).

          You're right again Yonik, I missed line 1319.

          But I think it would be cleaner/safer to move the responsibility to clear() from consumers to producers.
          (Producer being the deepest tokenstream in the call sequence, the one that would instantiate a new Token if it implemented next()).

          Otherwise you get bugs like the one I had in testStopPositons() in the patch for LUCENE-1095:
          The test chains two stop filters:

          • a = WhiltSpaceAnalyzer().
          • f1 = StopFilter(a)
          • f2 = StopFilter(f1)

          Now the test itself calls next().
          StopFilter implements only next(Token).
          So this is the sequence we get:

          • test call f2.next()
          • TokenSteam next() calls t2.next(new Token())
          • t2.next(r) calls t1.next(r) repeatedly (until r not stopped).
          • t1.next(r) calls a.ts.next(r) repeatedly (until r not stopped).

          The cause for the bug was that when t1 returns a token r, it may have set r's pos_incr to something other than1. But when t2 calls t1 again (because t2 stopped r), that pos_incr should have bean cleared to 1. Now this can also be fixed by changing StopFilter to clear() before every call to t1.next(r), except for the first time it calls ts.next(), because for the first call it can assume that its consumer already cleared r. Since most words are not stopped, this distinction between first call to successive calls is important, performance wise.

          Now, this is a little complicated (and not only because of my writing style : - ) ),
          and so I think moving the clear() responsibility to the (deep most) producer would make things more simple and safe.

          Show
          doronc Doron Cohen added a comment - I think it's used for both tokenized and un-tokenized.... see line1319. It seems redundant to call clear() in both the consumer (DocumentsWriter) and producer (Tokenizer). You're right again Yonik, I missed line 1319. But I think it would be cleaner/safer to move the responsibility to clear() from consumers to producers. (Producer being the deepest tokenstream in the call sequence, the one that would instantiate a new Token if it implemented next()). Otherwise you get bugs like the one I had in testStopPositons() in the patch for LUCENE-1095 : The test chains two stop filters: a = WhiltSpaceAnalyzer(). f1 = StopFilter(a) f2 = StopFilter(f1) Now the test itself calls next(). StopFilter implements only next(Token). So this is the sequence we get: test call f2.next() TokenSteam next() calls t2.next(new Token()) t2.next(r) calls t1.next(r) repeatedly (until r not stopped). t1.next(r) calls a.ts.next(r) repeatedly (until r not stopped). The cause for the bug was that when t1 returns a token r, it may have set r's pos_incr to something other than1. But when t2 calls t1 again (because t2 stopped r), that pos_incr should have bean cleared to 1. Now this can also be fixed by changing StopFilter to clear() before every call to t1.next(r), except for the first time it calls ts.next(), because for the first call it can assume that its consumer already cleared r. Since most words are not stopped, this distinction between first call to successive calls is important, performance wise. Now, this is a little complicated (and not only because of my writing style : - ) ), and so I think moving the clear() responsibility to the (deep most) producer would make things more simple and safe.
          Hide
          mikemccand Michael McCandless added a comment -

          But I think it would be cleaner/safer to move the responsibility to clear() from consumers to producers.

          +1

          So, the contract of next(Token) is: if you are the source of Tokens, you must call Token.clear() before setting the fields in it & returning it.

          Show
          mikemccand Michael McCandless added a comment - But I think it would be cleaner/safer to move the responsibility to clear() from consumers to producers. +1 So, the contract of next(Token) is: if you are the source of Tokens, you must call Token.clear() before setting the fields in it & returning it.
          Hide
          doronc Doron Cohen added a comment -

          Thanks Mike, I'll remove the call from DocumentsWriter and updated the javadocs accordingly.

          Show
          doronc Doron Cohen added a comment - Thanks Mike, I'll remove the call from DocumentsWriter and updated the javadocs accordingly.
          Hide
          doronc Doron Cohen added a comment -

          Updated patch calling Token.clear() only by producers and javadocs updated as discussed.

          Show
          doronc Doron Cohen added a comment - Updated patch calling Token.clear() only by producers and javadocs updated as discussed.
          Hide
          doronc Doron Cohen added a comment -

          Committed, thanks for the reviews Yonik & Mike!

          Show
          doronc Doron Cohen added a comment - Committed, thanks for the reviews Yonik & Mike!

            People

            • Assignee:
              doronc Doron Cohen
              Reporter:
              doronc Doron Cohen
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development