Solr
  1. Solr
  2. SOLR-1662

BufferedTokenStream incorrect cloning

    Details

      Description

      As part of writing tests for SOLR-1657, I rewrote one of the base classes (BaseTokenTestCase) to use the new TokenStream API, but also with some additional safety.

       public static String tsToString(TokenStream in) throws IOException {
          StringBuilder out = new StringBuilder();
          TermAttribute termAtt = (TermAttribute) in.addAttribute(TermAttribute.class);
          // extra safety to enforce, that the state is not preserved and also
          // assign bogus values
          in.clearAttributes();
          termAtt.setTermBuffer("bogusTerm");
          while (in.incrementToken()) {
            if (out.length() > 0)
              out.append(' ');
            out.append(termAtt.term());
            in.clearAttributes();
            termAtt.setTermBuffer("bogusTerm");
          }
      
          in.close();
          return out.toString();
        }
      

      Setting the term text to bogus values helps find bugs in tokenstreams that do not clear or clone properly. In this case there is a problem with a tokenstream AB_AAB_Stream in TestBufferedTokenStream, it converts A B -> A A B but does not clone, so the values get overwritten.

      This can be fixed in two ways:

      • BufferedTokenStream does the cloning
      • subclasses are responsible for the cloning

      The question is which one should it be?

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Just the short desription from the API side in Lucene:
          Lucene's documentation of TokenStream.next() says: "The returned Token is a "full private copy" (not re-used across calls to next())". AB_AAB_Stream.process() duplicates the token by just putting it uncloned into the outQueue. As the consumer of the BufferedTokenStream assumes that the Token is private it is allowed to change it - and by that it also changes the token in the outQueue. If you e.g. put another TokenFilter in fromt of this AB_AAB_Stream, and modify the token there it would break.
          In my opinion, the responsibility to clone is in AB_AAB_Stream, BufferedTokenStream will never return the same token twice by itsself. So its a bug in the test. But Robert told me that e.g. RemoveDuplicates has a similar problem.
          The general contract for writing such streams is: whenever you return a Token from next(), never put it somewhere else uncloned, because the caller can change it.

          The fix is to do: write((Token) t.clone());

          Show
          Uwe Schindler added a comment - Just the short desription from the API side in Lucene: Lucene's documentation of TokenStream.next() says: "The returned Token is a "full private copy" (not re-used across calls to next())". AB_AAB_Stream.process() duplicates the token by just putting it uncloned into the outQueue. As the consumer of the BufferedTokenStream assumes that the Token is private it is allowed to change it - and by that it also changes the token in the outQueue. If you e.g. put another TokenFilter in fromt of this AB_AAB_Stream, and modify the token there it would break. In my opinion, the responsibility to clone is in AB_AAB_Stream, BufferedTokenStream will never return the same token twice by itsself. So its a bug in the test. But Robert told me that e.g. RemoveDuplicates has a similar problem. The general contract for writing such streams is: whenever you return a Token from next(), never put it somewhere else uncloned, because the caller can change it. The fix is to do: write((Token) t.clone());
          Hide
          Robert Muir added a comment -

          but Robert told me that e.g. RemoveDuplicates has a similar problem.

          Right, there is no cloning in RemoveDuplicates. CommonGrams creates a new Token() when it grams, but its not clear that this one is correct either.

          So if we decide its the responsibility of the subclass, these implementations need thorough tests to see if they are ok or not.
          If we add the cloning to BufferedTokenStream itself, then we know they are ok...

          Show
          Robert Muir added a comment - but Robert told me that e.g. RemoveDuplicates has a similar problem. Right, there is no cloning in RemoveDuplicates. CommonGrams creates a new Token() when it grams, but its not clear that this one is correct either. So if we decide its the responsibility of the subclass, these implementations need thorough tests to see if they are ok or not. If we add the cloning to BufferedTokenStream itself, then we know they are ok...
          Hide
          Shalin Shekhar Mangar added a comment -

          So if we decide its the responsibility of the subclass, these implementations need thorough tests to see if they are ok or not.
          If we add the cloning to BufferedTokenStream itself, then we know they are ok...

          I think cloning should be done by sub-classes before writing. If BufferedTokenStream clones the token then every sub-class pays the price even though the use-case may just be to throw the token away.

          Show
          Shalin Shekhar Mangar added a comment - So if we decide its the responsibility of the subclass, these implementations need thorough tests to see if they are ok or not. If we add the cloning to BufferedTokenStream itself, then we know they are ok... I think cloning should be done by sub-classes before writing. If BufferedTokenStream clones the token then every sub-class pays the price even though the use-case may just be to throw the token away.
          Hide
          Uwe Schindler added a comment -

          I think cloning should be done by sub-classes before writing. If BufferedTokenStream clones the token then every sub-class pays the price even though the use-case may just be to throw the token away.

          +1, that was what i said in my first comment, too. Because BufferedTokenStream itsself never reuses the token. The problem is the test and RemoveDuplicates, that push the same instance twice into the queue.

          Show
          Uwe Schindler added a comment - I think cloning should be done by sub-classes before writing. If BufferedTokenStream clones the token then every sub-class pays the price even though the use-case may just be to throw the token away. +1, that was what i said in my first comment, too. Because BufferedTokenStream itsself never reuses the token. The problem is the test and RemoveDuplicates, that push the same instance twice into the queue.
          Hide
          Robert Muir added a comment -

          cool, will work up a patch with javadocs wordage (I think we need this at least, its not obvious and there is no mention of it), and any fixes.

          Show
          Robert Muir added a comment - cool, will work up a patch with javadocs wordage (I think we need this at least, its not obvious and there is no mention of it), and any fixes.
          Hide
          Robert Muir added a comment - - edited

          attached is patch to fix the javadocs and test (this same test is also an example in the javadocs header...)

          edit: also btw i finished converting the tests to assertTokenStreamContents for the other BufferedTokenStreams, there are no problems in DuplicateFilter or CommonGrams, so this should be all we need.

          Show
          Robert Muir added a comment - - edited attached is patch to fix the javadocs and test (this same test is also an example in the javadocs header...) edit: also btw i finished converting the tests to assertTokenStreamContents for the other BufferedTokenStreams, there are no problems in DuplicateFilter or CommonGrams, so this should be all we need.
          Hide
          Uwe Schindler added a comment -

          +1 Looks good!

          Show
          Uwe Schindler added a comment - +1 Looks good!
          Hide
          Shalin Shekhar Mangar added a comment -

          Committed revision 891889.

          Thanks Robert and Uwe!

          Show
          Shalin Shekhar Mangar added a comment - Committed revision 891889. Thanks Robert and Uwe!
          Hide
          Hoss Man added a comment -

          Correcting Fix Version based on CHANGES.txt, see this thread for more details...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Show
          Hoss Man added a comment - Correcting Fix Version based on CHANGES.txt, see this thread for more details... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E
          Hide
          Hoss Man added a comment -

          Committed revision 949472.

          merged to branch-1.4 for 1.4.1

          Show
          Hoss Man added a comment - Committed revision 949472. merged to branch-1.4 for 1.4.1

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development