Solr
  1. Solr
  2. SOLR-1394

HTML stripper is splitting tokens

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.4
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      The Solr HTML stripper is replacing any removed HTML with whitespace. This is to keep offsets correct for highlighting.

      However, as was already pointed out in SOLR-42, this means that any token containing an HTML entity will be split into several tokens. That makes the HTML stripper completely unreliable for international text (and any text is potentially interantional).

      The current code is actually deficient for BOTH highlighting and indexing, where the previous incarnation (that did not insert spaces) only had problems with highlighting.

      The only workaround is to not use entities at all, which is impossible in some situations and inconvenient in most situations. If the client is required to transform entities before handing it to Solr, it might as well be required to also strip tags, and then the HTML stripper would not be needed at all.

      Today, we have a better solution that can be used: offset correction. We can then avoid inserting extra whitespace, but still get correct offsets. The attached patch implements just that.

      1. hex-entity.patch
        2 kB
        Anders Melchiorsen
      2. SOLR-1394.patch
        15 kB
        Anders Melchiorsen
      3. SOLR-1394.patch
        7 kB
        Anders Melchiorsen

        Activity

        Hide
        Anders Melchiorsen added a comment -

        Proof-of-concept fixes. I don't expect them to be perfect.

        There is one test that still breaks. I think the test does not make sense with my changes, but I did not remove it.

        Also, the new functionality (offset correction) is not tested. I don't know how to do it.

        Show
        Anders Melchiorsen added a comment - Proof-of-concept fixes. I don't expect them to be perfect. There is one test that still breaks. I think the test does not make sense with my changes, but I did not remove it. Also, the new functionality (offset correction) is not tested. I don't know how to do it.
        Hide
        Anders Melchiorsen added a comment -

        Right version of patch file.

        Show
        Anders Melchiorsen added a comment - Right version of patch file.
        Hide
        Jason Rutherglen added a comment -

        I'm also seeing this problem and will try out the patch (after a number of other issues are fixed!)

        Show
        Jason Rutherglen added a comment - I'm also seeing this problem and will try out the patch (after a number of other issues are fixed!)
        Hide
        Jason Rutherglen added a comment -

        Here's the exception:

        Caused by: java.io.IOException: Mark invalid
        at java.io.BufferedReader.reset(BufferedReader.java:485)
        at org.apache.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171)
        at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:728)
        at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:742)
        at org.apache.lucene.analysis.CharReader.read(CharReader.java:51)
        at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:451)
        at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:637)
        at org.apache.lucene.analysis.standard.StandardTokenizer.next(StandardTokenizer.java:198)
        at org.apache.lucene.analysis.standard.StandardFilter.next(StandardFilter.java:84)
        at org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:53)
        at org.apache.solr.analysis.EnglishPorterFilter.next(EnglishPorterFilterFactory.java:108)
        at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:245)
        at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:162)

        Show
        Jason Rutherglen added a comment - Here's the exception: Caused by: java.io.IOException: Mark invalid at java.io.BufferedReader.reset(BufferedReader.java:485) at org.apache.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171) at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:728) at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:742) at org.apache.lucene.analysis.CharReader.read(CharReader.java:51) at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:451) at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:637) at org.apache.lucene.analysis.standard.StandardTokenizer.next(StandardTokenizer.java:198) at org.apache.lucene.analysis.standard.StandardFilter.next(StandardFilter.java:84) at org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:53) at org.apache.solr.analysis.EnglishPorterFilter.next(EnglishPorterFilterFactory.java:108) at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:245) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:162)
        Hide
        Anders Melchiorsen added a comment -

        In which situation do you get this exception?

        Show
        Anders Melchiorsen added a comment - In which situation do you get this exception?
        Hide
        Jason Rutherglen added a comment -

        In which situation do you get this exception?

        We need to log the HTML. I'll post it when we implement the logging.

        Show
        Jason Rutherglen added a comment - In which situation do you get this exception? We need to log the HTML. I'll post it when we implement the logging.
        Hide
        Anders Melchiorsen added a comment -

        Jason, did you get that exception with (and not without) my patch? You said that you had other problems as well, so I just want to make sure that it is a problem with the patch ...

        Show
        Anders Melchiorsen added a comment - Jason, did you get that exception with (and not without) my patch? You said that you had other problems as well, so I just want to make sure that it is a problem with the patch ...
        Hide
        Jason Rutherglen added a comment -

        Anders, The error occurs with the existing HTML stripper code in Solr's trunk.

        Show
        Jason Rutherglen added a comment - Anders, The error occurs with the existing HTML stripper code in Solr's trunk.
        Hide
        Yonik Seeley added a comment -

        What's clear is that the html stripper still has problems - less clear to me is if this patch as it currently exists is better than what's in trunk.... if people think it is, we could commit it quickly for 1.4

        Show
        Yonik Seeley added a comment - What's clear is that the html stripper still has problems - less clear to me is if this patch as it currently exists is better than what's in trunk.... if people think it is, we could commit it quickly for 1.4
        Hide
        Anders Melchiorsen added a comment -

        To me (the author), things are much better with the patch, and I have seen no ill effects. The exception that Jason reported happens without the patch.

        My problem is that I cannot really vouch for the patch, as I have no previous experience with the Solr code. So, it would be really nice if someone with the experience could take fifteen minutes to review the three tiny modifications.

        When replacing read() with next() in one of the patches, I noted that I was not sure why it worked better. I have later figured out that read() is the external interface, so it should (probably?) not be used internally by the stripper.

        Show
        Anders Melchiorsen added a comment - To me (the author), things are much better with the patch, and I have seen no ill effects. The exception that Jason reported happens without the patch. My problem is that I cannot really vouch for the patch, as I have no previous experience with the Solr code. So, it would be really nice if someone with the experience could take fifteen minutes to review the three tiny modifications. When replacing read() with next() in one of the patches, I noted that I was not sure why it worked better. I have later figured out that read() is the external interface, so it should (probably?) not be used internally by the stripper.
        Hide
        Jason Rutherglen added a comment -

        Anders,

        We're seeing the error again, we're going to try this patch and
        HTMLStripReader and we'll see what happens. Here's the latest
        stacktrace, which is pretty much the same as the last one:

        SEVERE: java.io.IOException: Mark invalid
        	at java.io.BufferedReader.reset(BufferedReader.java:485)
        	at org.apache.lucene.analysis.CharReader.reset(CharReader.java:63)
        	at org.apache.solr.analysis.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:170)
        	at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:727)
        	at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:741)
        	at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:451
        )
        	at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java
        :637)
        	at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:174)
        	at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:50)
        	at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:38)
        	at org.apache.solr.analysis.SnowballPorterFilter.incrementToken(SnowballPorterFilterFactory.java:116
        )
        	at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:401)
        	at org.apache.solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:94)
        	at org.apache.solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:80)
        	at org.apache.lucene.analysis.TokenStream.incrementToken(TokenStream.java:316)
        	at org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:224)
        	at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:138)
        	at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.jav
        a:244)
        	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:772)
        	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:755)
        	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2611)
        	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2583)
        	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:241)
        	at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
        	at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140)
        
        Show
        Jason Rutherglen added a comment - Anders, We're seeing the error again, we're going to try this patch and HTMLStripReader and we'll see what happens. Here's the latest stacktrace, which is pretty much the same as the last one: SEVERE: java.io.IOException: Mark invalid at java.io.BufferedReader.reset(BufferedReader.java:485) at org.apache.lucene.analysis.CharReader.reset(CharReader.java:63) at org.apache.solr.analysis.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:170) at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:727) at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:741) at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:451 ) at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java :637) at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:174) at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:50) at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:38) at org.apache.solr.analysis.SnowballPorterFilter.incrementToken(SnowballPorterFilterFactory.java:116 ) at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:401) at org.apache.solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:94) at org.apache.solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:80) at org.apache.lucene.analysis.TokenStream.incrementToken(TokenStream.java:316) at org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:224) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:138) at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.jav a:244) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:772) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:755) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2611) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2583) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:241) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140)
        Hide
        Anders Melchiorsen added a comment -

        A reworked patch that should be easier to review.

        Now also includes a few new tests to show the problems that this patch is fixing.

        Show
        Anders Melchiorsen added a comment - A reworked patch that should be easier to review. Now also includes a few new tests to show the problems that this patch is fixing.
        Hide
        Anders Melchiorsen added a comment -

        Clarifying the description, the original report was just a pasted e-mail.

        Show
        Anders Melchiorsen added a comment - Clarifying the description, the original report was just a pasted e-mail.
        Hide
        Yonik Seeley added a comment -

        I've been testing this with a bunch of different HTML, and I don't see any places where this is worse, and it prevents splitting of tokens when it shouldn't.
        Given that the splitting is clearly a bug, and that changes to this filter won't affect the rest of Solr, I plan on committing this shortly.

        Things still aren't perfect as far as offsets and highlighting, but this patch makes it no worse.

        I modified the solr.xml document to escape the '&' and then added the strip char filter to the text field.
        The query was héllo OR hello OR unicode
        Before this patch: Good <em>unicode</em> support: héllo <em>(hell</em>o with an accent over the e)
        After this patch: Good <em>unicode</em> support: <em>héll</em>o <em>(hell</em>o with é accent over the e)

        Show
        Yonik Seeley added a comment - I've been testing this with a bunch of different HTML, and I don't see any places where this is worse, and it prevents splitting of tokens when it shouldn't. Given that the splitting is clearly a bug, and that changes to this filter won't affect the rest of Solr, I plan on committing this shortly. Things still aren't perfect as far as offsets and highlighting, but this patch makes it no worse. I modified the solr.xml document to escape the '&' and then added the strip char filter to the text field. The query was héllo OR hello OR unicode Before this patch: Good <em>unicode</em> support: héllo <em>(hell</em>o with an accent over the e) After this patch: Good <em>unicode</em> support: <em>héll</em>o <em>(hell</em>o with é accent over the e)
        Hide
        Anders Melchiorsen added a comment -

        Thanks, that sounds great.

        There is an existing off-by-one error in the numWhitespace calculation with hexadecimal numeric entities.

        I noticed that while reworking the patch, but did not bother to report it in here because I was annoyed from being ignored. Now you got me in a better mood, so I can fix that error if you like?

        Show
        Anders Melchiorsen added a comment - Thanks, that sounds great. There is an existing off-by-one error in the numWhitespace calculation with hexadecimal numeric entities. I noticed that while reworking the patch, but did not bother to report it in here because I was annoyed from being ignored. Now you got me in a better mood, so I can fix that error if you like?
        Hide
        Yonik Seeley added a comment -

        Committed. Thanks Anders!

        Show
        Yonik Seeley added a comment - Committed. Thanks Anders!
        Hide
        Yonik Seeley added a comment -

        Now you got me in a better mood, so I can fix that error if you like?

        Didn't see your message before the commit - yes, a patch would be great! Could possibly make it into 1.4 still if you're quick

        Show
        Yonik Seeley added a comment - Now you got me in a better mood, so I can fix that error if you like? Didn't see your message before the commit - yes, a patch would be great! Could possibly make it into 1.4 still if you're quick
        Hide
        Anders Melchiorsen added a comment -

        I did not even test that this patch compiles, but it should show what I had in mind.

        I will not have time to work on this during the weekend.

        Show
        Anders Melchiorsen added a comment - I did not even test that this patch compiles, but it should show what I had in mind. I will not have time to work on this during the weekend.
        Hide
        Yonik Seeley added a comment -

        It works! I added some tests, and committed. Thanks again!

        Show
        Yonik Seeley added a comment - It works! I added some tests, and committed. Thanks again!
        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4

          People

          • Assignee:
            Unassigned
            Reporter:
            Anders Melchiorsen
          • Votes:
            2 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development