Solr
  1. Solr
  2. SOLR-5983

HTMLStripCharFilter is treating CDATA sections incorrectly

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.7.1
    • Fix Version/s: 4.8, 4.9, 6.0
    • Component/s: Schema and Analysis
    • Labels:
      None
    • Environment:

      Rhat - running in AWS Large Instance (4processors, 16gb ram) working in attached storage.

      Description

      I'm hammering on this Solr Instance. I've got three cores that I'm using to store millions of small bits of reference data. I'm using a heavily tweaked Tika to parse xml files and ingest them into Solr, while referencing this data. So I'm making hundreds of query requests against solr, while also making some substantial posts. (I queue up the posts, in general sending in 100 documents at a time).

      Stack Trace:

      4099640 [qtp39890933-24] WARN org.eclipse.jetty.servlet.ServletHandler – Error for /solr/us_patent_gran
      t/update
      java.lang.AssertionError: Attempting to read past the end of a segment.
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter$TextSegment.nextChar(HTMLStripCharFi
      lter.java:30885)
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.zzDoEOF(HTMLStripCharFilter.java:311
      50)
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.nextChar(HTMLStripCharFilter.java:31
      802)
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:30829)
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:30842) at org.apache.lucene.analysis.standard.std40.StandardTokenizerImpl40.zzRefill(StandardTokenizerImpl40.java:916)
      at org.apache.lucene.analysis.standard.std40.StandardTokenizerImpl40.getNextToken(StandardTokenizerImpl40.java:1123)
      at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:17
      5)
      at org.apache.lucene.analysis.payloads.TokenOffsetPayloadTokenFilter.incrementToken(TokenOffsetPa
      yloadTokenFilter.java:45)
      at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
      at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:182)
      at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
      at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
      at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:455)
      at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1534)
      at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:236)
      at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:160)
      at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:
      69)
      at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java
      :51)
      at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProces
      sor.java:704)
      at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProces
      sor.java:858)
      at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProces
      sor.java:557)
      at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:
      100)
      at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
      at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
      at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
      at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.ja
      va:74)
      at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
      at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
      at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:780)
      at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427)
      at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)
      at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)

      1. SOLR-5983.patch
        28 kB
        Steve Rowe
      2. temp.txt
        165 kB
        Dan

        Activity

        Hide
        Steve Rowe added a comment -

        Hi Dan,

        Do you know which document triggered the problem? If so, can you post it here, ideally in the form you're indexing (after Tika etc. pre-processing)?

        Steve

        Show
        Steve Rowe added a comment - Hi Dan, Do you know which document triggered the problem? If so, can you post it here, ideally in the form you're indexing (after Tika etc. pre-processing)? Steve
        Hide
        Dan added a comment -

        Steve, it is an intermittent issue. I can go back and re-index the same document set without a problem.

        Show
        Dan added a comment - Steve, it is an intermittent issue. I can go back and re-index the same document set without a problem.
        Hide
        Dan added a comment -

        Well, darn it, I just proved myself wrong. I was able to reproduce it with the same data set. Please give me a bit to track down the exact file.

        Show
        Dan added a comment - Well, darn it, I just proved myself wrong. I was able to reproduce it with the same data set. Please give me a bit to track down the exact file.
        Hide
        Dan added a comment -

        Here is the offending solr document that is causing the error.

        Show
        Dan added a comment - Here is the offending solr document that is causing the error.
        Hide
        Dan added a comment -

        Hi Steve, I've attached the results of calling toString on the solr document that is causing this error.

        Show
        Dan added a comment - Hi Steve, I've attached the results of calling toString on the solr document that is causing this error.
        Hide
        Steve Rowe added a comment -

        Dan,

        Strings of this form (from the description_html field) trigger the exception:

        <! [CDATA[Ultraflexible Series Cable] ] >
        

        The above string alone hits the assert. The characters between <! and [CDATA[, ] and ], and ] and > are all U+2009 THIN SPACE.

        I'm working on tracking down why - looks like it's related to the U+2009 char in front of [CDATA[.

        By the way, if you're inserting the U+2009 intentionally to block recognition of CDATA sections and force HTML stripping, an alternate technique is to run text through HTMLStripCharFilter twice.

        Show
        Steve Rowe added a comment - Dan, Strings of this form (from the description_html field) trigger the exception: <! [CDATA[Ultraflexible Series Cable] ] > The above string alone hits the assert. The characters between <! and [CDATA[ , ] and ] , and ] and > are all U+2009 THIN SPACE. I'm working on tracking down why - looks like it's related to the U+2009 char in front of [CDATA[ . By the way, if you're inserting the U+2009 intentionally to block recognition of CDATA sections and force HTML stripping, an alternate technique is to run text through HTMLStripCharFilter twice.
        Hide
        Steve Rowe added a comment -

        Looks like there are two problems:

        1. Any chars between <! and [CDATA[ should block recognition of a CDATA section, but those chars are now passed through to the output, and a CDATA section is improperly recognized.
        2. The immediate cause of the assert is an unclosed CDATA section. HTMLStripCharFilter requires the exact string ]]> to close out a CDATA section, following the XML spec. When a CDATA section is started (even improperly, as in the first problem above), but the CDATA closing string is not found, the assert is hit at end-of-input. So this is the minimal error-triggering string:
        <![CDATA[
        

        I'm working on a fix.

        Show
        Steve Rowe added a comment - Looks like there are two problems: Any chars between <! and [CDATA[ should block recognition of a CDATA section, but those chars are now passed through to the output, and a CDATA section is improperly recognized. The immediate cause of the assert is an unclosed CDATA section. HTMLStripCharFilter requires the exact string ]]> to close out a CDATA section, following the XML spec. When a CDATA section is started (even improperly, as in the first problem above), but the CDATA closing string is not found, the assert is hit at end-of-input. So this is the minimal error-triggering string: <![CDATA[ I'm working on a fix.
        Hide
        Steve Rowe added a comment -

        Patch fixing the problem. Tests added to HTMLStripCharFilterTest, and extracted redundant char filter testing out into HTMLStripCharFilterTest.assertHTMLStripsTo() methods.

        Committing shortly.

        Show
        Steve Rowe added a comment - Patch fixing the problem. Tests added to HTMLStripCharFilterTest , and extracted redundant char filter testing out into HTMLStripCharFilterTest.assertHTMLStripsTo() methods. Committing shortly.
        Hide
        ASF subversion and git services added a comment -

        Commit 1588136 from sarowe@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1588136 ]

        SOLR-5983: HTMLStripCharFilter is treating CDATA sections incorrectly

        Show
        ASF subversion and git services added a comment - Commit 1588136 from sarowe@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1588136 ] SOLR-5983 : HTMLStripCharFilter is treating CDATA sections incorrectly
        Hide
        ASF subversion and git services added a comment -

        Commit 1588137 from sarowe@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1588137 ]

        SOLR-5983: HTMLStripCharFilter is treating CDATA sections incorrectly (merged trunk r1588136)

        Show
        ASF subversion and git services added a comment - Commit 1588137 from sarowe@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1588137 ] SOLR-5983 : HTMLStripCharFilter is treating CDATA sections incorrectly (merged trunk r1588136)
        Hide
        ASF subversion and git services added a comment -

        Commit 1588138 from sarowe@apache.org in branch 'dev/branches/lucene_solr_4_8'
        [ https://svn.apache.org/r1588138 ]

        SOLR-5983: HTMLStripCharFilter is treating CDATA sections incorrectly (merged trunk r1588136)

        Show
        ASF subversion and git services added a comment - Commit 1588138 from sarowe@apache.org in branch 'dev/branches/lucene_solr_4_8' [ https://svn.apache.org/r1588138 ] SOLR-5983 : HTMLStripCharFilter is treating CDATA sections incorrectly (merged trunk r1588136)
        Hide
        Steve Rowe added a comment -

        Committed to trunk, branch_4x, and the lucene_solr_4_8 branch.

        Thanks Dan!

        Show
        Steve Rowe added a comment - Committed to trunk, branch_4x, and the lucene_solr_4_8 branch. Thanks Dan!
        Hide
        Uwe Schindler added a comment -

        Close issue after release of 4.8.0

        Show
        Uwe Schindler added a comment - Close issue after release of 4.8.0

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Dan
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development