Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-5983

HTMLStripCharFilter is treating CDATA sections incorrectly

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.7.1
    • Fix Version/s: 4.8, 4.9, 6.0
    • Component/s: Schema and Analysis
    • Labels:
      None
    • Environment:

      Rhat - running in AWS Large Instance (4processors, 16gb ram) working in attached storage.

      Description

      I'm hammering on this Solr Instance. I've got three cores that I'm using to store millions of small bits of reference data. I'm using a heavily tweaked Tika to parse xml files and ingest them into Solr, while referencing this data. So I'm making hundreds of query requests against solr, while also making some substantial posts. (I queue up the posts, in general sending in 100 documents at a time).

      Stack Trace:

      4099640 [qtp39890933-24] WARN org.eclipse.jetty.servlet.ServletHandler – Error for /solr/us_patent_gran
      t/update
      java.lang.AssertionError: Attempting to read past the end of a segment.
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter$TextSegment.nextChar(HTMLStripCharFi
      lter.java:30885)
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.zzDoEOF(HTMLStripCharFilter.java:311
      50)
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.nextChar(HTMLStripCharFilter.java:31
      802)
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:30829)
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:30842) at org.apache.lucene.analysis.standard.std40.StandardTokenizerImpl40.zzRefill(StandardTokenizerImpl40.java:916)
      at org.apache.lucene.analysis.standard.std40.StandardTokenizerImpl40.getNextToken(StandardTokenizerImpl40.java:1123)
      at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:17
      5)
      at org.apache.lucene.analysis.payloads.TokenOffsetPayloadTokenFilter.incrementToken(TokenOffsetPa
      yloadTokenFilter.java:45)
      at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
      at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:182)
      at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
      at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
      at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:455)
      at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1534)
      at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:236)
      at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:160)
      at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:
      69)
      at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java
      :51)
      at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProces
      sor.java:704)
      at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProces
      sor.java:858)
      at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProces
      sor.java:557)
      at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:
      100)
      at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
      at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
      at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
      at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.ja
      va:74)
      at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
      at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
      at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:780)
      at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427)
      at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)
      at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)

        Attachments

        1. SOLR-5983.patch
          28 kB
          Steve Rowe
        2. temp.txt
          165 kB
          Dan

          Activity

            People

            • Assignee:
              steve_rowe Steve Rowe
              Reporter:
              danfunk Dan
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: