Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-5983

HTMLStripCharFilter is treating CDATA sections incorrectly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 4.7.1
    • 4.8, 4.9, 6.0
    • Schema and Analysis
    • None
    • Rhat - running in AWS Large Instance (4processors, 16gb ram) working in attached storage.

    Description

      I'm hammering on this Solr Instance. I've got three cores that I'm using to store millions of small bits of reference data. I'm using a heavily tweaked Tika to parse xml files and ingest them into Solr, while referencing this data. So I'm making hundreds of query requests against solr, while also making some substantial posts. (I queue up the posts, in general sending in 100 documents at a time).

      Stack Trace:

      4099640 [qtp39890933-24] WARN org.eclipse.jetty.servlet.ServletHandler – Error for /solr/us_patent_gran
      t/update
      java.lang.AssertionError: Attempting to read past the end of a segment.
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter$TextSegment.nextChar(HTMLStripCharFi
      lter.java:30885)
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.zzDoEOF(HTMLStripCharFilter.java:311
      50)
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.nextChar(HTMLStripCharFilter.java:31
      802)
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:30829)
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:30842) at org.apache.lucene.analysis.standard.std40.StandardTokenizerImpl40.zzRefill(StandardTokenizerImpl40.java:916)
      at org.apache.lucene.analysis.standard.std40.StandardTokenizerImpl40.getNextToken(StandardTokenizerImpl40.java:1123)
      at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:17
      5)
      at org.apache.lucene.analysis.payloads.TokenOffsetPayloadTokenFilter.incrementToken(TokenOffsetPa
      yloadTokenFilter.java:45)
      at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
      at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:182)
      at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
      at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
      at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:455)
      at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1534)
      at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:236)
      at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:160)
      at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:
      69)
      at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java
      :51)
      at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProces
      sor.java:704)
      at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProces
      sor.java:858)
      at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProces
      sor.java:557)
      at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:
      100)
      at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
      at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
      at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
      at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.ja
      va:74)
      at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
      at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
      at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:780)
      at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427)
      at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)
      at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)

      Attachments

        1. SOLR-5983.patch
          28 kB
          Steven Rowe
        2. temp.txt
          165 kB
          Dan

        Activity

          People

            sarowe Steven Rowe
            danfunk Dan
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: