Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-5983

HTMLStripCharFilter is treating CDATA sections incorrectly

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 4.7.1
    • 4.8, 4.9, 6.0
    • Schema and Analysis
    • None
    • Rhat - running in AWS Large Instance (4processors, 16gb ram) working in attached storage.

    Description

      I'm hammering on this Solr Instance. I've got three cores that I'm using to store millions of small bits of reference data. I'm using a heavily tweaked Tika to parse xml files and ingest them into Solr, while referencing this data. So I'm making hundreds of query requests against solr, while also making some substantial posts. (I queue up the posts, in general sending in 100 documents at a time).

      Stack Trace:

      4099640 [qtp39890933-24] WARN org.eclipse.jetty.servlet.ServletHandler – Error for /solr/us_patent_gran
      t/update
      java.lang.AssertionError: Attempting to read past the end of a segment.
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter$TextSegment.nextChar(HTMLStripCharFi
      lter.java:30885)
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.zzDoEOF(HTMLStripCharFilter.java:311
      50)
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.nextChar(HTMLStripCharFilter.java:31
      802)
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:30829)
      at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:30842) at org.apache.lucene.analysis.standard.std40.StandardTokenizerImpl40.zzRefill(StandardTokenizerImpl40.java:916)
      at org.apache.lucene.analysis.standard.std40.StandardTokenizerImpl40.getNextToken(StandardTokenizerImpl40.java:1123)
      at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:17
      5)
      at org.apache.lucene.analysis.payloads.TokenOffsetPayloadTokenFilter.incrementToken(TokenOffsetPa
      yloadTokenFilter.java:45)
      at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
      at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:182)
      at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
      at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
      at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:455)
      at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1534)
      at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:236)
      at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:160)
      at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:
      69)
      at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java
      :51)
      at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProces
      sor.java:704)
      at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProces
      sor.java:858)
      at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProces
      sor.java:557)
      at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:
      100)
      at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
      at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
      at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
      at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.ja
      va:74)
      at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
      at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
      at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:780)
      at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427)
      at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)
      at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            sarowe Steven Rowe
            danfunk Dan
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment