Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Environment:

      Ubuntu 8.04, Sun Java 6

      Description

      When indexing large (1 megabyte) documents I get a lot of exceptions with stack traces like the below. It happens both in the Solr 1.3 release and in the July 9 1.4 nightly. I believe this to NOT be the same issue as SOLR-42. I found some further discussion on solr-user: http://www.nabble.com/IOException:-Mark-invalid-while-analyzing-HTML-td17052153.html

      In that discussion, Grant asked the original poster to open a Jira issue, but I didn't see one so I'm opening one; please feel free to merge or close if it's redundant.

      My stack trace follows.

      Jul 15, 2009 8:36:42 AM org.apache.solr.core.SolrCore execute
      INFO: [] webapp=/solr path=/update params={} status=500 QTime=3
      Jul 15, 2009 8:36:42 AM org.apache.solr.common.SolrException log
      SEVERE: java.io.IOException: Mark invalid
      at java.io.BufferedReader.reset(BufferedReader.java:485)
      at org.apache.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171)
      at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:728)
      at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:742)
      at java.io.Reader.read(Reader.java:123)
      at org.apache.lucene.analysis.CharTokenizer.next(CharTokenizer.java:108)
      at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:178)
      at org.apache.lucene.analysis.standard.StandardFilter.next(StandardFilter.java:84)
      at org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:53)
      at org.apache.solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:347)
      at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:159)
      at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
      at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
      at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:765)
      at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:748)
      at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2512)
      at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2484)
      at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:240)
      at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
      at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140)
      at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
      at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
      at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
      at org.apache.solr.core.SolrCore.execute(SolrCore.java:1292)
      at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
      at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
      at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
      at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
      at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
      at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
      at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
      at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
      at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
      at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
      at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
      at org.mortbay.jetty.Server.handle(Server.java:285)
      at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
      at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
      at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
      at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
      at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
      at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
      at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

      Thanks.

      1. SOLR-1283.modules.patch
        1 kB
        Hoss Man
      2. SOLR-1283.patch
        1 kB
        Julien Coloos

        Activity

        Hide
        Grant Ingersoll added a comment -

        We should make the buffer size configurable, I guess. However, there's always the potential to go past it or use up a lot of memory in the meantime (if one is expecting really large files)

        Show
        Grant Ingersoll added a comment - We should make the buffer size configurable, I guess. However, there's always the potential to go past it or use up a lot of memory in the meantime (if one is expecting really large files)
        Hide
        solrize added a comment -

        Right now I'm getting a ton of these errors. It doesn't seem strictly dependent on the doc size. If I can crank up the buffer size enough that the error happens only occasionally instead of frequently, that would be a big improvement over the present situation. Thanks!

        Show
        solrize added a comment - Right now I'm getting a ton of these errors. It doesn't seem strictly dependent on the doc size. If I can crank up the buffer size enough that the error happens only occasionally instead of frequently, that would be a big improvement over the present situation. Thanks!
        Hide
        solrize added a comment -

        Is the buffer size the parameter DEFAULT_READ_AHEAD (set to 8192) in HTMLStripReader.java?

        Should I set it to be the same as maxFieldLength from solrconfig.xml? That would let it hold the entire document. I currently have that config parameter set to 10000000 (10 MB).

        Thanks

        Show
        solrize added a comment - Is the buffer size the parameter DEFAULT_READ_AHEAD (set to 8192) in HTMLStripReader.java? Should I set it to be the same as maxFieldLength from solrconfig.xml? That would let it hold the entire document. I currently have that config parameter set to 10000000 (10 MB). Thanks
        Hide
        solrize added a comment -

        I now have a workaround. The documents I'm indexing don't actually have html in them, but the schema was set up to use HTMLStripReader anyway. I switched to the standard analyzer and the problem went away, and indexing also seems to be running faster than before. I do still think the issue needs fixing since I'm sure some people use solr to index large web pages which do need html stripping. Anyway, thanks to Erik H. for advice about this.

        Show
        solrize added a comment - I now have a workaround. The documents I'm indexing don't actually have html in them, but the schema was set up to use HTMLStripReader anyway. I switched to the standard analyzer and the problem went away, and indexing also seems to be running faster than before. I do still think the issue needs fixing since I'm sure some people use solr to index large web pages which do need html stripping. Anyway, thanks to Erik H. for advice about this.
        Hide
        David Bowen added a comment -

        It seems to me that the code should bail out and just assume that a "<" did not begin an HTML tag if it still isn't sure after reading the DEFAULT_READ_AHEAD (8,192) characters. It looks like the code was intended to do that (see the checks against safeReadAheadLimit) but must be missing some case.

        Show
        David Bowen added a comment - It seems to me that the code should bail out and just assume that a "<" did not begin an HTML tag if it still isn't sure after reading the DEFAULT_READ_AHEAD (8,192) characters. It looks like the code was intended to do that (see the checks against safeReadAheadLimit) but must be missing some case.
        Hide
        Julien Coloos added a comment -

        The issue is also happening in current trunk (revision 903234), with the class HTMLStripCharFilter (replacing deprecated HTMLStripReader it seems).

        Example of stacktrace:

        26 janv. 2010 16:02:56 org.apache.solr.common.SolrException log
        GRAVE: java.io.IOException: Mark invalid
                at java.io.BufferedReader.reset(BufferedReader.java:485)
                at org.apache.lucene.analysis.CharReader.reset(CharReader.java:63)
                at org.apache.solr.analysis.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:172)
                at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:734)
                at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:748)
                at java.io.Reader.read(Reader.java:122)
                at org.apache.lucene.analysis.CharTokenizer.incrementToken(CharTokenizer.java:77)
                at org.apache.lucene.analysis.ISOLatin1AccentFilter.incrementToken(ISOLatin1AccentFilter.java:43)
                at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:383)
                at org.apache.lucene.analysis.ISOLatin1AccentFilter.next(ISOLatin1AccentFilter.java:64)
                at org.apache.solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:379)
                at org.apache.lucene.analysis.TokenStream.incrementToken(TokenStream.java:318)
                at org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:225)
                at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:38)
                at org.apache.solr.analysis.SnowballPorterFilter.incrementToken(SnowballPorterFilterFactory.java:116)
                at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:406)
                at org.apache.solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:97)
                at org.apache.solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:83)
                at org.apache.lucene.analysis.TokenStream.incrementToken(TokenStream.java:321)
                at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:138)
                at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:244)
                at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:781)
                at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:764)
                at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2630)
                at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2602)
                at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:241)
                at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
                at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
                at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
                at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
                at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
                at org.apache.solr.core.SolrCore.execute(SolrCore.java:1317)
                at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
                at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
                at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
                at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
                at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
                at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
                at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
                at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
                at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
                at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
                at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
                at org.mortbay.jetty.Server.handle(Server.java:285)
                at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
                at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
                at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:723)
                at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
                at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
                at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
                at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
        

        After a quick code review, it seems this one is due to the peek function which can read a byte from the input stream, while not incrementing the numRead variable (as done in the next function): functions checking whether read ahead limit was reached rely on numRead.
        The exception can then be triggered when reading exceeds the read ahead limit, as for example with a big document containing a malformed processing instruction like

        <?>   ?????
        ... (anything except  '?>')
        

        Note: the issue is triggered here because readProcessingInstruction calls peek whenever the character '?' was found (to check whether it is followed by '>').

        You will find attached a patch to fix the issue, as well as an updated JUnit test (which actually only checks for the malformed processing instruction, maybe you will find a more general test to perform on the next/peek functions).

        Regards

        Show
        Julien Coloos added a comment - The issue is also happening in current trunk (revision 903234), with the class HTMLStripCharFilter (replacing deprecated HTMLStripReader it seems). Example of stacktrace: 26 janv. 2010 16:02:56 org.apache.solr.common.SolrException log GRAVE: java.io.IOException: Mark invalid at java.io.BufferedReader.reset(BufferedReader.java:485) at org.apache.lucene.analysis.CharReader.reset(CharReader.java:63) at org.apache.solr.analysis.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:172) at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:734) at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:748) at java.io.Reader.read(Reader.java:122) at org.apache.lucene.analysis.CharTokenizer.incrementToken(CharTokenizer.java:77) at org.apache.lucene.analysis.ISOLatin1AccentFilter.incrementToken(ISOLatin1AccentFilter.java:43) at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:383) at org.apache.lucene.analysis.ISOLatin1AccentFilter.next(ISOLatin1AccentFilter.java:64) at org.apache.solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:379) at org.apache.lucene.analysis.TokenStream.incrementToken(TokenStream.java:318) at org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:225) at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:38) at org.apache.solr.analysis.SnowballPorterFilter.incrementToken(SnowballPorterFilterFactory.java:116) at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:406) at org.apache.solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:97) at org.apache.solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:83) at org.apache.lucene.analysis.TokenStream.incrementToken(TokenStream.java:321) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:138) at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:244) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:781) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:764) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2630) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2602) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:241) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1317) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:723) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) After a quick code review, it seems this one is due to the peek function which can read a byte from the input stream, while not incrementing the numRead variable (as done in the next function): functions checking whether read ahead limit was reached rely on numRead . The exception can then be triggered when reading exceeds the read ahead limit, as for example with a big document containing a malformed processing instruction like <?> ????? ... (anything except '?>') Note: the issue is triggered here because readProcessingInstruction calls peek whenever the character ' ? ' was found (to check whether it is followed by ' > '). You will find attached a patch to fix the issue, as well as an updated JUnit test (which actually only checks for the malformed processing instruction, maybe you will find a more general test to perform on the next / peek functions). Regards
        Hide
        Hoss Man added a comment -

        Updates patch to trunk (where the charfilter stuff has been refactored into the new top level "modules" directory)

        I'm not familiar with the HTMLStripCharFilter stuff, so i can't say whether the "fix" is correct (no idea if "peek" should be incrementing that counter – that's why even private methods should have javadocs), but the test certainly looks valid to me

        Show
        Hoss Man added a comment - Updates patch to trunk (where the charfilter stuff has been refactored into the new top level "modules" directory) I'm not familiar with the HTMLStripCharFilter stuff, so i can't say whether the "fix" is correct (no idea if "peek" should be incrementing that counter – that's why even private methods should have javadocs), but the test certainly looks valid to me
        Hide
        Hoss Man added a comment -

        we have a patch that seems to work, so we should dfinitely try to get this into the next release ... i'm hoping someone more familiar with the code can sanity check it soon.

        Show
        Hoss Man added a comment - we have a patch that seems to work, so we should dfinitely try to get this into the next release ... i'm hoping someone more familiar with the code can sanity check it soon.
        Hide
        Grant Ingersoll added a comment -

        From IRC:

        I wonder if the issue isn't that in next()
        [21:35] gsingers: if it gets something off the stack (pushed) it doesn't increment numRead
        [21:37] gsingers: but, I guess one could argue that numRead should track exactly what is read off the InputStream
        [21:38] gsingers: and in that case, peek is still doing a read
        [21:38] gsingers: so it should inc. it
        [21:38] gsingers: I suppose the only harm in more aggressively incrementing it is that you don't hold as much in buffer as you could otherwise

        Show
        Grant Ingersoll added a comment - From IRC: I wonder if the issue isn't that in next() [21:35] gsingers: if it gets something off the stack (pushed) it doesn't increment numRead [21:37] gsingers: but, I guess one could argue that numRead should track exactly what is read off the InputStream [21:38] gsingers: and in that case, peek is still doing a read [21:38] gsingers: so it should inc. it [21:38] gsingers: I suppose the only harm in more aggressively incrementing it is that you don't hold as much in buffer as you could otherwise
        Hide
        Hoss Man added a comment -

        As i mentioned in IRC (prior to Grant's previously posted comments) the core issue is: what is the intended purpose of the "numRead" counter?

        • If it's suppose to count the number of times "input.read()" is called (ie: "num read from inner stream"), then "peek" has a bug by not incrementing.
        • If it's suppose to count the number of times "next()" returns a char (ie: "num read from outer stream"), then as grant mentioned "next" has a bug by not incrementing when using the stack.

        The patch currently assumes the former and seems to fix the bug, i haven't tried the same test case with an approach to the later, but i suspect that may also work.

        Show
        Hoss Man added a comment - As i mentioned in IRC (prior to Grant's previously posted comments) the core issue is: what is the intended purpose of the "numRead" counter? If it's suppose to count the number of times "input.read()" is called (ie: "num read from inner stream"), then "peek" has a bug by not incrementing. If it's suppose to count the number of times "next()" returns a char (ie: "num read from outer stream"), then as grant mentioned "next" has a bug by not incrementing when using the stack. The patch currently assumes the former and seems to fix the bug, i haven't tried the same test case with an approach to the later, but i suspect that may also work.
        Hide
        Yonik Seeley added a comment -

        Since it looks like the primary use of numRead is in relation to mark() and reset() on the underlying stream, it does look like #1 is the correct interpretation (i.e. the patch looks correct)

        Show
        Yonik Seeley added a comment - Since it looks like the primary use of numRead is in relation to mark() and reset() on the underlying stream, it does look like #1 is the correct interpretation (i.e. the patch looks correct)
        Hide
        Yonik Seeley added a comment -

        Committed to 3x and trunk.

        Show
        Yonik Seeley added a comment - Committed to 3x and trunk.
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1.0 release

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1.0 release
        Hide
        Steve Rowe added a comment -

        The below-listed exception, which appears to be the same as that in other reports on this issue, is triggered when processing with HTMLStripCharFilter the ClueWeb09 documents with TREC-IDs clueweb09-en0000-00-14171, clueweb09-en0000-00-14228, clueweb09-en0000-00-14235, clueweb09-en0000-00-14240, clueweb09-en0000-00-14248, and clueweb09-en0000-00-14265:

        java.io.IOException: Mark invalid
                at java.io.BufferedReader.reset(BufferedReader.java:485)
                at org.apache.lucene.analysis.CharReader.reset(CharReader.java:69)
                at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:171)
                at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:734)
        

        Once LUCENE-3690 has been committed, this will only affect the (deprecated) old implementation, which will be renamed to LegacyHTMLStripCharFilter.

        Show
        Steve Rowe added a comment - The below-listed exception, which appears to be the same as that in other reports on this issue, is triggered when processing with HTMLStripCharFilter the ClueWeb09 documents with TREC-IDs clueweb09-en0000-00-14171, clueweb09-en0000-00-14228, clueweb09-en0000-00-14235, clueweb09-en0000-00-14240, clueweb09-en0000-00-14248, and clueweb09-en0000-00-14265: java.io.IOException: Mark invalid at java.io.BufferedReader.reset(BufferedReader.java:485) at org.apache.lucene.analysis.CharReader.reset(CharReader.java:69) at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:171) at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:734) Once LUCENE-3690 has been committed, this will only affect the (deprecated) old implementation, which will be renamed to LegacyHTMLStripCharFilter .

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            solrize
          • Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development