Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: None
    • Labels:
      None
    • Environment:

      Description

      We see intermittent issues with OutOfMemory caused by tika failing to process content. Here is an example:

      Dec 29, 2011 7:12:05 AM org.apache.solr.common.SolrException log
      SEVERE: java.lang.OutOfMemoryError: Java heap space
      at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50)
      at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
      at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
      at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
      at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
      at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
      at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
      at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
      at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:195)
      at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
      at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
      at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244)
      at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478)
      at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
      at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
      at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
      at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
      at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
      at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
      at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
      at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)

        Activity

        Hide
        Rob Tulloh added a comment -

        Agreed. It is our plan to move content extraction out of band. However, during our prototyping and testing, we want to be sure that tika will meet all our requirements. So, even if we move content handling out of band, we still need it to work reliably and correctly.

        Thanks for the note. It confirms what we thought and that is helpful.

        Show
        Rob Tulloh added a comment - Agreed. It is our plan to move content extraction out of band. However, during our prototyping and testing, we want to be sure that tika will meet all our requirements. So, even if we move content handling out of band, we still need it to work reliably and correctly. Thanks for the note. It confirms what we thought and that is helpful.
        Hide
        Eric Pugh added a comment -

        I have found that Solr CELL is great for small numbers of documents, or quick prototyping. But as you scale up in either # or complexity of documents, it becomes a bottle neck. The Tika CLI is very easy to use, and you can throw more resources at doing Tika extraction if you do it outside of Solr and then just send the text in, versus doing it inside of Solr. And it's less risk that you bring down Solr! I wonder if we should put something in the wiki that recommends that if you start having problems with Solr CELL, then move to running Tika outside, and maybe include some sample code?

        Solr Cell is an awesome feature, but it can also cut you!

        Show
        Eric Pugh added a comment - I have found that Solr CELL is great for small numbers of documents, or quick prototyping. But as you scale up in either # or complexity of documents, it becomes a bottle neck. The Tika CLI is very easy to use, and you can throw more resources at doing Tika extraction if you do it outside of Solr and then just send the text in, versus doing it inside of Solr. And it's less risk that you bring down Solr! I wonder if we should put something in the wiki that recommends that if you start having problems with Solr CELL, then move to running Tika outside, and maybe include some sample code? Solr Cell is an awesome feature, but it can also cut you!
        Hide
        Rob Tulloh added a comment -
        Show
        Rob Tulloh added a comment - Opened POI ticket: https://issues.apache.org/bugzilla/show_bug.cgi?id=52400
        Hide
        Martijn van Groningen added a comment -

        To me this seems like an apache poi issue. I don't think this is a Solr Cell issue like Hoss is suggesting.

        The only few times a saw Solr cell having OOM issues was when very large files were being uploaded (a few hundred megabytes per file). In such cases it is better to do parsing outside of Solr by using plain Tika in the application in front of Solr.

        Show
        Martijn van Groningen added a comment - To me this seems like an apache poi issue. I don't think this is a Solr Cell issue like Hoss is suggesting. The only few times a saw Solr cell having OOM issues was when very large files were being uploaded (a few hundred megabytes per file). In such cases it is better to do parsing outside of Solr by using plain Tika in the application in front of Solr.
        Hide
        Rob Tulloh added a comment -
        Show
        Rob Tulloh added a comment - Opened tika ticket https://issues.apache.org/jira/browse/TIKA-835
        Hide
        Rob Tulloh added a comment -

        I'll open a ticket against tika for this issue. I'll also try to document a case when the first try of a document fails with an error and the second attempt fails with OOM. That might be a solr issue as I would expect a retry to fail with the same error as the first try.

        Show
        Rob Tulloh added a comment - I'll open a ticket against tika for this issue. I'll also try to document a case when the first try of a document fails with an error and the second attempt fails with OOM. That might be a solr issue as I would expect a retry to fail with the same error as the first try.
        Hide
        Rob Tulloh added a comment -

        Looks reproducible when I downloaded tika-app-1.0.jar.

        [rtulloh@chiemsim-500 oom2]$ java -Xmx2G -jar ../tika-app-1.0.jar XXXX
        Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
                at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50)
                at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
                at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
                at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
                at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
                at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
                at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
                at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
                at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
                at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
                at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
                at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
                at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:82)
                at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
                at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
                at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
                at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
                at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
                at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:128)
                at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:392)
                at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:99)
        

        I have seen that some documents get processed correctly the first time I submit to solr, but fail w/ OOM when submitted again during retry.

        Show
        Rob Tulloh added a comment - Looks reproducible when I downloaded tika-app-1.0.jar. [rtulloh@chiemsim-500 oom2]$ java -Xmx2G -jar ../tika-app-1.0.jar XXXX Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50) at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63) at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:82) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:128) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:392) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:99) I have seen that some documents get processed correctly the first time I submit to solr, but fail w/ OOM when submitted again during retry.
        Hide
        Hoss Man added a comment -

        have you tried parsing these docs using tika on the command line?

        https://tika.apache.org/1.0/gettingstarted.html#Using_Tika_as_a_command_line_utility

        ...nothing in these stack traces seems to suggests a problem specifically in Solr

        (It's completely possible that Solr is doing something inefficient (memory wise) when using Tika that is contributing the the OOM, but if you're getting errors on these docs even when you don't get OOM that suggests a more fundamental underlying problem)

        Show
        Hoss Man added a comment - have you tried parsing these docs using tika on the command line? https://tika.apache.org/1.0/gettingstarted.html#Using_Tika_as_a_command_line_utility ...nothing in these stack traces seems to suggests a problem specifically in Solr (It's completely possible that Solr is doing something inefficient (memory wise) when using Tika that is contributing the the OOM, but if you're getting errors on these docs even when you don't get OOM that suggests a more fundamental underlying problem)
        Hide
        Rob Tulloh added a comment -

        I have another document that causes OOM in a batch, but when I submit it by itself, it produces this Tika error. Maybe this is helpful?

        Dec 29, 2011 10:37:21 AM org.apache.solr.common.SolrException log
        SEVERE: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.T
        NEFParser@19ed00d1
                at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:201)
                at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
                at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
                at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244)
                at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478)
                at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
                at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
                at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
                at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
                at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
                at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
                at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
                at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
                at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
                at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
                at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
                at org.mortbay.jetty.Server.handle(Server.java:326)
                at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
                at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
                at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
                at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
                at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
                at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
                at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
               ... 23 more
        Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
                at org.apache.poi.util.LittleEndian.readUShort(LittleEndian.java:302)
                at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:53)
                at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
                at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
                at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
                at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
                at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
                at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
                at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
                at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
                ... 26 more
        
        Show
        Rob Tulloh added a comment - I have another document that causes OOM in a batch, but when I submit it by itself, it produces this Tika error. Maybe this is helpful? Dec 29, 2011 10:37:21 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.T NEFParser@19ed00d1 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:201) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) ... 23 more Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun at org.apache.poi.util.LittleEndian.readUShort(LittleEndian.java:302) at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:53) at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63) at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 26 more
        Hide
        Rob Tulloh added a comment -

        I successfully isolated one document that causes an OOM. Note the input size is small (only 40K).

        Thu Dec 29 08:53:57 2011 feedBatch out: solCol2 # docs 1 bytes 40265 # err 1 # millis 3549
        Thu Dec 29 08:53:57 2011 ContentError: 1466911872::1 ContentError [m_contentID=1466911872::1, m_module=SolrContentManager, m_error=Java heap space  java.lang.OutOfMemoryError: Java heap space         at]
        

        Stack trace from solr looks the same as originally reported:

        Dec 29, 2011 8:53:57 AM org.apache.solr.common.SolrException log
        SEVERE: java.lang.OutOfMemoryError: Java heap space
                at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50)
                at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
                at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
                at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
                at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
                at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
                at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
                at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
                at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
                at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
                at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
                at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:195)
                at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
                at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
                at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244)
                at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478)
                at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
                at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
                at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
                at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
                at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
                at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
                at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
                at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
                at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        

        The document in question here appears to be of type application/ms-tnef. I will add more information to the ticket as I drill down and learn more.

        Show
        Rob Tulloh added a comment - I successfully isolated one document that causes an OOM. Note the input size is small (only 40K). Thu Dec 29 08:53:57 2011 feedBatch out: solCol2 # docs 1 bytes 40265 # err 1 # millis 3549 Thu Dec 29 08:53:57 2011 ContentError: 1466911872::1 ContentError [m_contentID=1466911872::1, m_module=SolrContentManager, m_error=Java heap space java.lang.OutOfMemoryError: Java heap space at] Stack trace from solr looks the same as originally reported: Dec 29, 2011 8:53:57 AM org.apache.solr.common.SolrException log SEVERE: java.lang.OutOfMemoryError: Java heap space at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50) at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63) at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:195) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) The document in question here appears to be of type application/ms-tnef. I will add more information to the ticket as I drill down and learn more.
        Hide
        Rob Tulloh added a comment -

        I will attempt to isolate some content that causes OOM and feed them one at a time. If I can reproduce an OOM via this mechanism, I will try the CLI. I have never used the tika CLI. I will take a look at the project page to see how to invoke that.

        Show
        Rob Tulloh added a comment - I will attempt to isolate some content that causes OOM and feed them one at a time. If I can reproduce an OOM via this mechanism, I will try the CLI. I have never used the tika CLI. I will take a look at the project page to see how to invoke that.
        Hide
        Martijn van Groningen added a comment -

        I have had cases where large files where being extracted with Solr cell and eventually Solr crashed due to a lot of parsed content was in memory.
        Just an idea to see if this is a Solr or a Tika issue. Can you try to use the Tika command line utility and parse the document(s) that cause OOM?

        Show
        Martijn van Groningen added a comment - I have had cases where large files where being extracted with Solr cell and eventually Solr crashed due to a lot of parsed content was in memory. Just an idea to see if this is a Solr or a Tika issue. Can you try to use the Tika command line utility and parse the document(s) that cause OOM?
        Hide
        Rob Tulloh added a comment -

        Here is an example of a batch containing 1 document that took 49 seconds to process. This is typical for the slow/sluggish behavior we see. The content in particular here is a PDF document.

        Thu Dec 29 07:19:41 2011 feedBatch out: solr2Col2 # docs 1 bytes 6348940 # err 0 # millis 49018
        Thu Dec 29 07:19:41 2011 Long running batch (t= 49018 ) doc 1493434104::2 mime = application/octet-stream
        

        I will see if I can isolate a document that causes an OOM. The most recent OOM I captured was a batch containing more than 1 document and I amm not sure which document may have been the root cause of the OOM.

        Show
        Rob Tulloh added a comment - Here is an example of a batch containing 1 document that took 49 seconds to process. This is typical for the slow/sluggish behavior we see. The content in particular here is a PDF document. Thu Dec 29 07:19:41 2011 feedBatch out: solr2Col2 # docs 1 bytes 6348940 # err 0 # millis 49018 Thu Dec 29 07:19:41 2011 Long running batch (t= 49018 ) doc 1493434104::2 mime = application/octet-stream I will see if I can isolate a document that causes an OOM. The most recent OOM I captured was a batch containing more than 1 document and I amm not sure which document may have been the root cause of the OOM.
        Hide
        Rob Tulloh added a comment -

        In this particular test, we are using 2 threads to feed a single solr instance. We batch documents according to these parameters:

        1. Max bytes: 5M
        2. Max docs: 200

        These are thresholds. So, it is possible for a large document of size greater than 5M to get fed to Solr by itself. However, consider this. What I observe is that it is content type rather than size that is causing issues. I have seen 2 particular behaviors of concern. The first is slow/sluggish behavior. I have some outputs from our load generator that show that Solr/Tika sometimes takes over 10 minutes to injest some content. I have one test set where I feed 4 documents in a single batch and it takes over 13 minutes for these 4 documents to get indexed. This was run against an empty solr index. The other behavior is OOM.

        I cannot share the content as the content is proprietary. I am happy to provide more details from Solr and/or Tika if you can tell me what to look for or what debug I should enable to capture helpful information.

        Show
        Rob Tulloh added a comment - In this particular test, we are using 2 threads to feed a single solr instance. We batch documents according to these parameters: 1. Max bytes: 5M 2. Max docs: 200 These are thresholds. So, it is possible for a large document of size greater than 5M to get fed to Solr by itself. However, consider this. What I observe is that it is content type rather than size that is causing issues. I have seen 2 particular behaviors of concern. The first is slow/sluggish behavior. I have some outputs from our load generator that show that Solr/Tika sometimes takes over 10 minutes to injest some content. I have one test set where I feed 4 documents in a single batch and it takes over 13 minutes for these 4 documents to get indexed. This was run against an empty solr index. The other behavior is OOM. I cannot share the content as the content is proprietary. I am happy to provide more details from Solr and/or Tika if you can tell me what to look for or what debug I should enable to capture helpful information.
        Hide
        Martijn van Groningen added a comment -

        What is the average size of the files you're sending to Solr? How many files are you sending concurrently to Solr?

        I believe that Solr Cell internally saves the parsed content in a String before it adds it to the index. In other words the parsed content is kept in ram and this can cause OOM issues.

        Show
        Martijn van Groningen added a comment - What is the average size of the files you're sending to Solr? How many files are you sending concurrently to Solr? I believe that Solr Cell internally saves the parsed content in a String before it adds it to the index. In other words the parsed content is kept in ram and this can cause OOM issues.

          People

          • Assignee:
            Unassigned
            Reporter:
            Rob Tulloh
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development