Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-2886

Out of Memory Error with DIH and TikaEntityProcessor

    XMLWordPrintableJSON

Details

    Description

      I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to apache-solr-4.0-2011-10-14_08-56-59.war and then apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various sizes, using the TikaEntityProcessor. My indexing would run to completion and was completely successful under the June build. The only error was readability of the fulltext in highlighting. This was fixed in Tika 0.10 (TIKA-611). I chose to use the October 14 build of Solr because Tika 0.10 had recently been included (SOLR-2372).

      On the same machine without changing any memory settings my initial problem is a Perm Gen error. Fine, I increase the PermGen space.

      I've set the "onError" parameter to "skip" for the TikaEntityProcessor. Now I get several (6)

      SEVERE: Exception thrown while getting data
      java.net.SocketTimeoutException: Read timed out
      SEVERE: Exception in entity : tika:org.apache.solr.handler.dataimport.DataImport
      HandlerException: Exception in invoking url <url removed> # 2975

      pairs. And after ~3881 documents, with auto commit set unreasonably frequently I consistently get an Out of Memory Error

      SEVERE: Exception while processing: f document : null:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.OutOfMemoryError: Java heap space

      The stack trace points to org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151) and org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:718).

      The October 30 build performs identically.

      Funny thing is that monitoring via JConsole doesn't reveal any memory issues.

      Because the out of Memory error did not occur in June, this leads me to believe that a bug has been introduced to the code since then.

      Attachments

        Activity

          People

            Unassigned Unassigned
            pgwillia Tricia Jenkins
            Votes:
            2 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: