Solr
  1. Solr
  2. SOLR-2886

Out of Memory Error with DIH and TikaEntityProcessor

    Details

      Description

      I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to apache-solr-4.0-2011-10-14_08-56-59.war and then apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various sizes, using the TikaEntityProcessor. My indexing would run to completion and was completely successful under the June build. The only error was readability of the fulltext in highlighting. This was fixed in Tika 0.10 (TIKA-611). I chose to use the October 14 build of Solr because Tika 0.10 had recently been included (SOLR-2372).

      On the same machine without changing any memory settings my initial problem is a Perm Gen error. Fine, I increase the PermGen space.

      I've set the "onError" parameter to "skip" for the TikaEntityProcessor. Now I get several (6)

      SEVERE: Exception thrown while getting data
      java.net.SocketTimeoutException: Read timed out
      SEVERE: Exception in entity : tika:org.apache.solr.handler.dataimport.DataImport
      HandlerException: Exception in invoking url <url removed> # 2975

      pairs. And after ~3881 documents, with auto commit set unreasonably frequently I consistently get an Out of Memory Error

      SEVERE: Exception while processing: f document : null:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.OutOfMemoryError: Java heap space

      The stack trace points to org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151) and org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:718).

      The October 30 build performs identically.

      Funny thing is that monitoring via JConsole doesn't reveal any memory issues.

      Because the out of Memory error did not occur in June, this leads me to believe that a bug has been introduced to the code since then.

        Activity

        Hide
        Hoss Man added a comment -

        removing fixVersion=4.0 since there is no evidence that anyone is currently working on this issue. (this can certainly be revisited if volunteers step forward)

        FWIW: it's not clear to me reading the comments how Solr would/could use the suggested workaround in the PDFBOX issue, since Solr dones't invoke PDFBox directly, and delegates to Tika.

        If someone with more tika knowledge can suggest a way in which solr users can configure/influence how Tika uses PDFBox to control this setting, that seems like it would resolve things

        Show
        Hoss Man added a comment - removing fixVersion=4.0 since there is no evidence that anyone is currently working on this issue. (this can certainly be revisited if volunteers step forward) FWIW: it's not clear to me reading the comments how Solr would/could use the suggested workaround in the PDFBOX issue, since Solr dones't invoke PDFBox directly, and delegates to Tika. If someone with more tika knowledge can suggest a way in which solr users can configure/influence how Tika uses PDFBox to control this setting, that seems like it would resolve things
        Hide
        Robert Muir added a comment -

        rmuir20120906-bulk-40-change

        Show
        Robert Muir added a comment - rmuir20120906-bulk-40-change
        Hide
        Hoss Man added a comment -

        bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

        Show
        Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
        Hide
        Tricia Jenkins added a comment - - edited

        Some further bug tracing shows that pdfbox' RandomAccessBuffer is responsible for the Out of Memory error. It creates a new larger buffer and then copies the existing buffer into the new buffer. There are no checks to see if the space is available to create a new larger buffer. Use of a RandomAccessBuffer was introduced with PDFBOX-948 at revision 1072678 on Feb 20, 2011 and in for the pdfbox 1.5 and 1.6 releases, hence TIKA-594 for tika 0.10 in revision 1080162 on Mar 10, 2011 and revision 1171497 on Sep 16, 2011. Moving my DIH workflow to another machine with more memory available is able to run to completion.

        It appears that I am the victim of the caviate stated in PDFBOX-948:

        For normal sized PDFs files, the in-memory implementation RandomAccessBuffer should not increase the memory usage too much, while providing faster IO as all access operations are only memory copies.

        Therefore, please consider switching the default to in-memory scratch buffers. Users with very large files can still pass a temporary directory.

        Now to track down how to detect "large files" and use a temporary directory instead. This may turn out to be a TIKA issue rather than Solr.

        Show
        Tricia Jenkins added a comment - - edited Some further bug tracing shows that pdfbox' RandomAccessBuffer is responsible for the Out of Memory error. It creates a new larger buffer and then copies the existing buffer into the new buffer. There are no checks to see if the space is available to create a new larger buffer. Use of a RandomAccessBuffer was introduced with PDFBOX-948 at revision 1072678 on Feb 20, 2011 and in for the pdfbox 1.5 and 1.6 releases, hence TIKA-594 for tika 0.10 in revision 1080162 on Mar 10, 2011 and revision 1171497 on Sep 16, 2011. Moving my DIH workflow to another machine with more memory available is able to run to completion. It appears that I am the victim of the caviate stated in PDFBOX-948 : For normal sized PDFs files, the in-memory implementation RandomAccessBuffer should not increase the memory usage too much, while providing faster IO as all access operations are only memory copies. Therefore, please consider switching the default to in-memory scratch buffers. Users with very large files can still pass a temporary directory. Now to track down how to detect "large files" and use a temporary directory instead. This may turn out to be a TIKA issue rather than Solr.

          People

          • Assignee:
            Unassigned
            Reporter:
            Tricia Jenkins
          • Votes:
            2 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development