Tika
  1. Tika
  2. TIKA-521

OutOfMemoryError Parsing XSLX File

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7, 0.8
    • Fix Version/s: 0.10
    • Component/s: parser
    • Labels:
      None

      Description

      I have several XSLX files I'm trying to parse with Tika that are failing with an OutOfMemoryError even when using a large heap size. For instance the attached 1.26MB excel file fails using a 512MB heap.

      1. Out of memory issue in 1.0.jpg
        229 kB
        samraj
      2. Out of memory issue in 1.0.jpg
        229 kB
        samraj
      3. TikaExcelEventBasedExtraction.diff
        21 kB
        Nick Burch
      4. tika-diff.txt
        2 kB
        Sjoerd Smeets
      5. tika-new-files.tar.bz2
        5 kB
        Sjoerd Smeets
      6. memory-test.xlsx
        1.27 MB
        Stephen Duncan Jr

        Activity

        Hide
        Ken Krugler added a comment -

        Tika CLI uses BoilerpipeContentHandler in regular (don't include markup) mode. Here the content handler is essentially dispatching to the Boilerpipe package, so any memory issues would be in that 3rd party code base.

        Show
        Ken Krugler added a comment - Tika CLI uses BoilerpipeContentHandler in regular (don't include markup) mode. Here the content handler is essentially dispatching to the Boilerpipe package, so any memory issues would be in that 3rd party code base.
        Hide
        Maxim Valyanskiy added a comment -

        Sorry, I missed screenshot with stack trace.

        Here is what I found: in you case, memory is wasted in BoilerpipeContentHandler. This class is used in Tika GUI and in Tika CLI with "-T" switch. Tika command line application works fine with HTML or plain text (-t) output

        Show
        Maxim Valyanskiy added a comment - Sorry, I missed screenshot with stack trace. Here is what I found: in you case, memory is wasted in BoilerpipeContentHandler. This class is used in Tika GUI and in Tika CLI with "-T" switch. Tika command line application works fine with HTML or plain text (-t) output
        Hide
        Maxim Valyanskiy added a comment -

        Tika from trunk with POI from trunk parses this test file with -Xmx64M.

        Please post stack trace for OOM that you have

        Show
        Maxim Valyanskiy added a comment - Tika from trunk with POI from trunk parses this test file with -Xmx64M. Please post stack trace for OOM that you have
        Hide
        samraj added a comment -

        i have tried with tika 1.0 and i got error while parsing the same document. here i attached the error screenshot

        Show
        samraj added a comment - i have tried with tika 1.0 and i got error while parsing the same document. here i attached the error screenshot
        Hide
        samraj added a comment -

        Issue occured with tika 1.0

        Show
        samraj added a comment - Issue occured with tika 1.0
        Hide
        Nick Burch added a comment -

        Tika has been updated, in r1081392. Until the next formal release, you'll need to build from svn / grab a nightly build if you want to try out the new changes.

        Show
        Nick Burch added a comment - Tika has been updated, in r1081392. Until the next formal release, you'll need to build from svn / grab a nightly build if you want to try out the new changes.
        Hide
        samraj added a comment -

        Hi Nick,

        POI 3.8 beta 1 released . can u pls update tika 0.9 with that. Also pdf box 1.5 released.

        We are also facing xlsx extraction >2MB.My whole system get hanged if i tried to extract the data..

        Show
        samraj added a comment - Hi Nick, POI 3.8 beta 1 released . can u pls update tika 0.9 with that. Also pdf box 1.5 released. We are also facing xlsx extraction >2MB.My whole system get hanged if i tried to extract the data..
        Hide
        Nick Burch added a comment -

        POI dependency bumped and patch applied in r1081392.

        Show
        Nick Burch added a comment - POI dependency bumped and patch applied in r1081392.
        Hide
        Nick Burch added a comment -

        POI 3.8 beta 1 is being voted on at the moment. Once it has been released, I'll upgrade the dependency in Tika and apply the patch to switch Tika to using event based XSSF parsing.

        Show
        Nick Burch added a comment - POI 3.8 beta 1 is being voted on at the moment. Once it has been released, I'll upgrade the dependency in Tika and apply the patch to switch Tika to using event based XSSF parsing.
        Hide
        Nick Burch added a comment -

        Updated TikaExcelEventBasedExtraction.diff which allows you to build against the latest POI snapshot (assuming you manually installed the POI jars locally)

        Show
        Nick Burch added a comment - Updated TikaExcelEventBasedExtraction.diff which allows you to build against the latest POI snapshot (assuming you manually installed the POI jars locally)
        Hide
        Nick Burch added a comment -

        Updated TikaExcelEventBasedExtraction.diff which does allow all tests to pass

        This can be applied once there's a newer POI out to depend against

        Show
        Nick Burch added a comment - Updated TikaExcelEventBasedExtraction.diff which does allow all tests to pass This can be applied once there's a newer POI out to depend against
        Hide
        Nick Burch added a comment - - edited

        The attached patch TikaExcelEventBasedExtraction.diff works against the latest POI Trunk, and switches the Tika .xlsx extraction to using event based parsing.

        However, it isn't quite finished - the sheet protected/not protected information needed by the metadata extraction isn't yet supported. This will need to be added in before we could commit this patch (which will also need a new POI release too)

        Show
        Nick Burch added a comment - - edited The attached patch TikaExcelEventBasedExtraction.diff works against the latest POI Trunk, and switches the Tika .xlsx extraction to using event based parsing. However, it isn't quite finished - the sheet protected/not protected information needed by the metadata extraction isn't yet supported. This will need to be added in before we could commit this patch (which will also need a new POI release too)
        Hide
        Chris A. Mattmann added a comment -
        • classify
        Show
        Chris A. Mattmann added a comment - classify
        Hide
        Sjoerd Smeets added a comment -

        Proposed patch

        Show
        Sjoerd Smeets added a comment - Proposed patch
        Hide
        Sjoerd Smeets added a comment -

        Attached a proposed patch for bigger XLS files. It has been tested with a XSL spreadsheet of 70Mb with a heapsize of 1024Mb. It should be able to handle bigger files, since it is using SAX parsing. However, using a smaller heapsize for the test file restulted in a OutOfMemoryError, when extracting the different parts of the XLS document.

        Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2786)
        at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:133)
        at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource$FakeZipEntry.<init>(ZipInputStreamZipEntrySource.java:118)
        at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:55)
        at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:82)
        at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:220)
        at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:154)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:68)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:146)
        at com.ravn.test.tika.XLSTester.parse(XLSTester.java:47)
        at com.ravn.test.TikaTester.main(TikaTester.java:39)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at com.intellij.rt.execution.application.AppMain.main(AppMain.java:115)

        The proposed patch is an attempt to generate the same information about a XSL document as the XSSFExcelExtractorDecorator parser does. There are still some issues to look into, which are commented with TODO. Some advice on these matters would be welcome. Could someone check if the proposed patch is acceptable, so I'll try to implement the TODO things plus write some testcases? Maybe this can then be the default parser

        I also changed/created certain parts in POI in order to get the patch working. See https://issues.apache.org/bugzilla/show_bug.cgi?id=50076 for the proposed changes for POI.

        Show
        Sjoerd Smeets added a comment - Attached a proposed patch for bigger XLS files. It has been tested with a XSL spreadsheet of 70Mb with a heapsize of 1024Mb. It should be able to handle bigger files, since it is using SAX parsing. However, using a smaller heapsize for the test file restulted in a OutOfMemoryError, when extracting the different parts of the XLS document. Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:133) at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource$FakeZipEntry.<init>(ZipInputStreamZipEntrySource.java:118) at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:55) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:82) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:220) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:154) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:68) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:146) at com.ravn.test.tika.XLSTester.parse(XLSTester.java:47) at com.ravn.test.TikaTester.main(TikaTester.java:39) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:115) The proposed patch is an attempt to generate the same information about a XSL document as the XSSFExcelExtractorDecorator parser does. There are still some issues to look into, which are commented with TODO. Some advice on these matters would be welcome. Could someone check if the proposed patch is acceptable, so I'll try to implement the TODO things plus write some testcases? Maybe this can then be the default parser I also changed/created certain parts in POI in order to get the patch working. See https://issues.apache.org/bugzilla/show_bug.cgi?id=50076 for the proposed changes for POI.
        Hide
        Maxim Valyanskiy added a comment -

        If a plain text is enough for you, you can apply patch from TIKA-511 and call ExtractorFactory.setAllThreadsPreferEventExtractors(true) before running Tika

        Show
        Maxim Valyanskiy added a comment - If a plain text is enough for you, you can apply patch from TIKA-511 and call ExtractorFactory.setAllThreadsPreferEventExtractors(true) before running Tika
        Hide
        Sjoerd Smeets added a comment -

        Ok, I'll see if I can create a patch for this.

        Show
        Sjoerd Smeets added a comment - Ok, I'll see if I can create a patch for this.
        Hide
        Nick Burch added a comment -

        It would need someone to work up a patch. We can't simply use XSSFEventBasedExcelExtractor, as that produces limited plain text, but we want to generate HTML + include headers, footers, links, comments etc

        So, we'd need code that was similar to XSSFEventBasedExcelExtractor, but which also did the additional work to include the extra parts we currently have

        Show
        Nick Burch added a comment - It would need someone to work up a patch. We can't simply use XSSFEventBasedExcelExtractor, as that produces limited plain text, but we want to generate HTML + include headers, footers, links, comments etc So, we'd need code that was similar to XSSFEventBasedExcelExtractor, but which also did the additional work to include the extra parts we currently have
        Hide
        Sjoerd Smeets added a comment -

        I'm facing the same issue. Increasing the heapssize to the maximum will cover for a certain amount of xlsx files, but there are still a lot of files causing an OutOfMemoryError (> 10 Mb XLS files). The XSSFEventBasedExcelExtractor indeed processes these files as we would like to. What would be the draw back of using XSSFEventBasedExcelExtractor?

        Show
        Sjoerd Smeets added a comment - I'm facing the same issue. Increasing the heapssize to the maximum will cover for a certain amount of xlsx files, but there are still a lot of files causing an OutOfMemoryError (> 10 Mb XLS files). The XSSFEventBasedExcelExtractor indeed processes these files as we would like to. What would be the draw back of using XSSFEventBasedExcelExtractor?
        Hide
        Stephen Duncan Jr added a comment -

        I have 7MB files that can't be handled when giving 2GB of RAM, it required 3GB to process. I'm looking at likely needing to run on 32-bit Java, so increasing the heap size that high is not really an option. Besides, at the growth rate I see, a 20MB file might require 10GB of heap. That simply doesn't scale for reasonable file sizes. Meanwhile, the same 7MB file can be parsed using the alternate API using 128MB for the heap size. That should allow any reasonable file to be processed assuming a reasonable 1GB heap size.

        Show
        Stephen Duncan Jr added a comment - I have 7MB files that can't be handled when giving 2GB of RAM, it required 3GB to process. I'm looking at likely needing to run on 32-bit Java, so increasing the heap size that high is not really an option. Besides, at the growth rate I see, a 20MB file might require 10GB of heap. That simply doesn't scale for reasonable file sizes. Meanwhile, the same 7MB file can be parsed using the alternate API using 128MB for the heap size. That should allow any reasonable file to be processed assuming a reasonable 1GB heap size.
        Hide
        Nick Burch added a comment -

        Excel files really really munch memory. XLSX is worse than XLS, as the xml processing into objects takes lots of memory.

        Some files are worse than others, depends on the kinds of things in them. I'd suggest you just up your heap size.

        Show
        Nick Burch added a comment - Excel files really really munch memory. XLSX is worse than XLS, as the xml processing into objects takes lots of memory. Some files are worse than others, depends on the kinds of things in them. I'd suggest you just up your heap size.
        Hide
        Stephen Duncan Jr added a comment -

        Using the POI API directly, and using their event-based model, I was able to to parse the file using less than 20MB of heap space (less than 64MB of heap size allocated). Can Tika be modified to use the event based API when extracting text? Here's sample code used:

        final String filePath = "C:\\Users\\stephen.duncan\\tmp
        memory-test.xlsx";
        XSSFEventBasedExcelExtractor extractor = new XSSFEventBasedExcelExtractor(filePath);

        String text = extractor.getText();
        System.out.println(text);

        Show
        Stephen Duncan Jr added a comment - Using the POI API directly, and using their event-based model, I was able to to parse the file using less than 20MB of heap space (less than 64MB of heap size allocated). Can Tika be modified to use the event based API when extracting text? Here's sample code used: final String filePath = "C:\\Users\\stephen.duncan\\tmp memory-test.xlsx"; XSSFEventBasedExcelExtractor extractor = new XSSFEventBasedExcelExtractor(filePath); String text = extractor.getText(); System.out.println(text);

          People

          • Assignee:
            Nick Burch
            Reporter:
            Stephen Duncan Jr
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development