Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1866

Out of memory error on Word document

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.11, 1.12
    • Fix Version/s: 2.0, 1.13
    • Component/s: parser
    • Labels:
      None

      Description

      Trying to get the text from the attached MS Word document throws an Out of Memory error. Worked my way up from no memory arguments to 2G, 3G, 8G - all result in the same error.

      The document is only 220K, it appears to be the number of tables causing the issue.

      java -Xms8G -Xmx8G -jar tika-app-1.12.jar --text EPA-HQ-RCRA-2013-0396-0010.docx 
      

      Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
      at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
      at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506)
      at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
      at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
      at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
      at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
      at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
      at org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
      at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
      at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
      at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
      at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
      at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
      at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
      at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
      at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source)
      at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:158)
      at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:177)
      at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:119)
      at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:58)
      at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:204)
      at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
      at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
      at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)

        Attachments

        1. U77VVDMDHSQ6M2CLZH3AM2IEZOIUEJWI.pptx
          1.67 MB
          Tim Allison
        2. tika-enemy.docx
          205 kB
          Shawn Johnson

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                shawnjohnson159 Shawn Johnson
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: