Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1866

Out of memory error on Word document

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.11, 1.12
    • 1.13, 2.0.0
    • parser
    • None

    Description

      Trying to get the text from the attached MS Word document throws an Out of Memory error. Worked my way up from no memory arguments to 2G, 3G, 8G - all result in the same error.

      The document is only 220K, it appears to be the number of tables causing the issue.

      java -Xms8G -Xmx8G -jar tika-app-1.12.jar --text EPA-HQ-RCRA-2013-0396-0010.docx 
      

      Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
      at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
      at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506)
      at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
      at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
      at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
      at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
      at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
      at org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
      at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
      at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
      at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
      at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
      at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
      at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
      at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
      at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source)
      at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:158)
      at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:177)
      at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:119)
      at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:58)
      at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:204)
      at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
      at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
      at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)

      Attachments

        1. tika-enemy.docx
          205 kB
          Shawn Johnson
        2. U77VVDMDHSQ6M2CLZH3AM2IEZOIUEJWI.pptx
          1.67 MB
          Tim Allison

        Issue Links

          Activity

            People

              Unassigned Unassigned
              shawnjohnson159 Shawn Johnson
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: