Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1866

Out of memory error on Word document

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.11, 1.12
    • 1.13, 2.0.0
    • parser
    • None

    Description

      Trying to get the text from the attached MS Word document throws an Out of Memory error. Worked my way up from no memory arguments to 2G, 3G, 8G - all result in the same error.

      The document is only 220K, it appears to be the number of tables causing the issue.

      java -Xms8G -Xmx8G -jar tika-app-1.12.jar --text EPA-HQ-RCRA-2013-0396-0010.docx 
      

      Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
      at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
      at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506)
      at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
      at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
      at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
      at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
      at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
      at org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
      at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
      at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
      at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
      at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
      at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
      at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
      at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
      at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source)
      at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:158)
      at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:177)
      at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:119)
      at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:58)
      at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:204)
      at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
      at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
      at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            shawnjohnson159 Shawn Johnson
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Issue deployment