[TIKA-1866] Out of memory error on Word document - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.11, 1.12
Fix Version/s: 1.13, 2.0.0
Component/s: parser
Labels:
None

Description

Trying to get the text from the attached MS Word document throws an Out of Memory error. Worked my way up from no memory arguments to 2G, 3G, 8G - all result in the same error.

The document is only 220K, it appears to be the number of tables causing the issue.

java -Xms8G -Xmx8G -jar tika-app-1.12.jar --text EPA-HQ-RCRA-2013-0396-0010.docx

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506)
at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
at org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source)
at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:158)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:177)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:119)
at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:58)
at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:204)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

tika-enemy.docx
22/Feb/16 22:40
205 kB
Shawn Johnson
U77VVDMDHSQ6M2CLZH3AM2IEZOIUEJWI.pptx
02/Mar/16 19:10
1.67 MB
Tim Allison

Issue Links

depends upon

TIKA-1895 Upgrade to POI 3.15-beta1 when available

Resolved

duplicates

TIKA-1473 Apache Tika is not working for .docx documents

Resolved

is duplicated by

TIKA-2326 java.lang.OutOfMemoryError: Java heap space

Closed

relates to

TIKA-1961 OutOfMemory when parsing shapes xml from xlsx files with multi-byte Unicode characters

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Shawn Johnson

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/Feb/16 22:39

Updated:: 12/Apr/21 13:01

Resolved:: 18/Apr/16 16:50