Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1961

OutOfMemory when parsing shapes xml from xlsx files with multi-byte Unicode characters

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.6
    • 1.13, 2.0.0
    • parser
    • None

    Description

      Piccolo parser used by xmlbeans seems to be reading xml files by portions of 8192 bytes. Problems appear when a portion crosses a multi-byte Unicode character.

      I managed to create a problematic file myself, dmsu1332-reproduced.xlsx.
      Some files got fixed just by opening and saving the files in Office 2013 but this one doesn't get fixed by the trick with open/save without modification.

      The file xl/drawings/drawing1.xml within this xlsx contains a formula. The border between 1st and 2nd portions (at 0x2000) crosses the same Unicode character in the same way: F09D90-BA.
      I noticed that the character before this multi-byte Unicode character should be a single-byte character. Otherwise it will be some other issue (not OutOfMemory, but just a failure to parse the xml file within the xlsx).
      I don't know if this can be reproduced with two- or three-byte Unicode characters, or if other split patter would result into issues (i.e. F0-9D90BA and F09D-90BA).
      Problematic char http://unicode.scarfboy.com/?s=U%2B1d43a ;

      Finally with formulas it is reproduced easier because each symbol in a formula, if the symbol is automatically typed in italic, such as "a", "x" or "dx" (these are two symbols), is represented by a 4-byte Unicode character.

      stack trace:
      java.lang.OutOfMemoryError: Java heap space
      at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
      at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506)
      at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
      at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
      at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
      at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
      at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
      at org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
      at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
      at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
      at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
      at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
      at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
      at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
      at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
      at org.openxmlformats.schemas.drawingml.x2006.spreadsheetDrawing.CTDrawing$Factory.parse(Unknown Source)
      at org.apache.poi.xssf.usermodel.XSSFDrawing.<init>(XSSFDrawing.java:84)
      at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getShapes(XSSFReader.java:294)
      at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:148)
      at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:114)
      at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:94)
      at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
      at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)

      Attachments

        1. dmsu1332-reproduced.xlsx
          10 kB
          Andrei Rebegea
        2. problem char separation.png
          67 kB
          Andrei Rebegea

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tallison Tim Allison
            andrei.rebegea Andrei Rebegea
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment