Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1961

OutOfMemory when parsing shapes xml from xlsx files with multi-byte Unicode characters

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6
    • Fix Version/s: 2.0, 1.13
    • Component/s: parser
    • Labels:
      None

      Description

      Piccolo parser used by xmlbeans seems to be reading xml files by portions of 8192 bytes. Problems appear when a portion crosses a multi-byte Unicode character.

      I managed to create a problematic file myself, dmsu1332-reproduced.xlsx.
      Some files got fixed just by opening and saving the files in Office 2013 but this one doesn't get fixed by the trick with open/save without modification.

      The file xl/drawings/drawing1.xml within this xlsx contains a formula. The border between 1st and 2nd portions (at 0x2000) crosses the same Unicode character in the same way: F09D90-BA.
      I noticed that the character before this multi-byte Unicode character should be a single-byte character. Otherwise it will be some other issue (not OutOfMemory, but just a failure to parse the xml file within the xlsx).
      I don't know if this can be reproduced with two- or three-byte Unicode characters, or if other split patter would result into issues (i.e. F0-9D90BA and F09D-90BA).
      Problematic char http://unicode.scarfboy.com/?s=U%2B1d43a ;

      Finally with formulas it is reproduced easier because each symbol in a formula, if the symbol is automatically typed in italic, such as "a", "x" or "dx" (these are two symbols), is represented by a 4-byte Unicode character.

      stack trace:
      java.lang.OutOfMemoryError: Java heap space
      at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
      at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506)
      at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
      at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
      at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
      at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
      at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
      at org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
      at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
      at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
      at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
      at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
      at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
      at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
      at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
      at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
      at org.openxmlformats.schemas.drawingml.x2006.spreadsheetDrawing.CTDrawing$Factory.parse(Unknown Source)
      at org.apache.poi.xssf.usermodel.XSSFDrawing.<init>(XSSFDrawing.java:84)
      at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getShapes(XSSFReader.java:294)
      at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:148)
      at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:114)
      at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:94)
      at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
      at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)

        Attachments

        1. problem char separation.png
          67 kB
          Andrei Rebegea
        2. dmsu1332-reproduced.xlsx
          10 kB
          Andrei Rebegea

          Issue Links

            Activity

              People

              • Assignee:
                tallison@apache.org Tim Allison
                Reporter:
                andrei.rebegea Andrei Rebegea
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: