Tika
  1. Tika
  2. TIKA-1132

Parsing some XLS documents hangs entire JVM, requires kill -9

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2, 1.3
    • Fix Version/s: 1.5
    • Component/s: parser
    • Labels:
      None
    • Environment:

      Description

      Some XLS documents hang the entire JVM. A control-C or regular kill won't stop the JVM, a kill -9 is required.

      We're running within an email server application parsing documents to extract text of all attachments. When we hit a message with the affected attachment the entire JVM hangs and we mark the message to skip extracting the text from the affected message the next attempt. Unfortunately, it kills all email processing on the server until the internal watchdogs kill -9 the application.

      We have seen the issue for several months with different documents, but they are always Excel files. Some get complaints from Excel when opening but not all.

      In addition to experiencing the problem on our Linux servers I have tested on OSX and experienced the same problems. I ran the Tika UI and select the affected file or run the CLI. The problem is the same.
      Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls

      When running on multi-CPU machines there are two threads running at 100% every time.

      I have attached a document that triggers the error.

      I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is accurately extracted.

      1. mod.xls
        36 kB
        Ryan Krueger
      2. mod3.xlsx
        25 kB
        Ryan Krueger

        Activity

        Hide
        Ryan Krueger added a comment - - edited

        This file triggers the error.

        Show
        Ryan Krueger added a comment - - edited This file triggers the error.
        Hide
        Nick Burch added a comment -

        I can confirm that it goes into an infinite loop for me too

        Any chance that you could run it in a profiler or similar, and track down where the loop is happening? (My hunch is it'll be an edge case in POI / POI not handling a subtle form of corruption)

        Show
        Nick Burch added a comment - I can confirm that it goes into an infinite loop for me too Any chance that you could run it in a profiler or similar, and track down where the loop is happening? (My hunch is it'll be an edge case in POI / POI not handling a subtle form of corruption)
        Hide
        Ryan Krueger added a comment -

        Running jvisualvm and pulling a thread dump I get the same trace each time:

        "main" prio=10 tid=0x0000000000606800 nid=0x7799 runnable [0x00007fe26bf1d000]
        java.lang.Thread.State: RUNNABLE
        at org.apache.poi.ss.usermodel.DataFormatter$FractionFormat.format(DataFormatter.java:1009)
        at org.apache.poi.ss.usermodel.DataFormatter$FractionFormat.format(DataFormatter.java:1033)
        at java.text.Format.format(Format.java:157)
        at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:699)
        at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:669)
        at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell(FormatTrackingHSSFListener.java:129)
        at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord(ExcelExtractor.java:419)
        at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord(ExcelExtractor.java:323)
        at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord(FormatTrackingHSSFListener.java:82)
        at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord(HSSFRequest.java:112)
        at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:147)
        at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
        at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:299)
        at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:151)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)

        Looking at POI 3.8 in grepcode I see the affected code. The methods appear to be unchanged in 3.9.

        I don't know what's causing the issue as it doesn't immediately appear to me to be an infinite loop.

        Here is the apparent section from org.apache.poi.ss.usermodel.DataFormatter.

        1005 double minVal = 1.0;
        1006 double currDenom = Math.pow(10 , fractParts[1].length()) - 1d;
        1007 double currNeum = 0;
        1008 for (int i = (int)(Math.pow(10, fractParts[1].length())- 1d); i > 0; i--) {
        1009 for(int i2 = (int)(Math.pow(10, fractParts[1].length())- 1d); i2 > 0; i2--){
        1010 if (minVal >= Math.abs((double)i2/(double)i - decPart))

        { 1011 currDenom = i; 1012 currNeum = i2; 1013 minVal = Math.abs((double)i2/(double)i - decPart); 1014 }

        1015 }
        1016 }

        Show
        Ryan Krueger added a comment - Running jvisualvm and pulling a thread dump I get the same trace each time: "main" prio=10 tid=0x0000000000606800 nid=0x7799 runnable [0x00007fe26bf1d000] java.lang.Thread.State: RUNNABLE at org.apache.poi.ss.usermodel.DataFormatter$FractionFormat.format(DataFormatter.java:1009) at org.apache.poi.ss.usermodel.DataFormatter$FractionFormat.format(DataFormatter.java:1033) at java.text.Format.format(Format.java:157) at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:699) at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:669) at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell(FormatTrackingHSSFListener.java:129) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord(ExcelExtractor.java:419) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord(ExcelExtractor.java:323) at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord(FormatTrackingHSSFListener.java:82) at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord(HSSFRequest.java:112) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:147) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:299) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:151) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) Looking at POI 3.8 in grepcode I see the affected code. The methods appear to be unchanged in 3.9. I don't know what's causing the issue as it doesn't immediately appear to me to be an infinite loop. Here is the apparent section from org.apache.poi.ss.usermodel.DataFormatter. 1005 double minVal = 1.0; 1006 double currDenom = Math.pow(10 , fractParts [1] .length()) - 1d; 1007 double currNeum = 0; 1008 for (int i = (int)(Math.pow(10, fractParts [1] .length())- 1d); i > 0; i--) { 1009 for(int i2 = (int)(Math.pow(10, fractParts [1] .length())- 1d); i2 > 0; i2--){ 1010 if (minVal >= Math.abs((double)i2/(double)i - decPart)) { 1011 currDenom = i; 1012 currNeum = i2; 1013 minVal = Math.abs((double)i2/(double)i - decPart); 1014 } 1015 } 1016 }
        Hide
        Ryan Krueger added a comment -

        I saved the xls file as a new xlsx file, no change.

        I modified the xls file removed sections until I was able to zero in on the affected cells.

        I looks like a custom formatted cell with this format trigger the error:

        1. ????????????/????????????

        This happens regardless of the number in the cell. I'll update a new test file.

        Show
        Ryan Krueger added a comment - I saved the xls file as a new xlsx file, no change. I modified the xls file removed sections until I was able to zero in on the affected cells. I looks like a custom formatted cell with this format trigger the error: ???????????? / ???????????? This happens regardless of the number in the cell. I'll update a new test file.
        Hide
        Ryan Krueger added a comment -

        This trivial file triggers the error.

        Show
        Ryan Krueger added a comment - This trivial file triggers the error.
        Hide
        Nick Burch added a comment -

        Thanks for the test file. There's an open bug in poi about fraction formatting, it might be the same thing. I'll hopefully be able to take a look in the next few days, other work permitting

        Show
        Nick Burch added a comment - Thanks for the test file. There's an open bug in poi about fraction formatting, it might be the same thing. I'll hopefully be able to take a look in the next few days, other work permitting
        Hide
        Tim Allison added a comment -

        Tika gui took longer than I was willing to wait, too. tika.parseToString() returned a value in about 30 seconds. As you both suggested, the fraction formatter was likely the culprit. I just submitted a patch to poi 54686.

        Show
        Tim Allison added a comment - Tika gui took longer than I was willing to wait, too. tika.parseToString() returned a value in about 30 seconds. As you both suggested, the fraction formatter was likely the culprit. I just submitted a patch to poi 54686.
        Hide
        Tim Allison added a comment -

        Upgrade to POI-3.10-beta2 fixed this.

        Show
        Tim Allison added a comment - Upgrade to POI-3.10-beta2 fixed this.
        Hide
        Tim Allison added a comment - - edited

        Any recommendations for a test? The underlying problem was that POI was doing on the order of 10^24 division calculations...so not infinite, but exceedingly slow. Would a jUnit timeout of, say, 10 seconds be reasonable?

        Show
        Tim Allison added a comment - - edited Any recommendations for a test? The underlying problem was that POI was doing on the order of 10^24 division calculations...so not infinite, but exceedingly slow. Would a jUnit timeout of, say, 10 seconds be reasonable?
        Hide
        Tim Allison added a comment -

        Resolved with upgrade to poi-3.10-beta2.
        Could use help getting jUnit's timeout to work.
        Currently no unit tests for this.

        Show
        Tim Allison added a comment - Resolved with upgrade to poi-3.10-beta2. Could use help getting jUnit's timeout to work. Currently no unit tests for this.

          People

          • Assignee:
            Tim Allison
            Reporter:
            Ryan Krueger
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development