Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2428

EMFParser loops forever with corrupted files

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.15, 1.16
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      EMFParser hangs with the attached corrupted EMF files.

      Sorry Tim Allison! Just now having time to test against our forensic test corpus...

      1. Carved-1285676.emf
        10 kB
        Luis Filipe Nassif
      2. Carved-1296288.emf
        5 kB
        Luis Filipe Nassif
      3. Carved-912866.emf
        7 kB
        Luis Filipe Nassif

        Issue Links

          Activity

          Hide
          lfcnassif Luis Filipe Nassif added a comment -

          Corrupted files

          Show
          lfcnassif Luis Filipe Nassif added a comment - Corrupted files
          Hide
          lfcnassif Luis Filipe Nassif added a comment -

          Seems like the issue is at POI level. Threads are stuck at:

          java.lang.Thread.State: RUNNABLE
                  at java.io.FileInputStream.skip(Native Method)
                  at java.io.BufferedInputStream.skip(Unknown Source)
                  - locked <0x0000000717f30ac0> (a java.io.BufferedInputStream)
                  at org.apache.tika.io.ProxyInputStream.skip(ProxyInputStream.java:117)
                  at org.apache.tika.io.TikaInputStream.skip(TikaInputStream.java:655)
                  at java.io.FilterInputStream.skip(Unknown Source)
                  at org.apache.poi.util.IOUtils.skipFully(IOUtils.java:364)
                  at org.apache.poi.hemf.record.UnimplementedHemfRecord.init(UnimplementedHemfRecord.java:43)
                  at org.apache.poi.hemf.extractor.HemfExtractor$HemfRecordIterator._next(HemfExtractor.java:101)
                  at org.apache.poi.hemf.extractor.HemfExtractor$HemfRecordIterator.next(HemfExtractor.java:77)
                  at org.apache.poi.hemf.extractor.HemfExtractor$HemfRecordIterator.next(HemfExtractor.java:60)
                  at org.apache.tika.parser.microsoft.EMFParser.parse(EMFParser.java:82)
                  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
                  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
                  at dpf.sp.gpinf.indexer.parsers.IndexerDefaultParser.parse(IndexerDefaultParser.java:150)
                  at dpf.sp.gpinf.indexer.io.ParsingReader$ParsingTask.run(ParsingReader.java:263)
                  at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
                  at java.util.concurrent.FutureTask.run(Unknown Source)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
                  at java.lang.Thread.run(Unknown Source)
          
          
          Show
          lfcnassif Luis Filipe Nassif added a comment - Seems like the issue is at POI level. Threads are stuck at: java.lang. Thread .State: RUNNABLE at java.io.FileInputStream.skip(Native Method) at java.io.BufferedInputStream.skip(Unknown Source) - locked <0x0000000717f30ac0> (a java.io.BufferedInputStream) at org.apache.tika.io.ProxyInputStream.skip(ProxyInputStream.java:117) at org.apache.tika.io.TikaInputStream.skip(TikaInputStream.java:655) at java.io.FilterInputStream.skip(Unknown Source) at org.apache.poi.util.IOUtils.skipFully(IOUtils.java:364) at org.apache.poi.hemf.record.UnimplementedHemfRecord.init(UnimplementedHemfRecord.java:43) at org.apache.poi.hemf.extractor.HemfExtractor$HemfRecordIterator._next(HemfExtractor.java:101) at org.apache.poi.hemf.extractor.HemfExtractor$HemfRecordIterator.next(HemfExtractor.java:77) at org.apache.poi.hemf.extractor.HemfExtractor$HemfRecordIterator.next(HemfExtractor.java:60) at org.apache.tika.parser.microsoft.EMFParser.parse(EMFParser.java:82) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at dpf.sp.gpinf.indexer.parsers.IndexerDefaultParser.parse(IndexerDefaultParser.java:150) at dpf.sp.gpinf.indexer.io.ParsingReader$ParsingTask.run(ParsingReader.java:263) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang. Thread .run(Unknown Source)
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          Thank you, Luis Filipe Nassif, for reporting this and finding the cause.

          From the Javadocs for FileInputStream:

          This method may skip more bytes than are remaining in the backing file. This produces no exception and the number of bytes skipped may include some number of bytes that were beyond the EOF of the backing file. Attempting to read from the stream after skipping past the end will result in -1 indicating the end of the file.

          From the Javadocs for InputStream:

          The skip method may, for a variety of reasons, end up skipping over some smaller number of bytes, possibly 0. This may result from any of a number of conditions; reaching end of file before n bytes have been skipped is only one possibility. The actual number of bytes skipped is returned.

          If bytes skipped is more than requested, we've hit EOF. If bytes skipped == 0, we need to test with a read, according to guava. And it looks like commons-io in IOUtils abandons skip altogether in favor of read in its skipFully.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited Thank you, Luis Filipe Nassif , for reporting this and finding the cause. From the Javadocs for FileInputStream: This method may skip more bytes than are remaining in the backing file. This produces no exception and the number of bytes skipped may include some number of bytes that were beyond the EOF of the backing file. Attempting to read from the stream after skipping past the end will result in -1 indicating the end of the file. From the Javadocs for InputStream: The skip method may, for a variety of reasons, end up skipping over some smaller number of bytes, possibly 0. This may result from any of a number of conditions; reaching end of file before n bytes have been skipped is only one possibility. The actual number of bytes skipped is returned. If bytes skipped is more than requested, we've hit EOF. If bytes skipped == 0, we need to test with a read, according to guava . And it looks like commons-io in IOUtils abandons skip altogether in favor of read in its skipFully .
          Hide
          lfcnassif Luis Filipe Nassif added a comment -

          I just put the stacktrace, you found the cause.

          But I understood it can skip more than are remaining in the source, but no more than was requested, right?

          Show
          lfcnassif Luis Filipe Nassif added a comment - I just put the stacktrace, you found the cause. But I understood it can skip more than are remaining in the source, but no more than was requested, right?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          I wonder why I didn't see this in our common crawl/govdocs1 corpus? When you process EMF, are those literally carved as standalone files or are they part of carved doc/ppt/xls?

          Show
          tallison@mitre.org Tim Allison added a comment - I wonder why I didn't see this in our common crawl/govdocs1 corpus? When you process EMF, are those literally carved as standalone files or are they part of carved doc/ppt/xls?
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          But I understood it can skip more than are remaining in the source, but no more than was requested, right?

          In one of the attached files, the first bad loop has requested == got (both 4,294,902,047). In the second time in the loop (and every one thereafter), requested == 4,294,902,047, got == 4,294,967,296.

          Ugh. Overflow...total needs to be long.

          With more testing, it is possible for FileInputStream to allege that it skipped 20k bytes on a 10k byte file as the javadocs warn.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited But I understood it can skip more than are remaining in the source, but no more than was requested, right? In one of the attached files, the first bad loop has requested == got (both 4,294,902,047). In the second time in the loop (and every one thereafter), requested == 4,294,902,047, got == 4,294,967,296. Ugh. Overflow... total needs to be long. With more testing, it is possible for FileInputStream to allege that it skipped 20k bytes on a 10k byte file as the javadocs warn.
          Hide
          lfcnassif Luis Filipe Nassif added a comment -

          Strange, I don't think the javadocs allow that. Maybe there is an issue with IOUtils.skipFully() or TikaInputStream.skip()?

          Those emf files are deleted files recovered from one of our test images. Our algorithm for recovering deleted files often recovers corruped ones (partially overwritten by OS), so we have a lot of them!

          Show
          lfcnassif Luis Filipe Nassif added a comment - Strange, I don't think the javadocs allow that. Maybe there is an issue with IOUtils.skipFully() or TikaInputStream.skip()? Those emf files are deleted files recovered from one of our test images. Our algorithm for recovering deleted files often recovers corruped ones (partially overwritten by OS), so we have a lot of them!
          Show
          tallison@mitre.org Tim Allison added a comment - https://bz.apache.org/bugzilla/show_bug.cgi?id=61294
          Hide
          tallison@mitre.org Tim Allison added a comment -

          I don't think the javadocs allow that.

          I think the javadocs warn about this for FileInputStream with the following, but I think the implementation is, um, less than ideal and in conflict with the behavior we'd expect from the javadocs for InputStream.

          number of bytes skipped may include some number of bytes that were beyond the EOF of the backing file

          This test passes for me:

              @Test
              public void testFalseAllegationFromFileInputStream() throws IOException {
                  File tmp = File.createTempFile("poi", "");
                  FileOutputStream fos = new FileOutputStream(tmp);
                  for (int i = 0; i < 10000; i++) {
                      fos.write(2);
                  }
                  fos.flush();
                  fos.close();
                  assertEquals(10000, tmp.length());
          
                  InputStream is = new FileInputStream(tmp);
                  assertEquals(20000, is.skip(20000));
                  is.close();
                  tmp.delete();
              }
          
          Show
          tallison@mitre.org Tim Allison added a comment - I don't think the javadocs allow that. I think the javadocs warn about this for FileInputStream with the following, but I think the implementation is, um, less than ideal and in conflict with the behavior we'd expect from the javadocs for InputStream. number of bytes skipped may include some number of bytes that were beyond the EOF of the backing file This test passes for me: @Test public void testFalseAllegationFromFileInputStream() throws IOException { File tmp = File.createTempFile("poi", ""); FileOutputStream fos = new FileOutputStream(tmp); for (int i = 0; i < 10000; i++) { fos.write(2); } fos.flush(); fos.close(); assertEquals(10000, tmp.length()); InputStream is = new FileInputStream(tmp); assertEquals(20000, is.skip(20000)); is.close(); tmp.delete(); }
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Maybe there is an issue with IOUtils.skipFully()

          Y, completely. We need to defend against FileInputStream's potentially incorrect allegations, and we need to defend against an InputStream returning 0, which can mean either that it hit the end of the InputStream or it just didn't skip anything for this particular call.

          So, y, my implementation of skipFully in POI is at fault here, and I need to fix it.

          Show
          tallison@mitre.org Tim Allison added a comment - Maybe there is an issue with IOUtils.skipFully() Y, completely. We need to defend against FileInputStream's potentially incorrect allegations, and we need to defend against an InputStream returning 0, which can mean either that it hit the end of the InputStream or it just didn't skip anything for this particular call. So, y, my implementation of skipFully in POI is at fault here, and I need to fix it.
          Hide
          lfcnassif Luis Filipe Nassif added a comment -

          If bytes skipped is more than requested, we've hit EOF. If bytes skipped == 0, we need to test with a read, according to guava

          Let me clarify my comment, I mean if 20,000 bytes are requested to be skiped in a file with 10,000, it can return more than 10,000 (reproduced by your test), but no more than 20,000.

          Show
          lfcnassif Luis Filipe Nassif added a comment - If bytes skipped is more than requested, we've hit EOF. If bytes skipped == 0, we need to test with a read, according to guava Let me clarify my comment, I mean if 20,000 bytes are requested to be skiped in a file with 10,000, it can return more than 10,000 (reproduced by your test), but no more than 20,000.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Sorry. I misunderstood. Right. That's my belief.

          Show
          tallison@mitre.org Tim Allison added a comment - Sorry. I misunderstood. Right. That's my belief.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Our algorithm for recovering deleted files often recovers corruped ones (partially overwritten by OS)

          I was toying with the notion of a "smoke-test" level set of tests that would randomly permute some of the bytes within our test files to see if we could trigger this kind of thing. Sounds like you have a use case for us to do this...

          Show
          tallison@mitre.org Tim Allison added a comment - Our algorithm for recovering deleted files often recovers corruped ones (partially overwritten by OS) I was toying with the notion of a "smoke-test" level set of tests that would randomly permute some of the bytes within our test files to see if we could trigger this kind of thing. Sounds like you have a use case for us to do this...
          Hide
          lfcnassif Luis Filipe Nassif added a comment -

          That would be very nice!

          Show
          lfcnassif Luis Filipe Nassif added a comment - That would be very nice!
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited
          Show
          tallison@mitre.org Tim Allison added a comment - - edited https://bz.apache.org/bugzilla/show_bug.cgi?id=61295 https://bz.apache.org/bugzilla/show_bug.cgi?id=61300 I suspect quite a few more will come out of the woodwork... See TIKA-2430 .

            People

            • Assignee:
              Unassigned
              Reporter:
              lfcnassif Luis Filipe Nassif
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development