Tika
  1. Tika
  2. TIKA-645

Parsers can't get at an underlying TikaInputStream to get the file if they wanted one

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9
    • Fix Version/s: 0.10
    • Component/s: parser
    • Labels:
      None

      Description

      Spotted this with the office parser, but it should be general. The user creates a TikaInputStream, and passes that off to the parser framework. The Parser that is called may wish to spot that the input is a File backed TikaInputStream, and take a shortcut to use the file instead of the InputStream.

      However, what the parser gets is a TaggedInputStream wrapping a CountingInputStream wrapping the original TikaInputStream. As such, it can't get at the file.

        Issue Links

          Activity

          Hide
          Nick Burch added a comment -

          One solution that springs to mind is to place the hasFile and getFile methods onto an interface. TikaInputStream, TaggedInputStream and CountingInputStream could then all implement this. That way, if the underlying stream is a TikaInputStream, the parser can still find out and grab the file. If it isn't, then nothing changes.

          Show
          Nick Burch added a comment - One solution that springs to mind is to place the hasFile and getFile methods onto an interface. TikaInputStream, TaggedInputStream and CountingInputStream could then all implement this. That way, if the underlying stream is a TikaInputStream, the parser can still find out and grab the file. If it isn't, then nothing changes.
          Hide
          Jukka Zitting added a comment -

          The InputStream passed to the parse() method is still TikaInputStream, isn't it? The parser should then already have access to the underlying file, even if it at some point wants to use wrappers with the given stream.

          Show
          Jukka Zitting added a comment - The InputStream passed to the parse() method is still TikaInputStream, isn't it? The parser should then already have access to the underlying file, even if it at some point wants to use wrappers with the given stream.
          Hide
          Nick Burch added a comment -

          Nope, when the parser is called from the CLI it gets a TaggedInputStream (wrapping a CountingInputStream wrapping a TikaInputStream)

          Show
          Nick Burch added a comment - Nope, when the parser is called from the CLI it gets a TaggedInputStream (wrapping a CountingInputStream wrapping a TikaInputStream)
          Hide
          Jukka Zitting added a comment -

          In revision 1104472 I made TikaInputStream extend TaggedInputStream which allows us to avoid one of these levels of wrapping (and also makes it possible to separate exceptions from the original stream from those caused by temporary file handling, etc.).

          I'll look at what we can do about the CountingInputStream wrapper.

          Show
          Jukka Zitting added a comment - In revision 1104472 I made TikaInputStream extend TaggedInputStream which allows us to avoid one of these levels of wrapping (and also makes it possible to separate exceptions from the original stream from those caused by temporary file handling, etc.). I'll look at what we can do about the CountingInputStream wrapper.
          Hide
          Jukka Zitting added a comment -

          And in revision 1124385 I modified SecureContentHandler to use TikaInputStream instead of CountingInputStream, which avoids the other stream wrapper.

          Now a TikaInputStream given to AutoDetectParser will get passed as-is all the way down to the concrete Parser implementation, so I'm resolving this as fixed.

          Note that this fix required a minor backwards-compatibility break in the SecureContentHandler class, but this shouldn't be a big issue since I don't think the class is widely used outside AutoDetectParser.

          Show
          Jukka Zitting added a comment - And in revision 1124385 I modified SecureContentHandler to use TikaInputStream instead of CountingInputStream, which avoids the other stream wrapper. Now a TikaInputStream given to AutoDetectParser will get passed as-is all the way down to the concrete Parser implementation, so I'm resolving this as fixed. Note that this fix required a minor backwards-compatibility break in the SecureContentHandler class, but this shouldn't be a big issue since I don't think the class is widely used outside AutoDetectParser.
          Hide
          Nick Burch added a comment -

          I've just gone to test this with a 5mb excel file. With 0.9, anything smaller than "-Xms21m -Xmx21m" gives an out of memory

          With the latest trunk, something else breaks though:

          java -Xms21m -Xmx21m -jar ~/java/apache-tika/tika-app/target/tika-app-1.0-SNAPSHOT.jar --metadata ex45698-22488.xls
          Exception in thread "main" org.apache.tika.exception.TikaException: Zip bomb detected!
          at org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContentHandler.java:128)
          at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:132)
          at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
          at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:340)
          at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
          Caused by: org.apache.tika.sax.SecureContentHandler$SecureSAXException: Suspected zip bomb: 0 input bytes produced 1000020 output characters
          at org.apache.tika.sax.SecureContentHandler.advance(SecureContentHandler.java:144)

          Looks like something else needs to be updated too?

          Show
          Nick Burch added a comment - I've just gone to test this with a 5mb excel file. With 0.9, anything smaller than "-Xms21m -Xmx21m" gives an out of memory With the latest trunk, something else breaks though: java -Xms21m -Xmx21m -jar ~/java/apache-tika/tika-app/target/tika-app-1.0-SNAPSHOT.jar --metadata ex45698-22488.xls Exception in thread "main" org.apache.tika.exception.TikaException: Zip bomb detected! at org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContentHandler.java:128) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:132) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:340) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97) Caused by: org.apache.tika.sax.SecureContentHandler$SecureSAXException: Suspected zip bomb: 0 input bytes produced 1000020 output characters at org.apache.tika.sax.SecureContentHandler.advance(SecureContentHandler.java:144) Looks like something else needs to be updated too?
          Hide
          Jukka Zitting added a comment -

          Hmm, yes. I'll look into this.

          Show
          Jukka Zitting added a comment - Hmm, yes. I'll look into this.
          Hide
          Jukka Zitting added a comment -

          The problem was that when the parser was using TikaInputStream.getFile(), no bytes were recorded as being read from the stream and the SecureContentHandler couldn't figure out where the all the output is coming from.

          In revision 1124788 I changed the logic a bit so that when the stream is based on a file, the SecureContentHandler class looks at the total size of the input file instead of the number of bytes read from the input stream.

          Show
          Jukka Zitting added a comment - The problem was that when the parser was using TikaInputStream.getFile(), no bytes were recorded as being read from the stream and the SecureContentHandler couldn't figure out where the all the output is coming from. In revision 1124788 I changed the logic a bit so that when the stream is based on a file, the SecureContentHandler class looks at the total size of the input file instead of the number of bytes read from the input stream.
          Hide
          Nick Burch added a comment -

          Thanks. With this change in, tika-app from svn trunk can parse the same 5mb excel file with "-Xms15m -Xmx15m", so the use of NPOIFS and access to the file has saved us a bit over the size of the office file in memory, which is good news!

          Show
          Nick Burch added a comment - Thanks. With this change in, tika-app from svn trunk can parse the same 5mb excel file with "-Xms15m -Xmx15m", so the use of NPOIFS and access to the file has saved us a bit over the size of the office file in memory, which is good news!

            People

            • Assignee:
              Jukka Zitting
              Reporter:
              Nick Burch
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development