Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2384

Double close of InputStream in accept text/plain in tika-server

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.15
    • Fix Version/s: 1.16
    • Component/s: server
    • Labels:
      None

      Description

      As reported by Haris Osmanagic on the user list, TikaResource closes the InputStream twice on requests for text/plain.

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        Haris is correct. The static "parse()" closes the InputStream so we shouldn't wrap the call to parse in an autoclose try(InputStream is = xyz) { TikaResource.parse(...) } Once I remove the autoclosing try, the test passes.

        Show
        tallison@mitre.org Tim Allison added a comment - Haris is correct. The static "parse()" closes the InputStream so we shouldn't wrap the call to parse in an autoclose try(InputStream is = xyz) { TikaResource.parse(...) } Once I remove the autoclosing try, the test passes.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        From Luis Filipe Nassif:

        I think resources should be closed where they are opened, like parser.parse() API contract, no?

        I agree. However, we're wrapping the InputStream in a TikaInputStream within TikaResourse.parse().
        We want to make absolutely sure to close TikaInputStream in case it has created temp files, etc.

                
            public static void parse(Parser parser, Logger logger, String path, InputStream inputStream,
                                     ContentHandler handler, Metadata metadata, ParseContext parseContext) throws TikaServerParseException {
                try (TikaInputStream tikaInputStream = TikaInputStream.get(inputStream)) {
                    parser.parse(tikaInputStream, handler, metadata, parseContext);
                } catch (SAXException e) {
                    throw new TikaServerParseException(e);
                } catch (EncryptedDocumentException e) {
                    logger.warn("{}: Encrypted document", path, e);
                    throw new TikaServerParseException(e);
                } catch (Exception e) {
                    logger.warn("{}: Text extraction failed", path, e);
                    throw new TikaServerParseException(e);
                }
            }
        
        Show
        tallison@mitre.org Tim Allison added a comment - From Luis Filipe Nassif : I think resources should be closed where they are opened, like parser.parse() API contract, no? I agree. However, we're wrapping the InputStream in a TikaInputStream within TikaResourse.parse() . We want to make absolutely sure to close TikaInputStream in case it has created temp files, etc. public static void parse(Parser parser, Logger logger, String path, InputStream inputStream, ContentHandler handler, Metadata metadata, ParseContext parseContext) throws TikaServerParseException { try (TikaInputStream tikaInputStream = TikaInputStream.get(inputStream)) { parser.parse(tikaInputStream, handler, metadata, parseContext); } catch (SAXException e) { throw new TikaServerParseException(e); } catch (EncryptedDocumentException e) { logger.warn("{}: Encrypted document", path, e); throw new TikaServerParseException(e); } catch (Exception e) { logger.warn("{}: Text extraction failed", path, e); throw new TikaServerParseException(e); } }
        Hide
        tallison@mitre.org Tim Allison added a comment -

        This is actually caused by a subtle bug in the DigestingParser. If we wrap the InputStream in a TikaInputStream in the DigestingParser (instead of in CommonsDigester), everything works.

        The problem is that while CommonsDigester resets its TikaInputStream, it cannot reset the wrapped InputStream.

        Once we fix DigestingParser, we won't have to wrap the underlying stream in a TikaInputStream or close the underlying stream...because, theoretically, cxf will do that...is that right Sergey Beryozkin?

        Show
        tallison@mitre.org Tim Allison added a comment - This is actually caused by a subtle bug in the DigestingParser. If we wrap the InputStream in a TikaInputStream in the DigestingParser (instead of in CommonsDigester), everything works. The problem is that while CommonsDigester resets its TikaInputStream, it cannot reset the wrapped InputStream. Once we fix DigestingParser, we won't have to wrap the underlying stream in a TikaInputStream or close the underlying stream...because, theoretically, cxf will do that...is that right Sergey Beryozkin ?
        Hide
        sergey_beryozkin Sergey Beryozkin added a comment - - edited

        Hi Tim, CXF will only auto-close InputStream if its task is to copy InputStream (or it can avoid doing it if configured as such), which can only happen if InputStream is returned as a service method response (directly or as JAX-RS Response entity)

        Show
        sergey_beryozkin Sergey Beryozkin added a comment - - edited Hi Tim, CXF will only auto-close InputStream if its task is to copy InputStream (or it can avoid doing it if configured as such), which can only happen if InputStream is returned as a service method response (directly or as JAX-RS Response entity)
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Ah, ok. Thank you. Should we go back to closing it in TikaResourse.parse() then?

        Show
        tallison@mitre.org Tim Allison added a comment - Ah, ok. Thank you. Should we go back to closing it in TikaResourse.parse() then?
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Jenkins build Tika-trunk #1284 (See https://builds.apache.org/job/Tika-trunk/1284/)
        TIKA-2384 - TikaResource can close an InputStream twice; this revealed a (tallison: https://github.com/apache/tika/commit/6d710da272285e04d6632b501e0752d020063182)

        • (edit) tika-server/src/test/java/org/apache/tika/server/TikaResourceTest.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/utils/CommonsDigester.java
        • (edit) tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java
        • (edit) tika-core/src/main/java/org/apache/tika/parser/DigestingParser.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/DigestingParserTest.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Tika-trunk #1284 (See https://builds.apache.org/job/Tika-trunk/1284/ ) TIKA-2384 - TikaResource can close an InputStream twice; this revealed a (tallison: https://github.com/apache/tika/commit/6d710da272285e04d6632b501e0752d020063182 ) (edit) tika-server/src/test/java/org/apache/tika/server/TikaResourceTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/utils/CommonsDigester.java (edit) tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java (edit) tika-core/src/main/java/org/apache/tika/parser/DigestingParser.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/DigestingParserTest.java
        Hide
        sergey_beryozkin Sergey Beryozkin added a comment -

        Not sure - may be they can be closed in TikaResource's StreamingOutput implementation ?

        Show
        sergey_beryozkin Sergey Beryozkin added a comment - Not sure - may be they can be closed in TikaResource's StreamingOutput implementation ?
        Hide
        lfcnassif Luis Filipe Nassif added a comment -

        To clean temp files we can use

        TemporaryResources tmp = new TemporaryResources();
        Try{
            TikaInputStream.get(inputstream, tmp);
           .... 
        }finally {
            tmp.close();
        }
        
        Show
        lfcnassif Luis Filipe Nassif added a comment - To clean temp files we can use TemporaryResources tmp = new TemporaryResources(); Try{ TikaInputStream.get(inputstream, tmp); .... } finally { tmp.close(); }
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Jenkins build Tika-trunk #1285 (See https://builds.apache.org/job/Tika-trunk/1285/)
        TIKA-2384 – went too far in earlier commit. We should close the (tallison: https://github.com/apache/tika/commit/f74b089162820b0e80a8217b505164edb1531ef0)

        Show
        hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Tika-trunk #1285 (See https://builds.apache.org/job/Tika-trunk/1285/ ) TIKA-2384 – went too far in earlier commit. We should close the (tallison: https://github.com/apache/tika/commit/f74b089162820b0e80a8217b505164edb1531ef0 ) (edit) tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java TIKA-2384 – update changes to reflect fix. (tallison: https://github.com/apache/tika/commit/933ae1c6482312ec87b15d690b4ac3fe71833554 ) (edit) CHANGES.txt

          People

          • Assignee:
            Unassigned
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development