Tika
  1. Tika
  2. TIKA-815

Tika parsers should handle failures more gracefully

    Details

    • Type: Test Test
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 1.0
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      We encountered an OOM while parsing a Word document. We will report the failure to POI.

      This raises the question about the general robustness of the parsers.

      We've written a little test tool that reproduces the aforementionned OOM and other potential issues that will be reported to the individual parsers. It's the responsibility of the parsers to handle those failures gracefully.

      Yet it's easy to write generic tools at the Tika level to make these kind of tests.

      So we also submit this issue here to start a discussion on what role should Tika have when it comes to validate its parsers.

      Code here: https://github.com/lacostej/tika-hardener

        Activity

        Hide
        Chris A. Mattmann added a comment -

        Hi Jerome: what do you think about contributing the Tika hardener? I'm +1 to Jukka's suggestion on that and we'd love to have you helping out and appreciate your contributions so far!

        Show
        Chris A. Mattmann added a comment - Hi Jerome: what do you think about contributing the Tika hardener? I'm +1 to Jukka's suggestion on that and we'd love to have you helping out and appreciate your contributions so far!
        Jukka Zitting made changes -
        Field Original Value New Value
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Duplicate [ 3 ]
        Hide
        Jukka Zitting added a comment -

        Resolving this as a duplicate of all the followup issues mentioned above.

        Show
        Jukka Zitting added a comment - Resolving this as a duplicate of all the followup issues mentioned above.
        Hide
        Jukka Zitting added a comment -

        Would you be interested in contributing the tika-hardener codebase to Tika itself? It would make a great addition to our existing test suite.

        Show
        Jukka Zitting added a comment - Would you be interested in contributing the tika-hardener codebase to Tika itself? It would make a great addition to our existing test suite.
        Hide
        Jerome Lacoste added a comment -

        > For people with strong stability requirements, we provide the ForkParser

        To get the ForkParser to work, I've had to make 6 patches... And I haven't yet stress tested it. That makes me wary of using it in production!

        Please fix TIKA-808, TIKA-827 (optional), TIKA-828, TIKA-829, TIKA-830, TIKA-831 in that order.

        0001-TIKA-808-tika-doesn-t-parse-PDF-file.-The-issue-is-c.patch
        0002-TIKA-827-try-to-report-something-if-the-exception-is.patch (optional)
        0003-TIKA-828-make-sure-the-exceptions-thrown-by-TaggedIn.patch
        0004-TIKA-829-make-sure-tika-identifies-invalid-arguments.patch
        0005-TIKA-830-Tike.parseToString-caused-ForkParser-to-try.patch
        0006-TIKA-830-refactor-tests-for-clarity.patch
        0007-TIKA-831fix-for-errors-not-being-reported-properly.patch

        Thanks

        Show
        Jerome Lacoste added a comment - > For people with strong stability requirements, we provide the ForkParser To get the ForkParser to work, I've had to make 6 patches... And I haven't yet stress tested it. That makes me wary of using it in production! Please fix TIKA-808 , TIKA-827 (optional), TIKA-828 , TIKA-829 , TIKA-830 , TIKA-831 in that order. 0001- TIKA-808 -tika-doesn-t-parse-PDF-file.-The-issue-is-c.patch 0002- TIKA-827 -try-to-report-something-if-the-exception-is.patch (optional) 0003- TIKA-828 -make-sure-the-exceptions-thrown-by-TaggedIn.patch 0004- TIKA-829 -make-sure-tika-identifies-invalid-arguments.patch 0005- TIKA-830 -Tike.parseToString-caused-ForkParser-to-try.patch 0006- TIKA-830 -refactor-tests-for-clarity.patch 0007- TIKA-831 fix-for-errors-not-being-reported-properly .patch Thanks
        Hide
        Nick Burch added a comment -

        For people with strong stability requirements, we provide the ForkParser

        For everyone else, we suggest they report bugs when they hit issues, and ideally help work with us + the upstream libraries to fix things

        Show
        Nick Burch added a comment - For people with strong stability requirements, we provide the ForkParser For everyone else, we suggest they report bugs when they hit issues, and ideally help work with us + the upstream libraries to fix things
        Hide
        Jerome Lacoste added a comment -

        Agreed. Yet improving the default parsers might still be a good idea.

        Note: I didn't yet manage to use the forked parser, so I will mail the user list.

        Show
        Jerome Lacoste added a comment - Agreed. Yet improving the default parsers might still be a good idea. Note: I didn't yet manage to use the forked parser, so I will mail the user list.
        Hide
        Nick Burch added a comment -

        FYI Tika does provide the Fork Parser for cases when you want to ensure the parsing can't affect the parent application

        Show
        Nick Burch added a comment - FYI Tika does provide the Fork Parser for cases when you want to ensure the parsing can't affect the parent application
        Jerome Lacoste created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Jerome Lacoste
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development