Tika
  1. Tika
  2. TIKA-1074

Extraction should continue if an exception is hit visiting an embedded document

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: parser
    • Labels:
      None

      Description

      Spinoff from TIKA-1072.

      In that issue, a problematic document (still not sure if document is corrupt, or possible POI bug) caused an exception when visiting the embedded documents.

      If I change Tika to suppress that exception, the rest of the document extracts fine.

      So somehow I think we should be more robust here, and maybe log the exception, or save/record the exception(s) somewhere so after parsing the app could decide what to do about them ...

      1. TIKA-1074.patch
        10 kB
        Michael McCandless
      2. TIKA-1074.patch
        3 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        TIKA-1079 is another example where if we recorded/logged an exc and moved on we could have parsed the rest of the document ...

        Show
        Michael McCandless added a comment - TIKA-1079 is another example where if we recorded/logged an exc and moved on we could have parsed the rest of the document ...
        Hide
        Michael McCandless added a comment -

        Patch, just logging a warning and continuing, if we hit the exceptions in TIKA-1072, TIKA-1078 or TIKA-1079. I think it's ready.

        Show
        Michael McCandless added a comment - Patch, just logging a warning and continuing, if we hit the exceptions in TIKA-1072 , TIKA-1078 or TIKA-1079 . I think it's ready.
        Hide
        Uwe Schindler added a comment -

        Hi, catching Throwable is bad practice, it should better be Exception only. But donÄt forget to restore interrupt status when catching InterruptedException!

        Show
        Uwe Schindler added a comment - Hi, catching Throwable is bad practice, it should better be Exception only. But donÄt forget to restore interrupt status when catching InterruptedException!
        Hide
        Michael McCandless added a comment -

        Thanks Uwe, I'll change to catching Exception not Throwable, and restoring the interrupt bit for InterruptedException.

        Show
        Michael McCandless added a comment - Thanks Uwe, I'll change to catching Exception not Throwable, and restoring the interrupt bit for InterruptedException.
        Hide
        Michael McCandless added a comment -

        Patch, catching Exception not Throwable, and restoring the interrupt bit if the exc was InterruptedException.

        Show
        Michael McCandless added a comment - Patch, catching Exception not Throwable, and restoring the interrupt bit if the exc was InterruptedException.
        Hide
        Jukka Zitting added a comment -

        If we get an InterruptedException, then we shouldn't just log it and continue. I'd wrap it to a TikaException and re-throw.

        Also instead of:

        } catch (Exception e) {
          ...
          if (e instanceof SomeException) {
              ...
          }
        }
        

        a better pattern might be:

        } catch (SomeException e) {
          ...
        } catch (Exception e) {
          ...
        }
        
        Show
        Jukka Zitting added a comment - If we get an InterruptedException, then we shouldn't just log it and continue. I'd wrap it to a TikaException and re-throw. Also instead of: } catch (Exception e) { ... if (e instanceof SomeException) { ... } } a better pattern might be: } catch (SomeException e) { ... } catch (Exception e) { ... }
        Hide
        Michael McCandless added a comment -

        Thanks Jukka.

        InterruptedException is never thrown in these places today, so I can't add the separate catch clause (compiler is angry).

        So, the instanceof check for IE is in case in the future we do handle interrupts in these places ... we could just remove it and add it back in the future if we add IE (seems risky).

        Or I can change that code to throw TikaException instead on interrupt (and restore the interrupt bit), except in the TikaCLI case, EmbeddedDocumentExtractor.parseEmbedded doesn't throw TikaException today (the other two places already do). But it's a little weird throw TikaExc in response to an interrupt (ie, code above will be trying to catch an IE) ... I think it's cleaner to set the interrupt bit and let the next place that waits see the interrupt bit and throw IE?

        Show
        Michael McCandless added a comment - Thanks Jukka. InterruptedException is never thrown in these places today, so I can't add the separate catch clause (compiler is angry). So, the instanceof check for IE is in case in the future we do handle interrupts in these places ... we could just remove it and add it back in the future if we add IE (seems risky). Or I can change that code to throw TikaException instead on interrupt (and restore the interrupt bit), except in the TikaCLI case, EmbeddedDocumentExtractor.parseEmbedded doesn't throw TikaException today (the other two places already do). But it's a little weird throw TikaExc in response to an interrupt (ie, code above will be trying to catch an IE) ... I think it's cleaner to set the interrupt bit and let the next place that waits see the interrupt bit and throw IE?
        Hide
        Ray Gauss II added a comment -

        But it's a little weird throw TikaExc in response to an interrupt (ie, code above will be trying to catch an IE) ... I think it's cleaner to set the interrupt bit and let the next place that waits see the interrupt bit and throw IE?

        That's what I found in my investigation for TIKA-775 / TIKA-1059 as well.

        Show
        Ray Gauss II added a comment - But it's a little weird throw TikaExc in response to an interrupt (ie, code above will be trying to catch an IE) ... I think it's cleaner to set the interrupt bit and let the next place that waits see the interrupt bit and throw IE? That's what I found in my investigation for TIKA-775 / TIKA-1059 as well.
        Hide
        Jukka Zitting added a comment -

        InterruptedException is never thrown in these places today, so I can't add the separate catch clause (compiler is angry).

        It's a checked exception, so if it isn't declared to be thrown by POI, it shouldn't get thrown here (even though the VM doesn't strictly prohibit that). So in that case the extra check shouldn't even be needed.

        I think it's cleaner to set the interrupt bit and let the next place that waits see the interrupt bit and throw IE?

        I don't really like this approach. We're essentially saying: "Yes, you asked me to stop what I'm doing, but instead I'll just finish up what I was doing and ask the next guy to stop." Instead, when receiving an IE I'd prefer Tika to stop immediately, either by letting the IE bubble up or (where necessary) by throwing a TikaException that wraps the IE.

        Show
        Jukka Zitting added a comment - InterruptedException is never thrown in these places today, so I can't add the separate catch clause (compiler is angry). It's a checked exception, so if it isn't declared to be thrown by POI, it shouldn't get thrown here (even though the VM doesn't strictly prohibit that). So in that case the extra check shouldn't even be needed. I think it's cleaner to set the interrupt bit and let the next place that waits see the interrupt bit and throw IE? I don't really like this approach. We're essentially saying: "Yes, you asked me to stop what I'm doing, but instead I'll just finish up what I was doing and ask the next guy to stop." Instead, when receiving an IE I'd prefer Tika to stop immediately, either by letting the IE bubble up or (where necessary) by throwing a TikaException that wraps the IE.
        Hide
        Michael McCandless added a comment -

        InterruptedException is never thrown in these places today, so I can't add the separate catch clause (compiler is angry).

        It's a checked exception, so if it isn't declared to be thrown by POI, it shouldn't get thrown here (even though the VM doesn't strictly prohibit that).

        Exactly: I'm trying to future proof.

        So in that case the extra check shouldn't even be needed.

        Wait, do you mean I should remove the handling entirely (not bother future proofing)?

        I think it's cleaner to set the interrupt bit and let the next place that waits see the interrupt bit and throw IE?

        I don't really like this approach. We're essentially saying: "Yes, you asked me to stop what I'm doing, but instead I'll just finish up what I was doing and ask the next guy to stop." Instead, when receiving an IE I'd prefer Tika to stop immediately, either by letting the IE bubble up or (where necessary) by throwing a TikaException that wraps the IE.

        OK, maybe we can throw TikaException today (and set the interrupt
        bit), and then in the future (if/when these places really do throw
        IE), we can change this to throwing a IE instead of TikaException. I
        can put that as a TODO.

        Show
        Michael McCandless added a comment - InterruptedException is never thrown in these places today, so I can't add the separate catch clause (compiler is angry). It's a checked exception, so if it isn't declared to be thrown by POI, it shouldn't get thrown here (even though the VM doesn't strictly prohibit that). Exactly: I'm trying to future proof. So in that case the extra check shouldn't even be needed. Wait, do you mean I should remove the handling entirely (not bother future proofing)? I think it's cleaner to set the interrupt bit and let the next place that waits see the interrupt bit and throw IE? I don't really like this approach. We're essentially saying: "Yes, you asked me to stop what I'm doing, but instead I'll just finish up what I was doing and ask the next guy to stop." Instead, when receiving an IE I'd prefer Tika to stop immediately, either by letting the IE bubble up or (where necessary) by throwing a TikaException that wraps the IE. OK, maybe we can throw TikaException today ( and set the interrupt bit), and then in the future (if/when these places really do throw IE), we can change this to throwing a IE instead of TikaException. I can put that as a TODO.
        Hide
        Jukka Zitting added a comment -

        Wait, do you mean I should remove the handling entirely (not bother future proofing)?

        If POI decides to declare IE (or just generic Exception) as thrown by their API, it'll break binary compatibility, and thus in any case we'll need to adjust our code. So adding future proofing code here doesn't win us anything, it just complicates the codebase for no gain.

        Show
        Jukka Zitting added a comment - Wait, do you mean I should remove the handling entirely (not bother future proofing)? If POI decides to declare IE (or just generic Exception) as thrown by their API, it'll break binary compatibility, and thus in any case we'll need to adjust our code. So adding future proofing code here doesn't win us anything, it just complicates the codebase for no gain.
        Hide
        Michael McCandless added a comment -

        OK I'll remove the future proofing.

        Show
        Michael McCandless added a comment - OK I'll remove the future proofing.
        Hide
        Luis Filipe Nassif added a comment -

        I do not like this improvement. Currently I am setting ParseContext with a custom AutoDetectParser that, when an exception is hit, e.g. visiting an embedded, catches the exception, logs it AND extracts raw/binary strings from the problematic doc (or embedded). My app needs to extract text even from corrupt documents. With this "improvement" I will not know the problematic embedded when it is the best time to do something with it. I prefer to receive the exception when it is thrown and work around myself.

        Show
        Luis Filipe Nassif added a comment - I do not like this improvement. Currently I am setting ParseContext with a custom AutoDetectParser that, when an exception is hit, e.g. visiting an embedded, catches the exception, logs it AND extracts raw/binary strings from the problematic doc (or embedded). My app needs to extract text even from corrupt documents. With this "improvement" I will not know the problematic embedded when it is the best time to do something with it. I prefer to receive the exception when it is thrown and work around myself.
        Hide
        Michael McCandless added a comment -

        My app needs to extract text even from corrupt documents.

        That's exactly the intent here as well.

        Currently I am setting ParseContext with a custom AutoDetectParser that, when an exception is hit, e.g. visiting an embedded, catches the exception, logs it AND extracts raw/binary strings from the problematic doc (or embedded)

        Wait, the exceptions that this change now catches & logs is in the
        decoding an OLE10 embedded entry (into its byte[] data), not in
        actually parsing of the resulting byte[] data. If the exception is
        hit later when we recurse into parseEmbedded, the exception is still
        thrown as before, so your custom AutoDetectParser will still
        see/handle the exception.

        But I think this is separately a good idea (an AutoDetectParser
        logging & continuing by default): is this something you could possibly
        contribute...?

        Do you have an example corrupted document? We could test before/after
        this change and see.

        Show
        Michael McCandless added a comment - My app needs to extract text even from corrupt documents. That's exactly the intent here as well. Currently I am setting ParseContext with a custom AutoDetectParser that, when an exception is hit, e.g. visiting an embedded, catches the exception, logs it AND extracts raw/binary strings from the problematic doc (or embedded) Wait, the exceptions that this change now catches & logs is in the decoding an OLE10 embedded entry (into its byte[] data), not in actually parsing of the resulting byte[] data. If the exception is hit later when we recurse into parseEmbedded, the exception is still thrown as before, so your custom AutoDetectParser will still see/handle the exception. But I think this is separately a good idea (an AutoDetectParser logging & continuing by default): is this something you could possibly contribute...? Do you have an example corrupted document? We could test before/after this change and see.
        Hide
        Luis Filipe Nassif added a comment -

        Wait, the exceptions that this change now catches & logs is in the decoding an OLE10 embedded entry (into its byte[] data), not in actually parsing of the resulting byte[] data. If the exception is hit later when we recurse into parseEmbedded, the exception is still thrown as before, so your custom AutoDetectParser will still see/handle the exception.

        Hum you are right, I will still see exceptions from embedded docs. And this will improve parsing of the container.

        But I think this is separately a good idea (an AutoDetectParser logging & continuing by default): is this something you could possibly contribute...?

        I would like to, but I do not think my code has good quality. I think the meaning of "continuing" is application specific. My app has a Raw/Binary StringParser that uses heuristics to extract mixed ISO-8859-1, UTF-8 and UTF-16 strings from unknown files. It is the fallBackParser and it is also called when some exception is thrown by a corrupted doc. I could upload both, but they need a lot of enhacements.

        Do you have an example corrupted document? We could test before/after this change and see.

        Not of the kind you have, but now i see the parsing will be better after this change.

        Show
        Luis Filipe Nassif added a comment - Wait, the exceptions that this change now catches & logs is in the decoding an OLE10 embedded entry (into its byte[] data), not in actually parsing of the resulting byte[] data. If the exception is hit later when we recurse into parseEmbedded, the exception is still thrown as before, so your custom AutoDetectParser will still see/handle the exception. Hum you are right, I will still see exceptions from embedded docs. And this will improve parsing of the container. But I think this is separately a good idea (an AutoDetectParser logging & continuing by default): is this something you could possibly contribute...? I would like to, but I do not think my code has good quality. I think the meaning of "continuing" is application specific. My app has a Raw/Binary StringParser that uses heuristics to extract mixed ISO-8859-1, UTF-8 and UTF-16 strings from unknown files. It is the fallBackParser and it is also called when some exception is thrown by a corrupted doc. I could upload both, but they need a lot of enhacements. Do you have an example corrupted document? We could test before/after this change and see. Not of the kind you have, but now i see the parsing will be better after this change.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development