Tika
  1. Tika
  2. TIKA-830

Tika.parseToString() causes ForkParser to try to serialize itself

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: 1.1
    • Component/s: None
    • Labels:
      None

      Activity

      Hide
      Nick Burch added a comment -

      I believe this was fixed in Tika 1.1 so I'm marking it as fixed, please re-open though if it's still an issue.

      Show
      Nick Burch added a comment - I believe this was fixed in Tika 1.1 so I'm marking it as fixed, please re-open though if it's still an issue.
      Hide
      Jukka Zitting added a comment -

      Excellent, thanks Nick!

      Show
      Jukka Zitting added a comment - Excellent, thanks Nick!
      Hide
      Nick Burch added a comment -

      I think the ForkParser instanceof check is a good validation to have, with a helpful error

      In r1225448 I've updated ForkClient / ForkParser to have a more helpful message if we can't serialize any of the Parser.parse arguments (eg the ParseContext), along with a unit test

      Show
      Nick Burch added a comment - I think the ForkParser instanceof check is a good validation to have, with a helpful error In r1225448 I've updated ForkClient / ForkParser to have a more helpful message if we can't serialize any of the Parser.parse arguments (eg the ParseContext), along with a unit test
      Hide
      Jukka Zitting added a comment -

      The problem here is the basic assumption that the Tika facade class makes about how the configured parser will use the instance passed in the ParseContext.

      By default (and before we added the constructor that allows a custom parser to be given) the Tika facade will construct and use an AutoDetectParser based on all the available and/or configured format-specific parsers. Format-specific parsers that support embedded documents expect the ParseContext to contain a parser instance that they can delegate parsing tasks to, so to support parsing of embedded documents the Tika facade passes the configured parser instance through the ParseContext.

      The ForkParser on the other hand assumes that anything in the ParseContext is serializable so that it can be sent to the forked JVM process for use from there. Passing a ForkParser instance to the forked JVM like through the ParseContext could easily trigger a recursion of new JVM forks being created, which is why the ForkParser by design is not serializable.

      I agree with Nick that the resulting error message could certainly be better, but I don't it's a good idea to change the basic design of either ForkParser or the Tika facade class in this respect.

      If we want the Tika facade class to support forked parsing, I think it would be better to add a separate flag for that to explicitly make the facade class create and use a ForkParser instance based on the configured normal Parser instance. However, the ForkParser is a pretty complex tool that practically always needs custom configuration (java command, memory limits, class loader, etc.), which is why I don't think we should expose it through the Tika facade that's mostly designed for simpler use cases.

      PS. Instead of the instanceof check we now have in ForkParser (thanks for that, BTW!), it might be a better idea to check for errors from trying to serialize the ParseContext. That'll capture a muhc wider range of cases where a ForkParser instance or some other non-serializable resource is being passed to a forked JVM.

      Show
      Jukka Zitting added a comment - The problem here is the basic assumption that the Tika facade class makes about how the configured parser will use the instance passed in the ParseContext. By default (and before we added the constructor that allows a custom parser to be given) the Tika facade will construct and use an AutoDetectParser based on all the available and/or configured format-specific parsers. Format-specific parsers that support embedded documents expect the ParseContext to contain a parser instance that they can delegate parsing tasks to, so to support parsing of embedded documents the Tika facade passes the configured parser instance through the ParseContext. The ForkParser on the other hand assumes that anything in the ParseContext is serializable so that it can be sent to the forked JVM process for use from there. Passing a ForkParser instance to the forked JVM like through the ParseContext could easily trigger a recursion of new JVM forks being created, which is why the ForkParser by design is not serializable. I agree with Nick that the resulting error message could certainly be better, but I don't it's a good idea to change the basic design of either ForkParser or the Tika facade class in this respect. If we want the Tika facade class to support forked parsing, I think it would be better to add a separate flag for that to explicitly make the facade class create and use a ForkParser instance based on the configured normal Parser instance. However, the ForkParser is a pretty complex tool that practically always needs custom configuration (java command, memory limits, class loader, etc.), which is why I don't think we should expose it through the Tika facade that's mostly designed for simpler use cases. PS. Instead of the instanceof check we now have in ForkParser (thanks for that, BTW!), it might be a better idea to check for errors from trying to serialize the ParseContext. That'll capture a muhc wider range of cases where a ForkParser instance or some other non-serializable resource is being passed to a forked JVM.
      Hide
      Jerome Lacoste added a comment -

      From a desing PoV I find it strange that a special parser isn't supported. In order to write integration tests for either fork parsing or non fork parsing I have to duplicate a big part of the Tika class.

      Another fix I had in mind was to modify Tika#parseToString() to do something like

        context.set(Parser.class, parser.getUnderlyingParser();
      

      with parser.getUnderlyingParser() returning this in AbstractParser, but overriden to return the underlying parser in ForkParser.

      Does that make more sense than the ForkParser modifying the ParseContext ?

      Show
      Jerome Lacoste added a comment - From a desing PoV I find it strange that a special parser isn't supported. In order to write integration tests for either fork parsing or non fork parsing I have to duplicate a big part of the Tika class. Another fix I had in mind was to modify Tika#parseToString() to do something like context.set(Parser.class, parser.getUnderlyingParser(); with parser.getUnderlyingParser() returning this in AbstractParser, but overriden to return the underlying parser in ForkParser. Does that make more sense than the ForkParser modifying the ParseContext ?
      Hide
      Nick Burch added a comment -

      If we're not going to support the ForkParser like this, maybe we should add a check with a helpful error message? (rather than breaking in a slightly obscure manner as now)

      Show
      Nick Burch added a comment - If we're not going to support the ForkParser like this, maybe we should add a check with a helpful error message? (rather than breaking in a slightly obscure manner as now)
      Hide
      Jukka Zitting added a comment -

      I'm not sure if we should try to support passing a ForkParser as an argument to the Tika constructor.
      The ForkParser expects the "real" parser instance to be passed in the ParseContext, which isn't currently supported by the Tika facade.

      Instead of adding workarounds like this, I'd rather consider whether we want to extend the Tika facade to explicitly support forked parsing as a special mode. Personally I'd rather recommend people who need the ForkParser functionality to use the class directly instead of through the Tika facade (that's intended to be just a simple entry point for basic Tika functionality).

      Show
      Jukka Zitting added a comment - I'm not sure if we should try to support passing a ForkParser as an argument to the Tika constructor. The ForkParser expects the "real" parser instance to be passed in the ParseContext, which isn't currently supported by the Tika facade. Instead of adding workarounds like this, I'd rather consider whether we want to extend the Tika facade to explicitly support forked parsing as a special mode. Personally I'd rather recommend people who need the ForkParser functionality to use the class directly instead of through the Tika facade (that's intended to be just a simple entry point for basic Tika functionality).
      Hide
      Jerome Lacoste added a comment -

      Again the patches, with the proper license grant,

      Show
      Jerome Lacoste added a comment - Again the patches, with the proper license grant,
      Hide
      Jerome Lacoste added a comment -

      The fix, and a refactor

      Show
      Jerome Lacoste added a comment - The fix, and a refactor

        People

        • Assignee:
          Unassigned
          Reporter:
          Jerome Lacoste
        • Votes:
          0 Vote for this issue
          Watchers:
          1 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development