Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1508 Add uniformity to parser parameter configuration
  3. TIKA-1986

support parser parameters with type (int, double, etc) in configuration XML file

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.14
    • Component/s: config
    • Labels:
      None

      Description

      Tika Configuration should be enhanced to support for basic types like int, double, boolean, url, file.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user thammegowda opened a pull request:

          https://github.com/apache/tika/pull/123

          TIKA-1986 : support for typed parameters from XML configuration

          This is a sub-task of TIKA-1508 (please merge #91 first).

          This extends configuration file with a place for specifying parameters and their types.

          + Added a class called `Param`
          + Relying on JAXB annotations to convert between XML and Java objects
          + Test case updated to check for types

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/thammegowda/tika TIKA-1986

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/123.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #123


          commit b2cf23178ede925b0ef23f88ebf1aff95c8c157c
          Author: Thamme Gowda <tgowdan@gmail.com>
          Date: 2016-03-09T02:23:19Z

          Add uniformity to parser parameter configuration.

          1. Added Configurable interface.
          This can be used for all services like Parser, Detector which can take
          configurable parameters.

          2. Added ConfigurableParser interface which extends Parser interface.
          I didn't add new method to existing Parser because
          that will break the compatibility.

          3. AbstractParser extends ConfigurableParser and has
          default implementation for configure() contract.
          I think it is safe to do so and it doesnt break anything.
          In addition all parsers which extend AbstractParser will can easily
          access config from TikaConfig if they want to

          3. Added a TODO to TikaConfig,
          after this should allow multiple instances of same parser with
          different runtime configurations.

          4. TikaConfig is modified to detect if instance can be configured,
          if so, then checks if params are available in XML file, parses the
          params and invokes configure(ctx) method with these params

          5. Added DummyConfigurableParser that simply copies parameters to
          metadata for the sake of testing

          6. Added a sample XML config file for testing.
          Added ConfigurableParserTest that performs an end to end test of all
          the above.

          commit ae51417d8881dd90b921f02c2677a7d5bfd69a30
          Author: Thamme Gowda <tgowdan@gmail.com>
          Date: 2016-03-09T03:23:47Z

          remove unwanted TODO:

          commit 64db9614cfaa3e873a9dc9efc6d201d887f6a4c5
          Author: Thamme Gowda <tgowdan@gmail.com>
          Date: 2016-03-12T14:43:44Z

          Added a TikaConfigException, params getter

          commit 0d69ca7540b4350e043c5b9ed34d14a46bd70cf7
          Author: Thamme Gowda <tgowdan@gmail.com>
          Date: 2016-03-12T14:51:14Z

          Test Case updated with newer exception and getter

          commit e780d56652d48dd0f50b4e62a58153e95f055022
          Author: Thamme Gowda <tgowdan@gmail.com>
          Date: 2016-05-23T18:30:13Z

          merged upstream changes and resolved conflicts

          commit b64612dcdb021fbb8b3fbf31d70a02f1bb7736cb
          Author: Thamme Gowda <tgowdan@gmail.com>
          Date: 2016-05-23T18:52:55Z

          Update javadoc with @since

          commit 01869923533b330ec7728995e3ee5feceee1b90e
          Author: Thamme Gowda <tgowdan@gmail.com>
          Date: 2016-05-26T00:18:25Z

          Added support for type for runtime parameters

          commit 9e08a6bc0a2b2ffad12e4b6f90725b2201d0a69b
          Author: Thamme Gowda <tgowdan@gmail.com>
          Date: 2016-05-26T00:50:49Z

          Updated test case with type checking


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user thammegowda opened a pull request: https://github.com/apache/tika/pull/123 TIKA-1986 : support for typed parameters from XML configuration This is a sub-task of TIKA-1508 (please merge #91 first). This extends configuration file with a place for specifying parameters and their types. + Added a class called `Param` + Relying on JAXB annotations to convert between XML and Java objects + Test case updated to check for types You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/tika TIKA-1986 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/123.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #123 commit b2cf23178ede925b0ef23f88ebf1aff95c8c157c Author: Thamme Gowda <tgowdan@gmail.com> Date: 2016-03-09T02:23:19Z Add uniformity to parser parameter configuration. 1. Added Configurable interface. This can be used for all services like Parser, Detector which can take configurable parameters. 2. Added ConfigurableParser interface which extends Parser interface. I didn't add new method to existing Parser because that will break the compatibility. 3. AbstractParser extends ConfigurableParser and has default implementation for configure() contract. I think it is safe to do so and it doesnt break anything. In addition all parsers which extend AbstractParser will can easily access config from TikaConfig if they want to 3. Added a TODO to TikaConfig, after this should allow multiple instances of same parser with different runtime configurations. 4. TikaConfig is modified to detect if instance can be configured, if so, then checks if params are available in XML file, parses the params and invokes configure(ctx) method with these params 5. Added DummyConfigurableParser that simply copies parameters to metadata for the sake of testing 6. Added a sample XML config file for testing. Added ConfigurableParserTest that performs an end to end test of all the above. commit ae51417d8881dd90b921f02c2677a7d5bfd69a30 Author: Thamme Gowda <tgowdan@gmail.com> Date: 2016-03-09T03:23:47Z remove unwanted TODO: commit 64db9614cfaa3e873a9dc9efc6d201d887f6a4c5 Author: Thamme Gowda <tgowdan@gmail.com> Date: 2016-03-12T14:43:44Z Added a TikaConfigException, params getter commit 0d69ca7540b4350e043c5b9ed34d14a46bd70cf7 Author: Thamme Gowda <tgowdan@gmail.com> Date: 2016-03-12T14:51:14Z Test Case updated with newer exception and getter commit e780d56652d48dd0f50b4e62a58153e95f055022 Author: Thamme Gowda <tgowdan@gmail.com> Date: 2016-05-23T18:30:13Z merged upstream changes and resolved conflicts commit b64612dcdb021fbb8b3fbf31d70a02f1bb7736cb Author: Thamme Gowda <tgowdan@gmail.com> Date: 2016-05-23T18:52:55Z Update javadoc with @since commit 01869923533b330ec7728995e3ee5feceee1b90e Author: Thamme Gowda <tgowdan@gmail.com> Date: 2016-05-26T00:18:25Z Added support for type for runtime parameters commit 9e08a6bc0a2b2ffad12e4b6f90725b2201d0a69b Author: Thamme Gowda <tgowdan@gmail.com> Date: 2016-05-26T00:50:49Z Updated test case with type checking
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Wow. This will be so cool. I made one trivial recommendation on Github.

          The following is picky, but this is a major new addition to the API, and I think your superb work deserves a critical review.

          Would it be possible to add serialization of the parameters to TikaConfigSerializer? I may have missed this. Not crucial for the initial patch, but we'll want to add this.

          Going forward (Tika 2.0), how do we want the parsers to interact with the configuration? Should they interact directly with the params, or should they initialize their current param variables with the params?

          What's the benefit of configuring with a ParseContext instead of a Map<String, Param<?>>? Along the same lines, configure doesn't actually configure, it just sets the Map... Should we rename it as a setter? Or should we make it do something?

          IIRC AbstractParser is only there as syntactic sugar to gloss over the newer requirement to pass in a ParseContext at parse time. In 2.0, AbstractParser will go away...I think. So, might be better to make ConfigurableParser an abstract class that handles all of the functionality instead of an interface.

          Not crucial for the initial patch, but it would be great if we could add error checking/automatic configuration (perhaps via reflection) at the level of the ConfigurableParser so that each parser (configurable?) doesn't have to set their own params.

          Show
          tallison@mitre.org Tim Allison added a comment - Wow. This will be so cool. I made one trivial recommendation on Github. The following is picky, but this is a major new addition to the API, and I think your superb work deserves a critical review. Would it be possible to add serialization of the parameters to TikaConfigSerializer? I may have missed this. Not crucial for the initial patch, but we'll want to add this. Going forward (Tika 2.0), how do we want the parsers to interact with the configuration? Should they interact directly with the params, or should they initialize their current param variables with the params? What's the benefit of configuring with a ParseContext instead of a Map<String, Param<?>>? Along the same lines, configure doesn't actually configure, it just sets the Map... Should we rename it as a setter? Or should we make it do something? IIRC AbstractParser is only there as syntactic sugar to gloss over the newer requirement to pass in a ParseContext at parse time. In 2.0, AbstractParser will go away...I think. So, might be better to make ConfigurableParser an abstract class that handles all of the functionality instead of an interface. Not crucial for the initial patch, but it would be great if we could add error checking/automatic configuration (perhaps via reflection) at the level of the ConfigurableParser so that each parser (configurable?) doesn't have to set their own params.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          The other major thing...I realize looking back at our discussion. Unless we're very careful with parameter names, we'll have collisions across parsers and even Configurables link.

          What would you think of:

          context.setParams(SpecificConfigurable.class, Map<String, Param<?>)
          context.setParam(SpecificConfigurable.class, String paramName, Param<?>)

          Show
          tallison@mitre.org Tim Allison added a comment - The other major thing...I realize looking back at our discussion. Unless we're very careful with parameter names, we'll have collisions across parsers and even Configurables link . What would you think of: context.setParams(SpecificConfigurable.class, Map<String, Param<?>) context.setParam(SpecificConfigurable.class, String paramName, Param<?>)
          Hide
          tgowdan@gmail.com Thamme Gowda added a comment - - edited

          Thanks for the review and feedback. Tim Allison

          Would it be possible to add serialization of the parameters to TikaConfigSerializer? I may have missed this. Not crucial for the initial patch, but we'll want to add this.

          Yes, The functionality is right here in save(...). However, that needs to be integrated TikaConfigSerializer.
          I was thinking of updating the entire Configuration <-> XML (de)serialization using JAXB. I found that the deserialization and service loading are bundled together, so I could not proceed without separating them

          Going forward (Tika 2.0), how do we want the parsers to interact with the configuration? Should they interact directly with the params, or should they initialize their current param variables with the params?

          I really really liked your suggestion of invoking the setter methods to directly initialize. I +1 for having that integrated in tika 2.0. However, let's have the same functionality in bit more cleaner way by using annotations instead of crude reflection function calls.
          Lets make a contract - if there is a @Param annotation for attributes in parser objects, we shall initialize them. what do you say?

          What's the benefit of configuring with a ParseContext instead of a Map<String, Param<?>>? Along the same lines, configure doesn't actually configure, it just sets the Map... Should we rename it as a setter? Or should we make it do something?

          1. Right! The base class method doesn't configure. Child parsers should override that method to configure themselves, tika-core supplies the required parameters to configure.
          2. I think using a wrapper object like ParseContext allows us to make improvements later without breaking. using Map<String, Param<?>> is too limiting for future enhancements. I have a feeling that the parsers are going to be more complex, and they require many other resources to initialize.

          IIRC AbstractParser is only there as syntactic sugar to gloss over the newer requirement to pass in a ParseContext at parse time. In 2.0, AbstractParser will go away...I think. So, might be better to make ConfigurableParser an abstract class that handles all of the functionality instead of an interface

          I missed Tika 2.0 design discussions. +1 for the suggestion.

          Not crucial for the initial patch, but it would be great if we could add error checking/automatic configuration (perhaps via reflection) at the level of the ConfigurableParser so that each parser (configurable?) doesn't have to set their own params.

          +1, lets define @Param annotation and take it forward.

          Show
          tgowdan@gmail.com Thamme Gowda added a comment - - edited Thanks for the review and feedback. Tim Allison Would it be possible to add serialization of the parameters to TikaConfigSerializer? I may have missed this. Not crucial for the initial patch, but we'll want to add this. Yes, The functionality is right here in save(...) . However, that needs to be integrated TikaConfigSerializer. I was thinking of updating the entire Configuration <-> XML (de)serialization using JAXB. I found that the deserialization and service loading are bundled together, so I could not proceed without separating them Going forward (Tika 2.0), how do we want the parsers to interact with the configuration? Should they interact directly with the params, or should they initialize their current param variables with the params? I really really liked your suggestion of invoking the setter methods to directly initialize. I +1 for having that integrated in tika 2.0. However, let's have the same functionality in bit more cleaner way by using annotations instead of crude reflection function calls. Lets make a contract - if there is a @Param annotation for attributes in parser objects, we shall initialize them. what do you say? What's the benefit of configuring with a ParseContext instead of a Map<String, Param<?>>? Along the same lines, configure doesn't actually configure, it just sets the Map... Should we rename it as a setter? Or should we make it do something? 1. Right! The base class method doesn't configure. Child parsers should override that method to configure themselves, tika-core supplies the required parameters to configure. 2. I think using a wrapper object like ParseContext allows us to make improvements later without breaking. using Map<String, Param<?>> is too limiting for future enhancements. I have a feeling that the parsers are going to be more complex, and they require many other resources to initialize. IIRC AbstractParser is only there as syntactic sugar to gloss over the newer requirement to pass in a ParseContext at parse time. In 2.0, AbstractParser will go away...I think. So, might be better to make ConfigurableParser an abstract class that handles all of the functionality instead of an interface I missed Tika 2.0 design discussions. +1 for the suggestion. Not crucial for the initial patch, but it would be great if we could add error checking/automatic configuration (perhaps via reflection) at the level of the ConfigurableParser so that each parser (configurable?) doesn't have to set their own params. +1, lets define @Param annotation and take it forward.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Unless there are objections, let's create a dev branch (tika-1986?) in git and collaborate there?
          Thank you, again!

          Show
          tallison@mitre.org Tim Allison added a comment - Unless there are objections, let's create a dev branch (tika-1986?) in git and collaborate there? Thank you, again!
          Hide
          tgowdan@gmail.com Thamme Gowda added a comment -

          +1

          Show
          tgowdan@gmail.com Thamme Gowda added a comment - +1
          Hide
          tgowdan@gmail.com Thamme Gowda added a comment -

          UPDATE:
          Auto-initialization of params is now supported!
          Please have a look at https://github.com/apache/tika/pull/123#issuecomment-223158522

          CC Tim Allison Chris A. Mattmann

          Show
          tgowdan@gmail.com Thamme Gowda added a comment - UPDATE: Auto-initialization of params is now supported! Please have a look at https://github.com/apache/tika/pull/123#issuecomment-223158522 CC Tim Allison Chris A. Mattmann
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          thanks Thamme Gowda I will check it out

          Show
          chrismattmann Chris A. Mattmann added a comment - thanks Thamme Gowda I will check it out
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Great! Let's move development to the TIKA-1508 branch in asf's git?

          Show
          tallison@mitre.org Tim Allison added a comment - Great! Let's move development to the TIKA-1508 branch in asf's git?
          Hide
          tgowdan@gmail.com Thamme Gowda added a comment -


          Can I create TIKA-1508 branch and merge these changes? That will let me test if I have write access or not.

          Show
          tgowdan@gmail.com Thamme Gowda added a comment - Can I create TIKA-1508 branch and merge these changes? That will let me test if I have write access or not.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          I created the branch. Have at it!

          Show
          tallison@mitre.org Tim Allison added a comment - I created the branch. Have at it!
          Hide
          tgowdan@gmail.com Thamme Gowda added a comment -

          Thanks.
          Merged these changes to TIKA-1508 branch and pushed to asf git.

          Show
          tgowdan@gmail.com Thamme Gowda added a comment - Thanks. Merged these changes to TIKA-1508 branch and pushed to asf git.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Great. I'll try to integrate the PDFParser as an applied test case...might have time today.

          Thank you, again! I cannot tell you how much I look forward to having this, esp. with the new parameters I added to the PDFParser for ocr...so much easier to evaluate by changing the config file instead of updating the jar.

          Show
          tallison@mitre.org Tim Allison added a comment - Great. I'll try to integrate the PDFParser as an applied test case...might have time today. Thank you, again! I cannot tell you how much I look forward to having this, esp. with the new parameters I added to the PDFParser for ocr...so much easier to evaluate by changing the config file instead of updating the jar.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          I added a few more unit tests and added one proof-of-concept parameter to the PDFParser. We'll have to do some rewiring to transfer the params from the PDFParser to the PDFParserConfig so that we can send them to the PDF2XHTML object, but that's just wiring.

          This is really exciting.

          Thank you!

          Going forward (say 1.1[56] or 2.0), will we want to get rid of the current properties files and require configuration by TikaConfig?

          Show
          tallison@mitre.org Tim Allison added a comment - I added a few more unit tests and added one proof-of-concept parameter to the PDFParser. We'll have to do some rewiring to transfer the params from the PDFParser to the PDFParserConfig so that we can send them to the PDF2XHTML object, but that's just wiring. This is really exciting. Thank you! Going forward (say 1.1 [56] or 2.0), will we want to get rid of the current properties files and require configuration by TikaConfig?
          Hide
          tgowdan@gmail.com Thamme Gowda added a comment -

          Awesome. I am glad it helped to simplify customization of parsers.

          I too +1 for getting rid of many property files and make the configurations easy to find and modify.
          we may need to start a wiki page to list all params for each parser, so we will have all configs at one place.

          Show
          tgowdan@gmail.com Thamme Gowda added a comment - Awesome. I am glad it helped to simplify customization of parsers. I too +1 for getting rid of many property files and make the configurations easy to find and modify. we may need to start a wiki page to list all params for each parser, so we will have all configs at one place.
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          we may need to start a wiki page to list all params for each parser, so we will have all configs at one place.

          We could do this, but can't we rely on the javadocs for each parser's getters/setters?

          In looking at this again, and remembering Chris A. Mattmann's and Nick Burch's discussion...do we need the Configurable layer at all? Or, in other words, what does that buy us over just sending in a ParseContext for each parse/translation/langdetect, etc?

          You've solved the xml configuration issue very elegantly with the JAXB code.

          What would you think of changing:

          public void setParam(String key, Param<?> value)
          

          to

          public void setParam(Class clazz, String key, Param<?> value)
          

          ?

          This would accomplish Chris A. Mattmann's suggestion for strict namespacing of the parameters.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited we may need to start a wiki page to list all params for each parser, so we will have all configs at one place. We could do this, but can't we rely on the javadocs for each parser's getters/setters? In looking at this again, and remembering Chris A. Mattmann 's and Nick Burch 's discussion...do we need the Configurable layer at all? Or, in other words, what does that buy us over just sending in a ParseContext for each parse/translation/langdetect, etc? You've solved the xml configuration issue very elegantly with the JAXB code. What would you think of changing: public void setParam( String key, Param<?> value) to public void setParam( Class clazz, String key, Param<?> value) ? This would accomplish Chris A. Mattmann 's suggestion for strict namespacing of the parameters.
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          +1 to including clazz parameter.

          Show
          chrismattmann Chris A. Mattmann added a comment - +1 to including clazz parameter.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Wait, sorry, make that:

          public void setParam(String key, Param<?> value)
          

          to

          public void setParam(Class clazz, Param<?> value)
          

          I started down this road and just pushed a draft to the TIKA-1508 branch.

          I also added a check in case someone configures a parameter in the TikaConfig xml that doesn't exist in the object.

          Show
          tallison@mitre.org Tim Allison added a comment - Wait, sorry, make that: public void setParam( String key, Param<?> value) to public void setParam( Class clazz, Param<?> value) I started down this road and just pushed a draft to the TIKA-1508 branch. I also added a check in case someone configures a parameter in the TikaConfig xml that doesn't exist in the object.
          Hide
          tgowdan@gmail.com Thamme Gowda added a comment -

          Regarding documentation for all the parameters:
          We can let javadoc auto generate it for us from annotations.
          My plan of action is : We can add an additional attribute called 'description' where we can describe the field/param. And then make javadoc program include these strings

          {name, required, description}

          in generated javadocs. I will make those changes if I receive +1 on this plan.

          Regarding why do we need Configurable layer ?:
          I think its is not necessary now. My plan was to use it as a Marker Interface to decide whether or not we need to bind parameters from the configuration. (That if condition could save some time as the reflections are usually slower as per my knowledge). Looks like we are going to make most parsers as configurable to some extent. So we can just ignore the configurable check?

          Regarding addition of clazz parameter:
          I am not sure what is the use of it, could you please point me to the place where it was previously discussed?
          If you have started working on it, that's fine too.

          Show
          tgowdan@gmail.com Thamme Gowda added a comment - Regarding documentation for all the parameters: We can let javadoc auto generate it for us from annotations. My plan of action is : We can add an additional attribute called 'description' where we can describe the field/param. And then make javadoc program include these strings {name, required, description} in generated javadocs. I will make those changes if I receive +1 on this plan. Regarding why do we need Configurable layer ?: I think its is not necessary now. My plan was to use it as a Marker Interface to decide whether or not we need to bind parameters from the configuration. (That if condition could save some time as the reflections are usually slower as per my knowledge). Looks like we are going to make most parsers as configurable to some extent. So we can just ignore the configurable check? Regarding addition of clazz parameter: I am not sure what is the use of it, could you please point me to the place where it was previously discussed? If you have started working on it, that's fine too.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          We can let javadoc auto generate it for us from annotations.

          Ok, I figured the setters on the parsers/langdetectors/translators would be sufficient, no?

          Looks like we are going to make most parsers as configurable to some extent. So we can just ignore the configurable check?

          Y, that's what I just did on the TIKA-1508 branch. I'll remove the rest of the Configurable code now in that branch, if that's ok.

          I am not sure what is the use of it, could you please point me to the place where it was previously discussed?

          The use is to avoid parameter name conflicts (let's imagine two parsers have "isLenient", you'd want to be able to tell one parser to be lenient but the other one not lenient).

          here and here

          Y, I gave it a try in the TIKA-1508 branch just now.

          Show
          tallison@mitre.org Tim Allison added a comment - We can let javadoc auto generate it for us from annotations. Ok, I figured the setters on the parsers/langdetectors/translators would be sufficient, no? Looks like we are going to make most parsers as configurable to some extent. So we can just ignore the configurable check? Y, that's what I just did on the TIKA-1508 branch. I'll remove the rest of the Configurable code now in that branch, if that's ok. I am not sure what is the use of it, could you please point me to the place where it was previously discussed? The use is to avoid parameter name conflicts (let's imagine two parsers have "isLenient", you'd want to be able to tell one parser to be lenient but the other one not lenient). here and here Y, I gave it a try in the TIKA-1508 branch just now.
          Hide
          tgowdan@gmail.com Thamme Gowda added a comment -

          Ok, I figured the setters on the parsers/langdetectors/translators would be sufficient, no?

          That would be okay. How about this http://stackoverflow.com/q/20354943/1506477, is it better ?

          I'll remove the rest of the Configurable code now in that branch, if that's ok.

          Does that mean we have to remove void configure(ParseContext context) ; method as well?
          I think we will need this call, IMO this is how we can tell a Parser/translator/langdetector that we are done binding ALL the parameters and it is now a good time to initialize and possibly raise exception if something is not okay.

          The use is to avoid parameter name conflicts (let's imagine two parsers have "isLenient", you'd want to be able to tell one parser to be lenient but the other one not lenient).

          I do not see that coming. Each parser is currently getting its own Map of params so there are no such name conflicts across the parsers with current implementation Check this code snippet .

          The way parameters are specified in XML file is local to each parser (the params is wrapped inside parser tag, which means the map of params is not global and not shared by other parsers.

          Show
          tgowdan@gmail.com Thamme Gowda added a comment - Ok, I figured the setters on the parsers/langdetectors/translators would be sufficient, no? That would be okay. How about this http://stackoverflow.com/q/20354943/1506477 , is it better ? I'll remove the rest of the Configurable code now in that branch, if that's ok. Does that mean we have to remove void configure(ParseContext context) ; method as well? I think we will need this call, IMO this is how we can tell a Parser/translator/langdetector that we are done binding ALL the parameters and it is now a good time to initialize and possibly raise exception if something is not okay. The use is to avoid parameter name conflicts (let's imagine two parsers have "isLenient", you'd want to be able to tell one parser to be lenient but the other one not lenient). I do not see that coming. Each parser is currently getting its own Map of params so there are no such name conflicts across the parsers with current implementation Check this code snippet . The way parameters are specified in XML file is local to each parser (the params is wrapped inside parser tag, which means the map of params is not global and not shared by other parsers.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          That would be okay. How about this http://stackoverflow.com/q/20354943/1506477, is it better ?

          Would that go on the individual parser's setters or somewhere else? If on the setters, I think the standard javadocs should be sufficient. If elsewhere, y, that'd make sense...very cool.

          Does that mean we have to remove void configure(ParseContext context) ; method as well?

          I think we will need this call, IMO this is how we can tell a Parser/translator/langdetector that we are done binding ALL the parameters and it is now a good time to initialize and possibly raise exception if something is not okay.

          Ah, ok. I hadn't thought about that. I was figuring that individual setting would be sufficient. Do we have use cases where we need to set multiple settings and then initialize? I can imagine some... The reason I'd like to get rid of configure(ParseContext context) is that it adds yet another way to configure the parsers. If the user can easily touch the actual parser programmatically (i.e. it isn't wrapped in the AutoDetectParser), they'd now have three ways of configuring a parser: a) directly with setters, configuring something in the ParseContext and b) then submitting that in the call to parse(...) or c) the configure() method.

          The way parameters are specified in XML file is local to each parser (the params is wrapped inside parser tag, which means the map of params is not global and not shared by other parsers.

          Right, I agree completely about how the XML file works, but if a user is calling the AutoDetectParser programmatically, they have a single ParseContext for every parser. Or am I missing something about how configure is intended to work?

          Thank you, again!

          Show
          tallison@mitre.org Tim Allison added a comment - That would be okay. How about this http://stackoverflow.com/q/20354943/1506477 , is it better ? Would that go on the individual parser's setters or somewhere else? If on the setters, I think the standard javadocs should be sufficient. If elsewhere, y, that'd make sense...very cool. Does that mean we have to remove void configure(ParseContext context) ; method as well? I think we will need this call, IMO this is how we can tell a Parser/translator/langdetector that we are done binding ALL the parameters and it is now a good time to initialize and possibly raise exception if something is not okay. Ah, ok. I hadn't thought about that. I was figuring that individual setting would be sufficient. Do we have use cases where we need to set multiple settings and then initialize? I can imagine some... The reason I'd like to get rid of configure(ParseContext context) is that it adds yet another way to configure the parsers. If the user can easily touch the actual parser programmatically (i.e. it isn't wrapped in the AutoDetectParser), they'd now have three ways of configuring a parser: a) directly with setters, configuring something in the ParseContext and b) then submitting that in the call to parse(...) or c) the configure() method. The way parameters are specified in XML file is local to each parser (the params is wrapped inside parser tag, which means the map of params is not global and not shared by other parsers. Right, I agree completely about how the XML file works, but if a user is calling the AutoDetectParser programmatically, they have a single ParseContext for every parser. Or am I missing something about how configure is intended to work? Thank you, again!
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Do we have use cases where we need to set multiple settings and then initialize?

          We do in PDFParser...I now remember.

                  boolean checkExtractAccessPermission = getBooleanProp(props.getProperty("checkExtractAccessPermission"), false);
                  boolean allowExtractionForAccessibility = getBooleanProp(props.getProperty("allowExtractionForAccessibility"), true);
          
                  if (checkExtractAccessPermission == false) {
                      //silently ignore the crazy configuration of checkExtractAccessPermission = false,
                      //but allowExtractionForAccessibility=false
                      accessChecker = new AccessChecker();
                  } else {
                      accessChecker = new AccessChecker(allowExtractionForAccessibility);
                  }
          

          I think we could work around this with careful setters.

          Show
          tallison@mitre.org Tim Allison added a comment - Do we have use cases where we need to set multiple settings and then initialize? We do in PDFParser...I now remember. boolean checkExtractAccessPermission = getBooleanProp(props.getProperty( "checkExtractAccessPermission" ), false ); boolean allowExtractionForAccessibility = getBooleanProp(props.getProperty( "allowExtractionForAccessibility" ), true ); if (checkExtractAccessPermission == false ) { //silently ignore the crazy configuration of checkExtractAccessPermission = false , //but allowExtractionForAccessibility= false accessChecker = new AccessChecker(); } else { accessChecker = new AccessChecker(allowExtractionForAccessibility); } I think we could work around this with careful setters.
          Hide
          tgowdan@gmail.com Thamme Gowda added a comment -

          Thanks for the reply

          Would that go on the individual parser's setters or somewhere else?

          Those annotations will go on individual setters/fields. example :

          @Field(description="Executor, example:python")
          public String executor
          

          Do we have use cases where we need to set multiple settings and then initialize?

          Here is my example: imagine we have a parser that has to download a resource and place it on FS to make it work. It has two params: model 'url' and 'localpath' . We have to wait for both the params to be bound. I recommend to use configure(...) method to download the resource (and raise an exception if it fails). Ofcourse we can otherwise add an if(condition) in parse(..) to do the same, but it is (1) ugly - parse isnt meant to do this. If there are any issues with settings we let the parse fail early itself before even attempting to read the input and call parse (2) inefficient - if(...) is checked for all incoming documents. Having multiple options would be okay, we can use whichever is convenient.

          if a user is calling the AutoDetectParser programmatically, they have a single ParseContext for every parser.

          Ah! I missed that part. Please suggest what to do in that case?

          Show
          tgowdan@gmail.com Thamme Gowda added a comment - Thanks for the reply Would that go on the individual parser's setters or somewhere else? Those annotations will go on individual setters/fields. example : @Field(description= "Executor, example:python" ) public String executor Do we have use cases where we need to set multiple settings and then initialize? Here is my example: imagine we have a parser that has to download a resource and place it on FS to make it work. It has two params: model 'url' and 'localpath' . We have to wait for both the params to be bound. I recommend to use configure(...) method to download the resource (and raise an exception if it fails). Ofcourse we can otherwise add an if(condition) in parse(..) to do the same, but it is (1) ugly - parse isnt meant to do this. If there are any issues with settings we let the parse fail early itself before even attempting to read the input and call parse (2) inefficient - if(...) is checked for all incoming documents. Having multiple options would be okay, we can use whichever is convenient. if a user is calling the AutoDetectParser programmatically, they have a single ParseContext for every parser. Ah! I missed that part. Please suggest what to do in that case?
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          Those annotations will go on individual setters/fields. example :

          @Field(description="Executor, example:python") public String executor 

          How will this be better than the regular javadocs?

          It has two params: model 'url' and 'localpath'...

          One thought, you could check for null on the other for both of the setters; if the other isn't null, then try to load. I don't like that solution...I need to think a bit more about this...

          Ah! I missed that part. Please suggest what to do in that case?

          I did.

          public void setParam(Class clazz, Param<?> value)
          
          Show
          tallison@mitre.org Tim Allison added a comment - - edited Those annotations will go on individual setters/fields. example : @Field(description= "Executor, example:python" ) public String executor How will this be better than the regular javadocs? It has two params: model 'url' and 'localpath'... One thought, you could check for null on the other for both of the setters; if the other isn't null, then try to load. I don't like that solution...I need to think a bit more about this... Ah! I missed that part. Please suggest what to do in that case? I did. public void setParam( Class clazz, Param<?> value)
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Hi Thamme Gowda, I added an "Initializable" similar to your original "Configurable". I think this will be rarely used, but it will let users carry out the types of initialization that you pointed out might be crucial.

          Going back to Nick Burch's distinction between initialization and run-time/per document parameterization, I'm now thinking that it might be best to use the Param only at initialization...keep our current XParserConfig pattern for parameterization per document. I think this cleans up quite a bit...if you like this simplification, we can also strip out the Param setting/getting in ParseContext entirely.

          I just pushed a draft of this direction. Let me know what you think.

          Again, many thanks...this is a fantastic addition!!!

          Show
          tallison@mitre.org Tim Allison added a comment - Hi Thamme Gowda , I added an "Initializable" similar to your original "Configurable". I think this will be rarely used, but it will let users carry out the types of initialization that you pointed out might be crucial. Going back to Nick Burch 's distinction between initialization and run-time/per document parameterization, I'm now thinking that it might be best to use the Param only at initialization...keep our current XParserConfig pattern for parameterization per document. I think this cleans up quite a bit...if you like this simplification, we can also strip out the Param setting/getting in ParseContext entirely. I just pushed a draft of this direction. Let me know what you think. Again, many thanks...this is a fantastic addition!!!
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Ah! I missed that part. Please suggest what to do in that case?

          Under the current proposal, strip out all Param stuff from the ParseContext and make the user continue to use the XParserConfig pattern.

          Show
          tallison@mitre.org Tim Allison added a comment - Ah! I missed that part. Please suggest what to do in that case? Under the current proposal, strip out all Param stuff from the ParseContext and make the user continue to use the XParserConfig pattern.
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Can someone distinguish the choice here for me? I don't get it.

          Show
          chrismattmann Chris A. Mattmann added a comment - Can someone distinguish the choice here for me? I don't get it.
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          From Nick

          > I think that's exactly what ParseContext should be for..it should be a vehicle for Param passing. We can delineate by property name (FQ) and/or by class.

          I view ParseContext as somewhere you configure things on a per-document basis, not a per-parser basis.

          So, need to set where Tesseract lives on your system? Applies to everything, so on the parser. Need to tell Tesseract to use a German not an English dictionary on this particular jpeg? Applies to just this one document being parserd, so on the ParseContext

          link

          My current proposal is to add Thamme Gowda's fantastic beans for the initialization step, but to go back to what we have for runtime/per-file (setting parameters programmatically).

          If we allow users to use the Param stuff programmatically, they'll have some nasty java like so:

                  Param<Boolean> paramVal = new Param<>("sortByPosition", new Boolean(true));
                  context.setParam(PDFParser.class, paramVal);
          

          And there are no compile time guarantees that "sortByPosition" exists for PDFParser...

          Show
          tallison@mitre.org Tim Allison added a comment - - edited From Nick > I think that's exactly what ParseContext should be for..it should be a vehicle for Param passing. We can delineate by property name (FQ) and/or by class. I view ParseContext as somewhere you configure things on a per-document basis, not a per-parser basis. So, need to set where Tesseract lives on your system? Applies to everything, so on the parser. Need to tell Tesseract to use a German not an English dictionary on this particular jpeg? Applies to just this one document being parserd, so on the ParseContext link My current proposal is to add Thamme Gowda 's fantastic beans for the initialization step, but to go back to what we have for runtime/per-file (setting parameters programmatically). If we allow users to use the Param stuff programmatically, they'll have some nasty java like so: Param<Boolean> paramVal = new Param<>("sortByPosition", new Boolean(true)); context.setParam(PDFParser.class, paramVal); And there are no compile time guarantees that "sortByPosition" exists for PDFParser...
          Hide
          tgowdan@gmail.com Thamme Gowda added a comment -

          +1 for Initializable and keeping the functionality.

          I think this will be rarely used,

          Thats right, if there is any abstract class we can have no-operation implementation for that method.

          I think this cleans up quite a bit...if you like this simplification, we can also strip out the Param setting/getting in ParseContext entirely.

          +1 for the simplification. Since we guarantee that the Param s will be bound by the framework, we no longer need to wrap them in ParseContext. Agreed.

          Show
          tgowdan@gmail.com Thamme Gowda added a comment - +1 for Initializable and keeping the functionality. I think this will be rarely used, Thats right, if there is any abstract class we can have no-operation implementation for that method. I think this cleans up quite a bit...if you like this simplification, we can also strip out the Param setting/getting in ParseContext entirely. +1 for the simplification. Since we guarantee that the Param s will be bound by the framework, we no longer need to wrap them in ParseContext . Agreed.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Ok, I just reverted ParseContext to what it is in trunk, and I added some code to handle configuration of non-primitives in PDFParser.

          I think the only thing remaining...Do DummyConfigurableParser and its tests do anything different from DummyParameterizedParser? If not, let's remove.

          Other than that, please take a look, edit as you see fit and then commit!

          Show
          tallison@mitre.org Tim Allison added a comment - Ok, I just reverted ParseContext to what it is in trunk, and I added some code to handle configuration of non-primitives in PDFParser. I think the only thing remaining...Do DummyConfigurableParser and its tests do anything different from DummyParameterizedParser? If not, let's remove. Other than that, please take a look, edit as you see fit and then commit!
          Hide
          tgowdan@gmail.com Thamme Gowda added a comment -

          I too think DummyConfigurableParser and its tests can be removed as they are redundant.

          I just checked your commits on TIKA-1508 branch. You simplified it very well, thank you

          My little concern is with the initialize() method of Initializable interface.

          For example, I have a parser ObjectRecognitionParser which internally uses Tensorflow Image recogniser (you may treat it as an internal service for the parser) to classify images.
          I have exploited the runtime binding functionality to bind these params to the inner services as well.
          Link to my pull request

          Now I see that with the newer design, we miss out that functionality!
          Whats your opinion on changing initialize() to initialize(Map<String, Param<?>> params) ? It may seem redundant at first sight, but I think binding params to inner services is also a valid case IMO Parser cannot anticipate and declare all the parameters it's inner services accept!

          Show
          tgowdan@gmail.com Thamme Gowda added a comment - I too think DummyConfigurableParser and its tests can be removed as they are redundant. I just checked your commits on TIKA-1508 branch. You simplified it very well, thank you My little concern is with the initialize() method of Initializable interface. For example, I have a parser ObjectRecognitionParser which internally uses Tensorflow Image recogniser (you may treat it as an internal service for the parser) to classify images. I have exploited the runtime binding functionality to bind these params to the inner services as well. Link to my pull request Now I see that with the newer design, we miss out that functionality! Whats your opinion on changing initialize() to initialize(Map<String, Param<?>> params) ? It may seem redundant at first sight, but I think binding params to inner services is also a valid case IMO Parser cannot anticipate and declare all the parameters it's inner services accept!
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          Doh. Sorry. Didn't anticipate that. Isn't collaboration a good thing?

          Take a look now.

          I may have hosed the generics... Let me know.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited Doh. Sorry. Didn't anticipate that. Isn't collaboration a good thing? Take a look now. I may have hosed the generics... Let me know.
          Hide
          tgowdan@gmail.com Thamme Gowda added a comment -

          Awesome! thank you.

          All good now.

          +1.

          Show
          tgowdan@gmail.com Thamme Gowda added a comment - Awesome! thank you. All good now. +1.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          The honor is all yours! I think this is ready for trunk. Probably best to squash or do a diff patch before merging?

          Show
          tallison@mitre.org Tim Allison added a comment - The honor is all yours! I think this is ready for trunk. Probably best to squash or do a diff patch before merging?
          Hide
          tgowdan@gmail.com Thamme Gowda added a comment -

          nice
          We have tensorflow image recognition PR targetted to TIKA-1508. I think its a good idea to merge it on this and then carry it to master/trunk. Or else I will have to close it on github and raise a new PR.

          Probably best to squash or do a diff patch before merging?

          I have no comments.

          Show
          tgowdan@gmail.com Thamme Gowda added a comment - nice We have tensorflow image recognition PR targetted to TIKA-1508 . I think its a good idea to merge it on this and then carry it to master/trunk. Or else I will have to close it on github and raise a new PR. Probably best to squash or do a diff patch before merging? I have no comments.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Sounds good. Go for it!

          Show
          tallison@mitre.org Tim Allison added a comment - Sounds good. Go for it!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tika/pull/123

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/123
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Resolved! Thanks everyone! Tim Allison Thamme Gowda

          Show
          chrismattmann Chris A. Mattmann added a comment - Resolved! Thanks everyone! Tim Allison Thamme Gowda
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1092 (See https://builds.apache.org/job/Tika-trunk/1092/)
          TIKA-1986 – add Initializable, strip out handling of params passed via (tallison: rev 01320372fdbfc5e4ff0cfe0fe85fab91b5b369e7)

          • (edit) tika-core/src/main/java/org/apache/tika/config/TikaConfig.java
          • (add) tika-core/src/test/java/org/apache/tika/parser/InitializableParserTest.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
          • (edit) tika-core/src/test/java/org/apache/tika/parser/ParameterizedParserTest.java
          • (add) tika-core/src/test/java/org/apache/tika/parser/DummyInitializableParser.java
          • (add) tika-core/src/test/resources/org/apache/tika/config/TIKA-1986-initializable.xml
          • (add) tika-core/src/main/java/org/apache/tika/config/Initializable.java
          • (edit) tika-core/src/test/java/org/apache/tika/parser/DummyParameterizedParser.java
          • (edit) tika-core/src/test/resources/org/apache/tika/config/TIKA-1986-some-parameters.xml
            TIKA-1986 – revert parsecontext to ab7c325 and update PDFParser to (tallison: rev 30b0f6674450f9aab4bacb5ab071ae96f4835c63)
          • (edit) tika-core/src/main/java/org/apache/tika/parser/ParseContext.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
          • (edit) tika-parsers/src/test/resources/org/apache/tika/parser/pdf/tika-config.xml
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
          • (add) tika-parsers/src/test/resources/org/apache/tika/parser/pdf/tika-config-non-primitives.xml
            TIKA-1986 – allow params to be passed into initializable, delete (tallison: rev f8fe50a2239cc8875fe3df7de6215e40da48b587)
          • (edit) tika-core/src/main/java/org/apache/tika/config/Initializable.java
          • (edit) tika-core/src/main/java/org/apache/tika/utils/AnnotationUtils.java
          • (delete) tika-core/src/test/java/org/apache/tika/parser/DummyConfigurableParser.java
          • (delete) tika-core/src/test/java/org/apache/tika/parser/ConfigurableParserTest.java
          • (edit) tika-core/src/test/java/org/apache/tika/parser/DummyInitializableParser.java
          • (edit) tika-core/src/main/java/org/apache/tika/config/TikaConfig.java
          • (edit) tika-core/src/test/java/org/apache/tika/utils/AnnotationUtilsTest.java
            TIKA-1986 Fixed the outdated Bad paramter test case and removed deadcode (thammegowda: rev a1d1a81d15b287b44041d590d8b124be06af0a19)
          • (edit) tika-core/src/test/java/org/apache/tika/parser/ParameterizedParserTest.java
          • (edit) tika-core/src/main/java/org/apache/tika/utils/AnnotationUtils.java
            Update changes for TIKA-1508, TIKA-1993, TIKA-1986. (mattmann: rev da82df5e9def9698fd32f85fe706660641d7c31f)
          • (edit) CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1092 (See https://builds.apache.org/job/Tika-trunk/1092/ ) TIKA-1986 – add Initializable, strip out handling of params passed via (tallison: rev 01320372fdbfc5e4ff0cfe0fe85fab91b5b369e7) (edit) tika-core/src/main/java/org/apache/tika/config/TikaConfig.java (add) tika-core/src/test/java/org/apache/tika/parser/InitializableParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java (edit) tika-core/src/test/java/org/apache/tika/parser/ParameterizedParserTest.java (add) tika-core/src/test/java/org/apache/tika/parser/DummyInitializableParser.java (add) tika-core/src/test/resources/org/apache/tika/config/ TIKA-1986 -initializable.xml (add) tika-core/src/main/java/org/apache/tika/config/Initializable.java (edit) tika-core/src/test/java/org/apache/tika/parser/DummyParameterizedParser.java (edit) tika-core/src/test/resources/org/apache/tika/config/ TIKA-1986 -some-parameters.xml TIKA-1986 – revert parsecontext to ab7c325 and update PDFParser to (tallison: rev 30b0f6674450f9aab4bacb5ab071ae96f4835c63) (edit) tika-core/src/main/java/org/apache/tika/parser/ParseContext.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java (edit) tika-parsers/src/test/resources/org/apache/tika/parser/pdf/tika-config.xml (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java (add) tika-parsers/src/test/resources/org/apache/tika/parser/pdf/tika-config-non-primitives.xml TIKA-1986 – allow params to be passed into initializable, delete (tallison: rev f8fe50a2239cc8875fe3df7de6215e40da48b587) (edit) tika-core/src/main/java/org/apache/tika/config/Initializable.java (edit) tika-core/src/main/java/org/apache/tika/utils/AnnotationUtils.java (delete) tika-core/src/test/java/org/apache/tika/parser/DummyConfigurableParser.java (delete) tika-core/src/test/java/org/apache/tika/parser/ConfigurableParserTest.java (edit) tika-core/src/test/java/org/apache/tika/parser/DummyInitializableParser.java (edit) tika-core/src/main/java/org/apache/tika/config/TikaConfig.java (edit) tika-core/src/test/java/org/apache/tika/utils/AnnotationUtilsTest.java TIKA-1986 Fixed the outdated Bad paramter test case and removed deadcode (thammegowda: rev a1d1a81d15b287b44041d590d8b124be06af0a19) (edit) tika-core/src/test/java/org/apache/tika/parser/ParameterizedParserTest.java (edit) tika-core/src/main/java/org/apache/tika/utils/AnnotationUtils.java Update changes for TIKA-1508 , TIKA-1993 , TIKA-1986 . (mattmann: rev da82df5e9def9698fd32f85fe706660641d7c31f) (edit) CHANGES.txt

            People

            • Assignee:
              chrismattmann Chris A. Mattmann
              Reporter:
              tgowdan@gmail.com Thamme Gowda
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development