Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1508

Add uniformity to parser parameter configuration

    Details

    • Type: Improvement
    • Status: Reopened
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 1.14
    • Component/s: None
    • Labels:
      None

      Description

      We can currently configure parsers by the following means:
      1) programmatically by direct calls to the parsers or their config objects
      2) sending in a config object through the ParseContext
      3) modifying .properties files for specific parsers (e.g. PDFParser)

      Rather than scattering the landscape with .properties files for each parser, it would be great if we could specify parser parameters in the main config file, something along the lines of this:

          <parser class="org.apache.tika.parser.audio.AudioParser">
            <params>
              <int name="someparam1">2</int>
              <str name="someOtherParam2">something or other</str>
            </params>
            <mime>audio/basic</mime>
            <mime>audio/x-aiff</mime>
            <mime>audio/x-wav</mime>
          </parser>
      

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          tallison@mitre.org Tim Allison added a comment -

          TIKA-1508 emerged from a conversation started on TIKA-1445.

          Show
          tallison@mitre.org Tim Allison added a comment - TIKA-1508 emerged from a conversation started on TIKA-1445 .
          Hide
          tallison@mitre.org Tim Allison added a comment -

          From the dev list, it looks like there is interest in moving forward on this....somehow. Let's move discussion on this issue back to this ticket.

          Show
          tallison@mitre.org Tim Allison added a comment - From the dev list, it looks like there is interest in moving forward on this....somehow. Let's move discussion on this issue back to this ticket.
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          To respond to Nick's point about including concrete examples, I think that there are two parsers that use a .properties file to configure baseline configuration: TesseractOCR and PDFParser. The external parser configuration is far more complex and beyond the scope of this issue...I think.

          So for the PDFParser, there are these Boolean properties:

          enableAutospace true
          extractAnnotationText true
          sortByPosition	false
          suppressDuplicateOverlappingText	false
          useNonSequentialParser	false
          extractAcroFormContent	true
          extractInlineImages false
          extractUniqueInlineImagesOnly true
          checkExtractAccessPermission false
          allowExtractionForAccessibility true
          

          This would be changed to something like this in the TikaConfig file:

              <parser class="org.apache.tika.parser.pdf.PDFParser">
                <params>
                  <boolean name="enableAutospace">true</boolean>
                  <boolean name="extractAnnotationText">true</boolean>
                   ....
                </params>
                 <mime>application/pdf</mime>
              </parser>
          

          These can be set via the .properties file, or through code by passing in a PDFParserConfig object via the ParseContext or via direct calls to setters in PDFParser.

          For this issue, I propose a simple unifying mechanism to specify default behavior for parser parameters so that a user can see and manipulate the options from one file.

          This proposal would not break (I don't think; but I've had failures of imagination before...) any dynamic code that clients are using via ParseContext.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited To respond to Nick's point about including concrete examples, I think that there are two parsers that use a .properties file to configure baseline configuration: TesseractOCR and PDFParser. The external parser configuration is far more complex and beyond the scope of this issue...I think. So for the PDFParser, there are these Boolean properties: enableAutospace true extractAnnotationText true sortByPosition false suppressDuplicateOverlappingText false useNonSequentialParser false extractAcroFormContent true extractInlineImages false extractUniqueInlineImagesOnly true checkExtractAccessPermission false allowExtractionForAccessibility true This would be changed to something like this in the TikaConfig file: <parser class="org.apache.tika.parser.pdf.PDFParser"> <params> <boolean name="enableAutospace">true</boolean> <boolean name="extractAnnotationText">true</boolean> .... </params> <mime>application/pdf</mime> </parser> These can be set via the .properties file, or through code by passing in a PDFParserConfig object via the ParseContext or via direct calls to setters in PDFParser. For this issue, I propose a simple unifying mechanism to specify default behavior for parser parameters so that a user can see and manipulate the options from one file. This proposal would not break (I don't think; but I've had failures of imagination before...) any dynamic code that clients are using via ParseContext.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          From Nick Burch,

          > My personal view is that properties/configuration which apply to all documents of a type should be set at Parser creation time, either from a Tika Config object or someone in code doing "Parser p = new FooParser(); p.setblah();". Properties/config which vary from document to document should be set on the ParseContext
          >
          > Not sure if we had consensus on that as a policy though?

          +1 (from me)

          Show
          tallison@mitre.org Tim Allison added a comment - From Nick Burch , > My personal view is that properties/configuration which apply to all documents of a type should be set at Parser creation time, either from a Tika Config object or someone in code doing "Parser p = new FooParser(); p.setblah();". Properties/config which vary from document to document should be set on the ParseContext > > Not sure if we had consensus on that as a policy though? +1 (from me)
          Hide
          davemeikle Dave Meikle added a comment -
          • Pushed to 1.11 following 1.10 release
          Show
          davemeikle Dave Meikle added a comment - Pushed to 1.11 following 1.10 release
          Hide
          tallison@mitre.org Tim Allison added a comment -

          We need easier XML serialization of TikaConfig before we go any further in adding features.

          Show
          tallison@mitre.org Tim Allison added a comment - We need easier XML serialization of TikaConfig before we go any further in adding features.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thamme Gowda, now I remember why I paused on this – TIKA-1657.

          Show
          tallison@mitre.org Tim Allison added a comment - Thamme Gowda , now I remember why I paused on this – TIKA-1657 .
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user thammegowda opened a pull request:

          https://github.com/apache/tika/pull/91

          TIKA-1508 : Add uniformity to parser parameter configuration - contributed by Thamme Gowda

          1. Added `Configurable` interface.
          This can be used for all services like `Parser`, `Detector` which can take
          configurable parameters.

          2. Added `ConfigurableParser` interface which extends `Parser` interface.
          I didn't add new method to existing `Parser` because
          that will break the compatibility.

          3. `AbstractParser` extends `ConfigurableParser` and has
          default implementation for configure() contract.
          I think it is safe to do so and it doesn't break anything.
          In addition, all parsers which extend `AbstractParser` can easily
          access config from TikaConfig if they want to

          3. Added a TODO to `TikaConfig`,
          after this should allow multiple instances of same parser with
          different runtime configurations.

          4. `TikaConfig` is modified to detect if instance can be configured,
          if so, then checks if params are available in XML file, parses the
          params and invokes configure(ctx) method with these params

          5. Added `DummyConfigurableParser` that simply copies parameters to
          metadata for the sake of testing

          6. Added a sample XML config file for testing.
          Added `ConfigurableParserTest` that performs an end to end test of all
          the above.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/thammegowda/tika TIKA-1508

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/91.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #91


          commit b2cf23178ede925b0ef23f88ebf1aff95c8c157c
          Author: Thamme Gowda <tgowdan@gmail.com>
          Date: 2016-03-09T02:23:19Z

          Add uniformity to parser parameter configuration.

          1. Added Configurable interface.
          This can be used for all services like Parser, Detector which can take
          configurable parameters.

          2. Added ConfigurableParser interface which extends Parser interface.
          I didn't add new method to existing Parser because
          that will break the compatibility.

          3. AbstractParser extends ConfigurableParser and has
          default implementation for configure() contract.
          I think it is safe to do so and it doesnt break anything.
          In addition all parsers which extend AbstractParser will can easily
          access config from TikaConfig if they want to

          3. Added a TODO to TikaConfig,
          after this should allow multiple instances of same parser with
          different runtime configurations.

          4. TikaConfig is modified to detect if instance can be configured,
          if so, then checks if params are available in XML file, parses the
          params and invokes configure(ctx) method with these params

          5. Added DummyConfigurableParser that simply copies parameters to
          metadata for the sake of testing

          6. Added a sample XML config file for testing.
          Added ConfigurableParserTest that performs an end to end test of all
          the above.

          commit ae51417d8881dd90b921f02c2677a7d5bfd69a30
          Author: Thamme Gowda <tgowdan@gmail.com>
          Date: 2016-03-09T03:23:47Z

          remove unwanted TODO:


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user thammegowda opened a pull request: https://github.com/apache/tika/pull/91 TIKA-1508 : Add uniformity to parser parameter configuration - contributed by Thamme Gowda 1. Added `Configurable` interface. This can be used for all services like `Parser`, `Detector` which can take configurable parameters. 2. Added `ConfigurableParser` interface which extends `Parser` interface. I didn't add new method to existing `Parser` because that will break the compatibility. 3. `AbstractParser` extends `ConfigurableParser` and has default implementation for configure() contract. I think it is safe to do so and it doesn't break anything. In addition, all parsers which extend `AbstractParser` can easily access config from TikaConfig if they want to 3. Added a TODO to `TikaConfig`, after this should allow multiple instances of same parser with different runtime configurations. 4. `TikaConfig` is modified to detect if instance can be configured, if so, then checks if params are available in XML file, parses the params and invokes configure(ctx) method with these params 5. Added `DummyConfigurableParser` that simply copies parameters to metadata for the sake of testing 6. Added a sample XML config file for testing. Added `ConfigurableParserTest` that performs an end to end test of all the above. You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/tika TIKA-1508 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/91.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #91 commit b2cf23178ede925b0ef23f88ebf1aff95c8c157c Author: Thamme Gowda <tgowdan@gmail.com> Date: 2016-03-09T02:23:19Z Add uniformity to parser parameter configuration. 1. Added Configurable interface. This can be used for all services like Parser, Detector which can take configurable parameters. 2. Added ConfigurableParser interface which extends Parser interface. I didn't add new method to existing Parser because that will break the compatibility. 3. AbstractParser extends ConfigurableParser and has default implementation for configure() contract. I think it is safe to do so and it doesnt break anything. In addition all parsers which extend AbstractParser will can easily access config from TikaConfig if they want to 3. Added a TODO to TikaConfig, after this should allow multiple instances of same parser with different runtime configurations. 4. TikaConfig is modified to detect if instance can be configured, if so, then checks if params are available in XML file, parses the params and invokes configure(ctx) method with these params 5. Added DummyConfigurableParser that simply copies parameters to metadata for the sake of testing 6. Added a sample XML config file for testing. Added ConfigurableParserTest that performs an end to end test of all the above. commit ae51417d8881dd90b921f02c2677a7d5bfd69a30 Author: Thamme Gowda <tgowdan@gmail.com> Date: 2016-03-09T03:23:47Z remove unwanted TODO:
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          Thamme Gowda, this looks really good. I merged it on a local branch and made minimal modifications to the PDFParser to make this work...and it did...very straightforwardly.

          Recommendations:
          1) Let's not use ParseContext as the vehicle for param passing, we will have collisions with different parsers if anyone uses configure() outside of the normal course of events...it is simpler to use Map<String,String>. Or, if we do use the ParseContext, we should specify which parser the params are for, e.g. context.set{{PDFParser.class, Map<String,String> params. I do like the dual use of configure with ParseContext to achieve Nick's recommendation elegantly.

          2) We need to add a Map<String,String> getParams() to the Configurable interface so that when we serialize the config to XML, we can remember what the params were. We should also add that to the TikaConfigSerializer.

          3) It would be great to add parameter checking into the AbstractParser or somewhere else? I think a configurable (parser? or all configurables?) should need to register valid configuration keys at initialization, and then we can check the validity of the keys passed in during configure() once in the base class so that each extending parser isn't required to do this on its own.

          4) Let's subclass TikaException for TikaParameterConfigException? I don't feel strongly about this one.

          5) We'll need to add @Override configure() to pass on the configuration information to the wrapped parser in parser wrappers: ParserDecorator, DelegatingParser, ParserPostProcessor...any others? Or, do we need to set the parameters in the wrapped parser before wrapping?

          Questions for the broader dev community:

          A) Are we ok with Map<String,String> parameters? Or should we follow, say, Solr's syntax for type checking?

          <int name="pageWidth">10</int>
          

          B) We could use reflection to get around each parser having to add its own configuration code. We could create a static configurator that has a configure(Configurable configurable, Map<String, String> params method. That isn't quite right, because we'd have to know the type for each param (see above), but something along those lines. Too complex?

          Show
          tallison@mitre.org Tim Allison added a comment - - edited Thamme Gowda , this looks really good. I merged it on a local branch and made minimal modifications to the PDFParser to make this work...and it did...very straightforwardly. Recommendations: 1) Let's not use ParseContext as the vehicle for param passing, we will have collisions with different parsers if anyone uses configure() outside of the normal course of events...it is simpler to use Map<String,String>. Or, if we do use the ParseContext, we should specify which parser the params are for, e.g. context.set{{PDFParser.class, Map<String,String> params . I do like the dual use of configure with ParseContext to achieve Nick's recommendation elegantly. 2) We need to add a Map<String,String> getParams() to the Configurable interface so that when we serialize the config to XML, we can remember what the params were. We should also add that to the TikaConfigSerializer. 3) It would be great to add parameter checking into the AbstractParser or somewhere else? I think a configurable (parser? or all configurables?) should need to register valid configuration keys at initialization, and then we can check the validity of the keys passed in during configure() once in the base class so that each extending parser isn't required to do this on its own. 4) Let's subclass TikaException for TikaParameterConfigException? I don't feel strongly about this one. 5) We'll need to add @Override configure() to pass on the configuration information to the wrapped parser in parser wrappers: ParserDecorator, DelegatingParser, ParserPostProcessor...any others? Or, do we need to set the parameters in the wrapped parser before wrapping? Questions for the broader dev community: A) Are we ok with Map<String,String> parameters? Or should we follow, say, Solr's syntax for type checking? <int name="pageWidth">10</int> B) We could use reflection to get around each parser having to add its own configuration code. We could create a static configurator that has a configure(Configurable configurable, Map<String, String> params method. That isn't quite right, because we'd have to know the type for each param (see above), but something along those lines. Too complex?
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Tim and Thamme:

          1) Let's not use ParseContext as the vehicle for param passing, we will have collisions with different parsers if anyone uses configure() outside of the normal course of events...it is simpler to use Map<String,String>. Or, if we do use the ParseContext, we should specify which parser the params are for, e.g. context.set{{PDFParser.class, Map<String,String> params. I do like the dual use of configure with ParseContext to achieve Nick's recommendation elegantly.

          I think that's exactly what ParseContext should be for..it should be a vehicle for Param passing. We can delineate by property name (FQ) and/or by class.

          4) Let's subclass TikaException for TikaParameterConfigException? I don't feel strongly about this one.

          +1

          A) Are we ok with Map<String,String> parameters? Or should we follow, say, Solr's syntax for type checking?

          Yes I'm OK with Map<String,String>

          B) We could use reflection to get around each parser having to add its own configuration code. We could create a static configurator that has a configure(Configurable configurable, Map<String, String> params method. That isn't quite right, because we'd have to know the type for each param (see above), but something along those lines. Too complex?

          Maybe not too complex, but not as a start Just my 2c.

          Show
          chrismattmann Chris A. Mattmann added a comment - Tim and Thamme: 1) Let's not use ParseContext as the vehicle for param passing, we will have collisions with different parsers if anyone uses configure() outside of the normal course of events...it is simpler to use Map<String,String>. Or, if we do use the ParseContext, we should specify which parser the params are for, e.g. context.set{{PDFParser.class, Map<String,String> params. I do like the dual use of configure with ParseContext to achieve Nick's recommendation elegantly. I think that's exactly what ParseContext should be for..it should be a vehicle for Param passing. We can delineate by property name (FQ) and/or by class. 4) Let's subclass TikaException for TikaParameterConfigException? I don't feel strongly about this one. +1 A) Are we ok with Map<String,String> parameters? Or should we follow, say, Solr's syntax for type checking? Yes I'm OK with Map<String,String> B) We could use reflection to get around each parser having to add its own configuration code. We could create a static configurator that has a configure(Configurable configurable, Map<String, String> params method. That isn't quite right, because we'd have to know the type for each param (see above), but something along those lines. Too complex? Maybe not too complex, but not as a start Just my 2c.
          Hide
          gagravarr Nick Burch added a comment -

          > I think that's exactly what ParseContext should be for..it should be a vehicle for Param passing. We can delineate by property name (FQ) and/or by class.

          I view ParseContext as somewhere you configure things on a per-document basis, not a per-parser basis.

          So, need to set where Tesseract lives on your system? Applies to everything, so on the parser. Need to tell Tesseract to use a German not an English dictionary on this particular jpeg? Applies to just this one document being parserd, so on the ParseContext

          Show
          gagravarr Nick Burch added a comment - > I think that's exactly what ParseContext should be for..it should be a vehicle for Param passing. We can delineate by property name (FQ) and/or by class. I view ParseContext as somewhere you configure things on a per-document basis, not a per-parser basis. So, need to set where Tesseract lives on your system? Applies to everything, so on the parser. Need to tell Tesseract to use a German not an English dictionary on this particular jpeg? Applies to just this one document being parserd, so on the ParseContext
          Hide
          thammegowda Thamme Gowda added a comment -

          1. Please Let me know the final verdict when all of you agree to one thing, I will make changes as per the recommendation.

          2. +1. Agreed. I will update the code

          3. I really like the suggestion. That would allow us to validate parameters and fail early when they are wrong.
          But I think it requires a lot of rework on the side of Parsers as well. Parsers have to declare what params they expect from the configuration file, it is only after that we will be able to validate. Another simple/lazy approach is to simply assume all params are valid, pass all the params and let the parser raise exception when there are errors. The current PR has the latter approach. Let me know what you think?

          4. +1 Agreed. Will update the code.

          5. Anything that extends AbstractParser is now instance of Configurable. Anything that is an instance of Configurable will be checked and invoked with params while instantiating them. So ParserDecorator, DelegatingParser, ParserPostProcessor are all covered, Yay!! If no params are found in config file, a call is made with empty Map<String, String>. Now it is up to the implementation of these parsers to make use of params by overriding configure() method.

          A & B) I think solr way is complex to implement considering that we dont gain much after the effort (As of now we can just do Integer.parse() or similar ). Plus it introduces ambiguities with the type expected by parsers and the values supplied from configuration.

          Being said that, I am open to all the suggestions.

          Show
          thammegowda Thamme Gowda added a comment - 1. Please Let me know the final verdict when all of you agree to one thing, I will make changes as per the recommendation. 2. +1. Agreed. I will update the code 3. I really like the suggestion. That would allow us to validate parameters and fail early when they are wrong. But I think it requires a lot of rework on the side of Parsers as well. Parsers have to declare what params they expect from the configuration file, it is only after that we will be able to validate. Another simple/lazy approach is to simply assume all params are valid, pass all the params and let the parser raise exception when there are errors. The current PR has the latter approach. Let me know what you think? 4. +1 Agreed. Will update the code. 5. Anything that extends AbstractParser is now instance of Configurable. Anything that is an instance of Configurable will be checked and invoked with params while instantiating them. So ParserDecorator, DelegatingParser, ParserPostProcessor are all covered, Yay!! If no params are found in config file, a call is made with empty Map<String, String>. Now it is up to the implementation of these parsers to make use of params by overriding configure() method. A & B) I think solr way is complex to implement considering that we dont gain much after the effort (As of now we can just do Integer.parse() or similar ). Plus it introduces ambiguities with the type expected by parsers and the values supplied from configuration. Being said that, I am open to all the suggestions.
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          Maybe not too complex, but not as a start Just my 2c.

          The reason I propose this to start is so that we don't have to worry about changing our config and backward compatibility ...

          I think solr way is complex to implement considering that we dont gain much after the effort (As of now we can just do Integer.parse() or similar ). Plus it introduces ambiguities with the type expected by parsers and the values supplied from configuration.

          I think we gain quite a bit. The reason I suggested it is tied to 3)...What we would gain is automatic type checking/verification on loading from the config file.

          If the configurator were something like this:

              public static void configure(Configurable configurable, Map<String, ParamValue> params)
                      throws TikaConfigParameterException {
                  for (String k : params.keySet()) {
                      //camel case the first character
                      String setterName = "set"+k.substring(0,1).toUpperCase(Locale.ENGLISH)+k.substring(1);
                      try {
                          ParamValue v = params.get(k);
                          Method method = configurable.getClass().getDeclaredMethod(setterName, v.getTypeClass());
                          switch (v.getType()) {
                              case BOOLEAN:
                                  method.invoke(configurable, v.getBoolean());
                                  break;
                              case INTEGER:
                                  method.invoke(configurable, v.getInteger());
                                  break;
                              //....
                          }
                      } catch (Exception e) {
                          throw new TikaConfigParameterException("Exception with parameter: " + k +" with class: " +
                                  configurable.getClass(), e);
                      }
                  }
              }
          

          Then each parser that had configurations wouldn't have to register its configurable parameters (strike that suggestion above ), but there would be an exception at creation time if the setN method with a correctly typed parameter didn't exist.

          In short, small bit of code at the outset, but each parser wouldn't then have to repeat the parseInt and handle NumberFormatExceptions, etc. Each configurable parser wouldn't have to worry about configuration at all, except to have appropriate setters.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited Maybe not too complex, but not as a start Just my 2c. The reason I propose this to start is so that we don't have to worry about changing our config and backward compatibility ... I think solr way is complex to implement considering that we dont gain much after the effort (As of now we can just do Integer.parse() or similar ). Plus it introduces ambiguities with the type expected by parsers and the values supplied from configuration. I think we gain quite a bit. The reason I suggested it is tied to 3)...What we would gain is automatic type checking/verification on loading from the config file. If the configurator were something like this: public static void configure(Configurable configurable, Map< String , ParamValue> params) throws TikaConfigParameterException { for ( String k : params.keySet()) { //camel case the first character String setterName = "set" +k.substring(0,1).toUpperCase(Locale.ENGLISH)+k.substring(1); try { ParamValue v = params.get(k); Method method = configurable.getClass().getDeclaredMethod(setterName, v.getTypeClass()); switch (v.getType()) { case BOOLEAN: method.invoke(configurable, v.getBoolean()); break ; case INTEGER: method.invoke(configurable, v.getInteger()); break ; //.... } } catch (Exception e) { throw new TikaConfigParameterException( "Exception with parameter: " + k + " with class: " + configurable.getClass(), e); } } } Then each parser that had configurations wouldn't have to register its configurable parameters (strike that suggestion above ), but there would be an exception at creation time if the setN method with a correctly typed parameter didn't exist. In short, small bit of code at the outset, but each parser wouldn't then have to repeat the parseInt and handle NumberFormatExceptions, etc. Each configurable parser wouldn't have to worry about configuration at all, except to have appropriate setters.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Nick Burch and Chris A. Mattmann, I agree about the distinction, but if we can use the same mechanism for both, let's do it?

          What do you think of a compromise...

          A user can send in a Map<String, String> into the ParseContext with a key of the parser class name that is supposed to use them.

          Want to set pdf config:

          context.set(PDFParser.class, pdfParams)
          

          how about parameters for the RTFParser with clashing parameter names, no problem:

          context.set(RTFParser.class, rtfParams)
          

          Each configurable parser would then be responsible for checking the context to see if there was a value to its own class and then set its own parameters.

          Show
          tallison@mitre.org Tim Allison added a comment - Nick Burch and Chris A. Mattmann , I agree about the distinction, but if we can use the same mechanism for both, let's do it? What do you think of a compromise... A user can send in a Map<String, String> into the ParseContext with a key of the parser class name that is supposed to use them. Want to set pdf config: context.set(PDFParser.class, pdfParams) how about parameters for the RTFParser with clashing parameter names, no problem: context.set(RTFParser.class, rtfParams) Each configurable parser would then be responsible for checking the context to see if there was a value to its own class and then set its own parameters.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          sketch of ParamValue...

          public class ParamValue {
          
              enum Type {
                  BOOLEAN,
                  INTEGER,
                  LONG,
                  FLOAT,
                  DOUBLE,
                  STRING
              }
          
              final Type type;
              final String val;
              
              public ParamValue(int intVal) {
                  this.type = Type.INTEGER;
                  this.val = Integer.toString(intVal); 
              }
              
              //...
              
              public ParamValue(Type type, String value) {
                  this.type = type;
                  this.val = value;
              }
          
              public boolean getBoolean() {
                  if (! type.equals(Type.BOOLEAN)) {
                      throw new IllegalArgumentException("can't cast a "+type+ " to a boolean");
                  }
                  if ("true".equals(val)) {
                      return true;
                  } else if ("false".equals(val)) {
                      return false;
                  }
                  throw new IllegalArgumentException("Couldn't parse "+ val + " as a boolean; must be 'true' or 'false'");
              }
          
          Show
          tallison@mitre.org Tim Allison added a comment - sketch of ParamValue... public class ParamValue { enum Type { BOOLEAN, INTEGER, LONG, FLOAT, DOUBLE, STRING } final Type type; final String val; public ParamValue( int intVal) { this .type = Type.INTEGER; this .val = Integer .toString(intVal); } //... public ParamValue(Type type, String value) { this .type = type; this .val = value; } public boolean getBoolean() { if (! type.equals(Type.BOOLEAN)) { throw new IllegalArgumentException( "can't cast a " +type+ " to a boolean " ); } if ( " true " .equals(val)) { return true ; } else if ( " false " .equals(val)) { return false ; } throw new IllegalArgumentException( "Couldn't parse " + val + " as a boolean ; must be ' true ' or ' false '" ); }
          Hide
          thammegowda Thamme Gowda added a comment -

          Tim Allison Chris A. Mattmann

          Starting the discussion again on this issue.

          Your suggestion regarding specialized exception and a getter for params is already incorporated in PR.
          https://github.com/apache/tika/pull/91#commits-pushed-64db961

          Suggestion related to Type system is pending. I see what we are getting with the support for value Type.
          However, I still have a concern as I feel this is incomplete. Whats our take on multivalued params (aka Arrays of integers and strings as parameter values)?

          Show
          thammegowda Thamme Gowda added a comment - Tim Allison Chris A. Mattmann Starting the discussion again on this issue. Your suggestion regarding specialized exception and a getter for params is already incorporated in PR. https://github.com/apache/tika/pull/91#commits-pushed-64db961 Suggestion related to Type system is pending. I see what we are getting with the support for value Type. However, I still have a concern as I feel this is incomplete. Whats our take on multivalued params (aka Arrays of integers and strings as parameter values)?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Whats our take on multivalued params (aka Arrays of integers and strings as parameter values)?

          IMHO, might be nice, but Solr has been getting along without them. As soon as I say that, of course, we could use that in the DigestingParser...but we could also just back off to string serialization: "md5,sha1", etc.

          Show
          tallison@mitre.org Tim Allison added a comment - Whats our take on multivalued params (aka Arrays of integers and strings as parameter values)? IMHO, might be nice, but Solr has been getting along without them. As soon as I say that, of course, we could use that in the DigestingParser...but we could also just back off to string serialization: "md5,sha1", etc.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you for getting this conversation going again!

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you for getting this conversation going again!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tika/pull/91

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/91
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Resolved! Thanks everyone! Tim Allison Thamme Gowda

          Show
          chrismattmann Chris A. Mattmann added a comment - Resolved! Thanks everyone! Tim Allison Thamme Gowda
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1092 (See https://builds.apache.org/job/Tika-trunk/1092/)
          TIKA-1508 add some unit tests for ParameterizedParserTest (tallison: rev 18ab8f91f19b105270a107174afd64fe81d20f72)

          • (add) tika-core/src/test/resources/org/apache/tika/config/TIKA-1986-bad-types.xml
          • (delete) tika-core/src/test/resources/org/apache/tika/config/TIKA-1986-parametrized.xml
          • (add) tika-core/src/test/resources/org/apache/tika/config/TIKA-1986-some-parameters.xml
          • (add) tika-core/src/test/resources/org/apache/tika/config/TIKA-1986-bad-parameters.xml
          • (add) tika-core/src/test/resources/org/apache/tika/config/TIKA-1986-bad-values.xml
          • (add) tika-core/src/test/resources/org/apache/tika/config/TIKA-1986-parameterized.xml
          • (edit) tika-core/src/test/java/org/apache/tika/parser/ConfigurableParserTest.java
          • (delete) tika-core/src/test/java/org/apache/tika/parser/DummyParametrizedParser.java
          • (delete) tika-core/src/test/java/org/apache/tika/parser/ParametrizedParserTest.java
          • (add) tika-core/src/test/java/org/apache/tika/parser/DummyParameterizedParser.java
          • (add) tika-core/src/test/java/org/apache/tika/parser/ParameterizedParserTest.java
            TIKA-1508 proof of concept with on parameter on PDFParser (tallison: rev 853750d47fa99afad0df6d4f4727a35c96675254)
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
          • (add) tika-parsers/src/test/resources/org/apache/tika/parser/pdf/tika-config.xml
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
            updated with the new changes of TIKA-1508 (thammegowda: rev 31cf12d51539c1ec01c12d73f96e7eadfd2339f0)
          • (edit) tika-core/src/main/java/org/apache/tika/utils/AnnotationUtils.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowImageRecParser.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/recognition/tf/TensorflowImageRecParserTest.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecogniser.java
            Update changes for TIKA-1508, TIKA-1993, TIKA-1986. (mattmann: rev da82df5e9def9698fd32f85fe706660641d7c31f)
          • (edit) CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1092 (See https://builds.apache.org/job/Tika-trunk/1092/ ) TIKA-1508 add some unit tests for ParameterizedParserTest (tallison: rev 18ab8f91f19b105270a107174afd64fe81d20f72) (add) tika-core/src/test/resources/org/apache/tika/config/ TIKA-1986 -bad-types.xml (delete) tika-core/src/test/resources/org/apache/tika/config/ TIKA-1986 -parametrized.xml (add) tika-core/src/test/resources/org/apache/tika/config/ TIKA-1986 -some-parameters.xml (add) tika-core/src/test/resources/org/apache/tika/config/ TIKA-1986 -bad-parameters.xml (add) tika-core/src/test/resources/org/apache/tika/config/ TIKA-1986 -bad-values.xml (add) tika-core/src/test/resources/org/apache/tika/config/ TIKA-1986 -parameterized.xml (edit) tika-core/src/test/java/org/apache/tika/parser/ConfigurableParserTest.java (delete) tika-core/src/test/java/org/apache/tika/parser/DummyParametrizedParser.java (delete) tika-core/src/test/java/org/apache/tika/parser/ParametrizedParserTest.java (add) tika-core/src/test/java/org/apache/tika/parser/DummyParameterizedParser.java (add) tika-core/src/test/java/org/apache/tika/parser/ParameterizedParserTest.java TIKA-1508 proof of concept with on parameter on PDFParser (tallison: rev 853750d47fa99afad0df6d4f4727a35c96675254) (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java (add) tika-parsers/src/test/resources/org/apache/tika/parser/pdf/tika-config.xml (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java updated with the new changes of TIKA-1508 (thammegowda: rev 31cf12d51539c1ec01c12d73f96e7eadfd2339f0) (edit) tika-core/src/main/java/org/apache/tika/utils/AnnotationUtils.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowImageRecParser.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/recognition/tf/TensorflowImageRecParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecogniser.java Update changes for TIKA-1508 , TIKA-1993 , TIKA-1986 . (mattmann: rev da82df5e9def9698fd32f85fe706660641d7c31f) (edit) CHANGES.txt
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Need to update 2.x branch

          Show
          tallison@mitre.org Tim Allison added a comment - Need to update 2.x branch

            People

            • Assignee:
              chrismattmann Chris A. Mattmann
              Reporter:
              tallison@mitre.org Tim Allison
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development