Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.

      Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
      NUTCH_HOME/lib : tika-core.jar
      NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
      Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika

      Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.

      Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika.

      The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.

      The following libraries are required in the lib/ directory of the tika-parser :

      <library name="asm-3.1.jar"/>
      <library name="bcmail-jdk15-144.jar"/>
      <library name="commons-compress-1.0.jar"/>
      <library name="commons-logging-1.1.1.jar"/>
      <library name="dom4j-1.6.1.jar"/>
      <library name="fontbox-0.8.0-incubator.jar"/>
      <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
      <library name="hamcrest-core-1.1.jar"/>
      <library name="jce-jdk13-144.jar"/>
      <library name="jempbox-0.8.0-incubator.jar"/>
      <library name="metadata-extractor-2.4.0-beta-1.jar"/>
      <library name="mockito-core-1.7.jar"/>
      <library name="objenesis-1.0.jar"/>
      <library name="ooxml-schemas-1.0.jar"/>
      <library name="pdfbox-0.8.0-incubating.jar"/>
      <library name="poi-3.5-FINAL.jar"/>
      <library name="poi-ooxml-3.5-FINAL.jar"/>
      <library name="poi-scratchpad-3.5-FINAL.jar"/>
      <library name="tagsoup-1.2.jar"/>
      <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
      <library name="xml-apis-1.0.b2.jar"/>
      <library name="xmlbeans-2.3.0.jar"/>

      There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.

      Again, your comments are welcome. Please bear in mind that this is just a first step.

      Julien
      http://www.digitalpebble.com

      1. NUTCH-766.v2
        93 kB
        Julien Nioche
      2. NUTCH-766-v3.patch
        92 kB
        Julien Nioche
      3. NutchTikaConfig.java
        4 kB
        Sami Siren
      4. sample.tar.gz
        42 kB
        Julien Nioche
      5. TikaParser.java
        8 kB
        Sami Siren

        Issue Links

          Activity

          Hide
          Julien Nioche added a comment -

          patch for the Tika-plugin

          Show
          Julien Nioche added a comment - patch for the Tika-plugin
          Hide
          Julien Nioche added a comment -

          Patch for the ParserFactory to allow * as mimetype value for a parser plugin

          Show
          Julien Nioche added a comment - Patch for the ParserFactory to allow * as mimetype value for a parser plugin
          Hide
          Chris A. Mattmann added a comment -

          Hi Julien:

          I have had a look and was trying to test it out but got sidetracked. Give me this week to try and put together a final reviewable/commitable patch, otherwise, it's all yours.

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hi Julien: I have had a look and was trying to test it out but got sidetracked. Give me this week to try and put together a final reviewable/commitable patch, otherwise, it's all yours. Cheers, Chris
          Hide
          Julien Nioche added a comment -

          Hi Chris,

          No worries, I'd rather wait for you to have a look at it. It's quite a big change and it would be better if someone else had a look at it. Being the author I might miss something obvious

          Thanks

          J.

          Show
          Julien Nioche added a comment - Hi Chris, No worries, I'd rather wait for you to have a look at it. It's quite a big change and it would be better if someone else had a look at it. Being the author I might miss something obvious Thanks J.
          Hide
          Sami Siren added a comment -

          I took a brief look into the proposed patch, some somments:

          The public API footprint of new classes should be smaller, eg use private, package private or protected methods/classes as much as possible.

          I think the end result of this plugin should be replacing all Tika supported parsers (or the parsers we choose to replace) with the TikaParser and not to build a parallel ways to parse same formats. So I think we need to copy all of the the existing test files and move&adapt the existing testcases fully before committing this. That is a good way of seeing that the parse result is what is expected and also find out about possible differences with old vs. Tika version.

          Show
          Sami Siren added a comment - I took a brief look into the proposed patch, some somments: The public API footprint of new classes should be smaller, eg use private, package private or protected methods/classes as much as possible. I think the end result of this plugin should be replacing all Tika supported parsers (or the parsers we choose to replace) with the TikaParser and not to build a parallel ways to parse same formats. So I think we need to copy all of the the existing test files and move&adapt the existing testcases fully before committing this. That is a good way of seeing that the parse result is what is expected and also find out about possible differences with old vs. Tika version.
          Hide
          Julien Nioche added a comment -

          > I think the end result of this plugin should be replacing all Tika supported parsers (or the parsers we choose to replace) with the TikaParser and not to build a parallel ways to parse same formats.

          That's how I see it - it's just that we have the option of choosing when to use Tika or not for a given mimetype. It is used by default unless an association is created between a parser implementation and a mimetype in the parse-plugins.xml

          > So I think we need to copy all of the the existing test files and move&adapt the existing testcases fully before committing this. That is a good way of seeing that the parse result is what is expected and also find out about possible differences with old vs. Tika version.

          Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default. Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins.

          BTW http://wiki.apache.org/nutch/TikaPlugin lists the differences between the current version of Tika and the existing Nutch parsers

          Even if we decide to keep using the old plugins for some of the formats to start with, we'd still be able to the Tika plugin by default for the ones which have already the same coverage

          Show
          Julien Nioche added a comment - > I think the end result of this plugin should be replacing all Tika supported parsers (or the parsers we choose to replace) with the TikaParser and not to build a parallel ways to parse same formats. That's how I see it - it's just that we have the option of choosing when to use Tika or not for a given mimetype. It is used by default unless an association is created between a parser implementation and a mimetype in the parse-plugins.xml > So I think we need to copy all of the the existing test files and move&adapt the existing testcases fully before committing this. That is a good way of seeing that the parse result is what is expected and also find out about possible differences with old vs. Tika version. Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default. Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins. BTW http://wiki.apache.org/nutch/TikaPlugin lists the differences between the current version of Tika and the existing Nutch parsers Even if we decide to keep using the old plugins for some of the formats to start with, we'd still be able to the Tika plugin by default for the ones which have already the same coverage
          Hide
          Sami Siren added a comment -

          > Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default. Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins.

          I meant test files for the parsers we replace, not all

          > BTW http://wiki.apache.org/nutch/TikaPlugin lists the differences between the current version of Tika and the existing Nutch parsers

          ok, I had misses that one.

          Show
          Sami Siren added a comment - > Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default. Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins. I meant test files for the parsers we replace, not all > BTW http://wiki.apache.org/nutch/TikaPlugin lists the differences between the current version of Tika and the existing Nutch parsers ok, I had misses that one.
          Hide
          Chris A. Mattmann added a comment - - edited

          Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default. Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins.

          +1, I'm going to agree on this one here Julien. Other communities have convinced me of the need for backwards compat and unobtrusiveness when bringing in new functionality or results. +1 to at least in Nutch 1.1 leaving the old plugins (perhaps mentioning they should be deprecated and replaced by the Tika functionality) and then removing them in 1.2 or 1.3.

          I got bogged down with my paid job, but I found some Apache time recently so this is tops on my list to tackle.

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - - edited Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default. Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins. +1, I'm going to agree on this one here Julien. Other communities have convinced me of the need for backwards compat and unobtrusiveness when bringing in new functionality or results. +1 to at least in Nutch 1.1 leaving the old plugins (perhaps mentioning they should be deprecated and replaced by the Tika functionality) and then removing them in 1.2 or 1.3. I got bogged down with my paid job, but I found some Apache time recently so this is tops on my list to tackle. Cheers, Chris
          Hide
          Sami Siren added a comment -

          >+1, I'm going to agree on this one here Julien. Other communities have convinced me of the need for backwards compat and unobtrusiveness when bringing in new functionality or results. +1 to at least in Nutch 1.1 leaving the old plugins (perhaps mentioning they should be deprecated and replaced by the Tika functionality) and then removing them in 1.2 or 1.3.

          Chris, can you please explain me how keeping two components doing identical work would be more backwards compatible than having only 1?

          Show
          Sami Siren added a comment - >+1, I'm going to agree on this one here Julien. Other communities have convinced me of the need for backwards compat and unobtrusiveness when bringing in new functionality or results. +1 to at least in Nutch 1.1 leaving the old plugins (perhaps mentioning they should be deprecated and replaced by the Tika functionality) and then removing them in 1.2 or 1.3. Chris, can you please explain me how keeping two components doing identical work would be more backwards compatible than having only 1?
          Hide
          Chris A. Mattmann added a comment -

          Hi Sami:

          Chris, can you please explain me how keeping two components doing identical work would be more backwards compatible than having only 1?

          Sure, it's more of a configuration backwards-compat issue. For those folks who have gone to the trouble of customizing their nutch configuration (nuch-site.xml, or nutch-default.xml, or even parse-plugins), to remove out the parsing plugins (e.g., basically say they don't exist anymore and update your deployed configuration to use the tika-plugin), this patch would require a configuration update in their deployed environments. Because of that, why don't we ease them into that upgrade with at least one released version before the plugins go away. It would make it easier from a configuration backwards-compat perspective.

          HTH,
          Chris

          Show
          Chris A. Mattmann added a comment - Hi Sami: Chris, can you please explain me how keeping two components doing identical work would be more backwards compatible than having only 1? Sure, it's more of a configuration backwards-compat issue. For those folks who have gone to the trouble of customizing their nutch configuration (nuch-site.xml, or nutch-default.xml, or even parse-plugins), to remove out the parsing plugins (e.g., basically say they don't exist anymore and update your deployed configuration to use the tika-plugin), this patch would require a configuration update in their deployed environments. Because of that, why don't we ease them into that upgrade with at least one released version before the plugins go away. It would make it easier from a configuration backwards-compat perspective. HTH, Chris
          Hide
          Andrzej Bialecki added a comment -

          I agree with Chris, +1 on keeping the old plugins in 1.1 with a prominent deprecation note, but I feel equally strongly that we should not prolong their life-cycle beyond what we can support, i.e. I'm +1 on removing them in 1.2/1.3. We simply don't have resources to maintain so many duplicate plugins, and instead we should direct our efforts to improve those in Tika.

          Show
          Andrzej Bialecki added a comment - I agree with Chris, +1 on keeping the old plugins in 1.1 with a prominent deprecation note, but I feel equally strongly that we should not prolong their life-cycle beyond what we can support, i.e. I'm +1 on removing them in 1.2/1.3. We simply don't have resources to maintain so many duplicate plugins, and instead we should direct our efforts to improve those in Tika.
          Hide
          Sami Siren added a comment -

          Sure, it's more of a configuration backwards-compat issue. For those folks who have gone to the trouble of customizing their nutch configuration (nuch-site.xml, or nutch-default.xml, or even parse-plugins), to remove out the parsing plugins (e.g., basically say they don't exist anymore and update your deployed configuration to use the tika-plugin), this patch would require a configuration update in their deployed environments. Because of that, why don't we ease them into that upgrade with at least one released version before the plugins go away. It would make it easier from a configuration backwards-compat perspective.

          Ok, so you mean that we need to have duplicate parser plugins because we don't want to ask people already using nutch to reconfigure the bits this involves now even though we have to do it later? How is postponing going to ease the task they need to do anyway at some point? I still don't understand the (longer term) benefit.

          I am not strongly against the idea of keeping duplicate plugins, I mean it's just another ~20M in the .job, what I am worried about is that the history will repeat itself and we will end up having one more case of duplicate components (in this case many of them) doing the same work and no interest in cleaning up afterwards. Doing it the way I suggested would guarantee that this will not happen.

          Show
          Sami Siren added a comment - Sure, it's more of a configuration backwards-compat issue. For those folks who have gone to the trouble of customizing their nutch configuration (nuch-site.xml, or nutch-default.xml, or even parse-plugins), to remove out the parsing plugins (e.g., basically say they don't exist anymore and update your deployed configuration to use the tika-plugin), this patch would require a configuration update in their deployed environments. Because of that, why don't we ease them into that upgrade with at least one released version before the plugins go away. It would make it easier from a configuration backwards-compat perspective. Ok, so you mean that we need to have duplicate parser plugins because we don't want to ask people already using nutch to reconfigure the bits this involves now even though we have to do it later? How is postponing going to ease the task they need to do anyway at some point? I still don't understand the (longer term) benefit. I am not strongly against the idea of keeping duplicate plugins, I mean it's just another ~20M in the .job, what I am worried about is that the history will repeat itself and we will end up having one more case of duplicate components (in this case many of them) doing the same work and no interest in cleaning up afterwards. Doing it the way I suggested would guarantee that this will not happen.
          Hide
          Julien Nioche added a comment -

          Here is a slightly better version of the patch which :
          • fixes a small bug in the Tika parser (the API has changed slightly between 1.5beta and 1.5)
          • fixes a bug with the TestParserFactory
          • adds the tika-plugin to the list of plugins to be built in src/plugin/build.xml
          • limits public exposure of methods and classes (see Sami's comment)
          • modified parse-plugins.xml : added parse-tika and commented out associations between some mime-types and the old parsers

          I've also added an ANT script which uses IVY to pull the dependencies and copies them into the lib dir. Obviously this won't be needed when the plugin is committed but should simplify the initial testing. All you need to do after applying the patch is to :

          cd src/plugin/parse-tika/
          ant -f build-ivy.xml

          Am also attaching the content of the sample directory as an archive - just unzip onto the src/plugin/parse-tika/ before calling ant test-plugins

          Julien

          Show
          Julien Nioche added a comment - Here is a slightly better version of the patch which : • fixes a small bug in the Tika parser (the API has changed slightly between 1.5beta and 1.5) • fixes a bug with the TestParserFactory • adds the tika-plugin to the list of plugins to be built in src/plugin/build.xml • limits public exposure of methods and classes (see Sami's comment) • modified parse-plugins.xml : added parse-tika and commented out associations between some mime-types and the old parsers I've also added an ANT script which uses IVY to pull the dependencies and copies them into the lib dir. Obviously this won't be needed when the plugin is committed but should simplify the initial testing. All you need to do after applying the patch is to : cd src/plugin/parse-tika/ ant -f build-ivy.xml Am also attaching the content of the sample directory as an archive - just unzip onto the src/plugin/parse-tika/ before calling ant test-plugins Julien
          Hide
          Julien Nioche added a comment -

          new version of the patch + archive containing the binary docs used for testing

          Show
          Julien Nioche added a comment - new version of the patch + archive containing the binary docs used for testing
          Hide
          Julien Nioche added a comment -

          Updated version of the plugin : uses Tika 0.6

          Show
          Julien Nioche added a comment - Updated version of the plugin : uses Tika 0.6
          Hide
          Andrzej Bialecki added a comment -

          +1 to commit this - please remember to update nutch-default.xml to switch to the tika plugin, perhaps add a comment about the deprecated parse-* plugins - most people look here and not in the parse-plugins, where this change is documented...

          Show
          Andrzej Bialecki added a comment - +1 to commit this - please remember to update nutch-default.xml to switch to the tika plugin, perhaps add a comment about the deprecated parse-* plugins - most people look here and not in the parse-plugins, where this change is documented...
          Hide
          Chris A. Mattmann added a comment -

          +1 to commit this...

          Awesome, Andrzej. Will do so tonight, PST, if I don't hear any objections between now and then...

          Thanks!

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - +1 to commit this... Awesome, Andrzej. Will do so tonight, PST, if I don't hear any objections between now and then... Thanks! Cheers, Chris
          Hide
          Chris A. Mattmann added a comment -

          I'm going to hold off on committing this tonight. I've updated the docs per Andrzej, and I've also updated CHANGES.txt, but when running:

          ant clean compile-core test
          

          I'm seeing these messages during plugin testing for parse-tika:

          2010-02-10 22:39:16,593 ERROR tika.TikaParser (TikaParser.java:getParse(63)) - Can't retrieve Tika parser for mime-type application/pdf
          ------------- ---------------- ---------------
          
          Testcase: testIt took 2.684 sec
                  FAILED
          null
          junit.framework.AssertionFailedError
                  at org.apache.nutch.tika.TestPdfParser.testIt(TestPdfParser.java:79)
          

          It seems that the TikaConfig is not being found? I was looking at TikaParser#setConf and it seems that a default config is being created for Tika, but maybe not being loaded correctly? I need to look into this more...

          Show
          Chris A. Mattmann added a comment - I'm going to hold off on committing this tonight. I've updated the docs per Andrzej, and I've also updated CHANGES.txt, but when running: ant clean compile-core test I'm seeing these messages during plugin testing for parse-tika: 2010-02-10 22:39:16,593 ERROR tika.TikaParser (TikaParser.java:getParse(63)) - Can't retrieve Tika parser for mime-type application/pdf ------------- ---------------- --------------- Testcase: testIt took 2.684 sec FAILED null junit.framework.AssertionFailedError at org.apache.nutch.tika.TestPdfParser.testIt(TestPdfParser.java:79) It seems that the TikaConfig is not being found? I was looking at TikaParser#setConf and it seems that a default config is being created for Tika, but maybe not being loaded correctly? I need to look into this more...
          Hide
          Sami Siren added a comment -

          I suggest that we would still drive this a bit further an use. currently this patch does not use Tika for pkg formats nor html.

          Julien: was there a reason not to use AutoDetect parser? The only thing that I could come with was that the mime type detection would be done twice. We could get around this by implementing somethin simlilar to what composite parser does (it uses a parser (AutodetectParser) class from the context to do further parsing) to cover all supported pkg formats.

          Also was there a reson not to parse html wtih tika?

          I have a patch nearby to demonstrate some of the improvements that I will try to post briefly.

          Show
          Sami Siren added a comment - I suggest that we would still drive this a bit further an use. currently this patch does not use Tika for pkg formats nor html. Julien: was there a reason not to use AutoDetect parser? The only thing that I could come with was that the mime type detection would be done twice. We could get around this by implementing somethin simlilar to what composite parser does (it uses a parser (AutodetectParser) class from the context to do further parsing) to cover all supported pkg formats. Also was there a reson not to parse html wtih tika? I have a patch nearby to demonstrate some of the improvements that I will try to post briefly.
          Hide
          Sami Siren added a comment -

          Extended TikaConfig that is able to load parsers and can be used with existing tika classes. The call to (super) cannot load parser but then the config is porcessed again locally. This is a hack and hopefully at some point we can drop the class alltogether.

          Show
          Sami Siren added a comment - Extended TikaConfig that is able to load parsers and can be used with existing tika classes. The call to (super) cannot load parser but then the config is porcessed again locally. This is a hack and hopefully at some point we can drop the class alltogether.
          Hide
          Sami Siren added a comment -

          Modified parser that can process package formats too. To get rid of the mime type detection happening twice we have to extend AutoDetectParser so that skips the intitial detection but does the detection for the rest of the content (in pkg formats)

          Show
          Sami Siren added a comment - Modified parser that can process package formats too. To get rid of the mime type detection happening twice we have to extend AutoDetectParser so that skips the intitial detection but does the detection for the rest of the content (in pkg formats)
          Hide
          Julien Nioche added a comment -

          @Chris : I just did a fresh co from svn, applied the patch v3 and unzipped sample.tar.gz onto the directory parse-tika and ran the test just as you did but could not reproduce the problem. Could there be a difference between your version and the trunk?

          @Sami :

          was there a reason not to use AutoDetect parser?

          I suppose we could as long we give it a clue about the MimeType obtained from the Content. As you pointed out, there could be a duplication with the detection done by Mime-Util. I suppose one way to do would be to add a new version of the method getParse(Content conte, MimeType type). That's an interesting point.

          Also was there a reson not to parse html wtih tika?

          It is supposed to do so, if it does not then it's a bug which needs urgent fixing.

          Regarding parsing package formats, I think the plan is that Tika will handle that in the future but we could try to do that now if we find a relatively clean mechanism for doing so. BTW could you please send a diff and not the full code of the class you posted earlier, that would make the comparison much easier.

          Show
          Julien Nioche added a comment - @Chris : I just did a fresh co from svn, applied the patch v3 and unzipped sample.tar.gz onto the directory parse-tika and ran the test just as you did but could not reproduce the problem. Could there be a difference between your version and the trunk? @Sami : was there a reason not to use AutoDetect parser? I suppose we could as long we give it a clue about the MimeType obtained from the Content. As you pointed out, there could be a duplication with the detection done by Mime-Util. I suppose one way to do would be to add a new version of the method getParse(Content conte, MimeType type). That's an interesting point. Also was there a reson not to parse html wtih tika? It is supposed to do so, if it does not then it's a bug which needs urgent fixing. Regarding parsing package formats, I think the plan is that Tika will handle that in the future but we could try to do that now if we find a relatively clean mechanism for doing so. BTW could you please send a diff and not the full code of the class you posted earlier, that would make the comparison much easier.
          Hide
          Julien Nioche added a comment - - edited

          I had a closer look at the HTML parsing issue. What happens is that the association between the mime-type and the parser implementation is not explicitely set in parse-plugins.xml so the ParserFactory goes through all the plugins and gets the ones with a matching mimetype (or * for Tika). The Tika parser takes no precedence over the default HTML parser and the latter gets first in the list and is used for parsing.

          Of course that does not happen if parse-html is not specified in plugin.includes or if an explicit mapping is set in parse-plugins.xml. I don't think we want to have to specify explicitely that tika should be used in all the mappings and reserve cases for when a parser must be used instead of Tika.

          What we could do though is that in the cases where no explicit mapping is set for a mimetype, Tika (or any parser marked as supporting any mimetype) will be put first in the list of discovered parsers so it would remain the default choice unless an explicit mapping is set (even if a plugin is loaded and can handle the type).

          Makes sense?

          The ParserFactory section of the patch v3 can be replaced by :

          Index: src/java/org/apache/nutch/parse/ParserFactory.java
          ===================================================================
          — src/java/org/apache/nutch/parse/ParserFactory.java (revision 909059)
          +++ src/java/org/apache/nutch/parse/ParserFactory.java (working copy)
          @@ -348,11 +348,23 @@
          contentType))

          { extList.add(extensions[i]); }

          + else if ("*".equals(extensions[i].getAttribute("contentType")))

          { + // default plugins get the priority + extList.add(0, extensions[i]); + }

          }

          if (extList.size() > 0) {
          if (LOG.isInfoEnabled()) {

          • LOG.info("The parsing plugins: " + extList +
            + StringBuffer extensionsIDs = new StringBuffer("[");
            + boolean isFirst = true;
            + for (Extension ext : extList) { + if (!isFirst) extensionsIDs.append(" - "); + else isFirst=false; + extensionsIDs.append(ext.getId()); + }

            + extensionsIDs.append("]");
            + LOG.info("The parsing plugins: " + extensionsIDs.toString() +
            " are enabled via the plugin.includes system " +
            "property, and all claim to support the content type " +
            contentType + ", but they are not mapped to it in the " +
            @@ -369,7 +381,7 @@

          private boolean match(Extension extension, String id, String type)

          { return ((id.equals(extension.getId())) && - (type.equals(extension.getAttribute("contentType")) || + (type.equals(extension.getAttribute("contentType")) || extension.getAttribute("contentType").equals("*") || type.equals(DEFAULT_PLUGIN))); }
          Show
          Julien Nioche added a comment - - edited I had a closer look at the HTML parsing issue. What happens is that the association between the mime-type and the parser implementation is not explicitely set in parse-plugins.xml so the ParserFactory goes through all the plugins and gets the ones with a matching mimetype (or * for Tika). The Tika parser takes no precedence over the default HTML parser and the latter gets first in the list and is used for parsing. Of course that does not happen if parse-html is not specified in plugin.includes or if an explicit mapping is set in parse-plugins.xml. I don't think we want to have to specify explicitely that tika should be used in all the mappings and reserve cases for when a parser must be used instead of Tika. What we could do though is that in the cases where no explicit mapping is set for a mimetype, Tika (or any parser marked as supporting any mimetype) will be put first in the list of discovered parsers so it would remain the default choice unless an explicit mapping is set (even if a plugin is loaded and can handle the type). Makes sense? The ParserFactory section of the patch v3 can be replaced by : Index: src/java/org/apache/nutch/parse/ParserFactory.java =================================================================== — src/java/org/apache/nutch/parse/ParserFactory.java (revision 909059) +++ src/java/org/apache/nutch/parse/ParserFactory.java (working copy) @@ -348,11 +348,23 @@ contentType)) { extList.add(extensions[i]); } + else if ("*".equals(extensions [i] .getAttribute("contentType"))) { + // default plugins get the priority + extList.add(0, extensions[i]); + } } if (extList.size() > 0) { if (LOG.isInfoEnabled()) { LOG.info("The parsing plugins: " + extList + + StringBuffer extensionsIDs = new StringBuffer("["); + boolean isFirst = true; + for (Extension ext : extList) { + if (!isFirst) extensionsIDs.append(" - "); + else isFirst=false; + extensionsIDs.append(ext.getId()); + } + extensionsIDs.append("]"); + LOG.info("The parsing plugins: " + extensionsIDs.toString() + " are enabled via the plugin.includes system " + "property, and all claim to support the content type " + contentType + ", but they are not mapped to it in the " + @@ -369,7 +381,7 @@ private boolean match(Extension extension, String id, String type) { return ((id.equals(extension.getId())) && - (type.equals(extension.getAttribute("contentType")) || + (type.equals(extension.getAttribute("contentType")) || extension.getAttribute("contentType").equals("*") || type.equals(DEFAULT_PLUGIN))); }
          Hide
          Chris A. Mattmann added a comment -

          Hi Julien:

          @Chris : I just did a fresh co from svn, applied the patch v3 and unzipped sample.tar.gz onto the directory parse-tika and ran the test just as you did but could not reproduce the problem. Could there be a difference between your version and the trunk?

          I tried this process last night:

          1. SVN up to r908832
          2. download patch v3
          3. download sample.tgz
          4. apply patch v3 to r908832
          5. untar sample.tgz into src/plugin/parse-tika, creating a sample folder in that dir
          6. ant clean compile-core test

          Any idea why I'm seeing the error?

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hi Julien: @Chris : I just did a fresh co from svn, applied the patch v3 and unzipped sample.tar.gz onto the directory parse-tika and ran the test just as you did but could not reproduce the problem. Could there be a difference between your version and the trunk? I tried this process last night: 1. SVN up to r908832 2. download patch v3 3. download sample.tgz 4. apply patch v3 to r908832 5. untar sample.tgz into src/plugin/parse-tika, creating a sample folder in that dir 6. ant clean compile-core test Any idea why I'm seeing the error? Cheers, Chris
          Hide
          Julien Nioche added a comment -

          @Chris : did you do

          ant -f src/plugin/parse-tika/build-ivy.xml

          between 5 and 6? This is required in order to populate the lib directory automatically

          Show
          Julien Nioche added a comment - @Chris : did you do ant -f src/plugin/parse-tika/build-ivy.xml between 5 and 6? This is required in order to populate the lib directory automatically
          Hide
          Chris A. Mattmann added a comment -

          @Julien:

          Sigh, no I didn't!

          That's probably why! Thanks for the help. I'll try it later today. If that passes, my +1 to commit.

          @Sami, regarding your updates, would you be OK with me creating another issue to track them, attaching your diffs as patches against this issue, once committed to the trunk? That way we'll make sure they get into 1.1, but we won't block this issue anymore from getting in. Let me know what you think, thanks.

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - @Julien: Sigh, no I didn't! That's probably why! Thanks for the help. I'll try it later today. If that passes, my +1 to commit. @Sami, regarding your updates, would you be OK with me creating another issue to track them, attaching your diffs as patches against this issue, once committed to the trunk? That way we'll make sure they get into 1.1, but we won't block this issue anymore from getting in. Let me know what you think, thanks. Cheers, Chris
          Hide
          Chris A. Mattmann added a comment -
          • committed in r909268. Added in the nutch-default.xml comments near the parse-tika plugin.includes enable block. Sami, I'll create a new issue now to track your proposed updates to the Tika parser. I ran unit tests with the patch i committed, and they all passed.

          Thanks, Julien!

          Show
          Chris A. Mattmann added a comment - committed in r909268. Added in the nutch-default.xml comments near the parse-tika plugin.includes enable block. Sami, I'll create a new issue now to track your proposed updates to the Tika parser. I ran unit tests with the patch i committed, and they all passed. Thanks, Julien!
          Hide
          Chris A. Mattmann added a comment -
          • forgot to add in dep libs, added in r909269. Thanks!
          Show
          Chris A. Mattmann added a comment - forgot to add in dep libs, added in r909269. Thanks!
          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk #1067 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1067/)

          • 2nd part of Tika parser
          • fix for Tika parser
          Show
          Hudson added a comment - Integrated in Nutch-trunk #1067 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1067/ ) 2nd part of Tika parser fix for Tika parser
          Hide
          Julien Nioche added a comment -

          Have added small improvement in revision 910187 (Prioritise default Tika parser when discovering plugins matching mime-type).
          Thanks to Chris for testing and committing it + Andrzej and Sami for their comments and suggestions

          Show
          Julien Nioche added a comment - Have added small improvement in revision 910187 (Prioritise default Tika parser when discovering plugins matching mime-type). Thanks to Chris for testing and committing it + Andrzej and Sami for their comments and suggestions
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Nutch-trunk #1071 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1071/ )
          Hide
          Diego Campo added a comment -

          Hello,
          I'm concerned on the TikaParser.java efficiency.
          Couple issues to comment on:

          1/ If I understood correctly, it seems that for each meta tag, it traverses the DOM object beginning from its root.
          (I mean the DOMContentUtils methods calls).
          Wouldnt it be more efficient to traverse once and go getting each of the meta tags required?

          2/ I want to develop a custom parser. Is there an efficiency penalty of doing this back-and-forth from Tika to Nutch as opposed to developing the
          parser as just a nutch-compliant one?
          Thanks a lot.

          Show
          Diego Campo added a comment - Hello, I'm concerned on the TikaParser.java efficiency. Couple issues to comment on: 1/ If I understood correctly, it seems that for each meta tag, it traverses the DOM object beginning from its root. (I mean the DOMContentUtils methods calls). Wouldnt it be more efficient to traverse once and go getting each of the meta tags required? 2/ I want to develop a custom parser. Is there an efficiency penalty of doing this back-and-forth from Tika to Nutch as opposed to developing the parser as just a nutch-compliant one? Thanks a lot.

            People

            • Assignee:
              Chris A. Mattmann
              Reporter:
              Julien Nioche
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development