Nutch
  1. Nutch
  2. NUTCH-874

Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: nutchgora
    • Fix Version/s: 2.4
    • Component/s: parser
    • Labels:
      None
    • Environment:

      Nutch 2.0

      Description

      I just noticed while fixing NUTCH-564 that the ExtParser hasn't been brought up to date with Nutch 2.0 trunk. We should review the plugins in src/plugin to make sure they all work with Gora/Nutchbase now.

      1. NUTCH-874.patch
        7 kB
        Lewis John McGibbney

        Issue Links

          Activity

          Hide
          kiran added a comment -

          The following plugins need to be ported for compatibility in 2.x

          i) Feed
          ii) parse-swf
          iii) parse-ext
          iv) parse-zip
          v) parse-metatags ( I wrote patch for this earlier, NUTCH-1478)

          Show
          kiran added a comment - The following plugins need to be ported for compatibility in 2.x i) Feed ii) parse-swf iii) parse-ext iv) parse-zip v) parse-metatags ( I wrote patch for this earlier, NUTCH-1478 )
          Hide
          Hudson added a comment -

          Integrated in Nutch-nutchgora #375 (See https://builds.apache.org/job/Nutch-nutchgora/375/)
          NUTCH-874 Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora (part 1) (Revision 1396850)

          Result = SUCCESS
          lewismc :
          Files :

          • /nutch/branches/2.x/CHANGES.txt
          • /nutch/branches/2.x/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java
          • /nutch/branches/2.x/src/plugin/feed/src/java/org/apache/nutch/parse/feed/FeedParser.java
          • /nutch/branches/2.x/src/plugin/feed/src/test/org/apache/nutch/parse/feed/TestFeedParser.java
          • /nutch/branches/2.x/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
          • /nutch/branches/2.x/src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
          • /nutch/branches/2.x/src/plugin/parse-swf/src/test/org/apache/nutch/parse/swf/TestSWFParser.java
          • /nutch/branches/2.x/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
          • /nutch/branches/2.x/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipParser.java
          • /nutch/branches/2.x/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipTextExtractor.java
          • /nutch/branches/2.x/src/plugin/parse-zip/src/test/org/apache/nutch/parse/zip/TestZipParser.java
          Show
          Hudson added a comment - Integrated in Nutch-nutchgora #375 (See https://builds.apache.org/job/Nutch-nutchgora/375/ ) NUTCH-874 Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora (part 1) (Revision 1396850) Result = SUCCESS lewismc : Files : /nutch/branches/2.x/CHANGES.txt /nutch/branches/2.x/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java /nutch/branches/2.x/src/plugin/feed/src/java/org/apache/nutch/parse/feed/FeedParser.java /nutch/branches/2.x/src/plugin/feed/src/test/org/apache/nutch/parse/feed/TestFeedParser.java /nutch/branches/2.x/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java /nutch/branches/2.x/src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java /nutch/branches/2.x/src/plugin/parse-swf/src/test/org/apache/nutch/parse/swf/TestSWFParser.java /nutch/branches/2.x/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java /nutch/branches/2.x/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipParser.java /nutch/branches/2.x/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipTextExtractor.java /nutch/branches/2.x/src/plugin/parse-zip/src/test/org/apache/nutch/parse/zip/TestZipParser.java
          Hide
          Lewis John McGibbney added a comment -

          part 1 e.g. removal of unused imports committed @revision 1396850 in 2.x head

          Show
          Lewis John McGibbney added a comment - part 1 e.g. removal of unused imports committed @revision 1396850 in 2.x head
          Hide
          Lewis John McGibbney added a comment -

          trivial patch to remove unused classes brought to our attention by Kiran Chitturi. Thanks for this Kiran, your contributions are greatly appreciated.

          Show
          Lewis John McGibbney added a comment - trivial patch to remove unused classes brought to our attention by Kiran Chitturi. Thanks for this Kiran, your contributions are greatly appreciated.
          Hide
          Lewis John McGibbney added a comment -

          Set and classify

          Show
          Lewis John McGibbney added a comment - Set and classify
          Hide
          Lewis John McGibbney added a comment -

          I know the heat has kind of shifted away from Nutchgora but it would be great to clarify what this issues actually encapsulates. Was/is it is the case that some plugins in Nutchgora are not actually working with the Nutchgora API? I kinda confused with this one!

          Show
          Lewis John McGibbney added a comment - I know the heat has kind of shifted away from Nutchgora but it would be great to clarify what this issues actually encapsulates. Was/is it is the case that some plugins in Nutchgora are not actually working with the Nutchgora API? I kinda confused with this one!
          Hide
          Julien Nioche added a comment -

          I think Jukka already worked on something really similar to the ExtParser in Tika. See: http://tika.apache.org/0.7/api/org/apache/tika/parser/ExternalParser.html

          yes, that's the one I had in mind

          One of the plugins which hasn't been ported yet is the feed parser. We could rely on the one we recently added to Tika, knowing that there is a substantial difference in the sense that the Tika feed parser generates a simple XHTML representation of the document where the feeds are simply represented as anchors whereas the Nutch version created new documents for each feed.

          There is also the parse-rss plugin in Nutch which is quite similar - what's the difference with the feed one again? Since the Tika parser would handle all sorts of feed formats why not simply rely on it?

          Show
          Julien Nioche added a comment - I think Jukka already worked on something really similar to the ExtParser in Tika. See: http://tika.apache.org/0.7/api/org/apache/tika/parser/ExternalParser.html yes, that's the one I had in mind One of the plugins which hasn't been ported yet is the feed parser. We could rely on the one we recently added to Tika, knowing that there is a substantial difference in the sense that the Tika feed parser generates a simple XHTML representation of the document where the feeds are simply represented as anchors whereas the Nutch version created new documents for each feed. There is also the parse-rss plugin in Nutch which is quite similar - what's the difference with the feed one again? Since the Tika parser would handle all sorts of feed formats why not simply rely on it?
          Hide
          Chris A. Mattmann added a comment -

          Hey Julien,

          I think Jukka already worked on something really similar to the ExtParser in Tika. See: http://tika.apache.org/0.7/api/org/apache/tika/parser/ExternalParser.html

          If we go that route here in Nutch, then I think we should add an encoding attribute similar to NUTCH-564 and flow it through in parse-tika then. If we can do that, I think we're good!

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hey Julien, I think Jukka already worked on something really similar to the ExtParser in Tika. See: http://tika.apache.org/0.7/api/org/apache/tika/parser/ExternalParser.html If we go that route here in Nutch, then I think we should add an encoding attribute similar to NUTCH-564 and flow it through in parse-tika then. If we can do that, I think we're good! Cheers, Chris
          Hide
          Julien Nioche added a comment -

          Some plugins have not been ported to the new API as it does not provide multi valued parse results. See See http://search.lucidimagination.com/search/document/844c48289f2d07db/nutchbase_multi_value_parseresult_missing#4ed6f352ebcce8ef

          This is probably not the case for the ExtParser though. We could rely on Tika's mechanism for external parsing instead of maintaining ours. WDYT?

          Show
          Julien Nioche added a comment - Some plugins have not been ported to the new API as it does not provide multi valued parse results. See See http://search.lucidimagination.com/search/document/844c48289f2d07db/nutchbase_multi_value_parseresult_missing#4ed6f352ebcce8ef This is probably not the case for the ExtParser though. We could rely on Tika's mechanism for external parsing instead of maintaining ours. WDYT?

            People

            • Assignee:
              Chris A. Mattmann
              Reporter:
              Chris A. Mattmann
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development