Nutch
  1. Nutch
  2. NUTCH-779

Mechanism for passing metadata from parse to crawldb

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The patch attached allows to pass parse metadata to the corresponding entry of the crawldb.
      Comments are welcome

      1. NUTCH-779-v2.patch
        6 kB
        Julien Nioche
      2. NUTCH-779
        4 kB
        Julien Nioche

        Issue Links

          Activity

          Hide
          Andrzej Bialecki added a comment -

          You can already achieve this with ScoringFilters, although it requires using three methods instead ... I would also rename the status to "parse_meta", it's less cryptic this way. The property needs some documentation in nutch-default.xml plus a sensible default.

          Show
          Andrzej Bialecki added a comment - You can already achieve this with ScoringFilters, although it requires using three methods instead ... I would also rename the status to "parse_meta", it's less cryptic this way. The property needs some documentation in nutch-default.xml plus a sensible default.
          Hide
          Julien Nioche added a comment -

          > The property needs some documentation in nutch-default.xml plus a sensible default.

          Sure - just wanted the general approach to be checked before doing the tedious bits. Do you think it makes sense to do things the way I suggested or would you use the ScoringFilters instead?

          Show
          Julien Nioche added a comment - > The property needs some documentation in nutch-default.xml plus a sensible default. Sure - just wanted the general approach to be checked before doing the tedious bits. Do you think it makes sense to do things the way I suggested or would you use the ScoringFilters instead?
          Hide
          Andrzej Bialecki added a comment -

          Personally I would use ScoringFilters because I'm familiar with the API, but the approach that you propose is certainly more user friendly especially for novice users.

          Show
          Andrzej Bialecki added a comment - Personally I would use ScoringFilters because I'm familiar with the API, but the approach that you propose is certainly more user friendly especially for novice users.
          Hide
          Julien Nioche added a comment -

          Improved version of the patch. Followed AB's recommendations and renamed STATUS_PARSE_META + added description for param 'db.parsemeta.to.crawldb' in nutch-default.xml + fixed issue with IndexerMapReduce

          Show
          Julien Nioche added a comment - Improved version of the patch. Followed AB's recommendations and renamed STATUS_PARSE_META + added description for param 'db.parsemeta.to.crawldb' in nutch-default.xml + fixed issue with IndexerMapReduce
          Hide
          Julien Nioche added a comment -

          Could anyone please review this issue? I would like to commit it in time for the 1.1 release

          Show
          Julien Nioche added a comment - Could anyone please review this issue? I would like to commit it in time for the 1.1 release
          Hide
          Andrzej Bialecki added a comment -

          CrawlDbReducer, the cramped line if (metaFromParse!=null){ needs some whitespace fixing.

          Other than that, +1.

          Show
          Andrzej Bialecki added a comment - CrawlDbReducer, the cramped line if (metaFromParse!=null){ needs some whitespace fixing. Other than that, +1.
          Hide
          Julien Nioche added a comment -

          Committed revision 929038.

          Thanks Andrzej for your feedback

          Show
          Julien Nioche added a comment - Committed revision 929038. Thanks Andrzej for your feedback
          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk #1112 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1112/)
          Mechanism for passing metadata from parse to crawldb

          Show
          Hudson added a comment - Integrated in Nutch-trunk #1112 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1112/ ) Mechanism for passing metadata from parse to crawldb

            People

            • Assignee:
              Julien Nioche
              Reporter:
              Julien Nioche
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development