Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2261

ParseSegment job does not pass metadata for content-level redirects

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.11, 1.12, 1.13
    • 1.21
    • metadata, parser
    • None

    Description

      When Fetcher runs in parsing mode, CrawlDatum metadata is properly passed to a new CrawlDatum for content-level redirects (HTML meta tag "Refresh"). If Fetcher runs in non-parsing mode, and ParseSegment is run as a separate step, then metadata other than "repr" is not passed to the new CrawlDatum.

      This means that any filter relying on metadata, such as DepthScoringFilter and URLMetaScoringFilter, will not work.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dastle David Astle
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: