Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2261

ParseSegment job does not pass metadata for content-level redirects

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.11, 1.12, 1.13
    • Fix Version/s: 1.19
    • Component/s: metadata, parser
    • Labels:
      None

      Description

      When Fetcher runs in parsing mode, CrawlDatum metadata is properly passed to a new CrawlDatum for content-level redirects (HTML meta tag "Refresh"). If Fetcher runs in non-parsing mode, and ParseSegment is run as a separate step, then metadata other than "repr" is not passed to the new CrawlDatum.

      This means that any filter relying on metadata, such as DepthScoringFilter and URLMetaScoringFilter, will not work.

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              dastle David Astle

              Dates

              • Created:
                Updated:

                Issue deployment