Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2710

Normalize outlinks before checking for internal or external links

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • 1.21
    • None
    • None
    • Patch Available

    Description

      We have a normalizer that transforms external URLs back to internal URLs. But those URLs are never passed to the normalizer, because they have already been filtered out by internal and/or external host/domain checks in parseOutputFormat.filterNormalize().

      This patch proposes to move the normalizers above the checks for internal/external hosts/domains.

      Attachments

        1. NUTCH-2710.patch
          1 kB
          Markus Jelsma

        Activity

          People

            markus17 Markus Jelsma
            markus17 Markus Jelsma
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: