Nutch
  1. Nutch
  2. NUTCH-1339

Default URL normalization rules to remove page anchors completely

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: nutchgora, 1.6
    • Fix Version/s: 1.11
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The default rules of URLNormalizerRegex remove the anchor up to the first
      occurrence of ? or &. The remaining part of the anchor is kept
      which may cause a large, possibly infinite number of outlinks when the same document
      fetched again and again with different URLs,
      see http://www.mail-archive.com/user%40nutch.apache.org/msg05940.html

      Parameters in inner-page anchors are a common practice in AJAX web sites.
      Currently, crawling AJAX content is not supported (NUTCH-1323).

      1. NUTCH-1339-2.patch
        0.5 kB
        Sebastian Nagel
      2. NUTCH-1339.patch
        0.4 kB
        Sebastian Nagel

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Unassigned
            Reporter:
            Sebastian Nagel
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development