Nutch
  1. Nutch
  2. NUTCH-1339

Default URL normalization rules to remove page anchors completely

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: nutchgora, 1.6
    • Fix Version/s: 1.10
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The default rules of URLNormalizerRegex remove the anchor up to the first
      occurrence of ? or &. The remaining part of the anchor is kept
      which may cause a large, possibly infinite number of outlinks when the same document
      fetched again and again with different URLs,
      see http://www.mail-archive.com/user%40nutch.apache.org/msg05940.html

      Parameters in inner-page anchors are a common practice in AJAX web sites.
      Currently, crawling AJAX content is not supported (NUTCH-1323).

      1. NUTCH-1339.patch
        0.4 kB
        Sebastian Nagel
      2. NUTCH-1339-2.patch
        0.5 kB
        Sebastian Nagel

        Activity

        Julien Nioche made changes -
        Fix Version/s 1.10 [ 12327187 ]
        Fix Version/s 1.9 [ 12324611 ]
        Lewis John McGibbney made changes -
        Fix Version/s 1.9 [ 12324611 ]
        Fix Version/s 1.8 [ 12324326 ]
        Lewis John McGibbney made changes -
        Fix Version/s 1.8 [ 12324326 ]
        Fix Version/s 1.7 [ 12323281 ]
        Lewis John McGibbney made changes -
        Fix Version/s 1.7 [ 12323281 ]
        Fix Version/s 1.6 [ 12319941 ]
        Markus Jelsma made changes -
        Fix Version/s 1.6 [ 12319941 ]
        Hide
        Sebastian Nagel added a comment -

        BasicURLNormalizer does not remove the anchor for https URLs (NUTCH-1344).
        At least, in my case this was the real reason for the large number of bad URLs.

        The only motivation to remove the anchor not completely is the rare case that anchor and query parameters are accidentally swapped.

        Show
        Sebastian Nagel added a comment - BasicURLNormalizer does not remove the anchor for https URLs ( NUTCH-1344 ). At least, in my case this was the real reason for the large number of bad URLs. The only motivation to remove the anchor not completely is the rare case that anchor and query parameters are accidentally swapped.
        Hide
        Markus Jelsma added a comment -

        The anchor is still removed by the BasicURLNormalizer. We worked around the problem for the AJAXNormalizer by simply changing the normalizer order. First the AJAXNormalizer and then everything else. But, when indexing, first do the BasicNormalizer (if enabled) and only then the AJAXNormalizer.

        Show
        Markus Jelsma added a comment - The anchor is still removed by the BasicURLNormalizer. We worked around the problem for the AJAXNormalizer by simply changing the normalizer order. First the AJAXNormalizer and then everything else. But, when indexing, first do the BasicNormalizer (if enabled) and only then the AJAXNormalizer.
        Sebastian Nagel made changes -
        Attachment NUTCH-1339-2.patch [ 12523020 ]
        Hide
        Sebastian Nagel added a comment -

        now the correct patch

        Show
        Sebastian Nagel added a comment - now the correct patch
        Sebastian Nagel made changes -
        Field Original Value New Value
        Attachment NUTCH-1339.patch [ 12523019 ]
        Sebastian Nagel created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Sebastian Nagel
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development