Nutch
  1. Nutch
  2. NUTCH-1344

BasicURLNormalizer to normalize https same as http

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: nutchgora, 1.6
    • Fix Version/s: 1.6, 2.2
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Most of the normalization done by BasicURLNormalizer (lowercasing host, removing default port, removal of page anchors, cleaning . and . in the path) is not done for URLs with protocol https.

      1. NUTCH-1344.patch
        0.8 kB
        Sebastian Nagel

        Activity

        Hide
        Hudson added a comment -

        Integrated in Nutch-nutchgora #387 (See https://builds.apache.org/job/Nutch-nutchgora/387/)
        NUTCH-1344 BasicURLNormalizer to normalize https same as http - forgot to add committer (Revision 1401458)

        Result = FAILURE
        snagel :
        Files :

        • /nutch/branches/2.x/CHANGES.txt
        Show
        Hudson added a comment - Integrated in Nutch-nutchgora #387 (See https://builds.apache.org/job/Nutch-nutchgora/387/ ) NUTCH-1344 BasicURLNormalizer to normalize https same as http - forgot to add committer (Revision 1401458) Result = FAILURE snagel : Files : /nutch/branches/2.x/CHANGES.txt
        Hide
        Hudson added a comment -

        Integrated in Nutch-nutchgora #375 (See https://builds.apache.org/job/Nutch-nutchgora/375/)
        NUTCH-1344 BasicURLNormalizer to normalize https same as http (Revision 1396800)

        Result = SUCCESS
        snagel :
        Files :

        • /nutch/branches/2.x/CHANGES.txt
        • /nutch/branches/2.x/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
        Show
        Hudson added a comment - Integrated in Nutch-nutchgora #375 (See https://builds.apache.org/job/Nutch-nutchgora/375/ ) NUTCH-1344 BasicURLNormalizer to normalize https same as http (Revision 1396800) Result = SUCCESS snagel : Files : /nutch/branches/2.x/CHANGES.txt /nutch/branches/2.x/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
        Hide
        Hudson added a comment -

        Integrated in nutch-trunk-maven #449 (See https://builds.apache.org/job/nutch-trunk-maven/449/)
        NUTCH-1344 BasicURLNormalizer to normalize https same as http (Revision 1396801)

        Result = SUCCESS
        snagel :
        Files :

        • /nutch/trunk/CHANGES.txt
        • /nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
        Show
        Hudson added a comment - Integrated in nutch-trunk-maven #449 (See https://builds.apache.org/job/nutch-trunk-maven/449/ ) NUTCH-1344 BasicURLNormalizer to normalize https same as http (Revision 1396801) Result = SUCCESS snagel : Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
        Hide
        Sebastian Nagel added a comment -

        committed to trunk (revision 1396801) and 2.x (revision 1396800)

        Show
        Sebastian Nagel added a comment - committed to trunk (revision 1396801) and 2.x (revision 1396800)
        Hide
        Julien Nioche added a comment -

        Good catch Sebastian. PLease commit to both trunk and 2.x

        Show
        Julien Nioche added a comment - Good catch Sebastian. PLease commit to both trunk and 2.x
        Hide
        Markus Jelsma added a comment -

        I wouldn't know why. I think they should be treated equally.

        Show
        Markus Jelsma added a comment - I wouldn't know why. I think they should be treated equally.
        Hide
        Sebastian Nagel added a comment -

        Is there any reason why https should be treated different from http (and ftp)?

        Show
        Sebastian Nagel added a comment - Is there any reason why https should be treated different from http (and ftp)?

          People

          • Assignee:
            Unassigned
            Reporter:
            Sebastian Nagel
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development