Nutch
  1. Nutch
  2. NUTCH-566

Sun's URL class has bug in creation of relative query URLs

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.8, 0.8.1, 0.9.0
    • Fix Version/s: None
    • Component/s: fetcher
    • Labels:
      None
    • Environment:

      MacOS X and Linux (CentOS 4.5) both

    • Patch Info:
      Patch Available

      Description

      I'm using 0.81, but this will affect all other versions as well.

      Relative links of the form "?blah" are resolved incorrectly. For example, with a base URL of http://www.fleurie.org/entreprise.asp, and a relative link of "?id_entrep=111", Nutch will resolve this pair to the link
      "http://www.fleurie.org/?id_entrep=111". No such URL exists, and all browsers I tried will resolve the pair to "http://www.fleurie.org/entreprise.asp?id_entrep=111".

      I tracked this down to what could be called a bug in Sun's URL class. According to Sun's spec, they parse the relative URL according to RFC 2396. But the original RFC for relative links was RFC 1808, and the two RFCs differ in how they handle relative links beginning with "?". Most browsers (Netscape/Mozilla, IE, Safari) implemented RFC 1808, and stuck with it (for compatibility and also because the behavior makes more sense). Apparently even the people that wrote RFC 2396 recognized that this was a mistake, and the specified behavior was changed in RFC 3986 to match what browsers do.

      For a discussion of this, see http://gbiv.com/protocols/uri/rev-2002/issues.html#003-relative-query

      Sun's URL implementation, however, still implements RFC2396, as far as I can tell, and is out of step with the rest of the world.
      This breaks link extraction on a number of sites.

      I implemented a simple workaround, which I'm attaching. It is a static method to create URLs which behaves exactly as new URL(URL base, String relativePath), and I use it as a drop-in replacement for that in DOMContentUtils, Javascript link extraction, etc. Obviously, it really only matters wherever links are extracted. I haven't included the calling code from DOMContentUtils, etc. because my local versions are largely rewritten, but it should be pretty obvious.

      I put it in the org.apache.nutch.net directory, but obviously feel free to move it to another place if you feel it belongs there!

        Issue Links

          Activity

          Hide
          Sebastian Nagel added a comment -

          Linked to duplicate issues NUTCH-797 and NUTCH-952.

          Show
          Sebastian Nagel added a comment - Linked to duplicate issues NUTCH-797 and NUTCH-952 .
          Hide
          Sebastian Nagel added a comment -

          Was fixed by NUTCH-797 with version 1.4 (2.x will be patched soon), the problematic example (http://www.fleurie.org/entreprise.asp?id_entrep=111) is included in unit test (o.a.n.util.TestURLUtil).

          Show
          Sebastian Nagel added a comment - Was fixed by NUTCH-797 with version 1.4 (2.x will be patched soon), the problematic example ( http://www.fleurie.org/entreprise.asp?id_entrep=111 ) is included in unit test (o.a.n.util.TestURLUtil).
          Hide
          Andrzej Bialecki added a comment -

          I agree that this should be put into a utility class. We already have one in trunk, org.apache.nutch.util.URLUtil. Could any of you provide an updated patch, relative to the current trunk?

          Show
          Andrzej Bialecki added a comment - I agree that this should be put into a utility class. We already have one in trunk, org.apache.nutch.util.URLUtil. Could any of you provide an updated patch, relative to the current trunk?
          Hide
          Doug Cook added a comment -

          Hi Doğacan.

          Thanks for following up. The issue has gotten a little more complicated since I first posted it.

          I recently found, while looking for something else, that this is a dup of NUTCH-436. A solution for that has been committed.

          Both solutions are incomplete, however. The problem with the fix for NUTCH-436 is that it is applied locally to DOMContentUtils, but the bug affects Nutch globally anywhere links are extracted. In this sense my solution is superior because it creates a central utility function for the workaround, and calls that from all the places links are extracted (DOMContentUtils is the most common, but there are also SWFParser, JSParseFilter, and TextParser).

          The patch for Nutch 436 may be more complete in that it handles more cases than my simple fix. My fix handles the most common (only) case I've seen in practice (relative link beginning with '?'), and has the advantage of being simple, but the patch for Nutch-436 handles things like relative links beginning with ';'. I haven't had a chance to analyze them both to see which is more 'correct' (probably Nutch-436) and merge the two solutions, if necessary.

          I've been so swamped trying to get my product launched (it's a 1-person company!) that I haven't had time to follow up on all the contributions I'd like to make. Usually I get a fix working just well enough to fit my local needs and then I have to move on to tackling the next bug. At this point I've got hundreds of local fixes and improvements, some of which are probably useful for the Nutch community... hopefully after I launch I'll have time both to contribute some goodies back and also to pick up all the improvements that have been made to the trunk since I branched my code...

          -Doug

          Show
          Doug Cook added a comment - Hi Doğacan. Thanks for following up. The issue has gotten a little more complicated since I first posted it. I recently found, while looking for something else, that this is a dup of NUTCH-436 . A solution for that has been committed. Both solutions are incomplete, however. The problem with the fix for NUTCH-436 is that it is applied locally to DOMContentUtils, but the bug affects Nutch globally anywhere links are extracted. In this sense my solution is superior because it creates a central utility function for the workaround, and calls that from all the places links are extracted (DOMContentUtils is the most common, but there are also SWFParser, JSParseFilter, and TextParser). The patch for Nutch 436 may be more complete in that it handles more cases than my simple fix. My fix handles the most common (only) case I've seen in practice (relative link beginning with '?'), and has the advantage of being simple, but the patch for Nutch-436 handles things like relative links beginning with ';'. I haven't had a chance to analyze them both to see which is more 'correct' (probably Nutch-436) and merge the two solutions, if necessary. I've been so swamped trying to get my product launched (it's a 1-person company!) that I haven't had time to follow up on all the contributions I'd like to make. Usually I get a fix working just well enough to fit my local needs and then I have to move on to tackling the next bug. At this point I've got hundreds of local fixes and improvements, some of which are probably useful for the Nutch community... hopefully after I launch I'll have time both to contribute some goodies back and also to pick up all the improvements that have been made to the trunk since I branched my code... -Doug
          Hide
          Doğacan Güney added a comment -

          I am going to commit this one, but I am not sure what needs to be updated besides parse-html and parse-js. Any suggestions?

          Show
          Doğacan Güney added a comment - I am going to commit this one, but I am not sure what needs to be updated besides parse-html and parse-js. Any suggestions?
          Hide
          Doug Cook added a comment -

          Here's a static method to work around the problem.

          Show
          Doug Cook added a comment - Here's a static method to work around the problem.

            People

            • Assignee:
              Unassigned
              Reporter:
              Doug Cook
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development