Nutch
  1. Nutch
  2. NUTCH-797

parse-tika is not properly constructing URLs when the target begins with a "?"

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.1, nutchgora
    • Fix Version/s: 1.9
    • Component/s: parser
    • Labels:
      None
    • Environment:

      Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
      Also repro's on RHEL and java 1.4.2

    • Patch Info:
      Patch Available

      Description

      This is my first bug and patch on nutch, so apologies if I have not provided enough detail.

      In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are links in the page that look like this:

      <a href="?co=0&sk=0&p=2&pi=1">2</a></td><td><a href="?co=0&sk=0&p=3&pi=1">3</a>

      in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0", and a target of "?co=0&sk=0&p=2&pi=1"

      The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1

      because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining.

      While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part.

      I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes:
      Search.aspx?co=0&sk=0&p=2&pi=1

      The URL class then properly constructs the new url as:
      http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1

      If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way.

      Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects.

      Much thanks

      Here is the patch info:
      Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
      ===================================================================
      — src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362)
      +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy)
      @@ -299,6 +299,50 @@
      return false;
      }

      + private URL fixURL(URL base, String target) throws MalformedURLException
      + {
      + // handle params that are embedded into the base url - move them to target
      + // so URL class constructs the new url class properly
      + if (base.toString().indexOf(';') > 0)
      + return fixEmbeddedParams(base, target);
      +
      + // handle the case that there is a target that is a pure query.
      + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble
      + // URLs but I've seen this in numerous places, for example at
      + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0
      + // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by default
      + // URL constructs the base+target combo as
      + // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, incorrectly
      + // dropping the Search.aspx target
      + //
      + // Browsers handle these just fine, they must have an exception similar to this
      + if (target.startsWith("?"))
      +

      { + return fixPureQueryTargets(base, target); + }

      +
      + return new URL(base, target);
      + }
      +
      + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException
      + {
      + if (!target.startsWith("?"))
      + return new URL(base, target);
      +
      + String basePath = base.getPath();
      + String baseRightMost="";
      + int baseRightMostIdx = basePath.lastIndexOf("/");
      + if (baseRightMostIdx != -1)
      +

      { + baseRightMost = basePath.substring(baseRightMostIdx+1); + }

      +
      + if (target.startsWith("?"))
      + target = baseRightMost+target;
      +
      + return new URL(base, target);
      + }
      +
      /**

      • Handles cases where the url param information is encoded into the base
      • url as opposed to the target.
        @@ -400,8 +444,7 @@
        if (target != null && !noFollow && !post)
        try { - URL url = (base.toString().indexOf(';') > 0) ? - fixEmbeddedParams(base, target) : new URL(base, target); + URL url = fixURL(base, target); outlinks.add(new Outlink(url.toString(), linkText.toString().trim())); }

        catch (MalformedURLException e) {

      1. NUTCH-797-2x.patch
        8 kB
        Sebastian Nagel
      2. test_nutch_797.html
        0.2 kB
        Sebastian Nagel
      3. NUTCH-797.patch
        9 kB
        Andrzej Bialecki
      4. pureQueryUrl-2.patch
        12 kB
        Andrzej Bialecki
      5. pureQueryUrl.patch
        3 kB
        Robert Hohman

        Issue Links

          Activity

          Hide
          Andrzej Bialecki added a comment -

          Thanks for reporting this, and providing a patch. An updated revision of the standard, RFC3986 section 5.4.1 example 7 follows the same reasoning. I'll fix this shortly.

          Show
          Andrzej Bialecki added a comment - Thanks for reporting this, and providing a patch. An updated revision of the standard, RFC3986 section 5.4.1 example 7 follows the same reasoning. I'll fix this shortly.
          Hide
          Andrzej Bialecki added a comment -

          Hm, actually the picture is more complicated than I thought - if we apply both methods (fixEmbeddedParams and fixPureQueryTargets) then some of the test cases from RFC fail. However, all tests succeed if we only apply the fixPureQueryTargets !

          Looking at the origin of the fixEmbeddedParams method (NUTCH-436) something must been fixed in java.net.URL, because the test case mentioned in that issue now passes if we apply only fixPureQueryTargets. The same case with test cases in a near-duplicate issue NUTCH-566.

          Consequently I'm going to remove fixEmbeddedParams. I added all tests from RFC3986 section 5.4.1, and they all pass now. I'll attach an updated patch shortly.

          Show
          Andrzej Bialecki added a comment - Hm, actually the picture is more complicated than I thought - if we apply both methods (fixEmbeddedParams and fixPureQueryTargets) then some of the test cases from RFC fail. However, all tests succeed if we only apply the fixPureQueryTargets ! Looking at the origin of the fixEmbeddedParams method ( NUTCH-436 ) something must been fixed in java.net.URL, because the test case mentioned in that issue now passes if we apply only fixPureQueryTargets. The same case with test cases in a near-duplicate issue NUTCH-566 . Consequently I'm going to remove fixEmbeddedParams. I added all tests from RFC3986 section 5.4.1, and they all pass now. I'll attach an updated patch shortly.
          Hide
          Andrzej Bialecki added a comment -

          Updated patch with some refactoring and unit tests. If no objections I'll commit this shortly.

          Show
          Andrzej Bialecki added a comment - Updated patch with some refactoring and unit tests. If no objections I'll commit this shortly.
          Hide
          Ken Krugler added a comment -

          I thought this same issue (relative URL with leading '?') had been fixed in Tika. Or at least I reported it, and I thought Jukka rolled in code that would handle it. See TIKA-287, and the comment about "Note that special care must be taken to work around a known bug in the Java URL() class, when the relative URL is a query string and the base URL doesn't end with a '/'."

          Or is this the case of Nutch needing to implement similar link extraction support?

          Show
          Ken Krugler added a comment - I thought this same issue (relative URL with leading '?') had been fixed in Tika. Or at least I reported it, and I thought Jukka rolled in code that would handle it. See TIKA-287 , and the comment about "Note that special care must be taken to work around a known bug in the Java URL() class, when the relative URL is a query string and the base URL doesn't end with a '/'." Or is this the case of Nutch needing to implement similar link extraction support?
          Hide
          Andrzej Bialecki added a comment -

          Unfortunately the way your fix was applied there is not reusable (private method in HtmlParser... ugh ). So for the time being I think we'll go with our utility class ... which we should really move to the crawler-commons anyway!

          Show
          Andrzej Bialecki added a comment - Unfortunately the way your fix was applied there is not reusable (private method in HtmlParser... ugh ). So for the time being I think we'll go with our utility class ... which we should really move to the crawler-commons anyway!
          Hide
          Ken Krugler added a comment -

          Agreed re crawler-commons...feels like there's a beefy chunk of URL handling code that should go there.

          Show
          Ken Krugler added a comment - Agreed re crawler-commons...feels like there's a beefy chunk of URL handling code that should go there.
          Hide
          Robert Hohman added a comment -

          Makes sense, thanks for looking at this guys

          Show
          Robert Hohman added a comment - Makes sense, thanks for looking at this guys
          Hide
          Jukka Zitting added a comment -

          Wouldn't it be easier for Nutch to pass the base URL as the CONTENT_LOCATION metadata to the Tika parser? Then Tika would automatically apply these fixes, as discussed in TIKA-287.

          Show
          Jukka Zitting added a comment - Wouldn't it be easier for Nutch to pass the base URL as the CONTENT_LOCATION metadata to the Tika parser? Then Tika would automatically apply these fixes, as discussed in TIKA-287 .
          Hide
          Andrzej Bialecki added a comment -

          A few issues with this:

          • does this mean that the fixes would be applied to links found in other content types as well, not just html (the fixup code in TIKA-287 is located in HtmlParser)?
          • we need this also in other places, e.g. in the redirection handling code (both meta-refresh, javascript location.href and protocol-level redirect)
          • for a while we still need this in the parse-html plugin that does not use Tika.
          Show
          Andrzej Bialecki added a comment - A few issues with this: does this mean that the fixes would be applied to links found in other content types as well, not just html (the fixup code in TIKA-287 is located in HtmlParser)? we need this also in other places, e.g. in the redirection handling code (both meta-refresh, javascript location.href and protocol-level redirect) for a while we still need this in the parse-html plugin that does not use Tika.
          Hide
          Jukka Zitting added a comment -

          I guess we need to apply the same logic also to other Tika parsers that may deal with relative URLs.

          Since we in any case need this functionality in Tika, would it be useful for Nutch if it was made available as a public utility class or method in tika-core? It would be great if we could avoid duplicating the code in different projects.

          Show
          Jukka Zitting added a comment - I guess we need to apply the same logic also to other Tika parsers that may deal with relative URLs. Since we in any case need this functionality in Tika, would it be useful for Nutch if it was made available as a public utility class or method in tika-core? It would be great if we could avoid duplicating the code in different projects.
          Hide
          Andrzej Bialecki added a comment -

          That's one option, at least until the crawler-commons produces any artifacts ... Eventually I think that this code and other related code (e.g. deciding which URL is canonical in presence of redirects, url normalization and filtering) should end up in the crawler-commons.

          Show
          Andrzej Bialecki added a comment - That's one option, at least until the crawler-commons produces any artifacts ... Eventually I think that this code and other related code (e.g. deciding which URL is canonical in presence of redirects, url normalization and filtering) should end up in the crawler-commons.
          Hide
          Andrzej Bialecki added a comment -

          If there are no futher comments I'm going to commit the current patch with a TODO to revisit this code if/when it's refactored to an external dependency.

          Show
          Andrzej Bialecki added a comment - If there are no futher comments I'm going to commit the current patch with a TODO to revisit this code if/when it's refactored to an external dependency.
          Hide
          Markus Jelsma added a comment -

          Back on radar: has this ever been committed at all?

          Show
          Markus Jelsma added a comment - Back on radar: has this ever been committed at all?
          Hide
          Robert Hohman added a comment -

          Hi markus - I am not sure if the committers committed it. I thought they were going to.

          We have moved off of nutch and so I am a little out of touch with what the latest is.

          If you hve any other questions let me know.

          Show
          Robert Hohman added a comment - Hi markus - I am not sure if the committers committed it. I thought they were going to. We have moved off of nutch and so I am a little out of touch with what the latest is. If you hve any other questions let me know.
          Hide
          Markus Jelsma added a comment -

          We'll look in to it. Thanks for reporting.

          Show
          Markus Jelsma added a comment - We'll look in to it. Thanks for reporting.
          Hide
          Lewis John McGibbney added a comment -

          Hhmm, I wonder what the scenario with this is? Andrzej (or any other Tika commiters who might be watching) can you comment on whether this has been fixed in the tika 0.10? If this is the case, and we upgrade to Tika 0.10 as per NUTCH-1154 I thin this issue can be closed and we will be one step closer to getting 1.4 out the door. Alternatively, if this is not the case can someone similarly comment on how we can take this forward. It seems that most of the hard work has already been done!

          Show
          Lewis John McGibbney added a comment - Hhmm, I wonder what the scenario with this is? Andrzej (or any other Tika commiters who might be watching) can you comment on whether this has been fixed in the tika 0.10? If this is the case, and we upgrade to Tika 0.10 as per NUTCH-1154 I thin this issue can be closed and we will be one step closer to getting 1.4 out the door. Alternatively, if this is not the case can someone similarly comment on how we can take this forward. It seems that most of the hard work has already been done!
          Hide
          Andrzej Bialecki added a comment -

          The fixup code in Tika is still a private method in HtmlParser, so in this case the upgrade to Tika 0.10 won't help, we still have to apply the above patch.

          I'll commit this shortly.

          Show
          Andrzej Bialecki added a comment - The fixup code in Tika is still a private method in HtmlParser, so in this case the upgrade to Tika 0.10 won't help, we still have to apply the above patch. I'll commit this shortly.
          Hide
          Andrzej Bialecki added a comment -

          Committed in rev. 1181747 to trunk. Nutchgora needs more work, so I'm leaving this open.

          Show
          Andrzej Bialecki added a comment - Committed in rev. 1181747 to trunk. Nutchgora needs more work, so I'm leaving this open.
          Hide
          Markus Jelsma added a comment -

          Andrzej, it looks like the fix for NUTCH-1115 is gone since this commit.

          Show
          Markus Jelsma added a comment - Andrzej, it looks like the fix for NUTCH-1115 is gone since this commit.
          Hide
          Andrzej Bialecki added a comment -

          Uhh, sorry - I'll fix this in a moment.

          Show
          Andrzej Bialecki added a comment - Uhh, sorry - I'll fix this in a moment.
          Hide
          Andrzej Bialecki added a comment -

          I'm puzzled by the algorithm in fixEmbeddedParams (which was refactored into URLUtil), and I don't understand how it was ever supposed to work. If I enable this method then most of the test URLs in TestURLUtil fail, because they are not resolved according to the RFC.

          In your example in NUTCH-1115, what was the expected result of resolving the base url "http://www.funkybabes.nl/;ROOOWAN/fotoboek" and e.g. a target of "forumregels" ?

          Show
          Andrzej Bialecki added a comment - I'm puzzled by the algorithm in fixEmbeddedParams (which was refactored into URLUtil), and I don't understand how it was ever supposed to work. If I enable this method then most of the test URLs in TestURLUtil fail, because they are not resolved according to the RFC. In your example in NUTCH-1115 , what was the expected result of resolving the base url "http://www.funkybabes.nl/;ROOOWAN/fotoboek" and e.g. a target of "forumregels" ? http://www.funkybabes.nl/forumregels http://www.funkybabes.nl/;ROOOWAN/forumregels http://www.funkybabes.nl/forumregels;ROOOWAN none of the above
          Hide
          Markus Jelsma added a comment -

          I would expect http://www.funkybabes.nl/forumregels but iirc you would get http://www.funkybabes.nl/;ROOOWAN/forumregels. Such a structure would get out of hand quickly. That's why i included the fix to disable the fixEmbeddedParams method. NUTCH-436 is the issue when fixEmbeddedParams was introduced. I haven't seen real life cases where this is used at all.

          Show
          Markus Jelsma added a comment - I would expect http://www.funkybabes.nl/forumregels but iirc you would get http://www.funkybabes.nl/;ROOOWAN/forumregels . Such a structure would get out of hand quickly. That's why i included the fix to disable the fixEmbeddedParams method. NUTCH-436 is the issue when fixEmbeddedParams was introduced. I haven't seen real life cases where this is used at all.
          Hide
          Andrzej Bialecki added a comment -

          Well, I would expect http://www.funkybabes.nl/forumregels;ROOOWAN ... i.e. the embedded params from the base url would be transferred to the resolved url. FWIW, the RFC defines that a slash "/" is a valid sub-separator in the params sections, so it could be even argued that the value of the param in this case is "ROOOWAN/fotoboek".

          How about modifying the meaning of this option, so that it simply removes any embedded params from the base and discards them completely? This would satisfy the requirements of NUTCH-1115 and at least the use case would be clear - this option is for cleaning unlikely and probably buggy URLs.

          Show
          Andrzej Bialecki added a comment - Well, I would expect http://www.funkybabes.nl/forumregels;ROOOWAN ... i.e. the embedded params from the base url would be transferred to the resolved url. FWIW, the RFC defines that a slash "/" is a valid sub-separator in the params sections, so it could be even argued that the value of the param in this case is "ROOOWAN/fotoboek". How about modifying the meaning of this option, so that it simply removes any embedded params from the base and discards them completely? This would satisfy the requirements of NUTCH-1115 and at least the use case would be clear - this option is for cleaning unlikely and probably buggy URLs.
          Hide
          Markus Jelsma added a comment -

          Mmm, i think you are correct. It's bit confusing indeed. If its modified to simply remove them it should definately fix NUTCH-1115. There are quite a few buggy URL's that wouldn't enter the DB with them removed.

          I'll be happy to test any patches to the current trunk (the one without NUTCH-1115 )

          Show
          Markus Jelsma added a comment - Mmm, i think you are correct. It's bit confusing indeed. If its modified to simply remove them it should definately fix NUTCH-1115 . There are quite a few buggy URL's that wouldn't enter the DB with them removed. I'll be happy to test any patches to the current trunk (the one without NUTCH-1115 )
          Hide
          Andrzej Bialecki added a comment -

          Tentative patch, which changes the meaning of "fixEmbeddedParams" to "removeEmbeddedParams".

          Show
          Andrzej Bialecki added a comment - Tentative patch, which changes the meaning of "fixEmbeddedParams" to "removeEmbeddedParams".
          Hide
          Markus Jelsma added a comment -

          Hm, seems the parser.fix.embeddedparams switch doesn't have any effect anymore. I'll get the same output without embedded params. At least no embedded params but i cannot be enabled either right now.

          Show
          Markus Jelsma added a comment - Hm, seems the parser.fix.embeddedparams switch doesn't have any effect anymore. I'll get the same output without embedded params. At least no embedded params but i cannot be enabled either right now.
          Hide
          Andrzej Bialecki added a comment -

          That's unexpected I checked the patch and I can't see where the bug could be ... Did you make sure that your config is correct, and that the job actually sees the right value of this property in the config (check the job.xml via JobTracker)? TestDOMContentUtils indicates that it should work, so we need to make sure that the flag has correct value.

          Show
          Andrzej Bialecki added a comment - That's unexpected I checked the patch and I can't see where the bug could be ... Did you make sure that your config is correct, and that the job actually sees the right value of this property in the config (check the job.xml via JobTracker)? TestDOMContentUtils indicates that it should work, so we need to make sure that the flag has correct value.
          Hide
          Markus Jelsma added a comment -

          This test was on a local instance. I tried both values for parser.fix.embeddedparams with:
          $ bin/nutch parsechecker http://www.funkybabes.nl/;ROOOWAN/fotoboek

          Is this how it should be implemented? I'm not sure. Embedded params are a bit puzzling

          Show
          Markus Jelsma added a comment - This test was on a local instance. I tried both values for parser.fix.embeddedparams with: $ bin/nutch parsechecker http://www.funkybabes.nl/;ROOOWAN/fotoboek Is this how it should be implemented? I'm not sure. Embedded params are a bit puzzling
          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk #1631 (See https://builds.apache.org/job/Nutch-trunk/1631/)
          NUTCH-797 Fix parse-tika and parse-html to use relative URL resolution per RFC-3986.

          ab : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1181747
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java
          • /nutch/trunk/src/plugin/parse-html/ivy.xml
          • /nutch/trunk/src/plugin/parse-html/plugin.xml
          • /nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
          • /nutch/trunk/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java
          • /nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
          • /nutch/trunk/src/test/org/apache/nutch/util/TestURLUtil.java
          Show
          Hudson added a comment - Integrated in Nutch-trunk #1631 (See https://builds.apache.org/job/Nutch-trunk/1631/ ) NUTCH-797 Fix parse-tika and parse-html to use relative URL resolution per RFC-3986. ab : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1181747 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java /nutch/trunk/src/plugin/parse-html/ivy.xml /nutch/trunk/src/plugin/parse-html/plugin.xml /nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java /nutch/trunk/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java /nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java /nutch/trunk/src/test/org/apache/nutch/util/TestURLUtil.java
          Hide
          Hudson added a comment -

          Integrated in nutch-trunk-maven #3 (See https://builds.apache.org/job/nutch-trunk-maven/3/)
          NUTCH-797 Fix parse-tika and parse-html to use relative URL resolution per RFC-3986.

          ab : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1181747
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java
          • /nutch/trunk/src/plugin/parse-html/ivy.xml
          • /nutch/trunk/src/plugin/parse-html/plugin.xml
          • /nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
          • /nutch/trunk/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java
          • /nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
          • /nutch/trunk/src/test/org/apache/nutch/util/TestURLUtil.java
          Show
          Hudson added a comment - Integrated in nutch-trunk-maven #3 (See https://builds.apache.org/job/nutch-trunk-maven/3/ ) NUTCH-797 Fix parse-tika and parse-html to use relative URL resolution per RFC-3986. ab : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1181747 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java /nutch/trunk/src/plugin/parse-html/ivy.xml /nutch/trunk/src/plugin/parse-html/plugin.xml /nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java /nutch/trunk/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java /nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java /nutch/trunk/src/test/org/apache/nutch/util/TestURLUtil.java
          Hide
          Lewis John McGibbney added a comment -

          This has been committed but the issue is still open and marked as unresolved. I've just spent around 30 mins looking through the three open issues closely surrounding this problem area with constructing outlinks beginning with ?'s. I think that we need to have a close look to try and sort the three issues out.

          Show
          Lewis John McGibbney added a comment - This has been committed but the issue is still open and marked as unresolved. I've just spent around 30 mins looking through the three open issues closely surrounding this problem area with constructing outlinks beginning with ?'s. I think that we need to have a close look to try and sort the three issues out.
          Hide
          Lewis John McGibbney added a comment -

          Set and Classify

          Show
          Lewis John McGibbney added a comment - Set and Classify
          Hide
          Sebastian Nagel added a comment -

          Tested using parsechecker (cf. NUTCH-1743) with attached sample document:

          • fixed for trunk and parse-tika
          • still open for parse-html in 2.x

          Same applies to NUTCH-566 and NUTCH-952.

          Show
          Sebastian Nagel added a comment - Tested using parsechecker (cf. NUTCH-1743 ) with attached sample document: fixed for trunk and parse-tika still open for parse-html in 2.x Same applies to NUTCH-566 and NUTCH-952 .
          Hide
          Sebastian Nagel added a comment -

          Patch for 2.x:

          • port URLUtil.resolveURL() from 1.x (including unit test)
          • removed fixEmbeddedParams(): it's still in 1.x but unused (NUTCH-797 removed/deactivated NUTCH-1115)
          Show
          Sebastian Nagel added a comment - Patch for 2.x: port URLUtil.resolveURL() from 1.x (including unit test) removed fixEmbeddedParams(): it's still in 1.x but unused ( NUTCH-797 removed/deactivated NUTCH-1115 )

            People

            • Assignee:
              Julien Nioche
              Reporter:
              Robert Hohman
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Development