Droids
  1. Droids
  2. DROIDS-45

Fail to resolve outlink correctly

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.1.0
    • Fix Version/s: 0.2.0
    • Component/s: core
    • Labels:
      None

      Description

      I've encountered several cases that outlinks are not extracted correctly. Most are cause by the use of URI.resolve().

      1. For a base URI of new URI("http://www.domain.com"), <a href="test.html">test.html</a> will be resolved to http://www.domain.comtest.html

      2. For a base URI of new URI("http://www.domain.com/index.php"), <a href="?test=true">test with param</a> will be resolved to http://www.domain.com/?test=true

      3. for <a href="http://www.yahoo.com\n">line break!</a>, URL.resolve will throw exception. And in a browser, it can resolves the URI. (remarks: I didn't check if this scenario affect the default Tika/NekoHTML parsing. )

      I suspect there are many different scenarios, many of them are probably caused by non-standard usage. (but a crawler has to handle non-standard usage in order to function) Obviously, we cannot cater every case, and I suggest to consider a resolve failure as a bug if a link works in a Mozilla browser but not in Droids LinkExtractor.

      this issue is related to the LinkExtractor created in DROIDS-8

      1. DROIDS-45c.patch
        22 kB
        Mingfai Ma
      2. DROIDS-45b.patch
        9 kB
        Mingfai Ma

        Issue Links

          Activity

          Hide
          Richard Frovarp added a comment -

          Our code does this relatively well.
          However, using the droids-tika module for parsing seems to handle everything very well. Let's let the Tika people worry about those problems.

          Show
          Richard Frovarp added a comment - Our code does this relatively well. However, using the droids-tika module for parsing seems to handle everything very well. Let's let the Tika people worry about those problems.
          Hide
          Mingfai Ma added a comment -

          not sure if null path should be normalized to "/"

          assertEquals("http://www.apache.org/", normalizer.normalize("http://www.apache.org"));
          

          if a website behaves differently for null and "/" path, then there might be problem.

          LinkNormalizer
            //apply pattens
                  if (path != null && !"".equals(path))
                      for (Pattern pattern : PATH_REPLACEMENTS.keySet()) {
                          path = pattern.matcher(path).replaceAll(PATH_REPLACEMENTS.get(pattern));
                      }
                  else {
                      path = "/";
                  }
          

          changing "/" to null path is odd but may cause less problem. e.g. for http://www.apache.org, it just redirect the request to "http://www.apache.org", and the fetching operation won't be affected. I tested a couple of popular/famous websites and they will either redirect null path request to another url or to "/" path. One of the main function of this normalization is to avoid duplicated link as much as possible.

          Show
          Mingfai Ma added a comment - not sure if null path should be normalized to "/" assertEquals( "http: //www.apache.org/" , normalizer.normalize( "http://www.apache.org" )); if a website behaves differently for null and "/" path, then there might be problem. LinkNormalizer //apply pattens if (path != null && !"".equals(path)) for (Pattern pattern : PATH_REPLACEMENTS.keySet()) { path = pattern.matcher(path).replaceAll(PATH_REPLACEMENTS.get(pattern)); } else { path = "/" ; } changing "/" to null path is odd but may cause less problem. e.g. for http://www.apache.org , it just redirect the request to "http://www.apache.org", and the fetching operation won't be affected. I tested a couple of popular/famous websites and they will either redirect null path request to another url or to "/" path. One of the main function of this normalization is to avoid duplicated link as much as possible.
          Hide
          Mingfai Ma added a comment -

          added a LinkNormalizer to normalize url, e.g.

            assertEquals("http://www.apache.org/style.css", normalizer.normalize("http://www.apache.org/../style.css"));
            assertEquals("http://www.apache.org/style.css", normalizer.normalize("http://www.apache.org/../../style.css"));
            assertEquals("http://www.apache.org/style.css", normalizer.normalize("http://www.apache.org/../../../style.css"));
            assertEquals("http://www.apache.org/style.css", normalizer.normalize("http://www.apache.org/./style.css"));
          

          See the test case for more examples.

          Some changes are made to the LinkResolver to use the normalizer and some other changes such as removed the wrong space escaping.

          Show
          Mingfai Ma added a comment - added a LinkNormalizer to normalize url, e.g. assertEquals( "http: //www.apache.org/style.css" , normalizer.normalize( "http://www.apache.org/../style.css" )); assertEquals( "http: //www.apache.org/style.css" , normalizer.normalize( "http://www.apache.org/../../style.css" )); assertEquals( "http: //www.apache.org/style.css" , normalizer.normalize( "http://www.apache.org/../../../style.css" )); assertEquals( "http: //www.apache.org/style.css" , normalizer.normalize( "http://www.apache.org/./style.css" )); See the test case for more examples. Some changes are made to the LinkResolver to use the normalizer and some other changes such as removed the wrong space escaping.
          Hide
          Mingfai Ma added a comment -

          handled an additional case, for:
          <link rel="stylesheet" href="./style/style.css" type="text/css" media="screen, projection"/>
          under www.apache.org

          it resolves to:
          www.apache.org/style/style.css instead of www.apache.org/./style/style.css

          and changed the test case to use JUnit 4

          Show
          Mingfai Ma added a comment - handled an additional case, for: <link rel="stylesheet" href="./style/style.css" type="text/css" media="screen, projection"/> under www.apache.org it resolves to: www.apache.org/style/style.css instead of www.apache.org/./style/style.css and changed the test case to use JUnit 4
          Hide
          Mingfai Ma added a comment -

          overlooked this task b4.

          In NoRobot, this is a usage of URI.resolve as well. As NoRobot doesn't depend on core, i didn't make any change.

          Show
          Mingfai Ma added a comment - overlooked this task b4. In NoRobot, this is a usage of URI.resolve as well. As NoRobot doesn't depend on core, i didn't make any change.
          Hide
          Thorsten Scherler added a comment -

          Mingfai Ma can you add a patch to integrate it, please?

          Show
          Thorsten Scherler added a comment - Mingfai Ma can you add a patch to integrate it, please?
          Hide
          Mingfai Ma added a comment -

          Changed the API base on Thorsten's comment.

          Notice that these two classes need further processing to put into Droids. The classes are not in Droids package, there are no license terms, and the style doesn't align to the original LinkExtractor. They are attached as a base for a Droids implementation.

          Show
          Mingfai Ma added a comment - Changed the API base on Thorsten's comment. Notice that these two classes need further processing to put into Droids. The classes are not in Droids package, there are no license terms, and the style doesn't align to the original LinkExtractor. They are attached as a base for a Droids implementation.
          Hide
          Mingfai Ma added a comment -

          fixed the case for "?x=y" for base path with file name. e.g. http://www.apache.org/index.html <a href="?x=y">x=y</a>

          URI.resovle resolves it incorrectly and require special handling

          Show
          Mingfai Ma added a comment - fixed the case for "?x=y" for base path with file name. e.g. http://www.apache.org/index.html <a href="?x=y">x=y</a> URI.resovle resolves it incorrectly and require special handling
          Hide
          Mingfai Ma added a comment -

          there are some other cases:

          • mailto: , news:
          • url parameter with space
          • unicode characters

          I think it is still far from the full list of all special scenarios

          attached is my implementation some custom link transformation to handle more cases. The code could be moved to LinkExtractor if you think it's ok. I don't use SAX parser so I don't use LinkExtractor. It would be good if the URL/URI Transformation / resolution could be refactored to a standalone class.

          Another thing is I implemented some checking differently. without doing any benchmark with modern JDK, I believe my approach that uses indexOf and avoid regex is slightly more efficient.

          Show
          Mingfai Ma added a comment - there are some other cases: mailto: , news: url parameter with space unicode characters I think it is still far from the full list of all special scenarios attached is my implementation some custom link transformation to handle more cases. The code could be moved to LinkExtractor if you think it's ok. I don't use SAX parser so I don't use LinkExtractor. It would be good if the URL/URI Transformation / resolution could be refactored to a standalone class. Another thing is I implemented some checking differently. without doing any benchmark with modern JDK, I believe my approach that uses indexOf and avoid regex is slightly more efficient.
          Hide
          Mingfai Ma added a comment -

          the LinkExtractor doesn't append '/' automatically. and I think it shouldn't, as it is possible for a server to handle with and without '/' differently. For root domain URL, it may be ok. but for deeper URL, we can't just assume the last segment of the request path is a directory

          Apache mod_dir should append a trailing slash but unfortunately, not all web server on the internet have this feature enabled
          http://httpd.apache.org/docs/2.2/mod/mod_dir.html

          Show
          Mingfai Ma added a comment - the LinkExtractor doesn't append '/' automatically. and I think it shouldn't, as it is possible for a server to handle with and without '/' differently. For root domain URL, it may be ok. but for deeper URL, we can't just assume the last segment of the request path is a directory Apache mod_dir should append a trailing slash but unfortunately, not all web server on the internet have this feature enabled http://httpd.apache.org/docs/2.2/mod/mod_dir.html
          Hide
          Thorsten Scherler added a comment -

          the above will print out
          http://www.apache.orgindex.html

          However if you have
          URI newLink = new URI("http://www.apache.org/").resolve("/index.html");

          or

          URI newLink = new URI("http://www.apache.org/").resolve("index.html");

          it will return
          http://www.apache.org/index.html

          So I am not sure whether "base" (from LinkExtractor) has a trailing "/" or not..

          Show
          Thorsten Scherler added a comment - the above will print out http://www.apache.orgindex.html However if you have URI newLink = new URI("http://www.apache.org/").resolve("/index.html"); or URI newLink = new URI("http://www.apache.org/").resolve("index.html"); it will return http://www.apache.org/index.html So I am not sure whether "base" (from LinkExtractor) has a trailing "/" or not..
          Hide
          Thorsten Scherler added a comment -

          package testing;

          import java.net.URI;
          import java.net.URISyntaxException;

          import junit.framework.TestCase;

          public class URITesting extends TestCase{

          public void test() throws URISyntaxException

          { URI newLink = new URI("http://www.apache.org").resolve("index.html"); System.out.println(newLink.toString()); }

          }

          Show
          Thorsten Scherler added a comment - package testing; import java.net.URI; import java.net.URISyntaxException; import junit.framework.TestCase; public class URITesting extends TestCase{ public void test() throws URISyntaxException { URI newLink = new URI("http://www.apache.org").resolve("index.html"); System.out.println(newLink.toString()); } }
          Hide
          Mingfai Ma added a comment -
          Show
          Mingfai Ma added a comment - btw, I wonder one of the many Java Crawler should have encountered the same non-standard HTML usage or URI resolving issue. Two examples of out link extraction: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java?view=log https://archive-crawler.svn.sourceforge.net/svnroot/archive-crawler/trunk/heritrix2/engine/src/main/java/org/archive/extractor/RegexpHTMLLinkExtractor.java

            People

            • Assignee:
              Unassigned
              Reporter:
              Mingfai Ma
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development