Droids
  1. Droids
  2. DROIDS-45

Fail to resolve outlink correctly

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.1.0
    • Fix Version/s: 0.2.0
    • Component/s: core
    • Labels:
      None

      Description

      I've encountered several cases that outlinks are not extracted correctly. Most are cause by the use of URI.resolve().

      1. For a base URI of new URI("http://www.domain.com"), <a href="test.html">test.html</a> will be resolved to http://www.domain.comtest.html

      2. For a base URI of new URI("http://www.domain.com/index.php"), <a href="?test=true">test with param</a> will be resolved to http://www.domain.com/?test=true

      3. for <a href="http://www.yahoo.com\n">line break!</a>, URL.resolve will throw exception. And in a browser, it can resolves the URI. (remarks: I didn't check if this scenario affect the default Tika/NekoHTML parsing. )

      I suspect there are many different scenarios, many of them are probably caused by non-standard usage. (but a crawler has to handle non-standard usage in order to function) Obviously, we cannot cater every case, and I suggest to consider a resolve failure as a bug if a link works in a Mozilla browser but not in Droids LinkExtractor.

      this issue is related to the LinkExtractor created in DROIDS-8

      1. DROIDS-45b.patch
        9 kB
        Mingfai Ma
      2. DROIDS-45c.patch
        22 kB
        Mingfai Ma

        Issue Links

          Activity

          Mingfai Ma created issue -
          Thorsten Scherler made changes -
          Field Original Value New Value
          Link This issue is related to DROIDS-11 [ DROIDS-11 ]
          Mingfai Ma made changes -
          Attachment LinkResolver.java [ 12404473 ]
          Attachment LinkResolverTests.java [ 12404474 ]
          Thorsten Scherler made changes -
          Summary Fail to resovle outlink correctly Fail to resolve outlink correctly
          Mingfai Ma made changes -
          Attachment LinkResolverTests.java [ 12404564 ]
          Attachment LinkResolver.java [ 12404565 ]
          Mingfai Ma made changes -
          Attachment LinkResolver.java [ 12404473 ]
          Mingfai Ma made changes -
          Attachment LinkResolverTests.java [ 12404474 ]
          Mingfai Ma made changes -
          Attachment LinkResolver.java [ 12404648 ]
          Attachment LinkResolverTests.java [ 12404649 ]
          Mingfai Ma made changes -
          Attachment LinkResolver.java [ 12404565 ]
          Mingfai Ma made changes -
          Attachment LinkResolverTests.java [ 12404564 ]
          Mingfai Ma made changes -
          Attachment DROIDS-45.patch [ 12409389 ]
          Mingfai Ma made changes -
          Attachment DROIDS-45b.patch [ 12409490 ]
          Mingfai Ma made changes -
          Attachment DROIDS-45c.patch [ 12409689 ]
          Mingfai Ma made changes -
          Attachment DROIDS-45.patch [ 12409389 ]
          Mingfai Ma made changes -
          Attachment LinkResolver.java [ 12404648 ]
          Mingfai Ma made changes -
          Attachment LinkResolverTests.java [ 12404649 ]
          Richard Frovarp made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 0.0.2 [ 12314984 ]
          Resolution Fixed [ 1 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Mingfai Ma
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development