Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-287

HtmlParser should resolve relative paths in <a href="xxx"> elements

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.4
    • 0.5
    • parser
    • None

    Description

      Currently clients of the HtmlParser need to manually keep track of the appropriate base URL to use when resolving relative URLs in href="xxx" attributes.

      The parser should use the metadata RESOURCE_NAME_KEY value as the base.

      The parser should also watch for a <base> element in the <head> section, and use that to update the base URL.

      Note that special care must be taken to work around a known bug in the Java URL() class, when the relative URL is a query string and the base URL doesn't end with a '/'.

      Attachments

        1. UrlUtils.java
          2 kB
          Kenneth William Krugler
        2. UrlUtilsTest.java
          3 kB
          Kenneth William Krugler

        Issue Links

          Activity

            People

              jukkaz Jukka Zitting
              kkrugler Kenneth William Krugler
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: