Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-463

HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.8
    • parser
    • None

    Description

      All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.

      For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.

      But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

      Attachments

        1. TIKA-463.patch
          21 kB
          Julien Nioche
        2. TIKA-463-1.patch
          8 kB
          Kenneth William Krugler
        3. TIKA-463-2.patch
          9 kB
          Kenneth William Krugler
        4. TIKA-463-3.patch
          13 kB
          Kenneth William Krugler

        Issue Links

          Activity

            People

              kkrugler Kenneth William Krugler
              kkrugler Kenneth William Krugler
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: