[TIKA-463] HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.8
Component/s: parser
Labels:
None

Description

All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.

For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.

But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TIKA-463.patch
19/Jul/10 08:14
21 kB
Julien Nioche
TIKA-463-1.patch
12/Aug/10 23:16
8 kB
Kenneth William Krugler
TIKA-463-2.patch
16/Aug/10 18:32
9 kB
Kenneth William Krugler
TIKA-463-3.patch
17/Aug/10 14:59
13 kB
Kenneth William Krugler

Issue Links

is depended upon by

TIKA-460 HTMLHandler misses treatment of A elements

Resolved

relates to

TIKA-457 HTMLParser gets an early </body> event

Resolved

Activity

People

Assignee:: Kenneth William Krugler

Reporter:: Kenneth William Krugler

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 12/Jul/10 20:15

Updated:: 02/May/13 02:29

Resolved:: 17/Aug/10 15:06