[TIKA-1938] HtmlParser drops <script> elements found inside <head> - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.12
Fix Version/s: 1.14, 2.0.0
Component/s: parser
Labels:
None

Description

HtmlParser's HtmlHandler does not check for "SCRIPT" in startElement when parsing <head> (i.e. bodylevel == 0 && discardLevel ==0). This causes <script> elements found within <head> to be dropped altogether. They should be treated in the same manner as "LINK" elements.

Here is a sample test case that demonstrates the problem and can be run from within HtmlParserTest.java, although it could be generalized to check for <a>, <link> and <img> links by using the LinkContentHandler.

@Test
public void testScriptSrc() throws Exception {
    String url = "http://domain.com/logic.js";
    String scriptInBody =
            "<html><body><script src=\"" + url + "\"></script></body></html>";
    String scriptInHead =
            "<html><head><script src=\"" + url + "\"></script></head></html>";

    assertScriptLink(scriptInBody, url);
    assertScriptLink(scriptInHead, url);
}

private void assertScriptLink(String html, String url) throws Exception {
    // IdentityHtmlMapper needed to extract <script> tags
    ParseContext context = new ParseContext();
    context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);
    Metadata metadata = new Metadata();
    metadata.set(Metadata.CONTENT_TYPE, "text/html");

    final List<String> links = new ArrayList<String>();
    new HtmlParser().parse(
            new ByteArrayInputStream(html.getBytes(UTF_8)),
            new DefaultHandler() {
                @Override
                public void startElement(
                        String u, String l, String name, Attributes atts) {
                    if (name.equals("script") && atts.getValue("", "src") != null) {
                        links.add(atts.getValue("", "src"));
                    }
                }
            },
            metadata,
            context);

    assertEquals(1, links.size());
    assertEquals(url, links.get(0));
}

Attachments

Activity

People

Assignee:: Kenneth William Krugler

Reporter:: Joseph Naegele

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 07/Apr/16 16:12

Updated:: 12/Apr/21 12:59

Resolved:: 13/May/16 14:50