[NUTCH-3001] protocol-selenium requires Content-Type header - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.20
Component/s: None
Labels:
None

Description

It looks like the selenium protocol requires that there be a content-type header.

The logic seems to be: If the content type is html or xhtml, use selenium, otherwise just grab the bytes.

However, with the current logic, if the content-type is null, nothing is pulled.

My guess is that the logic should be : if the content type is not null and equals html or xhtml use selenium, otherwise grab the bytes.

Right?

      String contentType = getHeader(Response.CONTENT_TYPE);

      // handle with Selenium only if content type in HTML or XHTML
      if (contentType != null) {
         if (contentType.contains("text/html")
            || contentType.contains("application/xhtml")) {
               readPlainContent(url);
         } else {
...

Attachments

Issue Links

links to

GitHub Pull Request #774

Activity

People

Assignee:: Unassigned

Reporter:: Tim Allison

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 13/Sep/23 13:58

Updated:: 13/Mar/24 14:51

Resolved:: 13/Sep/23 18:58