[NUTCH-2042] parse-html increase chunk size used to detect charset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.3, 1.10
Fix Version/s: 2.3.1, 1.12
Component/s: parser
Labels:
None

Patch Info:

Patch Available

Description

The chunk used to detect the encoding of a document is set to 2000 bytes. Although it is definitely best practice to "define" the character set on top, 2000 bytes are sometimes not enough: 20 longer <link> elements pointing to javascript and css libs may "hide" the <meta> element containing content type and encoding. Same problem has been observed in ~~TIKA-357~~ and solved by increasing the buffer size to 8 kB.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-2042-2x-v1.patch
18/Jun/15 15:28
2 kB
Sebastian Nagel
NUTCH-2042-trunk-v1.patch
18/Jun/15 15:28
2 kB
Sebastian Nagel
NUTCH-2042-trunk-v2.patch
23/Jul/15 20:46
2 kB
Sebastian Nagel

Activity

People

Assignee:: Sebastian Nagel

Reporter:: Sebastian Nagel

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/Jun/15 15:16

Updated:: 13/Mar/24 14:50

Resolved:: 08/Dec/15 21:46