[NUTCH-153] TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.8
Fix Version/s: 1.0.0
Component/s: fetcher
Labels:
None
Environment:

all

Description

If TextParser is given postscript, it can take hours and then fail. This can be avoided with careful configuration, but if the server MIME type is wrong and the basename of the URL has no "file extension", then the this parser will take a long time and fail every time.

Analysis: The real problem is OutlinkExtractor.java as reported with bug ~~NUTCH-150~~, but the problem cannot be entirely addressed with that patch since the first call to reg expr match() can take a long time, despite quantifier limits.

Suggested fix: Reject files with "%!PS-Adobe" in the first 40 characters of the file.

Actual experience has shown that for safety and fail-safe reasons, it is worth protecting against GIGO directly in TextParse for this case, even though the suggested fix is not a general solution. (A general solution would be a timeout on match().)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TextParser.java.patch
27/Dec/05 12:30
1 kB
Paul Baclace

Activity

People

Assignee:: Unassigned

Reporter:: Paul Baclace

Votes:: 1 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 27/Dec/05 12:29

Updated:: 10/Apr/09 12:29

Resolved:: 22/Sep/08 15:02