[NUTCH-1389] parsechecker and indexchecker to report truncated content - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: nutchgora, 1.5
Fix Version/s: 1.7, 2.2
Component/s: indexer, parser
Labels:
None

Description

ParserChecker and IndexingFiltersChecker should report when a document is truncated due to

{http,file,ftp}

.content.limit.
Truncated content may cause text and metadata extraction to fail for PDF and other binary document formats.
A hint that truncation (and not a broken plugin) is the possible reason would be useful.
See ~~NUTCH-965~~ and ParseSegment.isTruncated(content).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-1389-2x.patch
26/Mar/13 21:27
1 kB
Sebastian Nagel
NUTCH-1389-trunk.patch
26/Mar/13 21:27
2 kB
Sebastian Nagel

Activity

People

Assignee:: Sebastian Nagel

Reporter:: Sebastian Nagel

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Jun/12 20:46

Updated:: 22/May/13 03:53

Resolved:: 27/Mar/13 21:35