[NUTCH-1258] MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.4
Fix Version/s: 1.5
Component/s: indexer
Labels:
None

Patch Info:

Patch Available

Description

The MoreIndexingFilter reads the Content-Type from parse metadata. However, this usually contains a lot of crap because web developers can set it to anything they like. The filter must be able to read the Content-Type field from content metadata as well because that contains the type detected by Tika's Detector.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-1258-1.5-1.patch
25/Jan/12 11:44
2 kB
Markus Jelsma
NUTCH-1258-v2.patch
13/Feb/12 11:55
2 kB
Julien Nioche

Issue Links

relates to

NUTCH-1259 Store detected content type in crawldatum metadata

Closed

Activity

People

Assignee:: Markus Jelsma

Reporter:: Markus Jelsma

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 25/Jan/12 11:26

Updated:: 22/May/13 03:54

Resolved:: 01/Mar/12 15:38