Description
The MoreIndexingFilter reads the Content-Type from parse metadata. However, this usually contains a lot of crap because web developers can set it to anything they like. The filter must be able to read the Content-Type field from content metadata as well because that contains the type detected by Tika's Detector.
Attachments
Attachments
Issue Links
- relates to
-
NUTCH-1259 Store detected content type in crawldatum metadata
- Closed