We currently have several plugins already distributed or proposed which do very comparable things :
NUTCH-809to generate metadata fields in parse-metadata and index them
NUTCH-1005to generate headings fields in parse-metadata and index them
NUTCH-422to index configurable fields
NUTCH-855to propagate metadata from the seeds to the outlinks and index them
NUTCH-940to generate configurable static fields
All these plugins have in common that they allow to extract information from various sources and generate fields from them and are largely redundant. Instead this issue proposes to have a single plugin allowing to generate configurable fields from :
- static values
- parse metadata
- content metadata
- crawldb metadata
and let the other plugins focus on the parsing and extraction of the values to index. This will make the addition of new fields simpler by relying on a stable common plugin instead of multiplying the code in various plugins.
This plugin will replace index-extra
NUTCH-422 and will serve as a basis for further improvements.