Description
We currently have several plugins already distributed or proposed which do very comparable things :
- parse-meta
NUTCH-809to generate metadata fields in parse-metadata and index them - headings
NUTCH-1005to generate headings fields in parse-metadata and index them - index-extra
NUTCH-422to index configurable fields - urlmeta
NUTCH-855to propagate metadata from the seeds to the outlinks and index them - index-static
NUTCH-940to generate configurable static fields
All these plugins have in common that they allow to extract information from various sources and generate fields from them and are largely redundant. Instead this issue proposes to have a single plugin allowing to generate configurable fields from :
- static values
- parse metadata
- content metadata
- crawldb metadata
and let the other plugins focus on the parsing and extraction of the values to index. This will make the addition of new fields simpler by relying on a stable common plugin instead of multiplying the code in various plugins.
This plugin will replace index-extra NUTCH-422 and will serve as a basis for further improvements.
Attachments
Attachments
1.
|
urlmeta to delegate indexing to index-metadata | Open | Unassigned | |
2.
|
parse-meta to delegate indexing to index-metadata | Open | Unassigned |