Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1264

Configurable indexing plugin (index-metadata)

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.5
    • Fix Version/s: 1.5
    • Component/s: indexer
    • Labels:
      None

      Description

      We currently have several plugins already distributed or proposed which do very comparable things :

      • parse-meta NUTCH-809 to generate metadata fields in parse-metadata and index them
      • headings NUTCH-1005 to generate headings fields in parse-metadata and index them
      • index-extra NUTCH-422 to index configurable fields
      • urlmeta NUTCH-855 to propagate metadata from the seeds to the outlinks and index them
      • index-static NUTCH-940 to generate configurable static fields

      All these plugins have in common that they allow to extract information from various sources and generate fields from them and are largely redundant. Instead this issue proposes to have a single plugin allowing to generate configurable fields from :

      • static values
      • parse metadata
      • content metadata
      • crawldb metadata

      and let the other plugins focus on the parsing and extraction of the values to index. This will make the addition of new fields simpler by relying on a stable common plugin instead of multiplying the code in various plugins.

      This plugin will replace index-extra NUTCH-422 and will serve as a basis for further improvements.

        Attachments

        1. NUTCH-1264-trunk-v2.patch
          11 kB
          Julien Nioche
        2. NUTCH-1264-trunk.patch
          12 kB
          Julien Nioche

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jnioche Julien Nioche
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: