Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-855

ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.1
    • 1.2
    • generator, indexer
    • None
    • Patch Available

    Description

      This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
      1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
      2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.

      The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
      www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
      or:
      http://slashdot.org/ corp_owner=Geeknet will_it_blend=indubitably
      http://engadget.com/ corp_owner=Weblogs genre=geeksquad_thriller

      To activate this plugin, you must modify two properties in your nutch-sites.xml:
      1. plugin.includes
      add: urlmeta
      to: <value>...</value>
      ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
      2. urlmeta.tags
      Insert a comma-delimited list of metatags. Using the above example:
      <value>corp_owner, will_it_blend, genre</value>
      Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

      Attachments

        1. nutch-855.txt
          19 kB
          Scott Gonyea

        Activity

          People

            chrismattmann Chris A. Mattmann
            sgonyea Scott Gonyea
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 168h
                168h
                Remaining:
                Remaining Estimate - 168h
                168h
                Logged:
                Time Spent - Not Specified
                Not Specified