Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-855

ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1
    • Fix Version/s: 1.2
    • Component/s: generator, indexer
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
      1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
      2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.

      The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
      www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
      or:
      http://slashdot.org/ corp_owner=Geeknet will_it_blend=indubitably
      http://engadget.com/ corp_owner=Weblogs genre=geeksquad_thriller

      To activate this plugin, you must modify two properties in your nutch-sites.xml:
      1. plugin.includes
      add: urlmeta
      to: <value>...</value>
      ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
      2. urlmeta.tags
      Insert a comma-delimited list of metatags. Using the above example:
      <value>corp_owner, will_it_blend, genre</value>
      Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

        Attachments

        1. nutch-855.txt
          19 kB
          Scott Gonyea

          Activity

            People

            • Assignee:
              chrismattmann Chris A. Mattmann
              Reporter:
              sgonyea Scott Gonyea
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 168h
                168h
                Remaining:
                Remaining Estimate - 168h
                168h
                Logged:
                Time Spent - Not Specified
                Not Specified