Nutch
  1. Nutch
  2. NUTCH-855

ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.1
    • Fix Version/s: 1.2
    • Component/s: generator, indexer
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
      1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
      2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.

      The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
      www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
      or:
      http://slashdot.org/ corp_owner=Geeknet will_it_blend=indubitably
      http://engadget.com/ corp_owner=Weblogs genre=geeksquad_thriller

      To activate this plugin, you must modify two properties in your nutch-sites.xml:
      1. plugin.includes
      add: urlmeta
      to: <value>...</value>
      ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
      2. urlmeta.tags
      Insert a comma-delimited list of metatags. Using the above example:
      <value>corp_owner, will_it_blend, genre</value>
      Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

      1. nutch-855.txt
        19 kB
        Scott Gonyea

        Activity

        Hide
        Chris A. Mattmann added a comment -

        updated the docs with your new comments Scott, in r983257. Thanks!

        Show
        Chris A. Mattmann added a comment - updated the docs with your new comments Scott, in r983257. Thanks!
        Hide
        Chris A. Mattmann added a comment -

        My preference is that rather than reopen issues (which is a real pain for JIRA and CHANGES.txt where they have already been marked resolved) just open a new issue and link it to this.

        I see that you reopened it I'm guessing b/c you'd like the description updated in the nutch-default.xml. I'll do that now.

        Show
        Chris A. Mattmann added a comment - My preference is that rather than reopen issues (which is a real pain for JIRA and CHANGES.txt where they have already been marked resolved) just open a new issue and link it to this. I see that you reopened it I'm guessing b/c you'd like the description updated in the nutch-default.xml. I'll do that now.
        Hide
        Scott Gonyea added a comment - - edited

        If it wasn't clear from my prior comment, the property for urlmeta in nutch-site should look like:
        <property>
        <name>urlmeta.tags</name>
        <value>tags,are,sooo,web2.0,man</value>
        </property>

        It might be nice if someone updates the "nutch-default.xml" entry for "urlmeta.tags" to the following:

        <property>
        <name>urlmeta.tags</name>
        <value></value>
        <description>
        To be used in conjunction with features introduced in NUTCH-655, which allows
        for custom metatags to be injected alongside your crawl URLs. Specifying those
        custom tags here will allow for their propagation into a pages outlinks, as
        well as allow for them to be included as part of an index.
        Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags with
        white-space at their boundaries, if you are using Hadoop releases prior to 0.21.
        </description>
        </property>

        Unless, of course, Nutch-1.2 ships with Hadoop-0.21... Then it's a wash. I do think it's good to note that in there, as someone may stumble across that tidbit while troubleshooting some unrelated timewaster. I'm looking out for you, long lost not-twin.

        Show
        Scott Gonyea added a comment - - edited If it wasn't clear from my prior comment, the property for urlmeta in nutch-site should look like: <property> <name>urlmeta.tags</name> <value>tags,are,sooo,web2.0,man</value> </property> It might be nice if someone updates the "nutch-default.xml" entry for "urlmeta.tags" to the following: <property> <name>urlmeta.tags</name> <value></value> <description> To be used in conjunction with features introduced in NUTCH-655 , which allows for custom metatags to be injected alongside your crawl URLs. Specifying those custom tags here will allow for their propagation into a pages outlinks, as well as allow for them to be included as part of an index. Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags with white-space at their boundaries, if you are using Hadoop releases prior to 0.21. </description> </property> Unless, of course, Nutch-1.2 ships with Hadoop-0.21... Then it's a wash. I do think it's good to note that in there, as someone may stumble across that tidbit while troubleshooting some unrelated timewaster. I'm looking out for you, long lost not-twin.
        Hide
        Scott Gonyea added a comment - - edited

        FYI for anyone who might use this:

        The "urlmeta.tags" must be comma-delimited, with no white-space to pad the boundaries.

        Show
        Scott Gonyea added a comment - - edited FYI for anyone who might use this: The "urlmeta.tags" must be comma-delimited, with no white-space to pad the boundaries.
        Hide
        Chris A. Mattmann added a comment -
        • Applied to 1.2-branch in r979079. Cleaned up comments, removed author tags (Nutch decided a long time ago that the project would move away from author tags), cleaned up formatting. Patch doesn't apply to trunk or Nutchbase branch because LuceneWriter doesn't exist anymore for Nutch 2.0. If someone wants to port this to Nutchbase-ville, by all means, but if so, please open a new issue for it. Thanks very much, Scott!
        Show
        Chris A. Mattmann added a comment - Applied to 1.2-branch in r979079. Cleaned up comments, removed author tags (Nutch decided a long time ago that the project would move away from author tags), cleaned up formatting. Patch doesn't apply to trunk or Nutchbase branch because LuceneWriter doesn't exist anymore for Nutch 2.0. If someone wants to port this to Nutchbase-ville, by all means, but if so, please open a new issue for it. Thanks very much, Scott!
        Hide
        Scott Gonyea added a comment -

        Updated comments, revised patch is now available. It's more robust to the nefarious "null" and his NullPointerException cabal.

        Show
        Scott Gonyea added a comment - Updated comments, revised patch is now available. It's more robust to the nefarious "null" and his NullPointerException cabal.
        Hide
        Scott Gonyea added a comment -

        This is my revised patch, with some small bug fixes.

        Show
        Scott Gonyea added a comment - This is my revised patch, with some small bug fixes.

          People

          • Assignee:
            Chris A. Mattmann
            Reporter:
            Scott Gonyea
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 168h
              168h
              Remaining:
              Remaining Estimate - 168h
              168h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development