Nutch
  1. Nutch
  2. NUTCH-855

ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.1
    • Fix Version/s: 1.2
    • Component/s: generator, indexer
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
      1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
      2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.

      The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
      www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
      or:
      http://slashdot.org/ corp_owner=Geeknet will_it_blend=indubitably
      http://engadget.com/ corp_owner=Weblogs genre=geeksquad_thriller

      To activate this plugin, you must modify two properties in your nutch-sites.xml:
      1. plugin.includes
      add: urlmeta
      to: <value>...</value>
      ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
      2. urlmeta.tags
      Insert a comma-delimited list of metatags. Using the above example:
      <value>corp_owner, will_it_blend, genre</value>
      Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.

      1. nutch-855.txt
        19 kB
        Scott Gonyea

        Activity

        Scott Gonyea created issue -
        Scott Gonyea made changes -
        Field Original Value New Value
        Attachment nutch-855 [ 12449523 ]
        Scott Gonyea made changes -
        Attachment nutch-855 [ 12449523 ]
        Scott Gonyea made changes -
        Attachment nutch-855.txt [ 12449524 ]
        Scott Gonyea made changes -
        Attachment nutch-855.txt [ 12449524 ]
        Scott Gonyea made changes -
        Attachment nutch-855.txt [ 12449526 ]
        Hide
        Scott Gonyea added a comment -

        This is my revised patch, with some small bug fixes.

        Show
        Scott Gonyea added a comment - This is my revised patch, with some small bug fixes.
        Scott Gonyea made changes -
        Attachment nutch-855.txt [ 12449967 ]
        Scott Gonyea made changes -
        Attachment nutch-855.txt [ 12449526 ]
        Hide
        Scott Gonyea added a comment -

        Updated comments, revised patch is now available. It's more robust to the nefarious "null" and his NullPointerException cabal.

        Show
        Scott Gonyea added a comment - Updated comments, revised patch is now available. It's more robust to the nefarious "null" and his NullPointerException cabal.
        Scott Gonyea made changes -
        Fix Version/s 2.0 [ 12314893 ]
        Description This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
        1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
        2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.

        The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
        [www.url.com]\t[key1]=[value1]\t[key2]=[value2]...[keyN]=[valueN]
        or:
        http://slashdot.org/ corp_owner=Geeknet will_it_blend=indubitably
        http://engadget.com/ corp_owner=Weblogs genre=geeksquad_thriller

        To activate this plugin, you must modify two properties in your nutch-sites.xml:
        1. plugin.includes
           from: index-(basic|anchor)
           to: index-(basic|anchor|urlmeta)
        2. urlmeta.tags
           Insert a comma-delimited list of metatags. Using the above example:
           <value>corp_owner, will_it_blend, genre</value>
           Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.
        This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
        1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs.
        2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly.

        The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of:
        www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
        or:
        http://slashdot.org/ corp_owner=Geeknet will_it_blend=indubitably
        http://engadget.com/ corp_owner=Weblogs genre=geeksquad_thriller

        To activate this plugin, you must modify two properties in your nutch-sites.xml:
        1. plugin.includes
           add: urlmeta
           to: <value>...</value>
           ie: <value>urlmeta|parse-tika|scoring-opic|...</value>
        2. urlmeta.tags
           Insert a comma-delimited list of metatags. Using the above example:
           <value>corp_owner, will_it_blend, genre</value>
           Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed.
        Chris A. Mattmann made changes -
        Assignee Chris A. Mattmann [ chrismattmann ]
        Chris A. Mattmann made changes -
        Status Open [ 1 ] In Progress [ 3 ]
        Hide
        Chris A. Mattmann added a comment -
        • Applied to 1.2-branch in r979079. Cleaned up comments, removed author tags (Nutch decided a long time ago that the project would move away from author tags), cleaned up formatting. Patch doesn't apply to trunk or Nutchbase branch because LuceneWriter doesn't exist anymore for Nutch 2.0. If someone wants to port this to Nutchbase-ville, by all means, but if so, please open a new issue for it. Thanks very much, Scott!
        Show
        Chris A. Mattmann added a comment - Applied to 1.2-branch in r979079. Cleaned up comments, removed author tags (Nutch decided a long time ago that the project would move away from author tags), cleaned up formatting. Patch doesn't apply to trunk or Nutchbase branch because LuceneWriter doesn't exist anymore for Nutch 2.0. If someone wants to port this to Nutchbase-ville, by all means, but if so, please open a new issue for it. Thanks very much, Scott!
        Chris A. Mattmann made changes -
        Status In Progress [ 3 ] Resolved [ 5 ]
        Fix Version/s 2.0 [ 12314893 ]
        Resolution Fixed [ 1 ]
        Hide
        Scott Gonyea added a comment - - edited

        FYI for anyone who might use this:

        The "urlmeta.tags" must be comma-delimited, with no white-space to pad the boundaries.

        Show
        Scott Gonyea added a comment - - edited FYI for anyone who might use this: The "urlmeta.tags" must be comma-delimited, with no white-space to pad the boundaries.
        Hide
        Scott Gonyea added a comment - - edited

        If it wasn't clear from my prior comment, the property for urlmeta in nutch-site should look like:
        <property>
        <name>urlmeta.tags</name>
        <value>tags,are,sooo,web2.0,man</value>
        </property>

        It might be nice if someone updates the "nutch-default.xml" entry for "urlmeta.tags" to the following:

        <property>
        <name>urlmeta.tags</name>
        <value></value>
        <description>
        To be used in conjunction with features introduced in NUTCH-655, which allows
        for custom metatags to be injected alongside your crawl URLs. Specifying those
        custom tags here will allow for their propagation into a pages outlinks, as
        well as allow for them to be included as part of an index.
        Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags with
        white-space at their boundaries, if you are using Hadoop releases prior to 0.21.
        </description>
        </property>

        Unless, of course, Nutch-1.2 ships with Hadoop-0.21... Then it's a wash. I do think it's good to note that in there, as someone may stumble across that tidbit while troubleshooting some unrelated timewaster. I'm looking out for you, long lost not-twin.

        Show
        Scott Gonyea added a comment - - edited If it wasn't clear from my prior comment, the property for urlmeta in nutch-site should look like: <property> <name>urlmeta.tags</name> <value>tags,are,sooo,web2.0,man</value> </property> It might be nice if someone updates the "nutch-default.xml" entry for "urlmeta.tags" to the following: <property> <name>urlmeta.tags</name> <value></value> <description> To be used in conjunction with features introduced in NUTCH-655 , which allows for custom metatags to be injected alongside your crawl URLs. Specifying those custom tags here will allow for their propagation into a pages outlinks, as well as allow for them to be included as part of an index. Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags with white-space at their boundaries, if you are using Hadoop releases prior to 0.21. </description> </property> Unless, of course, Nutch-1.2 ships with Hadoop-0.21... Then it's a wash. I do think it's good to note that in there, as someone may stumble across that tidbit while troubleshooting some unrelated timewaster. I'm looking out for you, long lost not-twin.
        Scott Gonyea made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Scott Gonyea made changes -
        Comment [ If it wasn't clear from my prior comment, the property for urlmeta in nutch-site should look like:

        <property>
          <name>urlmeta.tags</name>
          <value>damn,you,doug,cutting</value>
        </property>

        It might be nice if someone updates the "nutch-default.xml" entry for "urlmeta.tags" to the following:

        <property>
          <name>urlmeta.tags</name>
          <value></value>
          <description>
            To be used in conjunction with features introduced in NUTCH-655, which allows
            for custom metatags to be injected alongside your crawl URLs. Specifying those
            custom tags here will allow for their propagation into a pages outlinks, as
            well as allow for them to be included as part of an index.
            Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags with
            white-space at their boundaries, if you are using anything earlier than Hadoop-0.21.
          </description>
        </property>

        Unless, of course, Nutch-1.2 ships with Hadoop-0.21... Then it's a wash. I do think it's good to note that in there, as someone may stumble across that tidbit while troubleshooting some unrelated godawful bug. I'm looking out for you, long lost not-twin. ]
        Hide
        Chris A. Mattmann added a comment -

        My preference is that rather than reopen issues (which is a real pain for JIRA and CHANGES.txt where they have already been marked resolved) just open a new issue and link it to this.

        I see that you reopened it I'm guessing b/c you'd like the description updated in the nutch-default.xml. I'll do that now.

        Show
        Chris A. Mattmann added a comment - My preference is that rather than reopen issues (which is a real pain for JIRA and CHANGES.txt where they have already been marked resolved) just open a new issue and link it to this. I see that you reopened it I'm guessing b/c you'd like the description updated in the nutch-default.xml. I'll do that now.
        Chris A. Mattmann made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Chris A. Mattmann added a comment -

        updated the docs with your new comments Scott, in r983257. Thanks!

        Show
        Chris A. Mattmann added a comment - updated the docs with your new comments Scott, in r983257. Thanks!
        Lewis John McGibbney made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open In Progress In Progress
        10d 5h 12m 1 Chris A. Mattmann 25/Jul/10 08:00
        In Progress In Progress Resolved Resolved
        10h 50m 1 Chris A. Mattmann 25/Jul/10 18:51
        Resolved Resolved Reopened Reopened
        12d 12h 8m 1 Scott Gonyea 07/Aug/10 06:59
        Reopened Reopened Resolved Resolved
        9h 34m 1 Chris A. Mattmann 07/Aug/10 16:33
        Resolved Resolved Closed Closed
        1018d 12h 19m 1 Lewis John McGibbney 22/May/13 04:53

          People

          • Assignee:
            Chris A. Mattmann
            Reporter:
            Scott Gonyea
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 168h
              168h
              Remaining:
              Remaining Estimate - 168h
              168h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development