Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1
    • Component/s: injector
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this:
      http://www.myurl.com \t categ=value1 \t categ2=value2

      This functionality can be useful to store external knowledge and index it with a custom plugin

      1. NUTCH-655.v2
        3 kB
        Julien Nioche
      2. Injector.patch
        2 kB
        Julien Nioche

        Issue Links

          Activity

          Hide
          Scott Gonyea added a comment -

          Claus, see my patch: NUTCH-855

          Show
          Scott Gonyea added a comment - Claus, see my patch: NUTCH-855
          Hide
          Claus Schröter added a comment -

          Hi Julien, thanks for this patch...
          is there any way to inherit the metadata or parts of it to suburls while crawling?
          I fiddled around with a scoring filter but with no success.

          Cheers
          Claus

          Show
          Claus Schröter added a comment - Hi Julien, thanks for this patch... is there any way to inherit the metadata or parts of it to suburls while crawling? I fiddled around with a scoring filter but with no success. Cheers Claus
          Hide
          Julien Nioche added a comment -

          Committed revision 896539

          Show
          Julien Nioche added a comment - Committed revision 896539
          Hide
          Julien Nioche added a comment -

          good idea. I've made the modification and documented in the javadoc :

          The URL files contain one URL per line, optionally followed by custom metadata separated by tabs with the metadata key separated from the corresponding value by '='.
          Note that some metadata keys are reserved :

          • <i>nutch.score</i> : allows to set a custom score for a specific URL <br>
          • <i>nutch.fetchInterval</i> : allows to set a custom fetch interval for a specific URL <br>
            e.g. http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t userType=open_source
          Show
          Julien Nioche added a comment - good idea. I've made the modification and documented in the javadoc : The URL files contain one URL per line, optionally followed by custom metadata separated by tabs with the metadata key separated from the corresponding value by '='. Note that some metadata keys are reserved : <i>nutch.score</i> : allows to set a custom score for a specific URL <br> <i>nutch.fetchInterval</i> : allows to set a custom fetch interval for a specific URL <br> e.g. http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t userType=open_source
          Hide
          Andrzej Bialecki added a comment -

          I'm not sure about the latest addition (the score option). If we go this route, then I suggest doing the last minor step and recognize reserved metadata keys to do also other useful things like setting fetch interval. I.e. define and recognize "nutch.score" and "nutch.fetchInterval", and document it properly somewhere ...(wiki? javadoc? cmd-line synopsis?).

          Show
          Andrzej Bialecki added a comment - I'm not sure about the latest addition (the score option). If we go this route, then I suggest doing the last minor step and recognize reserved metadata keys to do also other useful things like setting fetch interval. I.e. define and recognize "nutch.score" and "nutch.fetchInterval", and document it properly somewhere ...(wiki? javadoc? cmd-line synopsis?).
          Hide
          Julien Nioche added a comment -

          Any objections to committing this patch?

          Show
          Julien Nioche added a comment - Any objections to committing this patch?
          Hide
          Julien Nioche added a comment -

          Improved version of the patch which allows to specify custom scores for the URLs. A score is specified by simply setting a float value instead of a name=value couple e.g.
          http://www.lemonde.fr/ label=newspaper 10.0
          http://www.lequipe.fr/ label=sports 2.0

          Show
          Julien Nioche added a comment - Improved version of the patch which allows to specify custom scores for the URLs. A score is specified by simply setting a float value instead of a name=value couple e.g. http://www.lemonde.fr/ label=newspaper 10.0 http://www.lequipe.fr/ label=sports 2.0
          Hide
          Doğacan Güney added a comment -

          Moved to 1.1.

          Show
          Doğacan Güney added a comment - Moved to 1.1.
          Hide
          Otis Gospodnetic added a comment -

          1.1 sounds good to me.

          Show
          Otis Gospodnetic added a comment - 1.1 sounds good to me.
          Hide
          Doğacan Güney added a comment -

          Is everyone OK with moving this issue to target 1.1 release?

          Show
          Doğacan Güney added a comment - Is everyone OK with moving this issue to target 1.1 release?
          Hide
          Julien Nioche added a comment -

          I agree that https://issues.apache.org/jira/browse/NUTCH-650 would provide a cleaner way of doing this but since it is a substantial change it might take some time before it is committed.

          Regarding https://issues.apache.org/jira/browse/NUTCH-628 we could also have a similar injector for hostDBs that could be used to store / update statistics or any other information about hosts without necessarily getting it from the crawlDB.

          Show
          Julien Nioche added a comment - I agree that https://issues.apache.org/jira/browse/NUTCH-650 would provide a cleaner way of doing this but since it is a substantial change it might take some time before it is committed. Regarding https://issues.apache.org/jira/browse/NUTCH-628 we could also have a similar injector for hostDBs that could be used to store / update statistics or any other information about hosts without necessarily getting it from the crawlDB.
          Hide
          Otis Gospodnetic added a comment - - edited

          I think we need a generic way for keeping meta data about hosts ... I think I started that somewhere in JIRA a while back.... aha: NUTCH-628

          I'm mentioning this simply because we can probably use the same or very similar mechanism for keeping meta data about hosts and individual URLs.

          But it looks like NUTCH-650 may be the way of the future.

          Show
          Otis Gospodnetic added a comment - - edited I think we need a generic way for keeping meta data about hosts ... I think I started that somewhere in JIRA a while back.... aha: NUTCH-628 I'm mentioning this simply because we can probably use the same or very similar mechanism for keeping meta data about hosts and individual URLs. But it looks like NUTCH-650 may be the way of the future.
          Hide
          Doğacan Güney added a comment -

          We may discuss if tab-separation is the best way to go, but +1 for the idea from me.

          Show
          Doğacan Güney added a comment - We may discuss if tab-separation is the best way to go, but +1 for the idea from me.
          Hide
          Julien Nioche added a comment -

          Patch for injecting metadata into a crawlDB

          Show
          Julien Nioche added a comment - Patch for injecting metadata into a crawlDB

            People

            • Assignee:
              Julien Nioche
              Reporter:
              Julien Nioche
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development