Nutch
  1. Nutch
  2. NUTCH-1184

Fetcher to parse and follow Nth degree outlinks

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5
    • Component/s: fetcher
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Fetcher improvements to parse and follow outlinks up to a specified depth. The number of outlinks to follow can be decreased by depth using a divisor. This patch introduces three new configuration directives:

      <property>
        <name>fetcher.follow.outlinks.depth</name>
        <value>-1</value>
        <description>(EXPERT)When fetcher.parse is true and this value is greater than 0 the fetcher will extract outlinks
        and follow until the desired depth is reached. A value of 1 means all generated pages are fetched and their first degree
        outlinks are fetched and parsed too. Be careful, this feature is in itself agnostic of the state of the CrawlDB and does not
        know about already fetched pages. A setting larger than 2 will most likely fetch home pages twice in the same fetch cycle.
        It is highly recommended to set db.ignore.external.links to true to restrict the outlink follower to URL's within the same
        domain. When disabled (false) the feature is likely to follow duplicates even when depth=1.
        A value of -1 of 0 disables this feature.
        </description>
      </property>
      
      <property>
        <name>fetcher.follow.outlinks.num.links</name>
        <value>4</value>
        <description>(EXPERT)The number of outlinks to follow when fetcher.follow.outlinks.depth is enabled. Be careful, this can multiply
        the total number of pages to fetch. This works with fetcher.follow.outlinks.depth.divisor, by default settings the followed outlinks
        at depth 1 is 8, not 4.
        </description>
      </property>
      
      <property>
        <name>fetcher.follow.outlinks.depth.divisor</name>
        <value>2</value>
        <description>(EXPERT)The divisor of fetcher.follow.outlinks.num.links per fetcher.follow.outlinks.depth. This decreases the number
        of outlinks to follow by increasing depth. The formula used is: outlinks = floor(divisor / depth * num.links). This prevents
        exponential growth of the fetch list.
        </description>
      </property>
      

      Please, do not use this unless you know what you're doing. This feature does not consider the state of the CrawlDB nor does it consider generator settings such as limiting the number of pages per (domain|host|ip) queue. It is not polite to use this feature with high settings as it can fetch many pages from the same domain including duplicates.

      Also, this feature will not work if fetcher.parse is disabled. With parsing enabled you might want to consider not to store downloaded content.

      1. NUTCH-1184-1.5-1.patch
        9 kB
        Markus Jelsma
      2. NUTCH-1184-1.5-2.patch
        8 kB
        Markus Jelsma
      3. NUTCH-1184-1.5-3.patch
        13 kB
        Markus Jelsma
      4. NUTCH-1184-1.5-4.patch
        13 kB
        Markus Jelsma
      5. NUTCH-1184-1.5-5.patch
        23 kB
        Markus Jelsma
      6. NUTCH-1184-1.5-5-ParseData.patch
        0.5 kB
        Markus Jelsma
      7. NUTCH-1184-1.5-9-ParseOutputFormat.patch
        6 kB
        Markus Jelsma
      8. NUTCH-1185-1.5-6.patch
        31 kB
        Markus Jelsma
      9. NUTCH-1185-1.5-7.patch
        31 kB
        Markus Jelsma
      10. NUTCH-1185-1.5-8.patch
        32 kB
        Markus Jelsma
      11. NUTCH-1185-1.5-9.patch
        33 kB
        Markus Jelsma

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk #1699 (See https://builds.apache.org/job/Nutch-trunk/1699/)
          Renamed FetcherStatus to FetcherOutlinks for the new outlinks section of NUTCH-1184
          NUTCH-1184 Fetcher to parse and follow Nth degree outlinks

          markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221194
          Files :

          • /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java

          markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221181
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/conf/nutch-default.xml
          • /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
          • /nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java
          • /nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java
          Show
          Hudson added a comment - Integrated in Nutch-trunk #1699 (See https://builds.apache.org/job/Nutch-trunk/1699/ ) Renamed FetcherStatus to FetcherOutlinks for the new outlinks section of NUTCH-1184 NUTCH-1184 Fetcher to parse and follow Nth degree outlinks markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221194 Files : /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221181 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/conf/nutch-default.xml /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java /nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java /nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java
          Hide
          Hudson added a comment -

          Integrated in nutch-trunk-maven #69 (See https://builds.apache.org/job/nutch-trunk-maven/69/)
          Renamed FetcherStatus to FetcherOutlinks for the new outlinks section of NUTCH-1184
          NUTCH-1184 Fetcher to parse and follow Nth degree outlinks

          markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221194
          Files :

          • /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java

          markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221181
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/conf/nutch-default.xml
          • /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
          • /nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java
          • /nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java
          Show
          Hudson added a comment - Integrated in nutch-trunk-maven #69 (See https://builds.apache.org/job/nutch-trunk-maven/69/ ) Renamed FetcherStatus to FetcherOutlinks for the new outlinks section of NUTCH-1184 NUTCH-1184 Fetcher to parse and follow Nth degree outlinks markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221194 Files : /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221181 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/conf/nutch-default.xml /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java /nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java /nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java
          Hide
          Markus Jelsma added a comment -

          Tested once more and successfully committed for 1.5 in rev. 1221181.

          Show
          Markus Jelsma added a comment - Tested once more and successfully committed for 1.5 in rev. 1221181.
          Hide
          Julien Nioche added a comment -

          Just managed to have a look and haven't seen any reason not to commit (disclaimer I haven't compiled or tested the code)
          Thanks

          Julien

          Show
          Julien Nioche added a comment - Just managed to have a look and haven't seen any reason not to commit (disclaimer I haven't compiled or tested the code) Thanks Julien
          Hide
          Markus Jelsma added a comment -

          If there are no further objection i will commit this one tomorrow.

          Show
          Markus Jelsma added a comment - If there are no further objection i will commit this one tomorrow.
          Hide
          Markus Jelsma added a comment -

          Julien, how about this one? I'd like to have this one in before some other commits mess up my diffs as happenend with NUTCH-1139

          Show
          Markus Jelsma added a comment - Julien, how about this one? I'd like to have this one in before some other commits mess up my diffs as happenend with NUTCH-1139
          Hide
          Markus Jelsma added a comment - - edited

          Patch fixes issue described in NUTCH-1212. This replaces the ParseOutputformat diff of 1.5-9.

          Show
          Markus Jelsma added a comment - - edited Patch fixes issue described in NUTCH-1212 . This replaces the ParseOutputformat diff of 1.5-9.
          Hide
          Markus Jelsma added a comment -

          New patch [9] solves an issue of NPE in filtering. It's now required to pass filters and normalizer instances.

          Show
          Markus Jelsma added a comment - New patch [9] solves an issue of NPE in filtering. It's now required to pass filters and normalizer instances.
          Hide
          Markus Jelsma added a comment -

          Another patch logging whether this feature is enabled and the maximum number of outlinks to follow per generated page.
          This patch also includes a recommendation in nutch-default to set db.ignore.external.links to true.

          Show
          Markus Jelsma added a comment - Another patch logging whether this feature is enabled and the maximum number of outlinks to follow per generated page. This patch also includes a recommendation in nutch-default to set db.ignore.external.links to true.
          Hide
          Markus Jelsma added a comment -

          This patch refactors filtering and parsing of outlinks to a static method in ParseOutputFormat.

          Show
          Markus Jelsma added a comment - This patch refactors filtering and parsing of outlinks to a static method in ParseOutputFormat.
          Hide
          Markus Jelsma added a comment -

          New patch includes all involved files:

          • ParseData
          • ParseOutputFormat
          • Fetcher
          • nutch-default

          It also adds a divisor to control the number of outlinks selected by depth. It also includes two new reporters for outlinks (detected and followed) plus a reported for the number of downloaded bytes.

          Show
          Markus Jelsma added a comment - New patch includes all involved files: ParseData ParseOutputFormat Fetcher nutch-default It also adds a divisor to control the number of outlinks selected by depth. It also includes two new reporters for outlinks (detected and followed) plus a reported for the number of downloaded bytes.
          Hide
          Markus Jelsma added a comment -

          Agreed on a static method in ParseOutputFormat for filtering and normalizing outlinks. I'll try to produce a complete patch today.

          Show
          Markus Jelsma added a comment - Agreed on a static method in ParseOutputFormat for filtering and normalizing outlinks. I'll try to produce a complete patch today.
          Hide
          Ferdy Galema added a comment -

          Not sure, but my best bet would be parser related, mostly because the fetcher already is pretty heavy weight at the moment. So that is ParserOutputFormat (a static method) or maybe a dedicated utility/class in the parse package that is instantiated with filters/normalizers.

          Show
          Ferdy Galema added a comment - Not sure, but my best bet would be parser related, mostly because the fetcher already is pretty heavy weight at the moment. So that is ParserOutputFormat (a static method) or maybe a dedicated utility/class in the parse package that is instantiated with filters/normalizers.
          Hide
          Markus Jelsma added a comment -

          Ferdy, yes, not filtering and normalizing is implemented as well but the patch seems to be missing too. The suggestion to use common code is good but as always it's a question of in which file to store the code.

          Julien, i know 1.4 is not released yet but current trunk was already prepared for 1.5 but Chris so i assumed that he is not taking a snapshot from trunk anymore. If so, then 1.4 will get several patches from 1.5

          I will hold this issue because 1) it needs a review, and 2) proper patch that includes all involved files + config settings. And perhaps the refactoring of outlink filtering code. Any suggestions on where to refactor that piece too?

          Thanks guys.

          Show
          Markus Jelsma added a comment - Ferdy, yes, not filtering and normalizing is implemented as well but the patch seems to be missing too. The suggestion to use common code is good but as always it's a question of in which file to store the code. Julien, i know 1.4 is not released yet but current trunk was already prepared for 1.5 but Chris so i assumed that he is not taking a snapshot from trunk anymore. If so, then 1.4 will get several patches from 1.5 I will hold this issue because 1) it needs a review, and 2) proper patch that includes all involved files + config settings. And perhaps the refactoring of outlink filtering code. Any suggestions on where to refactor that piece too? Thanks guys.
          Hide
          Julien Nioche added a comment -

          Markus, can you hold it until 1.4 is released? This is a substantial change and we should take time to review it carefully. I'm sorry I haven't been more available for reviewing your contribs lately and you have been far too productive for me to follow

          Show
          Julien Nioche added a comment - Markus, can you hold it until 1.4 is released? This is a substantial change and we should take time to review it carefully. I'm sorry I haven't been more available for reviewing your contribs lately and you have been far too productive for me to follow
          Hide
          Ferdy Galema added a comment -

          Hi Markus,

          This functionality is very useful. A few notes:

          do not normalize and filter in ParseOutputFormat if fetcher.parse = true

          Did you implement this too? If so please note that the current patch (both 1.5-5 files) does not include this. If you didn't, I think this would be a nice addition. (Otherwise with fetcher.parse enabled normalizing and filtering will be done twice, right?). And a minor suggestion is to perhaps create a common method for the outlink iteration, to avoid duplicate code.

          Other then that, the changes look good.

          Show
          Ferdy Galema added a comment - Hi Markus, This functionality is very useful. A few notes: do not normalize and filter in ParseOutputFormat if fetcher.parse = true Did you implement this too? If so please note that the current patch (both 1.5-5 files) does not include this. If you didn't, I think this would be a nice addition. (Otherwise with fetcher.parse enabled normalizing and filtering will be done twice, right?). And a minor suggestion is to perhaps create a common method for the outlink iteration, to avoid duplicate code. Other then that, the changes look good.
          Hide
          Markus Jelsma added a comment -

          Any comments? Objections? I'd like to push this in and mark the new config directives as expert.

          Show
          Markus Jelsma added a comment - Any comments? Objections? I'd like to push this in and mark the new config directives as expert.
          Hide
          Markus Jelsma added a comment -

          Patch for ParseData was missing. This now has a setOutlinks method.

          Show
          Markus Jelsma added a comment - Patch for ParseData was missing. This now has a setOutlinks method.
          Hide
          Markus Jelsma added a comment -

          New patch adds fetcher.follow.outlinks.num.links setting that limits the number of items that are followed at any given depth and defaults to 4 for politeness.

          Show
          Markus Jelsma added a comment - New patch adds fetcher.follow.outlinks.num.links setting that limits the number of items that are followed at any given depth and defaults to 4 for politeness.
          Hide
          Markus Jelsma added a comment -

          New patch does not initialize maxOutlinkDepth in fetcher.

          Show
          Markus Jelsma added a comment - New patch does not initialize maxOutlinkDepth in fetcher.
          Hide
          Markus Jelsma added a comment -

          New patch fixes the todo's and incorporates NUTCH-1174.

          Show
          Markus Jelsma added a comment - New patch fixes the todo's and incorporates NUTCH-1174 .
          Hide
          Markus Jelsma added a comment -

          Additional Todo's:

          • add setOutlinks to ParseData, fetcher can then put filtered and normalized outlinks
          • do not normalize and filter in ParseOutputFormat if fetcher.parse = true
          Show
          Markus Jelsma added a comment - Additional Todo's: add setOutlinks to ParseData, fetcher can then put filtered and normalized outlinks do not normalize and filter in ParseOutputFormat if fetcher.parse = true
          Hide
          Markus Jelsma added a comment -

          New patch uses HashSet to deduplicate the outlinks.

          Todo:

          • add normalized and filtered outlinks to ParseData
          • better checks for this feature
          Show
          Markus Jelsma added a comment - New patch uses HashSet to deduplicate the outlinks. Todo: add normalized and filtered outlinks to ParseData better checks for this feature
          Hide
          Markus Jelsma added a comment -

          ok, it seems fetched outlinks are added to the CrawlDB with status DB_FETCHED afterall. I likely updated with a wrong segment.

          Show
          Markus Jelsma added a comment - ok, it seems fetched outlinks are added to the CrawlDB with status DB_FETCHED afterall. I likely updated with a wrong segment.
          Hide
          Markus Jelsma added a comment - - edited

          Here's a first attempt, it introduces a new configuration directive fetcher.follow.outlinks.depth that determines how deep outlinks are to be processed. A value of -1 disables this feature, it also doesn't work if fetcher.parse is false. I had to add an outlinkDepth attribute to FetchItem, without it i cannot determine at which depth an item is.

          There's one problem, fetched outlinks are not yet picked up by updateDB. Anyone with an idea? I assume those outlinks do not write a record in CrawlFetch? Another problem is deduplication, or following URL's that are already fetched. This feature is agnostic to the state of the CrawlDB.

          Oh, this patch also adds a bytes_downloaded counter to Hadoop's Reporter.

          Show
          Markus Jelsma added a comment - - edited Here's a first attempt, it introduces a new configuration directive fetcher.follow.outlinks.depth that determines how deep outlinks are to be processed. A value of -1 disables this feature, it also doesn't work if fetcher.parse is false. I had to add an outlinkDepth attribute to FetchItem, without it i cannot determine at which depth an item is. There's one problem, fetched outlinks are not yet picked up by updateDB. Anyone with an idea? I assume those outlinks do not write a record in CrawlFetch? Another problem is deduplication, or following URL's that are already fetched. This feature is agnostic to the state of the CrawlDB. Oh, this patch also adds a bytes_downloaded counter to Hadoop's Reporter.

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development