Nutch
  1. Nutch
  2. NUTCH-762

Alternative Generator which can generate several segments in one parse of the crawlDB

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.1
    • Component/s: generator
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment.

      The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects:

      • can filter the URLs by score
      • normalisation is optional
      • IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale
      • can max the number of URLs per host or domain (but not by IP)
      • can choose to partition by host, domain or IP

      Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP.
      We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers.

      The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ...
      with the following options :
      MultiGenerator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]

      where most parameters are similar to the default Generator - apart from :
      -noNorm (explicit)
      -topN : max number of URLs per segment
      -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments

      Please give it a try and less me know what you think of it

      Julien Nioche
      http://www.digitalpebble.com

      1. NUTCH-762-v2.patch
        73 kB
        Julien Nioche
      2. NUTCH-762-v3.patch
        75 kB
        Julien Nioche

        Issue Links

          Activity

          Hide
          Julien Nioche added a comment -

          Patch for the MultiGenerator

          Show
          Julien Nioche added a comment - Patch for the MultiGenerator
          Hide
          Andrzej Bialecki added a comment -

          This class offers a strict superset of the current Generator functionality. Maintaining both tools would be cumbersome and error-prone. I propose to replace Generator with MultiGenerator (under the current name Generator).

          Show
          Andrzej Bialecki added a comment - This class offers a strict superset of the current Generator functionality. Maintaining both tools would be cumbersome and error-prone. I propose to replace Generator with MultiGenerator (under the current name Generator).
          Hide
          Jesse Hires added a comment -

          It would be handy if it output a list of the segments it generated, either one at a time or a list at the end of generating all of them. This would be very useful for automation scripts that rely on parsed output for further processing.

          Show
          Jesse Hires added a comment - It would be handy if it output a list of the segments it generated, either one at a time or a list at the end of generating all of them. This would be very useful for automation scripts that rely on parsed output for further processing.
          Hide
          Julien Nioche added a comment -

          Improved version of the patch :

          • fixed a few minor bugs
          • renamed Generator into OldGenerator
          • renamed MultiGenerator into Generator
          • fixed test classes to use new Generator
          • documented parameters in nutch-default.xml
          • add names of segments to the LOG to facilitate integration in scripts
          • PartitionUrlByHost is replaced by URLPartitioner which is more generic

          I decided to keep the old version for the time being but we might as well get rid of it altogether. The new version is now used in the Crawl class.

          Would be nice if people could give it a good try before we put it in 1.1

          Thanks

          Julien

          Show
          Julien Nioche added a comment - Improved version of the patch : fixed a few minor bugs renamed Generator into OldGenerator renamed MultiGenerator into Generator fixed test classes to use new Generator documented parameters in nutch-default.xml add names of segments to the LOG to facilitate integration in scripts PartitionUrlByHost is replaced by URLPartitioner which is more generic I decided to keep the old version for the time being but we might as well get rid of it altogether. The new version is now used in the Crawl class. Would be nice if people could give it a good try before we put it in 1.1 Thanks Julien
          Hide
          Andrzej Bialecki added a comment -

          It appears this class is not a strict superset - the generate.update.crawldb functionality is not there. This is a regression in a useful functionality, so I think it needs to be added back.

          Show
          Andrzej Bialecki added a comment - It appears this class is not a strict superset - the generate.update.crawldb functionality is not there. This is a regression in a useful functionality, so I think it needs to be added back.
          Hide
          Julien Nioche added a comment -

          If I am not mistaken the point of having generate.update.crawldb was to marke the URLs put in a fetchlist in order to be able to do another round of generation. This is not necessary now as we can generate several segments without writing a new crawldb.
          Am I missing something?

          Show
          Julien Nioche added a comment - If I am not mistaken the point of having generate.update.crawldb was to marke the URLs put in a fetchlist in order to be able to do another round of generation. This is not necessary now as we can generate several segments without writing a new crawldb. Am I missing something?
          Hide
          Andrzej Bialecki added a comment -

          In case of users generating just 1 segment at a time it's an unexpected loss of flexibility. You can't run this version of Generator twice without first completing both fetching & updating of all segments from the previous run - because some of the same urls would be generated in the next round. The point of generate.update.crawldb is to be able to freely interleave generate/update steps.

          E.g. the following scenario breaks in a non-obvious way:

          • generate 10 segments
          • fetch & update 8 of them
          • realize you need more rounds due to e.g. gone pages
          • generate additional 10 segments

          ..kaboom! now the new segments partially overlap with the unfetched 2 segments from the previous generation, and you are going to fetch some urls twice.

          Show
          Andrzej Bialecki added a comment - In case of users generating just 1 segment at a time it's an unexpected loss of flexibility. You can't run this version of Generator twice without first completing both fetching & updating of all segments from the previous run - because some of the same urls would be generated in the next round. The point of generate.update.crawldb is to be able to freely interleave generate/update steps. E.g. the following scenario breaks in a non-obvious way: generate 10 segments fetch & update 8 of them realize you need more rounds due to e.g. gone pages generate additional 10 segments ..kaboom! now the new segments partially overlap with the unfetched 2 segments from the previous generation, and you are going to fetch some urls twice.
          Hide
          Julien Nioche added a comment -

          OK, there was indeed an assumption that the generator would not need to be called again before an update. Am happy to add back generate.update.crawldb.

          Note that this version of the Generator also differs from the original version in that

          *IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale
          *can max the number of URLs per host or domain (but not by IP)

          We could allow more flexibility by counting per IP, again at the expense of performance. Not sure it is very useful in practice though. Since the way we count the URLs is now decoupled from the way we partition them, we can have an hybrid approach e.g. count per domain THEN partition by IP.

          Any thoughts on whether or not we should reintroduce the counting per IP?

          Show
          Julien Nioche added a comment - OK, there was indeed an assumption that the generator would not need to be called again before an update. Am happy to add back generate.update.crawldb. Note that this version of the Generator also differs from the original version in that *IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale *can max the number of URLs per host or domain (but not by IP) We could allow more flexibility by counting per IP, again at the expense of performance. Not sure it is very useful in practice though. Since the way we count the URLs is now decoupled from the way we partition them, we can have an hybrid approach e.g. count per domain THEN partition by IP. Any thoughts on whether or not we should reintroduce the counting per IP?
          Hide
          Andrzej Bialecki added a comment -

          In my experience the IP-based fetching was only (rarely) needed when there was a large number of urls from virtual hosts hosted at the same ISP. In other words, not a common case - others may have different experience depending on their typical crawl targets... IMHO I think we don't have to reimplement this.

          Show
          Andrzej Bialecki added a comment - In my experience the IP-based fetching was only (rarely) needed when there was a large number of urls from virtual hosts hosted at the same ISP. In other words, not a common case - others may have different experience depending on their typical crawl targets... IMHO I think we don't have to reimplement this.
          Hide
          Julien Nioche added a comment -

          Yes, I came across that situation too on a large crawl where a single machine was used to host a whole range of unrelated domain names (needless to say the host of the domains was not very pleased). We can now handle such cases that simply by partitioning by IP (and counting by domain).

          I will have a look at reintroducing generate.update.crawldb tomorrow.

          Show
          Julien Nioche added a comment - Yes, I came across that situation too on a large crawl where a single machine was used to host a whole range of unrelated domain names (needless to say the host of the domains was not very pleased). We can now handle such cases that simply by partitioning by IP (and counting by domain). I will have a look at reintroducing generate.update.crawldb tomorrow.
          Hide
          Julien Nioche added a comment -

          new patch which reintroduces the 'generator.update.crawldb' functionality

          Show
          Julien Nioche added a comment - new patch which reintroduces the 'generator.update.crawldb' functionality
          Hide
          Andrzej Bialecki added a comment -

          I just noticed that the new Generator uses different config property names ("generator." vs. "generate."), and the older versions are now marked with "(Deprecated)". However, this doesn't reflect the reality - properties with old names are simply ignored now, whereas "deprecated" implies that they should still work. For back-compat reason I think they should still work - the current (admittedly awkward) prefix is good enough, and I think that changing it in a minor release would create confusion. I suggest reverting to the old names where appropriate, and add new properties with the same prefix, i.e. "generate.".

          Show
          Andrzej Bialecki added a comment - I just noticed that the new Generator uses different config property names ("generator." vs. "generate."), and the older versions are now marked with "(Deprecated)". However, this doesn't reflect the reality - properties with old names are simply ignored now, whereas "deprecated" implies that they should still work. For back-compat reason I think they should still work - the current (admittedly awkward) prefix is good enough, and I think that changing it in a minor release would create confusion. I suggest reverting to the old names where appropriate, and add new properties with the same prefix, i.e. "generate.".
          Hide
          Julien Nioche added a comment -

          I just noticed that the new Generator uses different config property names ("generator." vs. "generate."), and the older versions are now marked with "(Deprecated)". However, this doesn't reflect the reality - properties with old names are simply ignored now, whereas "deprecated" implies that they should still work

          They will still work if we keep the old Generator as OldGenerator - which is what we assume in the patch. If we decide to get shot of the OldGenerator then yes, they should not be marked with "(Deprecated)"

          For back-compat reason I think they should still work - the current (admittedly awkward) prefix is good enough, and I think that changing it in a minor release would create confusion. I suggest reverting to the old names where appropriate, and add new properties with the same prefix, i.e. "generate.".

          the original assumption was that we'd keep both this version of the generator and the old one in which case we could have used a different prefix for the properties. If we want to replace the old generator altogether - which I think would be a good option - then indeed we should discuss whether or not to align on the old prefix.

          I don't have strong feelings on whether or not to modify the prefix in a minor release.

          Show
          Julien Nioche added a comment - I just noticed that the new Generator uses different config property names ("generator." vs. "generate."), and the older versions are now marked with "(Deprecated)". However, this doesn't reflect the reality - properties with old names are simply ignored now, whereas "deprecated" implies that they should still work They will still work if we keep the old Generator as OldGenerator - which is what we assume in the patch. If we decide to get shot of the OldGenerator then yes, they should not be marked with "(Deprecated)" For back-compat reason I think they should still work - the current (admittedly awkward) prefix is good enough, and I think that changing it in a minor release would create confusion. I suggest reverting to the old names where appropriate, and add new properties with the same prefix, i.e. "generate.". the original assumption was that we'd keep both this version of the generator and the old one in which case we could have used a different prefix for the properties. If we want to replace the old generator altogether - which I think would be a good option - then indeed we should discuss whether or not to align on the old prefix. I don't have strong feelings on whether or not to modify the prefix in a minor release.
          Hide
          Andrzej Bialecki added a comment -

          If we want to replace the old generator altogether - which I think would be a good option

          I think this makes sense now, since the new Generator in your latest patch is a strict superset of the old one.

          I don't have strong feelings on whether or not to modify the prefix in a minor release.

          I do , see also here: http://en.wikipedia.org/wiki/Principle_of_least_astonishment

          IMHO it's all about breaking or not breaking existing installs after a minor upgrade. I suspect most users won't be aware of a subtle change between "generate." and "generator.", especially since the command-line of the new Generator is compatible with the old one. So they will try to use the new Generator while keeping their existing configs.

          Show
          Andrzej Bialecki added a comment - If we want to replace the old generator altogether - which I think would be a good option I think this makes sense now, since the new Generator in your latest patch is a strict superset of the old one. I don't have strong feelings on whether or not to modify the prefix in a minor release. I do , see also here: http://en.wikipedia.org/wiki/Principle_of_least_astonishment IMHO it's all about breaking or not breaking existing installs after a minor upgrade. I suspect most users won't be aware of a subtle change between "generate." and "generator.", especially since the command-line of the new Generator is compatible with the old one. So they will try to use the new Generator while keeping their existing configs.
          Hide
          Julien Nioche added a comment -

          The change of prefix also reflected that we now use 2 different parameters so specify how to count the URLs (host or domain) and the max number of URLs. We can of course maintain the old parameters as well for the sake of compatibility, except that generate.max.per.host.by.ip won't be of much use anymore as we don't count per IP.

          Have just noticed that 'crawl.gen.delay' is not documented in nutch-default.xml, and does not seem to be used outside the Generator. What is it supposed to be used for?

          Show
          Julien Nioche added a comment - The change of prefix also reflected that we now use 2 different parameters so specify how to count the URLs (host or domain) and the max number of URLs. We can of course maintain the old parameters as well for the sake of compatibility, except that generate.max.per.host.by.ip won't be of much use anymore as we don't count per IP. Have just noticed that 'crawl.gen.delay' is not documented in nutch-default.xml, and does not seem to be used outside the Generator. What is it supposed to be used for?
          Hide
          Andrzej Bialecki added a comment -

          The change of prefix also reflected that we now use 2 different parameters so specify how to count the URLs (host or domain) and the max number of URLs. We can of course maintain the old parameters as well for the sake of compatibility, except that generate.max.per.host.by.ip won't be of much use anymore as we don't count per IP.

          Ok.

          Have just noticed that 'crawl.gen.delay' is not documented in nutch-default.xml, and does not seem to be used outside the Generator. What is it supposed to be used for?

          Ah, a bit of ancient magic .. This value, expressed in days, defines how long we should keep the lock on records in CrawlDb that were just selected for fetching. If these records are not updated in the meantime, the lock is canceled, i.e. the become eligible for selecting. Default value of this is 7 days.

          Show
          Andrzej Bialecki added a comment - The change of prefix also reflected that we now use 2 different parameters so specify how to count the URLs (host or domain) and the max number of URLs. We can of course maintain the old parameters as well for the sake of compatibility, except that generate.max.per.host.by.ip won't be of much use anymore as we don't count per IP. Ok. Have just noticed that 'crawl.gen.delay' is not documented in nutch-default.xml, and does not seem to be used outside the Generator. What is it supposed to be used for? Ah, a bit of ancient magic .. This value, expressed in days, defines how long we should keep the lock on records in CrawlDb that were just selected for fetching. If these records are not updated in the meantime, the lock is canceled, i.e. the become eligible for selecting. Default value of this is 7 days.
          Hide
          Julien Nioche added a comment -

          Committed revision 926155

          Have reverted the prefix for params to 'generate.' + added description of crawl.gen.delay on nutch-default + added warning when user specified generate.max.per.host.by.ip + param generate.max.per.host is now supported

          Thanks Andzrej for your reviewing it

          Show
          Julien Nioche added a comment - Committed revision 926155 Have reverted the prefix for params to 'generate.' + added description of crawl.gen.delay on nutch-default + added warning when user specified generate.max.per.host.by.ip + param generate.max.per.host is now supported Thanks Andzrej for your reviewing it
          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk #1104 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1104/)
          fixed NPE introduced in
          : Generator can generate several segments in one parse of the crawlDB

          Show
          Hudson added a comment - Integrated in Nutch-trunk #1104 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1104/ ) fixed NPE introduced in : Generator can generate several segments in one parse of the crawlDB

            People

            • Assignee:
              Julien Nioche
              Reporter:
              Julien Nioche
            • Votes:
              2 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development