Nutch
  1. Nutch
  2. NUTCH-1289

In distributed mode URL's are not partitioned

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: nutchgora
    • Fix Version/s: nutchgora
    • Component/s: fetcher
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      In distributed mode URL's are not partitioned to a specific machine which means the politeness policy is voided

      1. NUTCH-1289-v2.patch
        19 kB
        Ferdy Galema
      2. NUTCH-1289.patch
        3 kB
        Dan Rosher

        Activity

        Hide
        Lewis John McGibbney added a comment -

        Hi Dan, thanks for opening this issue and for the patch. Are you using trunk at all? If so is it possible to confirm if this functionality is already running in trunk... if not then we can get a patch cooked up.

        Show
        Lewis John McGibbney added a comment - Hi Dan, thanks for opening this issue and for the patch. Are you using trunk at all? If so is it possible to confirm if this functionality is already running in trunk... if not then we can get a patch cooked up.
        Hide
        Markus Jelsma added a comment -

        In trunk records of the same queue end up in the same fetch list which corresponds to a single mapper.

        Show
        Markus Jelsma added a comment - In trunk records of the same queue end up in the same fetch list which corresponds to a single mapper.
        Hide
        Lewis John McGibbney added a comment -

        Markus, what is your opinion as to which suits best? Or is it the case in Nutchgora that Dan's patch is more appropriate?

        Show
        Lewis John McGibbney added a comment - Markus, what is your opinion as to which suits best? Or is it the case in Nutchgora that Dan's patch is more appropriate?
        Hide
        Mathijs Homminga added a comment -

        Nice catch. The PartitionUrlByHost seems broken indeed.
        I would suggest that we use the existing o.a.n.crawl.URLPartitioner class which has support for three URL partition modes (host, domain, IP) and which is used by the GeneratorJob too.

        Pros: support for different partition modes in the Fetcher + no duplicate code.
        Or is there a reason why the Fetcher has its own partition logic?

        The URLPartitioner class is a Partitioner<SelectorEntry, WebPage> instead of a Partitioner<IntWritable, FetchEntry> but you can perhaps extract a method and use it from both classes, or create one URLPartitioner with two specific inner classes for the Generator and Fetcher.

        Show
        Mathijs Homminga added a comment - Nice catch. The PartitionUrlByHost seems broken indeed. I would suggest that we use the existing o.a.n.crawl.URLPartitioner class which has support for three URL partition modes (host, domain, IP) and which is used by the GeneratorJob too. Pros: support for different partition modes in the Fetcher + no duplicate code. Or is there a reason why the Fetcher has its own partition logic? The URLPartitioner class is a Partitioner<SelectorEntry, WebPage> instead of a Partitioner<IntWritable, FetchEntry> but you can perhaps extract a method and use it from both classes, or create one URLPartitioner with two specific inner classes for the Generator and Fetcher.
        Hide
        Ferdy Galema added a comment -

        This is a showstopper for the upcoming release. I will cook up a patch using your input and commit it asap.

        Show
        Ferdy Galema added a comment - This is a showstopper for the upcoming release. I will cook up a patch using your input and commit it asap.
        Hide
        Ferdy Galema added a comment -

        Done with patch v2. It fixes the problem as described above. It also features a minor improvement, namely that the partition code will be skipped entirely when there is just one partition. (For example in local mode.)

        It includes several tests, including the seed function, the different modes and signature partitioners.

        Show
        Ferdy Galema added a comment - Done with patch v2. It fixes the problem as described above. It also features a minor improvement, namely that the partition code will be skipped entirely when there is just one partition. (For example in local mode.) It includes several tests, including the seed function, the different modes and signature partitioners.
        Hide
        Ferdy Galema added a comment -

        Committed.

        Dan, could you verify this issue for closing?

        Show
        Ferdy Galema added a comment - Committed. Dan, could you verify this issue for closing?
        Hide
        Dan Rosher added a comment -

        Hi Ferdy,

        Thanks for adding the tests, looks good to me,

        Cheers,
        Dan

        Show
        Dan Rosher added a comment - Hi Ferdy, Thanks for adding the tests, looks good to me, Cheers, Dan
        Hide
        Hudson added a comment -

        Integrated in Nutch-nutchgora #184 (See https://builds.apache.org/job/Nutch-nutchgora/184/)
        NUTCH-1289 In distributed mode URL's are not partitioned (Revision 1297039)

        Result = SUCCESS
        ferdy :
        Files :

        • /nutch/branches/nutchgora/CHANGES.txt
        • /nutch/branches/nutchgora/src/java/org/apache/nutch/crawl/GeneratorJob.java
        • /nutch/branches/nutchgora/src/java/org/apache/nutch/crawl/URLPartitioner.java
        • /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherJob.java
        • /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/PartitionUrlByHost.java
        • /nutch/branches/nutchgora/src/test/org/apache/nutch/crawl/TestURLPartitioner.java
        Show
        Hudson added a comment - Integrated in Nutch-nutchgora #184 (See https://builds.apache.org/job/Nutch-nutchgora/184/ ) NUTCH-1289 In distributed mode URL's are not partitioned (Revision 1297039) Result = SUCCESS ferdy : Files : /nutch/branches/nutchgora/CHANGES.txt /nutch/branches/nutchgora/src/java/org/apache/nutch/crawl/GeneratorJob.java /nutch/branches/nutchgora/src/java/org/apache/nutch/crawl/URLPartitioner.java /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherJob.java /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/PartitionUrlByHost.java /nutch/branches/nutchgora/src/test/org/apache/nutch/crawl/TestURLPartitioner.java

          People

          • Assignee:
            Unassigned
            Reporter:
            Dan Rosher
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development