Nutch
  1. Nutch
  2. NUTCH-1347

fetcher politeness related to map-reduce

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Not a Problem
    • Affects Version/s: 1.4
    • Fix Version/s: None
    • Component/s: fetcher
    • Labels:

      Description

      when Nutch is running on Hadoop , based on map-reduce concept, each map task do some thing on it's owned data, so, each fetcher map-task work with it's Queues and do not know any thing about other Queus. so, enforce delay between successive requests and maximum concurrent requests policies on it's Queues. but with a simple test we found that it's not good piliteness mechanism when we have multiple map tasks.

        Activity

        Hide
        Julien Nioche added a comment -

        Not clear what the issue is. You can group URLs into a map input by host, domain or IP and then into each queue based on the same criteria.
        BTW why not asking on the mailing list before filing a JIRA? You've opened quite a few - which is good - but don't reply to comments or questions on them which defeats the object
        Thanks

        Show
        Julien Nioche added a comment - Not clear what the issue is. You can group URLs into a map input by host, domain or IP and then into each queue based on the same criteria. BTW why not asking on the mailing list before filing a JIRA? You've opened quite a few - which is good - but don't reply to comments or questions on them which defeats the object Thanks
        Hide
        behnam nikbakht added a comment -

        i can not recognize your solution.
        when i simply put a line in getFetchItem() method in FetchItemQueue class, see that there are impoliteness requests to same host:
        try {
        it = queue.remove(0);
        inProgress.add(it);
        System.out.println(it.url.toString()"<<"+System.currentTimeMillis());

        we can multiply minCrawlDelay or crawlDelay and maxThreads with number of map tasks but there is no coordination between tasks and also there are not equal number of url from each host for each task.
        also i found a bug in selector reduce task in generate phase, that result from less of coordination between tasks.
        for these problems i use a redis-server that is a fast data server for manintaining (key,value) pairs.
        so, redis maintain some variables like delay, maxThreads,... for each host and can dynamically set them acording to rate of success and block for each host.

        Show
        behnam nikbakht added a comment - i can not recognize your solution. when i simply put a line in getFetchItem() method in FetchItemQueue class, see that there are impoliteness requests to same host: try { it = queue.remove(0); inProgress.add(it); System.out.println(it.url.toString() "<<"+System.currentTimeMillis()); we can multiply minCrawlDelay or crawlDelay and maxThreads with number of map tasks but there is no coordination between tasks and also there are not equal number of url from each host for each task. also i found a bug in selector reduce task in generate phase, that result from less of coordination between tasks. for these problems i use a redis-server that is a fast data server for manintaining (key,value) pairs. so, redis maintain some variables like delay, maxThreads,... for each host and can dynamically set them acording to rate of success and block for each host.
        Hide
        Julien Nioche added a comment -

        i can not recognize your solution

        that's probably because you haven't really explained what the problem is. Are you seeing URLs from the same host being put in different queues?

        also i found a bug in selector reduce task in generate phase, that result from less of coordination between tasks.

        then please open a separate issue for this and include a clear description so that others can reproduce the problem or at least understand it

        Show
        Julien Nioche added a comment - i can not recognize your solution that's probably because you haven't really explained what the problem is. Are you seeing URLs from the same host being put in different queues? also i found a bug in selector reduce task in generate phase, that result from less of coordination between tasks. then please open a separate issue for this and include a clear description so that others can reproduce the problem or at least understand it

          People

          • Assignee:
            Unassigned
            Reporter:
            behnam nikbakht
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development