Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1630

How to achieve finishing fetch approximately at the same time for each queue (a.k.a adaptive queue size)

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Auto Closed
    • 2.1, 2.2, 2.2.1
    • 2.5
    • None
    • Patch Available

    Description

      Problem Definition:
      When crawling, due to unproportional size of queues; fetching needs to wait for a long time for long lasting queues when shorter ones are finished. That means you may have to wait for a couple of days for some of queues.

      Normally we define max queue size with generate.max.count but that's a static value. However number of URLs to be fetched increases with each depth. Defining same length for all queues does not mean all queues will finish around the same time. This problem has been addressed by some other users before [1]. So we came up with a different approach to this issue.

      Solution:
      Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our solution can be applicable to all three mods.

      1-Define a "fetch workload of current queue" (FW) value for each queue based on the previous fetches of that queue.
      We calculate this by:
      FW=average response time of previous depth * number of urls in current queue

      2- Calculate the harmonic mean [2] of all FW's to get the average workload of current depth (AW)

      3- Get the length for a queue by dividing AW by previously known average response time of that queue:
      Queue Length=AW / average response time

      Using this algoritm leads to a fetch phase where all queues finish up around the same time.

      As soon as posible i will send my patch. Do you have any comments ?

      [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html
      [2] In our opinion; harmonic mean is best in our case because our data has a few points that are much higher than the rest.

      Attachments

        1. NUTCH-1630.patch
          24 kB
          Talat Uyarer
        2. NUTCH-1630v2.patch
          35 kB
          Yasin Kılınç

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            talat Talat Uyarer
            Votes:
            2 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment