Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1196

Update job should impose an upper limit on the number of inlinks (nutchgora)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: nutchgora
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Currently the nutchgora branch does not limit the number of inlinks in the update job. This will result in some nasty out-of-memory exceptions and timeouts when the crawl is getting big. Nutch trunk already has a default limit of 10,000 inlinks. I will implement this in nutchgora too. Nutch trunk uses a sorting mechanism in the reducer itself, but I will implement it using standard Hadoop components instead (should be a bit faster). This means:

      The keys of the reducer will be a

      {url,score} tuple.

      Partitioning will be done by {url}.
      Sorting will be done by {url,score}

      .
      Finally grouping will be done by

      {url}

      again.

      This ensures all indentical urls will be put in the same reducer, but in order of scoring.

      Patch should be ready by tomorrow. Please let me know when you have any comments or suggestions.

        Attachments

        1. NUTCH-1196.patch
          21 kB
          Ferdy
        2. NUTCH-1196-v2.patch
          21 kB
          Ferdy

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              ferdy.g Ferdy
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: