Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1772

Injector does not need merging if no pre-existing crawldb

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.8
    • Fix Version/s: 1.9
    • Component/s: injector
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The injector currently works as following :

      • MapReduce job 1 - Mapper : converts input lines into CrawlDatum objects with normalisation and filtering
      • MapReduce job 1 - Reducer : identity reducers. Can still have duplicates at this stage
      • MapReducer job 2 - Mapper : CrawlDbFilter on existing crawldb (if any) + output of previous job
      • MapReducer job 2 - Reducer : deduplication

      If there is no existing crawldb (which will often be the case at injection time) we don't really need to do the second mapreduce job and could simply take the output of the MR job #1 as CrawlDB provided that we do the deduplication as part of the reduce step.
      If there is a crawldb then the reduce step of the MR job #1 is not really needed and we could have that step as map only.

        Attachments

        1. NUTCH-1772-Logging&ErrorHandling.patch
          26 kB
          Diaa
        2. NUTCH-1772.patch
          25 kB
          Julien Nioche

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jnioche Julien Nioche
              • Votes:
                1 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: