Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2230

Nutch doesn't index all URLs found

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Auto Closed
    • 2.3.1
    • 2.5
    • generator
    • None
    • MongoDB with WiredTiger storage engine (3.2 but probably affects other versions as well)

    Description

      The initial query run by the generator task, against mongodb, doesn't force ordering by _id. This causes an incorrect selection of ranges for successive map-reduce related queries. The successive queries do appear to be getting run in the correct order since _id is always indexed, but they should also explicitly specify a sort, since you are not guaranteed a particular order otherwise. I didn't dig deep enough to see if the root of the problem is with nutch or gora, and whether it only affected mongo or could affect other databases as well.

      Attachments

        Activity

          People

            Unassigned Unassigned
            acosand Aaron Cosand
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: