Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2230

Nutch doesn't index all URLs found

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Auto Closed
    • Affects Version/s: 2.3.1
    • Fix Version/s: 2.5
    • Component/s: generator
    • Labels:
      None
    • Environment:

      MongoDB with WiredTiger storage engine (3.2 but probably affects other versions as well)

      Description

      The initial query run by the generator task, against mongodb, doesn't force ordering by _id. This causes an incorrect selection of ranges for successive map-reduce related queries. The successive queries do appear to be getting run in the correct order since _id is always indexed, but they should also explicitly specify a sort, since you are not guaranteed a particular order otherwise. I didn't dig deep enough to see if the root of the problem is with nutch or gora, and whether it only affected mongo or could affect other databases as well.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              acosand Aaron Cosand
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: