Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2456

Allow to index pages/URLs not contained in CrawlDb

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 1.14
    • Component/s: indexer
    • Labels:
      None

      Description

      If http.redirect.max is set to a positive value, the Fetcher will follow redirects, creating a new CrawlDatum.
      If the redirected URL is fetched and parsed, during indexing for it we have a special case: dbDatum is null. This means that in https://github.com/apache/nutch/blob/6199492f5e1e8811022257c88dbf63f1e1c739d0/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L259 the document is not indexed, as it is assumed it only has inlinks (actually it has everything but dbDatum).
      I'm not sure what the correct fix is here. It seems to me the condition should use AND instead of OR anyway, but I may not understand the original intent. It is clear that it is too strict as is.
      However, the code following that line assumes all 4 objects are not null, so a patch would need to change more than just the condition.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                yossi Yossi Tamari
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: