Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1658

Nutch mangles seed URLs and then reports on the mangled ones

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • None
    • 1.7
    • None
    • Ubuntu 12.04

    Description

      Note: I'm using Nutch to verify that each of a long list of URIs is good, so I use them all as seeds in a single-iteration crawls.

      Some seed URIs are mangled by Nutch, and Nutch then reports on the mangled versions (which are no good) instead of the original ones (which are good). Two patterns have emerged from my tests:

      (1) If the query portion of the URI contains '//', it becomes '/', rendering the resource unfetchable. Example:

      https://www.pay.gov/paygov/forms/formInstance.html?nc=1356014395287&agencyFormId=44568890&userFormSearch=https%3A//www.pay.gov/paygov/keywordSearchForms.html%3FshowingDetails=true&showingAll=false&sortProperty=agencyFormName&totalResults=1&keyword=apma&ascending=true&pageOffset=0

      (2) If the URI has a trailing '.', it disappears, apparently rendering the resource unfetchable. Example:

      http://www.irs.gov/Individuals/ITIN-Policy-Change-Summary-for-2013.

      Both of the above are known good URIs. When they are used as seeds, Nutch 1.7 doesn't report about them, but instead reports about URIs that have been mangled as described above. In the '//' -> '/' case, Nutch reports that robot access is denied, which is probably true. In the trailing '.' case, Nutch says there's no such resource, which is true, but it's not the question I was trying to get Nutch to answer.)

      Attachments

        Activity

          People

            Unassigned Unassigned
            Steve Newcomb Steve Newcomb
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: