Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-243

Some meta-refresh urls get ignored due to matching regular expression

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Trivial
    • Resolution: Duplicate
    • 0.8
    • None
    • fetcher
    • None

    Description

      On fetching of pages with meta-refresh tags the url is taken at face value without any filtering. Some urls, such as those used by struts return with a jsessionid or with query strings. Examples are:

      http://www.somesite.com;jsessionid=3123123412ADBE3344...
      http://www.somesite.com?querystring=value

      The RegexURLFilter will match these urls according to the following regex inside of the regex-urlfilter.txt file:

      -[?*!@=]

      Should these urls be cleaned up to allow processing and not match the previous URL filter or should they be ignored as they currently are?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              musepwizard Dennis Kubes
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: