Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-286

Handling common error-pages as 404

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • None
    • None

    Description

      Idea: Some pages from some software-packages/scripts report an "http 200 ok" even though a specific page could not be found. Example I just found is:
      http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
      That's a typo3-page explaining in it's standard-layout and wording: "The requested page did not exist or was inaccessible."

      So I had the idea if somebody might create a plugin that could find commonly used formulations for "page does not exist" etc. and turn the page into a 404 before feeding them into the nutch-index - although the server responded with status 200 ok.

      Attachments

        Activity

          People

            Unassigned Unassigned
            neufeind Stefan Neufeind
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: