Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2215

Generator to restrict crawl to mime type

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None
    • None
    • Patch Available

    Description

      Large crawls fail to restrict crawling non-html via suffix filter alone, due to URL's hiding mime-types. This issue only passes records with a Content-Type that match a regex.

      Attachments

        1. NUTCH-2215.patch
          5 kB
          Markus Jelsma
        2. NUTCH-2215.patch
          5 kB
          Markus Jelsma

        Issue Links

          Activity

            People

              Unassigned Unassigned
              markus17 Markus Jelsma
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: