Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-578

URL fetched with 403 is generated over and over again

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 1.0.0
    • 1.9
    • generator
    • None
    • Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007

    • Patch Available

    Description

      I have not changed the following parameter in the nutch-default.xml:

      <property>
      <name>db.fetch.retry.max</name>
      <value>3</value>
      <description>The maximum number of times a url that has encountered
      recoverable errors is generated for fetch.</description>
      </property>

      However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3):

      fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/

      This is a bug, right?

      Thanks.

      Attachments

        1. urls.txt
          0.1 kB
          Nathaniel Powell
        2. regex-normalize.xml
          2 kB
          Nathaniel Powell
        3. nutch-site.xml
          3 kB
          Nathaniel Powell
        4. NUTCH-578.patch
          1 kB
          Emmanuel Joke
        5. NUTCH-578_v5.patch
          1 kB
          Sebastian Nagel
        6. NUTCH-578_v4.patch
          2 kB
          Evgeniy Serykh
        7. NUTCH-578_v3.patch
          2 kB
          Dmitry Lihachev
        8. NUTCH-578_v2.patch
          2 kB
          Emmanuel Joke
        9. crawl-urlfilter.txt
          2 kB
          Nathaniel Powell

        Issue Links

          Activity

            People

              markus17 Markus Jelsma
              npowell Nathaniel Powell
              Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: