Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1941

Optional rolling http.agent.name's

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Trivial
    • Resolution: Fixed
    • 2.3, 1.9
    • 1.10, 2.3.1
    • fetcher, protocol
    • None
    • Patch Available

    Description

      In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins can block your fetcher based merely on your crawler name.
      I propose the ability to implement rolling http.agent.name's which could be substituted every 5 seconds for example. This would mean that successive requests to the same domain would be sent with different http.agent.name.
      This behavior should be off by default.

      Attachments

        1. nutch.patch
          46 kB
          Asitang Mishra
        2. NUTCH-1941-ver1.patch
          3 kB
          Asitang Mishra
        3. agent.names.txt
          199 kB
          Lewis John McGibbney
        4. NUTCH-1941-ITR2.patch
          3 kB
          Asitang Mishra
        5. NUTCH-1941-itr3.patch
          3 kB
          Asitang Mishra
        6. NUTCH-1941-itr4.patch
          3 kB
          Asitang Mishra
        7. NUTCH-1941-v5.patch
          4 kB
          Sebastian Nagel
        8. NUTCH-1941-ver6.patch
          7 kB
          Asitang Mishra
        9. NUTCH-1941-2x-v6.patch
          6 kB
          Sebastian Nagel

        Activity

          People

            snagel Sebastian Nagel
            lewismc Lewis John McGibbney
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: