Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1941

Optional rolling http.agent.name's

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: 2.3, 1.9
    • Fix Version/s: 1.10, 2.3.1
    • Component/s: fetcher, protocol
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins can block your fetcher based merely on your crawler name.
      I propose the ability to implement rolling http.agent.name's which could be substituted every 5 seconds for example. This would mean that successive requests to the same domain would be sent with different http.agent.name.
      This behavior should be off by default.

        Attachments

        1. nutch.patch
          46 kB
          Asitang Mishra
        2. NUTCH-1941-ver1.patch
          3 kB
          Asitang Mishra
        3. agent.names.txt
          199 kB
          Lewis John McGibbney
        4. NUTCH-1941-ITR2.patch
          3 kB
          Asitang Mishra
        5. NUTCH-1941-itr3.patch
          3 kB
          Asitang Mishra
        6. NUTCH-1941-itr4.patch
          3 kB
          Asitang Mishra
        7. NUTCH-1941-v5.patch
          4 kB
          Sebastian Nagel
        8. NUTCH-1941-ver6.patch
          7 kB
          Asitang Mishra
        9. NUTCH-1941-2x-v6.patch
          6 kB
          Sebastian Nagel

          Activity

            People

            • Assignee:
              snagel Sebastian Nagel
              Reporter:
              lewismc Lewis John McGibbney
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: