Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2398

Fetcher saving redirected robots.txt under redirect target URL

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 1.14
    • Component/s: fetcher
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      NUTCH-2300 lets the Fetcher store optionally the robots.txt response (content and HTTP status). If the '.../robots.txt' is redirected, the redirected content is also stored but with the redirect source URL as key. It should use the redirect target URL instead. Otherwise one of the responses is overwritten in the segments map file.

        Activity

        Hide
        githubbot ASF GitHub Bot added a comment -

        sebastian-nagel opened a new pull request #199: NUTCH-2398: Save content of redirected robots.txt under redirect target URL
        URL: https://github.com/apache/nutch/pull/199

        do not use original URL (http://example.com/robots.txt) to store both
        redirect response (HTTP 301) and response of redirect target

        See also commoncrawl/nutch#4.

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - sebastian-nagel opened a new pull request #199: NUTCH-2398 : Save content of redirected robots.txt under redirect target URL URL: https://github.com/apache/nutch/pull/199 do not use original URL ( http://example.com/robots.txt ) to store both redirect response (HTTP 301) and response of redirect target See also commoncrawl/nutch#4. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        sebastian-nagel closed pull request #199: NUTCH-2398: Save content of redirected robots.txt under redirect target URL
        URL: https://github.com/apache/nutch/pull/199

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - sebastian-nagel closed pull request #199: NUTCH-2398 : Save content of redirected robots.txt under redirect target URL URL: https://github.com/apache/nutch/pull/199 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        wastl-nagel Sebastian Nagel added a comment -

        Commited to 1.x/master (2dc7472).

        Show
        wastl-nagel Sebastian Nagel added a comment - Commited to 1.x/master ( 2dc7472 ).
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Nutch-trunk #3434 (See https://builds.apache.org/job/Nutch-trunk/3434/)
        NUTCH-2398: Save content of redirected robots.txt under redirect target (snagel: https://github.com/apache/nutch/commit/620b85df36d0c802f333a56ca1ef7021a7935360)

        • (edit) src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Nutch-trunk #3434 (See https://builds.apache.org/job/Nutch-trunk/3434/ ) NUTCH-2398 : Save content of redirected robots.txt under redirect target (snagel: https://github.com/apache/nutch/commit/620b85df36d0c802f333a56ca1ef7021a7935360 ) (edit) src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java

          People

          • Assignee:
            wastl-nagel Sebastian Nagel
            Reporter:
            wastl-nagel Sebastian Nagel
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development