Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1008

Switch to crawler-commons version of robots.txt parsing code

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Duplicate
    • 1.4
    • None
    • None
    • None

    Description

      The Bixo project has an improved version of Nutch's robots.txt parsing code.

      This was recently contributed to crawler-commons, in a format that should be independent of Bixo, Cascading, and even Hadoop.

      Nutch could switch to this, and benefit from more robust parsing, better compliance with ad hoc extensions to the robot exclusion protocol, and a wider community of users/developers for that code.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kkrugler Kenneth William Krugler
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: