[NUTCH-1513] Support Robots.txt for Ftp urls - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.7, 2.2
Fix Version/s: 2.3, 1.8
Component/s: None
Labels:
- robots.txt

Description

As per [0], a FTP website can have robots.txt like [1]. In the nutch code, Ftp plugin is not parsing the robots file and accepting all urls.

In "src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java"

   public RobotRules getRobotRules(Text url, CrawlDatum datum) {
    return EmptyRobotRules.RULES;
  }

Its not clear of this was part of design or if its a bug.

[0] : https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
[1] : ftp://example.com/robots.txt

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-1513.trunk.v2.patch
08/May/13 23:49
7 kB
Tejas Patil
NUTCH-1513.trunk.patch
04/May/13 06:13
6 kB
Tejas Patil
NUTCH-1513.2.x.v2.patch
08/May/13 23:49
6 kB
Tejas Patil

Activity

People

Assignee:: Tejas Patil

Reporter:: Tejas Patil

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 04/Jan/13 08:56

Updated:: 22/May/13 03:54

Resolved:: 21/May/13 01:32