All Projects : Nutch : fetcher (Component)



 Select:   Open Issues   Road Map   Change Log   Popular Issues   

Open Issues

49 unresolved issue(s).

Versions

(with open issues due to be fixed per version for this component)
   Improvement NUTCH-475 UNRESOLVED Adaptive crawl delay Major Open
   New Feature NUTCH-478 UNRESOLVED Add function for stopping FetherThread gracefully Major Open
   New Feature NUTCH-207 UNRESOLVED Bandwidth target for fetcher rather than a thread count Major Open
   Bug NUTCH-496 UNRESOLVED ConcurrentModificationException can be thrown when getSorted() is called. Major Open
   Bug NUTCH-289 UNRESOLVED CrawlDatum should store IP address Major Open
   Improvement NUTCH-629 UNRESOLVED Detect slow and timeout servers and drop their URLs Major Open
   Bug NUTCH-755 UNRESOLVED DomainURLFilter crashes on malformed URL Major Open
   New Feature NUTCH-87 UNRESOLVED Efficient site-specific crawling for a large number of sites Major Open
   New Feature NUTCH-628 UNRESOLVED Host database to keep track of host-level information Major Open
   Bug NUTCH-283 UNRESOLVED If the Fetcher times out and abandons Fetcher Threads, severe errors will occur on those Threads Major Open
   Bug NUTCH-709 UNRESOLVED JSParseFilter gets into an infinate loop and ets all the stack Major Open
   Improvement NUTCH-649 UNRESOLVED Log list of files found but not crawled. Major Open
   New Feature NUTCH-714 UNRESOLVED Need a SFTP and SCP Protocol Handler Major Open
   Bug NUTCH-424 UNRESOLVED NekoHTML's DOMFragmentParser hangs on certain URLs (CLONE: Problem persists with Nutch 0.9 and 0.8.1 (Nekohtml 0.9.4)) Major Open
   Improvement NUTCH-753 UNRESOLVED Prevent new Fetcher to retrieve the robots twice Major Open
   New Feature NUTCH-460 UNRESOLVED RDF parser plugin Major Open
   Bug NUTCH-644 UNRESOLVED RTF parser doesn't compile anymore Major Open
   Bug NUTCH-119 UNRESOLVED Regexp to extract outlinks incorrect Major Open
   Bug NUTCH-385 UNRESOLVED Server delay feature conflicts with maxThreadsPerHost Major Open
   Improvement NUTCH-751 UNRESOLVED Upgrade version of HttpClient Major Open
   New Feature NUTCH-185 UNRESOLVED XMLParser is configurable xml parser plugin. Major Open
   Bug NUTCH-719 UNRESOLVED fetchQueues.totalSize incorrect in Fetcher2 Major Open
   Bug NUTCH-414 UNRESOLVED parse-mp3 plugin concatenating previous tags for text field Major Open
   Improvement NUTCH-409 UNRESOLVED Add "short circuit" notion to filters to speedup mixed site/subsite crawling Minor Open
   Improvement NUTCH-740 UNRESOLVED Configuration option to override default language for fetched pages. Minor Open
   Improvement NUTCH-490 UNRESOLVED Extension point with filters for Neko HTML parser (with patch) Minor Open
   Improvement NUTCH-410 UNRESOLVED Faster RegexNormalize with more features Minor Open
   Improvement NUTCH-84 UNRESOLVED Fetcher for constrained crawls Minor Open
   Bug NUTCH-363 UNRESOLVED Fetcher normalizes everything at least twice Minor Open
   Improvement NUTCH-769 UNRESOLVED Fetcher to skip queues for URLS getting repeated exceptions Minor Open
   New Feature NUTCH-49 UNRESOLVED Flag for generate to fetch only new pages to complement the -refetchonly flag Minor Open
   Bug NUTCH-13 UNRESOLVED If dns points to 127.0.0.1, the url is also crawled Minor Open
   Improvement NUTCH-295 UNRESOLVED More description for fetcher.threads.fetch property Minor Open
   New Feature NUTCH-158 UNRESOLVED Process Sitemap data in text, rss or xml format as well as OAI-PMH Minor Open
   New Feature NUTCH-351 UNRESOLVED Protocol forward proxy Minor Open
   Improvement NUTCH-569 UNRESOLVED Protocol plugins should report progress to the fetcher Minor Open
   Bug NUTCH-98 UNRESOLVED RobotRulesParser interprets robots.txt incorrectly Minor Open
   Bug NUTCH-566 UNRESOLVED Sun's URL class has bug in creation of relative query URLs Minor Open
   Bug NUTCH-18 UNRESOLVED Windows servers include illegal characters in URLs Minor Open
   New Feature NUTCH-208 UNRESOLVED http: proxy exception list: Minor Open
   New Feature NUTCH-705 UNRESOLVED parse-rtf plugin Minor Open
   New Feature NUTCH-427 UNRESOLVED protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implmentation. Minor Open
   Improvement NUTCH-658 UNRESOLVED Add Counter for # of doc fetched in Reporter Trivial Open
   Improvement NUTCH-113 UNRESOLVED Disable permanent DNS-to-IP caching for JVM 1.4 Trivial Open
   Improvement NUTCH-278 UNRESOLVED Fetcher-status might need clarification: kbit/s instead of kb/s shown Trivial Open
   Improvement NUTCH-182 UNRESOLVED Log when db.max configuration limits reached Trivial Open
   Improvement NUTCH-26 UNRESOLVED New Http Authentication mechanism Trivial Open
   Improvement NUTCH-100 UNRESOLVED New plugin urlfilter-db Trivial Open
   Improvement NUTCH-249 UNRESOLVED black- white list url filtering Trivial Open
Unreleased 0.8.2 1
Unreleased 1.1 6
  Unscheduled 42

Preset Filters


Component Summary

Open Open 49
   27%
Resolved Resolved 5
   3%
Closed Closed 126
   70%

Open Issues

By Priority
Major Major 23
   47%
Minor Minor 19
   39%
Trivial Trivial 7
   14%

By Assignee
Chris A. Mattmann 2
   4%
Dennis Kubes 2
   4%
Otis Gospodnetic 2
   4%
Sami Siren 1
   2%
Unassigned 42
   86%