Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
0.7.1
-
None
-
None
-
Linux SUSE 9.3
Description
I downloaded 0.7.1 and installed it.
Then changed crawl-urlfilter.txt for apache.org
Then I added an urllist.txt and tried scanning.
Apparently the URL has been ignored, even when it matched the rule in the crawl-url-filter.txt
guenter@deimos:~/workspace/lucene/nutch-0.7.1/bin> sh ./nutch crawl ../../urllist.txt
060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-default.xml
060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-tool.xml
060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-site.xml
060115 141534 No FS indicated, using default:local
060115 141534 crawl started in: crawl-20060115141534
060115 141534 rootUrlFile = ../../urllist.txt
060115 141534 threads = 10
060115 141534 depth = 5
060115 141535 Created webdb at LocalFS,/home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
060115 141535 Starting URL processing
060115 141535 Plugins: looking in: /home/guenter/workspace/lucene/nutch-0.7.1/plugins
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-more
060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-site/plugin.xml
060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter
060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-html/plugin.xml
060115 141535 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser
060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-text/plugin.xml
060115 141535 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-ext
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-pdf
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-rss
060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-basic/plugin.xml
060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-more
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-js
060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-regex/plugin.xml
060115 141535 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-ftp
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-msword
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/creativecommons
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/ontology
060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/nutch-extensionpoints/plugin.xml
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-file
060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-http/plugin.xml
060115 141535 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/clustering-carrot2
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/language-identifier
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-prefix
060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-url/plugin.xml
060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter
060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-basic/plugin.xml
060115 141535 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-httpclient
060115 141535 found resource crawl-urlfilter.txt at file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-urlfilter.txt
..060115 141535 Added 0 pages
060115 141535 FetchListTool started
060115 141535 Overall processing: Sorted 0 entries in 0.0 seconds.
060115 141535 Overall processing: Sorted NaN entries/second
060115 141535 FetchListTool completed
060115 141536 logging at INFO
060115 141537 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
060115 141537 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
060115 141537 Finishing update
060115 141537 Update finished
060115 141537 FetchListTool started
060115 141537 Overall processing: Sorted 0 entries in 0.0 seconds.
060115 141537 Overall processing: Sorted NaN entries/second
060115 141537 FetchListTool completed
060115 141537 logging at INFO
060115 141538 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
060115 141538 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
060115 141538 Finishing update
060115 141538 Update finished
060115 141538 FetchListTool started
060115 141538 Overall processing: Sorted 0 entries in 0.0 seconds.
060115 141538 Overall processing: Sorted NaN entries/second
060115 141538 FetchListTool completed
060115 141538 logging at INFO
060115 141539 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
060115 141539 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
060115 141539 Finishing update
060115 141539 Update finished
060115 141539 FetchListTool started
060115 141540 Overall processing: Sorted 0 entries in 0.0 seconds.
060115 141540 Overall processing: Sorted NaN entries/second
060115 141540 FetchListTool completed
060115 141540 logging at INFO
060115 141541 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
060115 141541 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
060115 141541 Finishing update
060115 141541 Update finished
060115 141541 FetchListTool started
060115 141541 Overall processing: Sorted 0 entries in 0.0 seconds.
060115 141541 Overall processing: Sorted NaN entries/second
060115 141541 FetchListTool completed
060115 141541 logging at INFO
060115 141542 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
060115 141542 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
060115 141542 Finishing update
060115 141542 Update finished
060115 141542 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments from /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
060115 141542 reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
060115 141542 reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
060115 141542 reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
060115 141542 reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
060115 141542 reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
060115 141542 Sorting pages by url...
060115 141542 Getting updated scores and anchors from db...
060115 141542 Sorting updates by segment...
060115 141542 Updating segments...
060115 141542 Done updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments from /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
060115 141542 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
060115 141542 * Opening segment 20060115141535
060115 141542 * Indexing segment 20060115141535
060115 141542 * Optimizing index...
060115 141542 * Moving index to NFS if needed...
060115 141542 DONE indexing segment 20060115141535: total 0 records in 0.035 s (NaN rec/s).
060115 141543 done indexing
060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
060115 141543 * Opening segment 20060115141537
060115 141543 * Indexing segment 20060115141537
060115 141543 * Optimizing index...
060115 141543 * Moving index to NFS if needed...
060115 141543 DONE indexing segment 20060115141537: total 0 records in 0.076 s (NaN rec/s).
060115 141543 done indexing
060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
060115 141543 * Opening segment 20060115141538
060115 141543 * Indexing segment 20060115141538
060115 141543 * Optimizing index...
060115 141543 * Moving index to NFS if needed...
060115 141543 DONE indexing segment 20060115141538: total 0 records in 0.012 s (NaN rec/s).
060115 141543 done indexing
060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
060115 141543 * Opening segment 20060115141539
060115 141543 * Indexing segment 20060115141539
060115 141543 * Optimizing index...
060115 141543 * Moving index to NFS if needed...
060115 141543 DONE indexing segment 20060115141539: total 0 records in 0.013 s (NaN rec/s).
060115 141543 done indexing
060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
060115 141543 * Opening segment 20060115141541
060115 141543 * Indexing segment 20060115141541
060115 141543 * Optimizing index...
060115 141543 * Moving index to NFS if needed...
060115 141543 DONE indexing segment 20060115141541: total 0 records in 0.02 s (NaN rec/s).
060115 141543 done indexing
060115 141543 Reading url hashes...
060115 141543 Sorting url hashes...
060115 141543 Deleting url duplicates...
060115 141543 Deleted 0 url duplicates.
060115 141543 Reading content hashes...
060115 141543 Sorting content hashes...
060115 141543 Deleting content duplicates...
060115 141543 Deleted 0 content duplicates.
060115 141543 Duplicate deletion complete locally. Now returning to NFS...
060115 141543 DeleteDuplicates complete
060115 141543 Merging segment indexes...
060115 141543 crawl finished: crawl-20060115141534
guenter@deimos:~/workspace/lucene/nutch-0.7.1/bin>