Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-177

Default installation seems to produce working entity of nutch

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.7.1
    • 0.8
    • None
    • None
    • Linux SUSE 9.3

    Description

      I downloaded 0.7.1 and installed it.
      Then changed crawl-urlfilter.txt for apache.org
      Then I added an urllist.txt and tried scanning.
      Apparently the URL has been ignored, even when it matched the rule in the crawl-url-filter.txt

      guenter@deimos:~/workspace/lucene/nutch-0.7.1/bin> sh ./nutch crawl ../../urllist.txt
      060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-default.xml
      060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-tool.xml
      060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-site.xml
      060115 141534 No FS indicated, using default:local
      060115 141534 crawl started in: crawl-20060115141534
      060115 141534 rootUrlFile = ../../urllist.txt
      060115 141534 threads = 10
      060115 141534 depth = 5
      060115 141535 Created webdb at LocalFS,/home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
      060115 141535 Starting URL processing
      060115 141535 Plugins: looking in: /home/guenter/workspace/lucene/nutch-0.7.1/plugins
      060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-more
      060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-site/plugin.xml
      060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter
      060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-html/plugin.xml
      060115 141535 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser
      060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-text/plugin.xml
      060115 141535 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser
      060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-ext
      060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-pdf
      060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-rss
      060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-basic/plugin.xml
      060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter
      060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-more
      060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-js
      060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-regex/plugin.xml
      060115 141535 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter
      060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-ftp
      060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-msword
      060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/creativecommons
      060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/ontology
      060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/nutch-extensionpoints/plugin.xml
      060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-file
      060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-http/plugin.xml
      060115 141535 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http
      060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/clustering-carrot2
      060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/language-identifier
      060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-prefix
      060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-url/plugin.xml
      060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter
      060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-basic/plugin.xml
      060115 141535 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter
      060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-httpclient
      060115 141535 found resource crawl-urlfilter.txt at file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-urlfilter.txt
      ..060115 141535 Added 0 pages
      060115 141535 FetchListTool started
      060115 141535 Overall processing: Sorted 0 entries in 0.0 seconds.
      060115 141535 Overall processing: Sorted NaN entries/second
      060115 141535 FetchListTool completed
      060115 141536 logging at INFO
      060115 141537 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
      060115 141537 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
      060115 141537 Finishing update
      060115 141537 Update finished
      060115 141537 FetchListTool started
      060115 141537 Overall processing: Sorted 0 entries in 0.0 seconds.
      060115 141537 Overall processing: Sorted NaN entries/second
      060115 141537 FetchListTool completed
      060115 141537 logging at INFO
      060115 141538 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
      060115 141538 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
      060115 141538 Finishing update
      060115 141538 Update finished
      060115 141538 FetchListTool started
      060115 141538 Overall processing: Sorted 0 entries in 0.0 seconds.
      060115 141538 Overall processing: Sorted NaN entries/second
      060115 141538 FetchListTool completed
      060115 141538 logging at INFO
      060115 141539 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
      060115 141539 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
      060115 141539 Finishing update
      060115 141539 Update finished
      060115 141539 FetchListTool started
      060115 141540 Overall processing: Sorted 0 entries in 0.0 seconds.
      060115 141540 Overall processing: Sorted NaN entries/second
      060115 141540 FetchListTool completed
      060115 141540 logging at INFO
      060115 141541 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
      060115 141541 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
      060115 141541 Finishing update
      060115 141541 Update finished
      060115 141541 FetchListTool started
      060115 141541 Overall processing: Sorted 0 entries in 0.0 seconds.
      060115 141541 Overall processing: Sorted NaN entries/second
      060115 141541 FetchListTool completed
      060115 141541 logging at INFO
      060115 141542 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
      060115 141542 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
      060115 141542 Finishing update
      060115 141542 Update finished
      060115 141542 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments from /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
      060115 141542 reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
      060115 141542 reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
      060115 141542 reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
      060115 141542 reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
      060115 141542 reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
      060115 141542 Sorting pages by url...
      060115 141542 Getting updated scores and anchors from db...
      060115 141542 Sorting updates by segment...
      060115 141542 Updating segments...
      060115 141542 Done updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments from /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
      060115 141542 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
      060115 141542 * Opening segment 20060115141535
      060115 141542 * Indexing segment 20060115141535
      060115 141542 * Optimizing index...
      060115 141542 * Moving index to NFS if needed...
      060115 141542 DONE indexing segment 20060115141535: total 0 records in 0.035 s (NaN rec/s).
      060115 141543 done indexing
      060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
      060115 141543 * Opening segment 20060115141537
      060115 141543 * Indexing segment 20060115141537
      060115 141543 * Optimizing index...
      060115 141543 * Moving index to NFS if needed...
      060115 141543 DONE indexing segment 20060115141537: total 0 records in 0.076 s (NaN rec/s).
      060115 141543 done indexing
      060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
      060115 141543 * Opening segment 20060115141538
      060115 141543 * Indexing segment 20060115141538
      060115 141543 * Optimizing index...
      060115 141543 * Moving index to NFS if needed...
      060115 141543 DONE indexing segment 20060115141538: total 0 records in 0.012 s (NaN rec/s).
      060115 141543 done indexing
      060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
      060115 141543 * Opening segment 20060115141539
      060115 141543 * Indexing segment 20060115141539
      060115 141543 * Optimizing index...
      060115 141543 * Moving index to NFS if needed...
      060115 141543 DONE indexing segment 20060115141539: total 0 records in 0.013 s (NaN rec/s).
      060115 141543 done indexing
      060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
      060115 141543 * Opening segment 20060115141541
      060115 141543 * Indexing segment 20060115141541
      060115 141543 * Optimizing index...
      060115 141543 * Moving index to NFS if needed...
      060115 141543 DONE indexing segment 20060115141541: total 0 records in 0.02 s (NaN rec/s).
      060115 141543 done indexing
      060115 141543 Reading url hashes...
      060115 141543 Sorting url hashes...
      060115 141543 Deleting url duplicates...
      060115 141543 Deleted 0 url duplicates.
      060115 141543 Reading content hashes...
      060115 141543 Sorting content hashes...
      060115 141543 Deleting content duplicates...
      060115 141543 Deleted 0 content duplicates.
      060115 141543 Duplicate deletion complete locally. Now returning to NFS...
      060115 141543 DeleteDuplicates complete
      060115 141543 Merging segment indexes...
      060115 141543 crawl finished: crawl-20060115141534
      guenter@deimos:~/workspace/lucene/nutch-0.7.1/bin>

      Attachments

        1. crawl-urlfilter.txt
          0.7 kB
          Matthias Günter
        2. urllist.txt
          0.0 kB
          Matthias Günter

        Activity

          People

            Unassigned Unassigned
            webcrawler Matthias Günter
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: