[NUTCH-599] nutch crawl and index problem - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 0.9.0
Fix Version/s: 1.0.0
Component/s: None
Labels:
None
Environment:

hadoop-0.12.2, java jdk1.6.0

Description

first i set

skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

skip image and other suffixes we can't yet parse
#-\.(png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$

skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.(/.+?)/.?\1/.*?\1/

skip everything else
+.

in conf/crawl-urlfilter.txt and use this command "bin/nutch crawl urls -dir crawled -depth 3" i can crawl http://guide.kanook.com but i can't crawl http://www.kapook.com , some webpage can't crawl all why? and index file after crawl don't have segments file for nutch search it have only

~~rw-r~~r- 1 nutch users 365 ม.ค. 7 16:47 _0.fdt
~~rw-r~~r- 1 nutch users 8 ม.ค. 7 16:47 _0.fdx
~~rw-r~~r- 1 nutch users 66 ม.ค. 7 16:47 _0.fnm
~~rw-r~~r- 1 nutch users 370 ม.ค. 7 16:47 _0.frq
~~rw-r~~r- 1 nutch users 9 ม.ค. 7 16:47 _0.nrm
~~rw-r~~r- 1 nutch users 611 ม.ค. 7 16:47 _0.prx
~~rw-r~~r- 1 nutch users 135 ม.ค. 7 16:47 _0.tii
~~rw-r~~r- 1 nutch users 10553 ม.ค. 7 16:47 _0.tis
~~rw-r~~r- 1 nutch users 0 ม.ค. 7 16:47 index.done
~~rw-r~~r- 1 nutch users 41 ม.ค. 7 16:47 segments_2
~~rw-r~~r- 1 nutch users 20 ม.ค. 7 16:47 segments.gen

how to solve it?

Attachments

Activity

People

Assignee:: Dogacan Guney

Reporter:: sudarat

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 08/Jan/08 01:46

Updated:: 10/Apr/09 12:29

Resolved:: 08/Jan/08 07:44