Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Won't Fix
-
0.9.0
-
None
-
None
-
hadoop-0.12.2, java jdk1.6.0
Description
first i set
- skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
- skip image and other suffixes we can't yet parse
#-\.(png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$
- skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
- skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.(/.+?)/.?\1/.*?\1/
- skip everything else
+.
in conf/crawl-urlfilter.txt and use this command "bin/nutch crawl urls -dir crawled -depth 3" i can crawl http://guide.kanook.com but i can't crawl http://www.kapook.com , some webpage can't crawl all why? and index file after crawl don't have segments file for nutch search it have only
rw-rr- 1 nutch users 365 ม.ค. 7 16:47 _0.fdt
rw-rr- 1 nutch users 8 ม.ค. 7 16:47 _0.fdx
rw-rr- 1 nutch users 66 ม.ค. 7 16:47 _0.fnm
rw-rr- 1 nutch users 370 ม.ค. 7 16:47 _0.frq
rw-rr- 1 nutch users 9 ม.ค. 7 16:47 _0.nrm
rw-rr- 1 nutch users 611 ม.ค. 7 16:47 _0.prx
rw-rr- 1 nutch users 135 ม.ค. 7 16:47 _0.tii
rw-rr- 1 nutch users 10553 ม.ค. 7 16:47 _0.tis
rw-rr- 1 nutch users 0 ม.ค. 7 16:47 index.done
rw-rr- 1 nutch users 41 ม.ค. 7 16:47 segments_2
rw-rr- 1 nutch users 20 ม.ค. 7 16:47 segments.gen
how to solve it?