Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-599

nutch crawl and index problem

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 0.9.0
    • 1.0.0
    • None
    • None
    • hadoop-0.12.2, java jdk1.6.0

    Description

      first i set

      1. skip file:, ftp:, & mailto: urls
        -^(file|ftp|mailto):
      1. skip image and other suffixes we can't yet parse
        #-\.(png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$
      1. skip URLs containing certain characters as probable queries, etc.
        -[?*!@=]
      1. skip URLs with slash-delimited segment that repeats 3+ times, to break loops
        -.(/.+?)/.?\1/.*?\1/
      1. skip everything else
        +.

      in conf/crawl-urlfilter.txt and use this command "bin/nutch crawl urls -dir crawled -depth 3" i can crawl http://guide.kanook.com but i can't crawl http://www.kapook.com , some webpage can't crawl all why? and index file after crawl don't have segments file for nutch search it have only

      rw-rr- 1 nutch users 365 ม.ค. 7 16:47 _0.fdt
      rw-rr- 1 nutch users 8 ม.ค. 7 16:47 _0.fdx
      rw-rr- 1 nutch users 66 ม.ค. 7 16:47 _0.fnm
      rw-rr- 1 nutch users 370 ม.ค. 7 16:47 _0.frq
      rw-rr- 1 nutch users 9 ม.ค. 7 16:47 _0.nrm
      rw-rr- 1 nutch users 611 ม.ค. 7 16:47 _0.prx
      rw-rr- 1 nutch users 135 ม.ค. 7 16:47 _0.tii
      rw-rr- 1 nutch users 10553 ม.ค. 7 16:47 _0.tis
      rw-rr- 1 nutch users 0 ม.ค. 7 16:47 index.done
      rw-rr- 1 nutch users 41 ม.ค. 7 16:47 segments_2
      rw-rr- 1 nutch users 20 ม.ค. 7 16:47 segments.gen

      how to solve it?

      Attachments

        Activity

          People

            dogacan Dogacan Guney
            jibjoice sudarat
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: