Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-659

Help! No urls fetched for internal repository website

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Invalid
    • 0.9.0
    • None
    • fetcher
    • None
    • nutch 0.9, TOMCAT6.0.18, JAVA 1.6.0_10, CentOS 5.2

    Description

      I am new to Nutch, and implemented Nutch for my internal company websites search. The version is nutch-2008-11-02_04-01-26.tar.

      My internal company websites includes several HTTP websites.

      Another one is SVN repository HTTPS websites in XML structure, using <dir> and <file> tag.

      The search in HTTP websites is good.

      The HTTPS is ok. We have some links in those HTTP websites which point to Word files under SVN website. They can be indexed.

      But the Nutch does not search my SVN website. If I only search the SVN website, it is always: 0 urls fetched.

      My nutch-site.xml is as following:

      <property>

      <name>plugin.includes</name>

      <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|msexcel|msword|mspowerpoint|pdf|zip|swf|rss)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

      1. skip file:, ftp:, & mailto: urls

      -^(ftp|mailto):

      1. accept hosts in MY.DOMAIN.NAME

      +^http://([a-z0-9]*\.)*smartlabs.com.au/

      Any help would be much appreciated. Thanks in advnce.

      Attachments

        Activity

          People

            Unassigned Unassigned
            windflying Bryan
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: