Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-927

Sub pages are not getting crawled

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • nutchgora
    • None
    • injector
    • None

    Description

      In my program the objective is to crawl all the pages and fetch the contents from it. The category wise fetching the information is done perfectly but the sub pages are not getting crawled. In the sense, the nextpages are in the form of links at the bottom of the webpage as shown below -

      <a href="http://reviews.logitech.com/7061/224/reviews.htm?page=2" title="Next Page >" name="BV_TrackingTag_Review_Display_NextPage">More Reviews for Z-5500 Digital 5.1 Speaker System</a>.

      I am using the below script to crawl the site.
      $NUTCH_HOME/search/scripts/crawl.sh testcrawlreviews 5 & > crawl.log

      where 5 is the depth

      Shown below is the snapshot

      cd $NUTCH_HOME
      bin/nutch inject $BASEDIR/crawldb urls
      bin/nutch generate $BASEDIR/crawldb $BASEDIR/segments
      SEGMENT=`ls $BASEDIR/segments/ | tail -1`
      echo processing segment $SEGMENT
      bin/nutch fetch $BASEDIR/segments/$SEGMENT -threads 10
      bin/nutch updatedb $BASEDIR/crawldb $BASEDIR/segments/$SEGMENT -filter
      done

      Attachments

        Activity

          People

            Unassigned Unassigned
            rameezraja Rameez Raja
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: