[NUTCH-927] Sub pages are not getting crawled - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: nutchgora
Fix Version/s: None
Component/s: injector
Labels:
None

Description

In my program the objective is to crawl all the pages and fetch the contents from it. The category wise fetching the information is done perfectly but the sub pages are not getting crawled. In the sense, the nextpages are in the form of links at the bottom of the webpage as shown below -

<a href="http://reviews.logitech.com/7061/224/reviews.htm?page=2" title="Next Page >" name="BV_TrackingTag_Review_Display_NextPage">More Reviews for Z-5500 Digital 5.1 Speaker System</a>.

I am using the below script to crawl the site.
$NUTCH_HOME/search/scripts/crawl.sh testcrawlreviews 5 & > crawl.log

where 5 is the depth

Shown below is the snapshot

cd $NUTCH_HOME
bin/nutch inject $BASEDIR/crawldb urls
bin/nutch generate $BASEDIR/crawldb $BASEDIR/segments
SEGMENT=`ls $BASEDIR/segments/ | tail -1`
echo processing segment $SEGMENT
bin/nutch fetch $BASEDIR/segments/$SEGMENT -threads 10
bin/nutch updatedb $BASEDIR/crawldb $BASEDIR/segments/$SEGMENT -filter
done

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Rameez Raja

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 28/Oct/10 03:58

Updated:: 28/Oct/10 08:10

Resolved:: 28/Oct/10 08:10