[NUTCH-1199] unfetched URLs problem - ASF JIRA

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Critical
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: None
Component/s: fetcher, generator
Labels:
- db_unfetched
- fetch
- freegen
- generate
- unfetched
- updatedb

Description

we write a script to fetch unfetched urls:
#first dump from readdb to a text file, and extract unfetched urls to a text file:
bin/nutch readdb $crawldb -dump $SITE_DIR/tmp/dump_urls.txt -format csv
cat $SITE_DIR/tmp/dump_urls.txt/part-00000 | grep db_unfetched > $SITE_DIR/tmp/dump_unf
unfetched_urls_file="$SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt"
cat $SITE_DIR/tmp/dump_unf | awk -F '"' '

{print $2}

' > $unfetched_urls_file

unfetched_count=`cat $unfetched_urls_file|wc -l`
#next, we have a list of unfetched urls in unfetched_urls.txt , then, we use command freegen to create segments for #these urls, we can not use command generate because these url's were generated previously
if [[ $unfetched_count -lt $it_size ]]

then
echo "UNFETCHED $J , $it_size URLs from $unfetched_count generated"
((J++))
bin/nutch freegen $SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt $crawlseg
s2=`ls -d $crawlseg/2* | tail -1`
bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb $crawldb $s2
echo "bin/nutch updatedb $crawldb $s2" >> $SITE_DIR/updatedblog.txt
get_new_links
exit
fi

if number of urls are greater than it_size, then package them
ij=1
while read line
do
let "ind = $ij / $it_size"
mkdir $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/
echo $line >> $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetched_urls$ind.txt
echo $ind
((ij++))
let "completed=$ij % $it_size"
if [[ $completed -eq 0 ]]

then
echo "UNFETCHED $J , $it_size URLs from $unfetched_count generated"
((J++))
bin/nutch freegen $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetched_urls$ind.txt $crawlseg
#finally fetch,parse and update new segment
s2=`ls -d $crawlseg/2* | tail -1`
bin/nutch fetch $s2
bin/nutch parse $s2
rm $crawldb/.locked
bin/nutch updatedb $crawldb $s2
echo "bin/nutch updatedb $crawldb $s2" >> $SITE_DIR/updatedblog.txt
fi
done <$unfetched_urls_file

Attachments

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: behnam nikbakht

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 08/Nov/11 06:29

Updated:: 08/Nov/11 09:52

Resolved:: 08/Nov/11 09:52

Agile

View on Board

unfetched URLs problem