Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1199

unfetched URLs problem

    XMLWordPrintableJSON

Details

    Description

      we write a script to fetch unfetched urls:
      #first dump from readdb to a text file, and extract unfetched urls to a text file:
      bin/nutch readdb $crawldb -dump $SITE_DIR/tmp/dump_urls.txt -format csv
      cat $SITE_DIR/tmp/dump_urls.txt/part-00000 | grep db_unfetched > $SITE_DIR/tmp/dump_unf
      unfetched_urls_file="$SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt"
      cat $SITE_DIR/tmp/dump_unf | awk -F '"' '

      {print $2}

      ' > $unfetched_urls_file

      unfetched_count=`cat $unfetched_urls_file|wc -l`
      #next, we have a list of unfetched urls in unfetched_urls.txt , then, we use command freegen to create segments for #these urls, we can not use command generate because these url's were generated previously
      if [[ $unfetched_count -lt $it_size ]]

      then
      echo "UNFETCHED $J , $it_size URLs from $unfetched_count generated"
      ((J++))
      bin/nutch freegen $SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt $crawlseg
      s2=`ls -d $crawlseg/2* | tail -1`
      bin/nutch fetch $s2
      bin/nutch parse $s2
      bin/nutch updatedb $crawldb $s2
      echo "bin/nutch updatedb $crawldb $s2" >> $SITE_DIR/updatedblog.txt
      get_new_links
      exit
      fi

      1. if number of urls are greater than it_size, then package them
        ij=1
        while read line
        do
        let "ind = $ij / $it_size"
        mkdir $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/
        echo $line >> $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetched_urls$ind.txt
        echo $ind
        ((ij++))
        let "completed=$ij % $it_size"
        if [[ $completed -eq 0 ]]

      then
      echo "UNFETCHED $J , $it_size URLs from $unfetched_count generated"
      ((J++))
      bin/nutch freegen $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetched_urls$ind.txt $crawlseg
      #finally fetch,parse and update new segment
      s2=`ls -d $crawlseg/2* | tail -1`
      bin/nutch fetch $s2
      bin/nutch parse $s2
      rm $crawldb/.locked
      bin/nutch updatedb $crawldb $s2
      echo "bin/nutch updatedb $crawldb $s2" >> $SITE_DIR/updatedblog.txt
      fi
      done <$unfetched_urls_file

      Attachments

        Activity

          People

            Unassigned Unassigned
            behnam.nikbakht behnam nikbakht
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: