Details
Description
This problem happens at the the second time I crawl a page
bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all
seconde time (re-fetch) :
bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all
I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
It happens only if the page has not changed
To reproduce easily, please add to nutch-site.xml :
<property> <name>db.fetch.interval.default</name> <value>60</value> <description>The default number of seconds between re-fetches of a page (1 minute) </description>