Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2222

re-fetch deletes all metadata except _csh_ and _rs_

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.3.1
    • 2.4
    • crawldb
    • None
    • Centos 6, mongodb 2.6 and mongodb 3.0 and hbase-0.98.8-hadoop2

    Description

      This problem happens at the the second time I crawl a page

      bin/nutch inject urls/
      bin/nutch generate -topN 1000
      bin/nutch fetch  -all
      bin/nutch parse -force   -all
      bin/nutch updatedb  -all
      

      seconde time (re-fetch) :

      bin/nutch generate -topN 1000 --> batchid changes for all existing pages
      bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already crawled  **
      bin/nutch parse -force   -all
      bin/nutch updatedb  -all
      

      I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2

      It happens only if the page has not changed

      To reproduce easily, please add to nutch-site.xml :

      <property>
        <name>db.fetch.interval.default</name>
        <value>60</value>
        <description>The default number of seconds between re-fetches of a page (1 minute)
      </description>
      

      Attachments

        1. TestReFetch.java
          6 kB
          Adnane B.
        2. index.html
          0.2 kB
          Adnane B.
        3. NUTCH-2222.patch
          0.5 kB
          Anas Laffet

        Activity

          People

            kamaci Furkan Kamaci
            abenjell Adnane B.
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: