Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2222

re-fetch deletes all metadata except _csh_ and _rs_

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.1
    • Fix Version/s: 2.4
    • Component/s: crawldb
    • Labels:
      None
    • Environment:

      Centos 6, mongodb 2.6 and mongodb 3.0 and hbase-0.98.8-hadoop2

      Description

      This problem happens at the the second time I crawl a page

      bin/nutch inject urls/
      bin/nutch generate -topN 1000
      bin/nutch fetch  -all
      bin/nutch parse -force   -all
      bin/nutch updatedb  -all
      

      seconde time (re-fetch) :

      bin/nutch generate -topN 1000 --> batchid changes for all existing pages
      bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already crawled  **
      bin/nutch parse -force   -all
      bin/nutch updatedb  -all
      

      I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2

      It happens only if the page has not changed

      To reproduce easily, please add to nutch-site.xml :

      <property>
        <name>db.fetch.interval.default</name>
        <value>60</value>
        <description>The default number of seconds between re-fetches of a page (1 minute)
      </description>
      

        Attachments

        1. TestReFetch.java
          6 kB
          Adnane B.
        2. NUTCH-2222.patch
          0.5 kB
          Anas Laffet
        3. index.html
          0.2 kB
          Adnane B.

          Activity

            People

            • Assignee:
              kamaci Furkan Kamaci
              Reporter:
              abenjell Adnane B.
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: