Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1679

UpdateDb using batchId, link may override crawled page.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.2.1
    • Fix Version/s: 2.3.1
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The problem is in Hbase store, not sure about other store.

      Suppose at first crawl cycle we crawl link A, then get an outlink B.
      In second cycle we crawl link B which also has a link point to A
      In second updatedb we load only page B from store, and will add A as new link because it doesn't know A already exist in store and will override A.

      UpdateDb must be run without batchId or we must set additionsAllowed=false

      Here are code for new page
      page = new WebPage();
      schedule.initializeSchedule(url, page);
      page.setStatus(CrawlStatus.STATUS_UNFETCHED);
      try

      { scoringFilters.initialScore(url, page); }

      catch (ScoringFilterException e)

      { page.setScore(0.0f); }

      new page will override old page status, score, fetchTime, fetchInterval, retries, metadata[CASH_KEY]

      • i think we can change something here so that new page will only update one column for example 'link' and if it is really a new page, we can initialize all above fields in generator
      • or we add operator checkAndPut to store so when add new page we will check if already exist first

        Attachments

        1. NUTCH-1679_4.patch
          3 kB
          Lewis John McGibbney
        2. NUTCH-1679_3.patch
          2 kB
          Alexander Kingson
        3. NUTCH-1679-2.patch
          4 kB
          Tien Nguyen Manh
        4. NUTCH-1679.patch
          3 kB
          Koen Smets

          Issue Links

            Activity

              People

              • Assignee:
                lewismc Lewis John McGibbney
                Reporter:
                tiennm Tien Nguyen Manh
              • Votes:
                5 Vote for this issue
                Watchers:
                12 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: