Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-907

DataStore API doesn't support multiple storage areas for multiple disjoint crawls

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • nutchgora
    • None
    • None

    Description

      In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths.

      This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data.

      In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this "crawlId" value to select one of possibly many existing crawl datasets.

      Attachments

        1. NUTCH-907.v2.patch
          50 kB
          Sertan Alkan
        2. NUTCH-907.patch
          46 kB
          Sertan Alkan

        Activity

          People

            ab Andrzej Bialecki
            ab Andrzej Bialecki
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: