[NUTCH-907] DataStore API doesn't support multiple storage areas for multiple disjoint crawls - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: nutchgora
Component/s: None
Labels:
None

Description

In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths.

This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data.

In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this "crawlId" value to select one of possibly many existing crawl datasets.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-907.v2.patch
05/Oct/10 13:36
50 kB
Sertan Alkan
NUTCH-907.patch
27/Sep/10 12:27
46 kB
Sertan Alkan

Activity

People

Assignee:: Andrzej Bialecki

Reporter:: Andrzej Bialecki

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 15/Sep/10 15:00

Updated:: 22/May/13 03:53

Resolved:: 21/Oct/10 12:02