Nutch
  1. Nutch
  2. NUTCH-907

DataStore API doesn't support multiple storage areas for multiple disjoint crawls

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: nutchgora
    • Component/s: None
    • Labels:
      None

      Description

      In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths.

      This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data.

      In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this "crawlId" value to select one of possibly many existing crawl datasets.

      1. NUTCH-907.v2.patch
        50 kB
        Sertan Alkan
      2. NUTCH-907.patch
        46 kB
        Sertan Alkan

        Activity

        Hide
        Doğacan Güney added a comment -

        Gora already supports this somewhat. While creating a data store, you can optionally specify a table name:

        public static <D extends DataStore<K,T>, K, T extends Persistent>
        D createDataStore(Class<D> dataStoreClass
        , Class<K> keyClass, Class<T> persistent, String schemaName)

        We should be able to leverage that in Nutch to support different crawl datasets. If we extend Nutch's current API to allow names to be specified for crawls then Nutch can simply create tables prefixed with crawl names as Andrzej suggested. For example, a crawl dataset with name "foo" will have a table called "foo_webtable".

        What do you think Andrzej? I think Gora needs no extension here but if people think API is awkward we can change Gora too.

        Show
        Doğacan Güney added a comment - Gora already supports this somewhat. While creating a data store, you can optionally specify a table name: public static <D extends DataStore<K,T>, K, T extends Persistent> D createDataStore(Class<D> dataStoreClass , Class<K> keyClass, Class<T> persistent, String schemaName) We should be able to leverage that in Nutch to support different crawl datasets. If we extend Nutch's current API to allow names to be specified for crawls then Nutch can simply create tables prefixed with crawl names as Andrzej suggested. For example, a crawl dataset with name "foo" will have a table called "foo_webtable". What do you think Andrzej? I think Gora needs no extension here but if people think API is awkward we can change Gora too.
        Hide
        Andrzej Bialecki added a comment -

        That's very good news - in that case I'm fine with the Gora API as it is now, we should change Nutch to make use of this functionality.

        Show
        Andrzej Bialecki added a comment - That's very good news - in that case I'm fine with the Gora API as it is now, we should change Nutch to make use of this functionality.
        Hide
        Sertan Alkan added a comment -

        Here's a patch to allow Nutch to create different schemas to based on the same schema definition. Some points about the patch;

        • To be able to prefix a schema name with a value, Nutch needs to know the default schema name defined in the gora mapping file (e.g ...table=<name>...). Gora handles creation internally at the moment and it doesn't expose this name to outside. So, the patch introduces two new configuration options to pass the schema name to Nutch internals.
          • Nutch ignores the schema name setting in gora mapping file, instead, configuration option storage.schema will tell the Nutch which schema name it should use to access to data store. This value is defaulted to webpage.
          • storage.schema.id option defines the prefix to add to schema name in storage.schema, and by default this id is not provided, i.e. all jobs will run on webpage store as before.
        • Apart from giving it as a configuration option, all jobs (injector, generator, fetcher, updatedb, indexer, benchmark and webtable reader) are modified to accept a schema id as an optional command line argument, -schemaId, which will override the configuration option (-schemaId may seem an odd name but I am not big on naming things).
        • Patch also modifies unit tests to use the same logic.

        All unit tests pass without a problem and I have run a simple crawl with a)default configuration, b)by providing a schema id from configuration and c)giving the ids from command line and jobs seem to run well.

        Show
        Sertan Alkan added a comment - Here's a patch to allow Nutch to create different schemas to based on the same schema definition. Some points about the patch; To be able to prefix a schema name with a value, Nutch needs to know the default schema name defined in the gora mapping file (e.g ...table=<name>...). Gora handles creation internally at the moment and it doesn't expose this name to outside. So, the patch introduces two new configuration options to pass the schema name to Nutch internals. Nutch ignores the schema name setting in gora mapping file, instead, configuration option storage.schema will tell the Nutch which schema name it should use to access to data store. This value is defaulted to webpage . storage.schema.id option defines the prefix to add to schema name in storage.schema , and by default this id is not provided, i.e. all jobs will run on webpage store as before. Apart from giving it as a configuration option, all jobs (injector, generator, fetcher, updatedb, indexer, benchmark and webtable reader) are modified to accept a schema id as an optional command line argument, -schemaId , which will override the configuration option ( -schemaId may seem an odd name but I am not big on naming things). Patch also modifies unit tests to use the same logic. All unit tests pass without a problem and I have run a simple crawl with a)default configuration, b)by providing a schema id from configuration and c)giving the ids from command line and jobs seem to run well.
        Hide
        Andrzej Bialecki added a comment -

        Hi Sertan,

        Thanks for the patch, this looks very good! A few comments:

        • I'm not good at naming things either... schemaId is a little bit cryptic though. If we didn't already use crawlId I would vote for that (and then rename crawlId to batchId or fetchId), as it is now... I dont know, maybe datasetId ..
        • since we now create multiple datasets, we need somehow to manage them - i.e. list and delete at least (create is implicit). There is no such functionality in this patch, but this can be addressed also as a separate issue.
        • IndexerMapReduce.createIndexJob: I think it would be useful to pass the "datasetId" as a Job property - this way indexing filter plugins can use this property to populate NutchDocument fields if needed. FWIW, this may be a good idea to do in other jobs as well...
        Show
        Andrzej Bialecki added a comment - Hi Sertan, Thanks for the patch, this looks very good! A few comments: I'm not good at naming things either... schemaId is a little bit cryptic though. If we didn't already use crawlId I would vote for that (and then rename crawlId to batchId or fetchId), as it is now... I dont know, maybe datasetId .. since we now create multiple datasets, we need somehow to manage them - i.e. list and delete at least (create is implicit). There is no such functionality in this patch, but this can be addressed also as a separate issue. IndexerMapReduce.createIndexJob: I think it would be useful to pass the "datasetId" as a Job property - this way indexing filter plugins can use this property to populate NutchDocument fields if needed. FWIW, this may be a good idea to do in other jobs as well...
        Hide
        Sertan Alkan added a comment - - edited

        Hi Andrzej,

        Thanks for the review and the feedback.

        • Funny thing, I was actually going for datasetId for the name, but now that you mention, I prefer to use crawlId for this and rename the old crawlId to batchId. I am not entirely sure how much invasive that's going to be, but I don't think it will be much of a hassle to change both all at once.
        • I agree that arguments should override the configuration by actually setting it so that the setting could be accessible elsewhere. I'll modify the patch to work this way.
        • A utility to handle the datasets is a good idea, though, considering the current GORA architecture I think we may need to add a client interface there somewhere. I've opened up an issue for this, we can start thinking about the design there. We won't be able write a generic utility in Nutch, though, since this won't be available till we roll out a new version of Gora. I'll pitch in the utility once we have that but as that doesn't affect this issue directly, I'd rather go for a separate issue for that. And until that issue is solved, I think it would be safe to leave manipulation of stores (listing, removing, truncation.. etc) to user's responsibility.

        I'll modify the patch to reflect those two changes.

        Show
        Sertan Alkan added a comment - - edited Hi Andrzej, Thanks for the review and the feedback. Funny thing, I was actually going for datasetId for the name, but now that you mention, I prefer to use crawlId for this and rename the old crawlId to batchId . I am not entirely sure how much invasive that's going to be, but I don't think it will be much of a hassle to change both all at once. I agree that arguments should override the configuration by actually setting it so that the setting could be accessible elsewhere. I'll modify the patch to work this way. A utility to handle the datasets is a good idea, though, considering the current GORA architecture I think we may need to add a client interface there somewhere. I've opened up an issue for this, we can start thinking about the design there. We won't be able write a generic utility in Nutch, though, since this won't be available till we roll out a new version of Gora. I'll pitch in the utility once we have that but as that doesn't affect this issue directly, I'd rather go for a separate issue for that. And until that issue is solved, I think it would be safe to leave manipulation of stores (listing, removing, truncation.. etc) to user's responsibility. I'll modify the patch to reflect those two changes.
        Hide
        Sertan Alkan added a comment -

        Here's the modified version of the patch after Andrzej's review. The additional points to the original patch are as follows;

        • The old crawlId option is renamed to batchId for convenience.
        • All jobs now accept an optional argument, -crawlId <id>, to prefix the schema. Jobs now keep this property in the configuration allowing later use by, say, plugins.

        All unit tests pass and again I have run a simple crawl w/o any problems. I have also tested the batchId option by generating two different sets of the injected urls and run a fetch-parse cycle on those sets. Jobs seem to recognize the correct batchId and select only the corresponding urls.

        Like I said before, I prefer to leave store manipulation utility out of this patch, and handle it in a separate issue once we have that functionality in Gora. What do you think?

        Show
        Sertan Alkan added a comment - Here's the modified version of the patch after Andrzej's review. The additional points to the original patch are as follows; The old crawlId option is renamed to batchId for convenience. All jobs now accept an optional argument, -crawlId <id> , to prefix the schema. Jobs now keep this property in the configuration allowing later use by, say, plugins. All unit tests pass and again I have run a simple crawl w/o any problems. I have also tested the batchId option by generating two different sets of the injected urls and run a fetch-parse cycle on those sets. Jobs seem to recognize the correct batchId and select only the corresponding urls. Like I said before, I prefer to leave store manipulation utility out of this patch, and handle it in a separate issue once we have that functionality in Gora. What do you think?
        Hide
        Andrzej Bialecki added a comment -

        Committed in rev. 1025963. Thank you Sertan for a high-quality patch and unit tests!

        Show
        Andrzej Bialecki added a comment - Committed in rev. 1025963. Thank you Sertan for a high-quality patch and unit tests!
        Hide
        Sertan Alkan added a comment -

        Thanks Andrzej, I've been waiting for this; I have a couple of use cases just for the functionality.

        Show
        Sertan Alkan added a comment - Thanks Andrzej, I've been waiting for this; I have a couple of use cases just for the functionality.

          People

          • Assignee:
            Andrzej Bialecki
            Reporter:
            Andrzej Bialecki
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development