Nutch
  1. Nutch
  2. NUTCH-1290

crawlId not supported by all Tools

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: nutchgora
    • Fix Version/s: nutchgora
    • Component/s: indexer
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      See also: https://issues.apache.org/jira/browse/NUTCH-907

      The StorageUtils class exposes a createDataStore method which uses the default schema for a persistent class specified in the Gora configuration.
      This method ignores Nutch' storage.schema property and the notion of a crawlId.

      Two tools use this method instead of the createWebStore method (which does support the storage.schema property and a crawlId):

      o.a.n.indexer.IndexerReducer (IndexerJob)
      o.a.n.util.domain.DomainStatistics

      I propose that these two start using the createWebStore method and that we make remove the createDataStore method from the StorageUtils.
      Also, these two tools should support the crawlId command line parameter.

      1. NUTCH-1290.patch
        5 kB
        Mathijs Homminga

        Issue Links

          Activity

          Hide
          Mathijs Homminga added a comment -

          Actually, the IndexerReducer is only used by the IndexerJob, which in turn is only implemented by the SolrIndexerJob at the moment.
          The SolrIndexerJob does pretend to support for the crawlId, but since it uses the createDataStore method (instead of the createWebStore method), it will ignore the crawlId eventually.

          Show
          Mathijs Homminga added a comment - Actually, the IndexerReducer is only used by the IndexerJob, which in turn is only implemented by the SolrIndexerJob at the moment. The SolrIndexerJob does pretend to support for the crawlId, but since it uses the createDataStore method (instead of the createWebStore method), it will ignore the crawlId eventually.
          Hide
          Mathijs Homminga added a comment -

          This patch modifies the following files in order to support crawlId consistently:

          o.a.n.indexer.IndexerReducer
          o.a.n.storage.StorageUtils
          o.a.n.storage.WebTableCreator
          o.a.n.util.domain.DomainStatistics

          Show
          Mathijs Homminga added a comment - This patch modifies the following files in order to support crawlId consistently: o.a.n.indexer.IndexerReducer o.a.n.storage.StorageUtils o.a.n.storage.WebTableCreator o.a.n.util.domain.DomainStatistics
          Hide
          Ferdy Galema added a comment -

          The recent commit of NUTCH-882 incorporated changes effectively superceding this issue.

          Show
          Ferdy Galema added a comment - The recent commit of NUTCH-882 incorporated changes effectively superceding this issue.

            People

            • Assignee:
              Unassigned
              Reporter:
              Mathijs Homminga
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development