Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2793

CSV indexer does not work in distributed mode

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.17
    • 1.20
    • indexer, plugin
    • None

    Description

      Reasons are discussed in https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768 and following comments.

      To summarize, the indexer interface is not aware of tasks so it cannot generate unique output name per reducers.

      But it seems achievable because IndexWriters initialize each writer with calls to 2 open functions:

      • One passing the general configuration and a "name"
      • The second to pass indexer parameters

      https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214

      Fortunately, "name" is generated by calling getUniqueFile which does exactly what we want:

      https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43

      I propose we use it instead of "nutch.csv" as CSVIndexWriter output file name. This is a breaking change because it modifies the output name but allows the indexer to work in distributed mode.

      PR will follow the ticket creation.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            pmezard Patrick M├ęzard

            Dates

              Created:
              Updated:

              Slack

                Issue deployment