Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2616

Review routing of deletions by Exchange component

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.15
    • Fix Version/s: 1.15
    • Component/s: indexer
    • Labels:
      None

      Description

      If the exchange component (NUTCH-2412) is enabled it must also route deletions (404, etc.) to the configured index writers. Deletions are done alone using the document ID (URL), there is no NutchDocument (or it's null) which needs to handled to avoid an NPE in the Exchanges class or the exchange plugins.

      NUTCH-2412 has added a new delete method in the IndexWriters class:

      • delete(String, NutchDocument) is now called from the indexing job (bin/nutch index ... -deleteGone). However, the NutchDocument is always null in case of deletions, see IndexerMapReduce.DELETE_ACTION.
      • delete(String) is now a no-op but is still called from CleaningJob (bin/nutch clean ...)

      We could (Roannel Fernández Hernández, are there better options?)

      • send deletions to all index writers. This causes a certain overhead (could be critical if deletion lists are long).
      • pass a document containing only a single field (the document ID / URL) to the exchange component.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                wastl-nagel Sebastian Nagel
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: