Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-1153

Documents crawled using manifoldcf 1.6 or earlier are needlessly recrawled after upgrade to 1.7 or later

    XMLWordPrintableJSON

    Details

      Description

      After upgrading to mcf 1.7 or later, pre-existing documents are recrawled and re-indexed even if they have not changed in any way since their last pre-upgrade crawl. The impact can be significant for large manifold deployments with millions+ static documents.

      There appear to be three contributing factors:
      1. The empty transformation version of a legacy document is different from the initial value of "0+0!" - in PipelineObjectWithVersions#buildAddPipeline and IncrementalIngester#checkFetchDocument
      2. Incorrect comparison of output versions in PipelineObjectWithVersions#buildAddPipeline where oldOutputVersion is compared to a VersionContext object instead of the version string, which can be obtained by calling VersionContext#getVersionString - if IPipelineSpecification#getStageDescriptionString continues to return a VersionContext object, a rename of the method could be useful
      3. In PipelineObjectWithVersions#buildAddPipeline, a null value for newAuthorityNameString is not treated the same as an empty string (like it is in other methods)

        Attachments

          Activity

            People

            • Assignee:
              kwright@metacarta.com Karl Wright
              Reporter:
              aeham.abushwashi Aeham Abushwashi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: