Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-1118

Documents processed by the shared drive connector incur an unnecessary synchronisation hit

    XMLWordPrintableJSON

Details

    Description

      Each document processed by the shared drive connector is passed through SharedDriveConnector#checkInclude to verify whether the document is eligible for ingestion. The calls made here to WorkerThread$ProcessActivity#checkMimeTypeIndexable and WorkerThread$ProcessActivity#checkLengthIndexable are unnecessarily costly as they each create a fresh instance of IncrementalIngester$PipelineConnections on every call. The constructor of IncrementalIngester$PipelineConnections can be very expensive due to the loading of output connection objects, which in turn requires some locking (via ZK - in a distrubuted environment).

      The other area of inefficiency is in WorkerThread$ProcessActivity#processDocumentReferences. This method creates new instances of PriorityCalculator using the less-efficient 3-arg constructor. This can be addressed using the same pattern implemented for CONNECTORS-1094

      To highlight the impact of the above calls, I profiled an active worker thread for 40 minutes. During that window, it spent ~23 minutes in SharedDriveConnector#checkInclude and its callees + 9 minutes creating instances of PriorityCalculator.

      I've seen the above issues when using the shared drive connector but I think other connectors too could be impacted - depending on how they're implemented.

      Attachments

        Activity

          People

            kwright@metacarta.com Karl Wright
            aeham.abushwashi Aeham Abushwashi
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: