Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
ManifoldCF 1.7
-
None
Description
In some cases, documents that are indexed may be virtual children of those that are queued. A good example of this is RSS feeds where the data being indexed all comes from the feed.
In order to implement this, the following changes would be required:
(1) IProcessActivity.ingestDocument() has a variant which allows you to include a virtual child document identifier in addition to the main document identifier.
(2) IIncrementalIngester's addOrReplaceDocument receives TWO document keys – one for main (queued) document identifier, one for child virtual document identifier.
(3) IIncrementalIngester has two new methods: beginDocument() and endDocument(), both of which take a main (queued) document identifier as an argument.
(4) ingeststatus table has two additional columns: a state, and a child key.
(5) The flow is: at beginDocument() time, put all records relating to a document into a "processing" state. Documents that are seen have their state changed. Documents never
encountered are deleted at the end.
(6) Incremental decisions not to update an output record STILL will require that the record be touched and its state set.
(7) DocumentIngest records for the entire set of children will be fetched when the document is queued.
(8) The getDocumentVersions() method must be modified to allow return of version strings for all children, although there can be "shortcuts" as well (where a single version
string applies to all children.)
(9) The decision about whether to refetch a document is based on the returned version strings and on those fetched by the stuffer thread.
(10) Similarly, processDocuments() receives version strings for all virtual children.
(11) There is no need to actively reset the state of document records on restart; the current logic should be robust enough to be able to generate the required deletions.
(12) Deleting a document deletes ALL child virtual documents. This happens within the incremental ingester.
(13) Requeuing interval must be computed across all children, taking the minimum, since there's no requirement that an ingeststatus record exist for the parent.
(14) All other logic, including making sure only one agent operates on a url at a time, is the same.
(15) Interrupting the delete phase is safe because next time the doc is processed the records will be removed.
Analysis:
- The critical thing is making the non-virtual case no worse.
- For a virtual child document, instead of one db access, there are two.
- For document records that are not changed, there are two additional writes that were not needed before.
- There's an additional index (or the document key index has another subfield).
- If the queries written can be done in such a way as to treat the standard (no child document) case specially, we may be able to avoid much impact; only two index queries per document returning zero rows each
- If we handle the standard case using the same mechanism, the WorkerThread logic dealing with deletions can go away.
Summary:
- Additional database overhead in the non-virtual indexing case consists of one additional write and one additional zero-row query, OR two additional zero-row queries.
- Additional database overhead in the non-virtual skip case consists of two additional writes, OR two additional zero-row queries.
- The overhead is low but is significant and will impact overall framework performance
- The up-sides are as follows: (a) handling an important but infrequent case better; (b) less connector involvement in indexing (e.g., IProcessActivity.deleteDocument() does nothing now, and can be deprecated).
Attachments
Issue Links
- depends upon
-
CONNECTORS-990 Experiment with folding getDocumentVersions() and processDocuments() into one method
- Resolved