Details
-
New Feature
-
Status: Resolved
-
Minor
-
Resolution: Won't Fix
-
4.0-ALPHA
-
None
Description
This is a spin-off of SOLR-2382.
Currently DIH requires users to retrieve, join and index all data for a full or delta update in one big step. This issue is to allow us to break this into individual steps. The idea is to have multiple "data-config.xml" files, some of which retrieve and cache data while others join and index data.
This is useful when Solr Records are a conglomeration of several data sources. With this feature, each data source can be retrieved and cached separately. Once all data sources have been retrieved, they can be joined and indexed in a final step. When doing a delta update, only the data sources that change need to have their caches updated (or frequently-changing data can remain un-cached while caching the more static data). This is particularly useful in light of the fact that Lucene/Solr cannot do a true "update" operation. DIH Caches also provide a handy way to archive source data for which there is no stable system-of-record.
Implementation Details:
- The DIHCacheWriter allows us to write the final (root entity) DIH output to a DIHCache rather than to Solr. Caches can be created from scratch ("full-update") or existing caches can be modified ("delta-update").
- The DIHCacheProcessor is an Entity Processor that reads a DIHCache. This Entity Processor can be used for both Root Entities and Child Entities. Cached data can be read back, joined to other Entities and indexed.
- Both DIHCacheWriter and DIHCacheProcessor support partitioning. DIHCacheWriter can write to a partitioned cache while DIHCacheProcessor can read back a particular partition. This can be handy when indexing to multiple shards.
- This patch is 100% stand-alone from the rest of DIH, so while users can patch and rebuild the DIH .jar file to include these classes, it is unnecessary. To use this functionality, simply include the code here in the classpath. (ex: in SOLR_HOME/lib)
- In addition to this patch, a persistent cache implementation is required.
- See
SOLR-2948for a DIH Cache Implementation built on Lucene (no additional dependencies). - See
SOLR-2613for a DIH Cache Implementation backed with BDB-JE (we use this in Production). - Other Cache Implementations (hopefully) will be developed in the future and become available for general use.
- This patch includes extensive unit tests. A MockDIHCache that supports persistence and delta updates facilitates the tests. Do not attempt to use MockDIHCache for anything other than testing or as a reference for developing your own DIHCache implementations.
Attachments
Attachments
Issue Links
- is depended upon by
-
SOLR-2549 DIH LineEntityProcessor support for delimited & fixed-width files
- Resolved
- is required by
-
SOLR-6144 DIH Cache backed with MapDB
- Resolved
- is superceded by
-
SOLR-14783 Remove DIH from 9.0
- Closed
updated patch. fixes a parameter-naming bug.