[SOLR-9240] Support parallel ETL with the topic expression - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Resolved
Affects Version/s: None
Fix Version/s: 6.2
Component/s: None
Labels:
None

Description

It would be useful for SolrCloud to support large scale Extract, Transform and Load work loads with streaming expressions. Instead of using MapReduce for ETL, the topic expression can be used which allows SolrCloud to be treated like a distributed message queue filled with data to be processed. The topic expression works in batches and supports retrieval of stored fields, so large scale text ETL will work well with this approach.

This ticket makes two changes to the topic() expression that makes parallel ETL possible:

1) Changes the topic expression so it can operate in parallel.
2) Adds the initialCheckpoint parameter to the topic expression so a topic can start pulling records from anywhere in the queue.

Daemons can be sent to worker nodes that each work on processing a partition of the data from the same topic. The daemon() function's natural behavior is perfect for iteratively calling a topic until all records in the topic have been processed.

The sample code below pulls all records from one collection and indexes them into another collection. A Transform function could be wrapped around the topic() to transform the records before loading. Custom functions can also be built to load the data in parallel to any outside system.

parallel(
         workerCollection, 
         workers="2", 
         sort="DaemonOp desc", 
         daemon(
                update(
                      updateCollection, 
                      batchSize=200, 
                      topic(
                            checkpointCollection,
                            topicCollection, 
                            q=*:*, 
                            id="topic1",
                            fl="id, to , from, body", 
                            partitionKeys="id",
                            initialCheckpoint="0")), 
                runInterval="1000", 
                id="daemon1"))

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-9240.patch
11/Jul/16 17:59
22 kB
Joel Bernstein
SOLR-9240.patch
11/Jul/16 16:50
19 kB
Joel Bernstein

Issue Links

is related to

SOLR-9258 Optimizing, storing and deploying AI models with Streaming Expressions

Closed

relates to

SOLR-9299 Allow Streaming Expressions to use Analyzers

Resolved

SOLR-9103 Restore ability for users to add custom Streaming Expressions

Closed

supercedes

SOLR-8591 Add BatchStream to the Streaming API

Closed

Activity

People

Assignee:: Joel Bernstein

Reporter:: Joel Bernstein

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 21/Jun/16 16:37

Updated:: 26/Aug/16 13:59

Resolved:: 18/Jul/16 01:56