[SOLR-13494] Add DeepRandomStream implementation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Resolved
Affects Version/s: None
Fix Version/s: 8.2
Component/s: streaming expressions
Labels:
None

Description

Currently the random Streaming Expression performs a conventional distributed search. This involves retrieving the top N docs from each shard and then selecting the top N from all the shards in the aggregator node. This technique eventually bogs down as the number of shards goes up and/or N goes up.

Selecting distributed random samples does not actually require this behavior. Instead you can select N/numShards from each shard and simply return all results. This technique will actually get faster as more shards are added instead of slowing down.

This ticket will allow the random Streaming Expression to use the strategy above when N reaches a certain threshold (ie 10000).

The DeepRandomStream class will implement the deep random sampling behavior.

The random Streaming Expression will switch between the RandomStream and DeepRandomStream depending on N.

Performance

Local testing shows astounding performance on random sampling with the new technique.

Selecting a random sample of 250,000 documents with two numeric fields and running a regression analysis on the sample set takes under a second. Attached is a screen shot with the math expression code.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-13494.patch
28/May/19 01:05
17 kB
Joel Bernstein
SOLR-13494.patch
28/May/19 17:39
25 kB
Joel Bernstein
Screen Shot 2019-05-28 at 4.50.54 PM.png
28/May/19 20:51
274 kB
Joel Bernstein

Issue Links

is related to

SOLR-10651 Streaming Expressions statistical functions library

Closed

Activity

People

Assignee:: Joel Bernstein

Reporter:: Joel Bernstein

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/May/19 00:27

Updated:: 26/Jul/19 08:56

Resolved:: 29/May/19 19:17