[SOLR-14614] Add Simplified Aggregation Interface to Streaming Expression - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 7.7.2, 8.4.1
Fix Version/s: None
Component/s: query, query parsers, streaming expressions
Labels:
None

Description

For the Data Analytics use cases the standard use case is:

Find a pattern
Then Aggregate by certain dimensions
Then compute metrics (like count, sum, avg)
Sort by a dimension or metric
look at top-n

This functionality has been available over many different interfaces in the past on solr, but only streaming expressions have the ability to deliver results in a scalable, performant and stable manner for systems that have large data to the tune of Big data systems.

However, one barrier to entry is the query interface, not being simple enough in streaming expressions.

to give an example of how involved the corresponding streaming expression can get, to get it to work on large scale systems, find top 10 cities where someone named Alex works with the respective counts

qt=/stream&aggregationMode=facet&expr=
select( top( rollup(sort(by%3D"city+asc",
   +plist( 
          select(facet(collection1,+q%3D"(*:*+AND+name:alex)",+buckets%3D"city",+bucketSizeLimit%3D"2010",+bucketSorts%3D"count(*)+desc",+count(*)),+city,+count(*)+as+Nj3bXa),

          select(facet(collection2,+q%3D"(*:*+AND+name:alex)",+buckets%3D"city",+bucketSizeLimit%3D"2010",+bucketSorts%3D"count(*)+desc",+count(*)),+city,+count(*)+as+Nj3bXa)
         )),
		+over%3D"city",+sum(Nj3bXa)),
	+n%3D"10",+sort%3D"sum(Nj3bXa)+desc"),
+city,+sum(Nj3bXa)+as+Nj3bXa)

This is a query on an alias with 2 collections behind it representing 2 data partitions, which is a requirement of sorts in big data systems. This is one of the only ways to get information from Billions of records in a matter of seconds. This is awesome in terms of capability and performance.

But one can see how involved this syntax can be in the current scheme and is a barrier to entry for new adopters.

This Jira is to track the work of creating a simplified analytics endpoint augmenting streaming expressions.

a starting proposal is to have the endpoint have these query parameters:

/analytics?action=aggregate&q=*:*&fq=name:alex&dimensions=city&metrics=count&sort=count&sortOrder=desc&limit=10

This is equivalent to a sql that an analyst would write:

select city, count(*) from collection where name = 'alex'
group by city order by count(*) desc limit 10;

On the solr side this would get translated to the best possible streaming expression using rollups, top, sort, plist etc.; but all done transparently to the user.

Heres to making the power of Streaming expressions simpler to use for all.

Attachments

Issue Links

relates to

SOLR-15036 Use plist automatically for executing a facet expression against a collection alias backed by multiple collections

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Aroop

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 01/Jul/20 17:30

Updated:: 16/Nov/21 16:49