An environment that is setup with 100K hive_tables each with 84 columns.
The basic search with query parameter specified is executed. Results take 75 secs to appear.
Similar test was performed with smaller data set (200 hive_tables each with 81 columns) resulted in less than ideal performance.
Atlas Basic Search API uses graph.indexQuery for performing search. This uses Solr for doing the search.
There are 2 aspects that affect performance:
- Solr's default for returning max query set when no limit is specified is 100K. In the test scenario, this is returning entire result set.
- Once result set is returned, EntityDiscoveryService.searchUsingBasicQuery does a sequential scan to filter data relevant to the query. This operation is proportional to size of the result set.
Following changes will improve performance:
- Solr's max result set property is governed by atlas.graph.index.search.max-result-set-size. It will make sense to set this to a lower number.
- Modify Solr's configuration solrconfig.xml to use FastLRUCache.
- Modify EntityDiscoveryService.searchUsingBasicQuery to form a query that takes additional parameters.