[HIVE-14468] Implement Druid query based input format - ASF JIRA

XML

Word

Printable

JSON

It is responsible of generating the splits and creating the record readers.

For Timeseries, TopN, GroupBy queries. Create a single split containing the broker address and the query. Then the record reader will submit the query to the broker, retrieve the results, and parse them and generate records.

For Select queries. Druid has the concept of threshold (limit) in Select query. In fact, it is used for retrieving the query results in multiple requests. Hence, we will emit a Druid Segment Metadata query to obtain the number of rows in the datasource. Then we create number of rows / default_threshold splits; default_threshold is a Hive configuration property defined as hive.druid.select.threshold. Each split generated contains the broker address and a Select JSON query with start and end date range (currently we assume uniform distribution of records across the time dimension). The splits are handled independently by the record readers, which submit the query to the broker, retrieve the results, and parse them and generate records. This way we can parallelize the retrieval of results for these queries.

is part of

HIVE-14217 Druid integration