Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14217 Druid integration
  3. HIVE-14468

Implement Druid query based input format

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.0
    • 2.3.0
    • Druid integration
    • None

    Description

      It is responsible of generating the splits and creating the record readers.

      • For Timeseries, TopN, GroupBy queries. Create a single split containing the broker address and the query. Then the record reader will submit the query to the broker, retrieve the results, and parse them and generate records.
      • For Select queries. Druid has the concept of threshold (limit) in Select query. In fact, it is used for retrieving the query results in multiple requests. Hence, we will emit a Druid Segment Metadata query to obtain the number of rows in the datasource. Then we create number of rows / default_threshold splits; default_threshold is a Hive configuration property defined as hive.druid.select.threshold. Each split generated contains the broker address and a Select JSON query with start and end date range (currently we assume uniform distribution of records across the time dimension). The splits are handled independently by the record readers, which submit the query to the broker, retrieve the results, and parse them and generate records. This way we can parallelize the retrieval of results for these queries.

      Attachments

        Issue Links

          Activity

            People

              jcamacho Jesús Camacho Rodríguez
              jcamacho Jesús Camacho Rodríguez
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: