[GRIFFIN-335] Hive Connector: Ability to Use "group by" caluse - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.6.0
Fix Version/s: None
Component/s: accuracy-batch
Labels:
- columns
- groupby
- hive

Description

Background:

Refer to https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-334 and https://issues.apache.org/jira/browse/GRIFFIN-333 .

If we have the ability to select specific columns, it will open the door to use SQLbase aggregation, further reducing volume of data from Hive sources.

Proposed Improvement:
So, I propose the feature to allow Hive connector to able to use SQL based aggregations.

Let's say we have source and target tables that have data like below.

src:

------------------------
|employee_id   |country|
------------------------
|1             | NZ    |
|2             | DE    |
|3             | DE    |
|4             | NZ    |
|5             | DE    |
....
....
------------------------

tgt:

------------------------
|total_employee|country|
------------------------
|10            | NZ    |
|11            | DE    |
------------------------

Then we can perform `accuracy` check [ `"rule":"src.total_employee = tgt.total_employee and src.country = tgt.country "` ] directly like below using `columns` and `groupby` clauses for source table:

      {
         "name":"src",
         "connector":{
            "type":"hive",
            "config":{
               "database":"mydatabase",
               "table.name":"mytable",
               "columns": "count(*) total_employee, country",
               "groupby": "country",
               "where":""
            }
         }
      }

Attachments

Issue Links

is a clone of

GRIFFIN-333 JDBC Connector: Ability to Use "group by" caluse

Open

Activity

People

Assignee:: Unassigned

Reporter:: Azhar

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Jul/20 03:46

Updated:: 19/Jul/20 12:04