Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.17.0
-
None
-
None
Description
Consider the below scenarios. The first 2 scenarios are giving expected results in terms of performance. But we are not getting expected performance for 3rd scenario which is UNION ALL with 2 different types of datasets (Parquet + DB).
Scenario 1- Parquet UNION ALL Parquet
Individual execution time of 1st query - 5 secs
Individual execution time of 2nd query - 5 secs
UNION ALL of both queries execution time - 10 secs
Scenario 2 - DB query UNION ALL DB query
Individual execution time of 1st query - 5 secs
Individual execution time of 2nd query - 5 secs
UNION ALL of both queries execution time - 10 secs
Scenario 3 - Parquet UNION ALL DB query
Individual execution time of 1st query - 5 secs
Individual execution time of 2nd query - 1 sec
UNION ALL execution time - 20 secs
Ideally the execution time should not be more than 6 secs.
Config-
HEAP memory - 16 GB
DRILL_MAX_DIRECT_MEMORY - 32 GB
2 Drillbits
Observation-
Observed that the query is distributed in 2 NODES when we are executing individual query or executing UNION ALL between same type datasets. But query is executing only on 1 NODE when we are executing UNION ALL between 2 types datasets (like Parquet UNION ALL DB). The Union query is not being parallelized i.e. split into multiple 'Minor Fragments'
Storage-
Storage is HDFS.
Parquet file size - 849 MB
Nature of query-
Both Parquet and DB query have some filter criteria. Those doesn't have sort or join.
Time taken-
SCAN | Total | |
Parquet | 2.018s | 5.419 sec |
DB | 0.146s | 0.257 sec |
Parquet UNION ALL DB | 15.632s | 20.729 sec |
The query we used-
SELECT column1, column2,...column25
FROM dfs.`root`.`Parquet`
WHERE column1 = <> AND column2 = <> AND column3 = <>
UNION ALL
SELECT column1, column2,...column25
FROM db.`Database`
WHERE column1 = <> AND column2 = <> AND column3 = <>
Below is the Plan after redacted names-
00-00 Screen
00-01 Project(Column1=[$0],...,Column25=[$24])
00-02 UnionExchange
01-01 UnionAll(all=[true])
01-03 Project (Column1=[$0],...,Column25=[$24])
01-05 SelectionVectorRemover
01-06 Filter(condition=[AND(=$0,'<value>',),=($1,'<value>',=($2,'<value>'))])
01-07 Scan(table=[[]], groupscan=[ParquetGroupScan], selectionroot=<>, numFiles=13, numRowGroups=69, usedMetadatafile=true, usedMetastore=false, filter=booleanAnd(..), cacheFileRoot=.., columns=[.....] )
01-02 Project(Column1=[$0],...,Column25=[$24])
01-04 Jdbc()
json-
{
"head" :
{ "version":1, "generator" :
,
"type" : "APACHE_DRILL_PHYSICAL","options" : [],"queue":0,"hasResourcePlan":false,"resultMode":"EXEC"
},
"graph":[
,
{"pop":"project","initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":
:4194304,"outputRowCount":1},
{"pop":"parquet-scan", "cost":
{"memoryCost":4194304,"outputRowCount":1.2287038E7}
},
{"pop":"filter" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":
{"memoryCost":4194304,"outputRowCount":41468.75}
},
{"pop":"selection-vector-remover" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":
{memoryCost}
:4194304,"outputRowCount":41468.75 } ,
{"pop":"project" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":
}
{"pop":"union-all" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":
} ,
{"pop":"union-exchange" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":
} ,
{"pop":"project" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":
,"outputProj":true },
{"pop":"screen" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":
}
]
}