Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
None
-
None
-
None
Description
When a user connects via spark-thrift server to execute SQL, it does not enable PPD with ORC. It ends up creating MetastoreRelation which does not have ORC PPD. Purpose of this JIRA is to convert MetastoreRelation to OrcRelation in HiveMetastoreCatalog, so that users can benefit from PPD even when connecting to spark-thrift server.
For example, "explain select count(1) from tpch_flat_orc_1000.lineitem where l_shipdate = '1990-04-18'", current plan is +------------------------------------------------------------------------------------------------------------------+--+ | plan | +------------------------------------------------------------------------------------------------------------------+--+ | == Physical Plan == | | TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#17L]) | | +- Exchange SinglePartition, None | | +- WholeStageCodegen | | : +- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#20L]) | | : +- Project | | : +- Filter (l_shipdate#11 = 1990-04-18) | | : +- INPUT | | +- HiveTableScan [l_shipdate#11], MetastoreRelation tpch_1000, lineitem, None | +------------------------------------------------------------------------------------------------------------------+--+ It would be good to change it to OrcRelation to do PPD with ORC, which reduces the runtime by large margin. +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+ | plan | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+ | == Physical Plan == | | TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#70L]) | | +- Exchange SinglePartition, None | | +- WholeStageCodegen | | : +- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#106L]) | | : +- Project | | : +- Filter (_col10#64 = 1990-04-18) | | : +- INPUT | | +- Scan OrcRelation[_col10#64] InputPaths: hdfs://nn:8020/apps/hive/warehouse/tpch_1000.db/lineitem, PushedFilters: [EqualTo(_col10,1990-04-18)] | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
Attachments
Issue Links
- duplicates
-
SPARK-14070 Use ORC data source for SQL queries on ORC tables
- Resolved
- links to