[HAWQ-1228] Use profile based on file format in HCatalog integration(HiveRC, HiveText profiles) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.1.0.0-incubating
Component/s: PXF
Labels:
None

Description

To leverage changes introduced in ~~HAWQ-1177~~, expand optimization for other Hive profiles. Additional information needs to be included in user metadata(e.g. DELIMITER etc).

Changes needed:

Enhance the Metadata API, to add new attributes: outputFormats, outputParameters;
Hive, HiveORC profiles should support just GPDBWritable format;
HIveText, HiveRC profiles should support both TEXT and GPDBWritable formats;
Unify HiveUserData data structures to be same among all Hive- profiles;
Bridge should read fragments using optimal profile read from fragment information;
Optimal profile should be determined based on file's input format(org.apache.hadoop.hive.ql.io.orc.OrcInputFormat - HiveORC, org.apache.hadoop.hive.ql.io.RCFileInputFormat - HiveRC, org.apache.hadoop.mapred.TextInputFormat - HiveText);
Default profile is Hive;
If Hive table has org.apache.hadoop.mapred.TextInputFormat but also has some comlex types - Hive profile should be used(limitation should be addressed in HAWQ-1265);
If table is homogeneous(all input file have the same output format) Bridge uses the same format which table has. Otherwise, if table is heterogeneous, GPDBWritable should be used;
Add new property outputFormat to pxf-profiles-default.xml, which means default output format of profile.

Attachments

Issue Links

links to

GitHub Pull Request #1076

Activity

People

Assignee:: Oleksandr Diachenko

Reporter:: Oleksandr Diachenko

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 20/Dec/16 22:23

Updated:: 18/Feb/17 05:26

Resolved:: 31/Jan/17 20:21