Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Restricting number of files in output is a very common use case. In Pig, currently users add a ORDER BY, GROUP BY or DISTINCT with the required parallelism before STORE to achieve it. All of the above operations create unnecessary overhead in processing. It would be ideal if STORE clause supported the PARALLEL statement and the partitioning of data was handled in a more simple and efficient manner.
This jira is more Tez specific and requires TEZ-3865. More details are in that jira regarding how it can be done via Tez. We will also have to add APIs to StoreFunc (HCatStorer, MultiStorage, etc) to get partition keys to partition the data for store statement.
Attachments
Issue Links
- requires
-
TEZ-3865 A new vertex manager to partition data for STORE
- Open