Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
Some of our users have a CustomPartitioner with join or group by as they know their data and know the keys to partition on. Since mapreduce provides data sorted within a reducer, they rely on that to have the data ordered as well.
For eg:
partition = group mydata by (hour, sortkey1, sortkey2, sortkey3) using MyCustomPartitioner PARALLEL 24;
The custom partitioner sends hours 0-23 to partitions 0-23, which ensures that the data is also sorted without having to do a group by.
With HCatStorer, this pattern will be used more. i.e,
partition = group mydata by (hour) using MyCustomPartitioner PARALLEL 24;
store partition into 'mydb.mytable' using HCatStorer();
instead of
store mydata into 'mydb.mytable' using HCatStorer();
where hour is the partition. The extra groupby above is to avoid having 1 file created per partition instead of 24 files per partition and concatenating them later to save namespace.