[PIG-4504] Enable Secondary key sort feature in spark mode - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: spark-branch
Component/s: spark
Labels:
None

Description

Some knowledge about secondary key sort:
MapReduce framework automatically sorts the keys generated by mappers. This means that, before starting reducers all intermediate (key, value) pairs generated by mappers must be sorted by key (and not by value). Values passed to each reducer are not sorted at all and they can be in any order. But if we make (key,value) as a compound key, let (key, value) pairs changes to ((key,value), null) pairs. Here we call (key,value) as compound key, key is the first key, value is the secondary key. In the shuffle process, pairs with the same first key will be grouped into the same partition by setting PartitionerClass in the JobConf . Pairs with the same first key but different secondary key will be sorted in the process of shuffle by setting SortComparatorClass in the JobConf. Pairs with the same first key but different secondary key will be transferred to the same reduce function by setting GroupingComparatorClass in the JobConf.

How pig implements secondary key sort in mapreduce mode?
In MR: it implements secondary key sort by setting GroupingComparatorClass, PartitionerClass, SortComparatorClass in JobControlCompiler#getJob

An example use secondary key sort:
TestAccumulator#testAccumWithSort

Currently, secondary key sort feature is not implement in spark mode.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-4504_2.patch
15/Apr/15 05:58
34 kB
liyunzhang
PIG-4504_3.patch
28/Apr/15 01:12
34 kB
liyunzhang
PIG-4504_4.patch
29/Apr/15 00:47
34 kB
liyunzhang
PIG-4504_5.patch
01/May/15 11:00
42 kB
liyunzhang
PIG-4504_6.patch
08/May/15 03:43
50 kB
liyunzhang
PIG-4504_7.patch
13/May/15 07:11
49 kB
liyunzhang
PIG-4504.patch
13/Apr/15 09:42
27 kB
liyunzhang
SecondaryKeySort_design_doc (1).docx
15/Apr/15 05:58
21 kB
liyunzhang
Why_need_split_PoLocalRearrange_POGlobalRearrange_POPackage_into_two_SparkNodes_in_sparkPlan.docx
15/Apr/15 00:47
73 kB
liyunzhang

Issue Links

is related to

SPARK-3655 Support sorting of values in addition to keys (i.e. secondary sort)

Resolved

links to

review board

Activity

People

Assignee:: liyunzhang

Reporter:: liyunzhang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Apr/15 06:53

Updated:: 21/Jun/17 09:18

Resolved:: 15/May/15 13:00