Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4059 Pig on Spark
  3. PIG-4504

Enable Secondary key sort feature in spark mode

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • spark-branch
    • spark
    • None

    Description

      Some knowledge about secondary key sort:
      MapReduce framework automatically sorts the keys generated by mappers. This means that, before starting reducers all intermediate (key, value) pairs generated by mappers must be sorted by key (and not by value). Values passed to each reducer are not sorted at all and they can be in any order. But if we make (key,value) as a compound key, let (key, value) pairs changes to ((key,value), null) pairs. Here we call (key,value) as compound key, key is the first key, value is the secondary key. In the shuffle process, pairs with the same first key will be grouped into the same partition by setting PartitionerClass in the JobConf . Pairs with the same first key but different secondary key will be sorted in the process of shuffle by setting SortComparatorClass in the JobConf. Pairs with the same first key but different secondary key will be transferred to the same reduce function by setting GroupingComparatorClass in the JobConf.

      How pig implements secondary key sort in mapreduce mode?
      In MR: it implements secondary key sort by setting GroupingComparatorClass, PartitionerClass, SortComparatorClass in JobControlCompiler#getJob

      An example use secondary key sort:
      TestAccumulator#testAccumWithSort

      Currently, secondary key sort feature is not implement in spark mode.

      Attachments

        1. PIG-4504_2.patch
          34 kB
          liyunzhang
        2. PIG-4504_3.patch
          34 kB
          liyunzhang
        3. PIG-4504_4.patch
          34 kB
          liyunzhang
        4. PIG-4504_5.patch
          42 kB
          liyunzhang
        5. PIG-4504_6.patch
          50 kB
          liyunzhang
        6. PIG-4504_7.patch
          49 kB
          liyunzhang
        7. PIG-4504.patch
          27 kB
          liyunzhang
        8. SecondaryKeySort_design_doc (1).docx
          21 kB
          liyunzhang
        9. Why_need_split_PoLocalRearrange_POGlobalRearrange_POPackage_into_two_SparkNodes_in_sparkPlan.docx
          73 kB
          liyunzhang

        Issue Links

          Activity

            People

              kellyzly liyunzhang
              kellyzly liyunzhang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: