Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4344

Add a testcase with CustomPartitioner that tests ordering within a reducer

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Some of our users have a CustomPartitioner with join or group by as they know their data and know the keys to partition on. Since mapreduce provides data sorted within a reducer, they rely on that to have the data ordered as well.

      For eg:
      partition = group mydata by (hour, sortkey1, sortkey2, sortkey3) using MyCustomPartitioner PARALLEL 24;

      The custom partitioner sends hours 0-23 to partitions 0-23, which ensures that the data is also sorted without having to do a group by.

      With HCatStorer, this pattern will be used more. i.e,
      partition = group mydata by (hour) using MyCustomPartitioner PARALLEL 24;
      store partition into 'mydb.mytable' using HCatStorer();
      instead of
      store mydata into 'mydb.mytable' using HCatStorer();

      where hour is the partition. The extra groupby above is to avoid having 1 file created per partition instead of 24 files per partition and concatenating them later to save namespace.

      Attachments

        Activity

          People

            Unassigned Unassigned
            rohini Rohini Palaniswamy
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: