Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1264

Skewed join sampler misses out the key with the highest frequency

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.7.0
    • 0.7.0
    • None
    • None

    Description

      I am noticing two issues with the sampler used in skewed join:
      1. It does not allocate multiple reducers to the key with the highest frequency.
      2. It seems to be allocating the same number of reducers to every key (8 in this case).

      Query:

      a = load 'studenttab10k' using PigStorage() as (name, age, gpa);
      b = load 'votertab10k' as (name, age, registration, contributions);
      e = join a by name right, b by name using "skewed" parallel 8;
      store e into 'SkewedJoin_9.out';

      Attachments

        Activity

          People

            rding Richard Ding
            sriranjan Sriranjan Manjunath
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: