Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-460

PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 0.2.0
    • None
    • None
    • None

    Description

      Currently order by is done in three MR jobs:

      job 1: read data in whatever loader the user requests, store using BinStorage
      job 2: load using RandomSampleLoader, find quantiles
      job 3: load data again and sort

      It is done this way because RandomSampleLoader extends BinStorage, and so needs the data in that format to read it.

      If the logic in RandomSampleLoader was made into an operator instead of being in a loader then jobs 1 and 2 could be merged. On average job 1 takes about 15% of the time of an order by script.

      Attachments

        1. sampler2.patch
          23 kB
          Amir Youssefi
        2. sampler.patch
          22 kB
          Alan Gates

        Activity

          People

            pkamath Pradeep Kamath
            gates Alan Gates
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: