Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4485

Can Pig disable RandomSampleLoader when doing "Order by"

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 0.13.0
    • None
    • None
    • None

    Description

      When reading parquet files with "order by":

      a = load '/xxx/xxx/parquet/xxx.parquet' using ParquetLoader();
      b = order a by col1 ;
      c = limit b 100 ;
      dump c
      

      Pig spawns a Sampler job always in the begining:

      Job Stats (time in seconds):
      JobId	Maps	Reduces	MaxMapTime	MinMapTIme	AvgMapTime	MedianMapTime	MaxReduceTime	MinReduceTime	AvgReduceTime	MedianReducetime	Alias	Feature	Outputs
      job_1426804645147_1270	1	1	8	8	8	8	4	4	4	4	b	SAMPLER
      job_1426804645147_1271	1	1	10	10	10	10	4	4	4	4	b	ORDER_BY,COMBINER
      job_1426804645147_1272	1	1	2	2	2	2	4	4	4	4	b		hdfs:/tmp/temp-xxx/tmp-xxx,
      

      The issue is when reading lots of files, the first sampler job can take a long time to finish.

      The ask is:
      1. Is the sampler job a must to implement "order by"?
      2. If no, is there any way to disable RandomSampleLoader manually?

      Thanks.

      Attachments

        Activity

          People

            Unassigned Unassigned
            haozhu Hao Zhu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: