Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
0.13.0
-
None
-
None
-
None
Description
When reading parquet files with "order by":
a = load '/xxx/xxx/parquet/xxx.parquet' using ParquetLoader();
b = order a by col1 ;
c = limit b 100 ;
dump c
Pig spawns a Sampler job always in the begining:
Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs job_1426804645147_1270 1 1 8 8 8 8 4 4 4 4 b SAMPLER job_1426804645147_1271 1 1 10 10 10 10 4 4 4 4 b ORDER_BY,COMBINER job_1426804645147_1272 1 1 2 2 2 2 4 4 4 4 b hdfs:/tmp/temp-xxx/tmp-xxx,
The issue is when reading lots of files, the first sampler job can take a long time to finish.
The ask is:
1. Is the sampler job a must to implement "order by"?
2. If no, is there any way to disable RandomSampleLoader manually?
Thanks.