[PIG-733] Order by sampling dumps entire sample to hdfs which causes dfs "FileSystem closed" error on large input - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.2.0
Fix Version/s: 0.3.0
Component/s: None
Labels:
None

Description

Order by has a sampling job which samples the input and creates a sorted list of sample items. CUrrently the number of items sampled is 100 per map task. So if the input is large resulting in many maps (say 50,000) the sample is big. This sorted sample is stored on dfs. The WeightedRangePartitioner computes quantile boundaries and weighted probabilities for repeating values in each map by reading the samples file from DFS. In queries with many maps (in the order of 50,000) the dfs read of the sample file fails with "FileSystem closed" error. This seems to point to a dfs issue wherein a big dfs file being read simultaneously by many dfs clients (in this case all maps) causes the clients to be closed. However on the pig side, loading the sample from each map in the final map reduce job and computing the quantile boundaries and weighted probabilities is inefficient. We should do this computation through a FindQuantiles udf in the same map reduce job which produces the sorted samples. This way lesser data is written to dfs and in the final map reduce job, the weightedRangePartitioner needs to just load the computed information.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-733-v2.patch
06/Apr/09 20:58
28 kB
Pradeep Kamath
PIG-733.patch
01/Apr/09 23:33
28 kB
Pradeep Kamath

Activity

People

Assignee:: Pradeep Kamath

Reporter:: Pradeep Kamath

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 25/Mar/09 19:04

Updated:: 24/Mar/10 22:10

Resolved:: 10/Apr/09 02:36