Details
Description
PIG-4443 spills the input splits to disk if serialized split size is greater than some threshold. It faces issues with relocalization when more than one vertex has job.split file. If a job.split file is already there on container reuse, it is reused causing wrong data to be read.
Either need a way to turn off relocalization or check the source+timestamp and redownload the file during relocalization.
Attachments
Attachments
Issue Links
- breaks
-
PIG-4443 Write inputsplits in Tez to disk if the size is huge and option to compress pig input splits
- Closed